OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards
# Add to your Claude Code skills
git clone https://github.com/agentscope-ai/OpenJudgeπ Website | π Try Online | π Documentation | π€ Contributing | δΈζ
OpenJudge is an open-source evaluation framework for AI applications (e.g., AI agents or chatbots) designed to evaluate quality and drive continuous application optimization.
In practice, application excellence depends on a trustworthy evaluation workflow: Collect test data β Define graders β Run evaluation at scale β Analyze weaknesses β Iterate quickly.
OpenJudge provides ready-to-use graders and supports generating scenario-specific rubrics (as graders), making this workflow simpler, more professional, and easy to integrate into your workflow. It can also convert grading results into reward signals to help you and optimize your application.
No comments yet. Be the first to share your thoughts!
π Try it now! Visit openjudge.me/app to use graders online β no installation required. Test built-in graders, build custom rubrics, and explore evaluation results directly in your browser.
2026-04-07 - π Skill Graders - 5 new LLM-based graders for evaluating AI Agent Skill packages: threat analysis (AITech taxonomy), declaration alignment, completeness, relevance, and design quality. π Documentation | Cookbook
2026-03-10 - π οΈ New Skills - Claude authenticity verification, find skills combo, and more. π Browse Skills
2026-02-12 - π Reference Hallucination Arena - Benchmark for evaluating LLM academic reference hallucination. π Documentation | π Leaderboard
2026-01-27 - π Paper Review - Automatically review academic papers using LLM-powered evaluation. π Documentation
2026-01-27 - π₯οΈ OpenJudge UI - A Streamlit-based visual interface for grader testing and Auto Arena. π Try Online | Run locally: streamlit run ui/app.py
Access 50+ production-ready graders featuring a comprehensive taxonomy, rigorously validated for reliable performance.
Focus: Semantic quality, functional correctness, structural compliance
Key Graders:
Relevance - Semantic relevance scoringSimilarity - Text similarity measurementSyntax Check - Code syntax validationJSON Match - Structure complianceFocus: Agent lifecycle, tool calling, memory, plan feasibility, trajectory quality
Key Graders:
Tool Selection - Tool choice accuracyMemory - Context preservationPlan - Strategy feasibilityTrajectory - Path optimizationFocus: Image-text coherence, visual generation quality, image helpfulness
Key Graders:
Image Coherence - Visual-text alignmentText-to-Image - Generation qualityImage Helpfulness - Image contributionChoose the build method that fits your requirements:
Using mainstream observability platforms like LangSmith or Langfuse? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We also provide integrations with training frameworks like VERL for RL training. π See Integrations for details
Explore OpenJudge without writing a single line of code. Our online platform at openjudge.me/app lets you:
π‘ Don't want to install anything? Try OpenJudge online β use graders directly in your browser, no setup needed.
pip install py-openjudge
π‘ More installation methods can be found in the Quickstart Guide.
π Complete Quickstart can be found in the Quickstart Guide.
A simple example to evaluate a single response:
import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common.relevance import RelevanceGrader
async def main():
# 1οΈβ£ Create model client
model = OpenAIChatModel(model="qwen3-32b")
# 2οΈβ£ Initialize grader
grader = RelevanceGrader(model=model)
# 3οΈβ£ Prepare data
data = {
"query": "What is machine learning?",
"response": "Machine learning is a subset of AI that enables computers to learn from data.",
}
# 4οΈβ£ Evaluate
result = await grader.aevaluate(**data)
print(f"Score: {result.score}") # Score: 4
print(f"Reason: {result.reason}")
if __name__ == "__main__":
asyncio.run(main())
Use multiple built-in graders to comprehensively evaluate your LLM application: π Explore All built-in graders
Business Scenario: Evaluating an e-commerce customer service agent that handles order inquiries. We assess the agent's performance across three dimensions: relevance, hallucination, and tool selection.
import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common import RelevanceGrader, HallucinationGrader
from openjudge.graders.agent.tool.tool_selection import ToolSelectionGrader
from openjudge.runner import GradingRunner
from openjudge.runner.aggregator import WeightedSumAggregator
from openjudge.analyzer.statistical import DistributionAnalyzer
TOOL_DEFINITIONS = [
{"name": "query_order", "description": "Query order status and logistics information", "parame