๐ฆ ResearchClawBench: Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery
# Add to your Claude Code skills
git clone https://github.com/InternScience/ResearchClawBenchEvaluating AI Agents for Automated Research from Re-Discovery to New-Discovery
Quick Start | Submit Tasks | How It Works | Domains | Leaderboard | Add Your Agent
ResearchClawBench is a benchmark that measures whether AI coding agents can independently conduct scientific research โ from reading raw data to producing publication-quality reports โ and then rigorously evaluates the results against real human-authored papers.
Unlike benchmarks that test coding ability or factual recall, ResearchClawBench asks: given a curated scientific workspace and the same research goal, can an AI agent arrive at the same (or better) scientific conclusions?
No comments yet. Be the first to share your thoughts!
https://github.com/user-attachments/assets/94829265-80a8-4d61-a744-3800603de6d9
Most AI benchmarks evaluate what models know. We evaluate what agents can do.
agents.json for easy customization.Every task in ResearchClawBench is built through a rigorous, expert-driven pipeline to ensure scientific validity and reproducibility:
flowchart TD
A["๐ High-Quality Paper Collection\n(Target Paper)"] --> B["๐งโ๐ฌ Human Expert Extraction\n(Core Task Instructions)"]
B --> C["๐ Evaluation Checklist\n(Criteria + Keywords + Weights)"]
B --> D["๐ Data & Related Work Collection\n(Datasets + Reference Papers)"]
C --> E["โ
Human Reproduction & Validation\n(Verify checklist is reproducible)"]
D --> E
style A fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style C fill:#fce7f3,stroke:#ec4899,stroke-width:2px
style D fill:#f0fdf4,stroke:#22c55e,stroke-width:2px
style E fill:#f5f3ff,stroke:#8b5cf6,stroke-width:2px
High-Quality Paper Collection โ Domain experts select recent, high-impact publications with clear methodology and reproducible results across 10 scientific disciplines.
Expert Task Extraction โ Human experts read each paper and distill the core research task into structured instructions, identifying the key scientific question, input data, and expected outputs.
Checklist Design โ Experts create a fine-grained evaluation checklist with weighted criteria (text and image items), each with specific technical keywords that a judge must verify.
Data & Related Work Collection โ The datasets and related reference materials are curated to form a research workspace for the task.
Human Reproduction & Validation โ Human researchers independently reproduce the paper's results from the provided workspace and instructions, verifying that every checklist item is achievable. This ensures the benchmark is fair and the checklist is grounded in reality.
ResearchClawBench operates in two distinct stages:
flowchart LR
subgraph Stage1["Stage 1 — Auto Research"]
A["Raw Data\n+ Instructions"] --> B["AI Agent\n(autonomous)"]
B --> C["Code\n+ Figures\n+ Report"]
end
subgraph Stage2["Stage 2 — Evaluation"]
C --> D["LLM Judge"]
E["Target Paper\n+ Checklist"] --> D
D --> F["Per-Item Scores\n+ Reasoning"]
end
style Stage1 fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
style Stage2 fill:#fff7ed,stroke:#f59e0b,stroke-width:2px
The AI agent receives a workspace containing raw datasets, reference materials, and task instructions. It must independently:
report/report.md) with figures, methodology, results, and discussionNo hand-holding. No chain-of-thought hints. The agent works in its own sandboxed workspace with full tool access โ just like a real researcher.
Once the agent finishes, its report is evaluated against the original published paper using a fine-grained checklist. The judge receives the task instructions, the AI report, and the checklist criteria โ then scores each item using a dual-mode rubric:
flowchart TD
subgraph Inputs
I["INSTRUCTIONS.md\n(task background)"]
R["Agent Report\n(text + figures)"]
CL["Checklist\n(from target paper)"]
end
I & R & CL --> J["Multimodal LLM Judge"]
J --> DET{"Determine\nEvaluation Mode"}
DET -->|"Quantitative\nresults"| OBJ["Mode A: Objective\n(Metric Optimization)"]
DET -->|"Qualitative\nreasoning"| SUB["Mode B: Subjective\n(Mechanism Analysis)"]
OBJ --> SO["Score by m