🦞 ResearchClawBench: Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery
# Add to your Claude Code skills
git clone https://github.com/InternScience/ResearchClawBenchGuides for using ai agents skills like ResearchClawBench.
Last scanned: 5/30/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-30T16:25:40.966Z",
"npmAuditRan": true,
"pipAuditRan": true
}No comments yet. Be the first to share your thoughts!
30 days in the Featured rail · terms & refunds
Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery
Quick Start | Submit Tasks | How It Works | Domains | Leaderboard | Add Your Agent
ResearchClawBench is a benchmark that measures whether AI coding agents can independently conduct scientific research — from reading raw data to producing publication-quality reports — and then rigorously evaluates the results against real human-authored papers.
Unlike benchmarks that test coding ability or factual recall, ResearchClawBench asks: given a curated scientific workspace and the same research goal, can an AI agent arrive at the same (or better) scientific conclusions?
https://github.com/user-attachments/assets/94829265-80a8-4d61-a744-3800603de6d9
Most AI benchmarks evaluate what models know. We evaluate what agents can do.
rcb-eval, a YAML-configured command-line evaluation workflow powered by ResearchHarness, supporting concurrent runs, repeated trials, automatic scoring, and Markdown evaluation reports with per-run and per-task statistics.agents.json for easy customization.Every task in ResearchClawBench is built through a rigorous, expert-driven pipeline to ensure scientific validity and reproducibility:
flowchart TD
A["📄 High-Quality Paper Collection\n(Target Paper)"] --> B["🧑🔬 Human Expert Extraction\n(Core Task Instructions)"]
B --> C["📋 Evaluation Rubric (Checklist)\n(Criteria + Keywords + Weights)"]
B --> D["📂 Data & Related Work Collection\n(Datasets + Reference Papers)"]
C --> E["✅ Human Reproduction & Validation\n(Verify rubric (checklist) is reproducible)"]
D --> E
style A fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style C fill:#fce7f3,stroke:#ec4899,stroke-width:2px
style D fill:#f0fdf4,stroke:#22c55e,stroke-width:2px
style E fill:#f5f3ff,stroke:#8b5cf6,stroke-width:2px
High-Quality Paper Collection — Domain experts select recent, high-impact publications with clear methodology and reproducible results across 10 scientific disciplines.
Expert Task Extraction — Human experts read each paper and distill the core research task into structured instructions, identifying the key scientific question, input data, and expected outputs.
Rubric (Checklist) Design — Experts create a fine-grained evaluation rubric (checklist) with weighted criteria (text and image items), each with specific technical keywords that a judge must verify.
Data & Related Work Collection — The datasets and related reference materials are curated to form a research workspace for the task.
Human Reproduction & Validation — Human researchers independently reproduce the paper's results from the provided workspace and instructions, verifying that every rubric (checklist) item is achievable. This ensures the benchmark is fair and the rubric (checklist) is grounded in r