ResearchClawBench

Name: ResearchClawBench
Author: InternScience

by InternScience

Pending

🦞 ResearchClawBench: Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery

102stars

7forks

Jupyter Notebook

Added 4/27/2026

View on GitHub Download ZIP

AI Agentsagentaiai-agentai-scientistai4science

Installation

# Add to your Claude Code skills
git clone https://github.com/InternScience/ResearchClawBench

README.md

Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery

ResearchClawBench is a benchmark that measures whether AI coding agents can independently conduct scientific research — from reading raw data to producing publication-quality reports — and then rigorously evaluates the results against real human-authored papers.

Unlike benchmarks that test coding ability or factual recall, ResearchClawBench asks: given a curated scientific workspace and the same research goal, can an AI agent arrive at the same (or better) scientific conclusions?

Agentic AI for Beginners

Build your first AI agent from scratch - tool use, ReAct pattern, memory, deployment

41 minBeginner

Comments (0)

to leave a comment.

No comments yet. Be the first to share your thoughts!

Related Skills

n8n

by n8n-io

Fair-code workflow automation platform with native AI capabilities. Combine visual building with custom code, self-host or cloud, 400+ integrations.

185,737

LiteMultiAgent claude-research-plan-implement

flowchart TD
    A["📄 High-Quality Paper Collection\n(Target Paper)"] --> B["🧑‍🔬 Human Expert Extraction\n(Core Task Instructions)"]
    B --> C["📋 Evaluation Checklist\n(Criteria + Keywords + Weights)"]
    B --> D["📂 Data & Related Work Collection\n(Datasets + Reference Papers)"]
    C --> E["✅ Human Reproduction & Validation\n(Verify checklist is reproducible)"]
    D --> E

    style A fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
    style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style C fill:#fce7f3,stroke:#ec4899,stroke-width:2px
    style D fill:#f0fdf4,stroke:#22c55e,stroke-width:2px
    style E fill:#f5f3ff,stroke:#8b5cf6,stroke-width:2px

flowchart LR
    subgraph Stage1["Stage 1 &mdash; Auto Research"]
        A["Raw Data\n+ Instructions"] --> B["AI Agent\n(autonomous)"]
        B --> C["Code\n+ Figures\n+ Report"]
    end

    subgraph Stage2["Stage 2 &mdash; Evaluation"]
        C --> D["LLM Judge"]
        E["Target Paper\n+ Checklist"] --> D
        D --> F["Per-Item Scores\n+ Reasoning"]
    end

    style Stage1 fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
    style Stage2 fill:#fff7ed,stroke:#f59e0b,stroke-width:2px

flowchart TD
    subgraph Inputs
        I["INSTRUCTIONS.md\n(task background)"]
        R["Agent Report\n(text + figures)"]
        CL["Checklist\n(from target paper)"]
    end

    I & R & CL --> J["Multimodal LLM Judge"]

    J --> DET{"Determine\nEvaluation Mode"}

    DET -->|"Quantitative\nresults"| OBJ["Mode A: Objective\n(Metric Optimization)"]
    DET -->|"Qualitative\nreasoning"| SUB["Mode B: Subjective\n(Mechanism Analysis)"]

    OBJ --> SO["Score by m

ResearchClawBench

Related Skills

Overview

✨ Highlights

🎬 Demo

💡 Why ResearchClawBench?

📢 News

Understanding The Benchmark

🏗️ Data Construction

⚙️ How It Works

Stage 1: Autonomous Research

Stage 2: Reference-Based Evaluation