EurekAgent: an autonomous research system for metric-driven tasks, built with Claude Code. Define the problem and metric. Get breakthrough results.
# Add to your Claude Code skills
git clone https://github.com/THU-Team-Eureka/EurekAgentEurekAgent is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by THU-Team-Eureka. EurekAgent: an autonomous research system for metric-driven tasks, built with Claude Code. Define the problem and metric. Get breakthrough results. It has 55 GitHub stars.
EurekAgent's catalog security scan is still queued. You can run an instant dependency and prompt-injection check now with the "Scan for vulnerabilities" button above.
Clone the repository with "git clone https://github.com/THU-Team-Eureka/EurekAgent" and add it to your Claude Code skills directory (see the Installation section above).
EurekAgent is primarily written in Python. It is open-source under THU-Team-Eureka on GitHub, so you can review or fork the full source.
Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh EurekAgent against similar tools.
No comments yet. Be the first to share your thoughts!
Unlocks once the catalog security scan passes (runs nightly).
The deep catalog scan for this skill is still queued. Run an instant dependency check now instead.
We present EurekAgent, an agent system for metric-driven autonomous scientific discovery. Define your problem and evaluation criteria — EurekAgent coordinates off-the-shelf CLI agents to propose diverse approaches, implement them, run experiments, and iterate. Human intervention is optional but supported at every step.
https://github.com/user-attachments/assets/c5b45b20-7eec-454e-98c3-6880bcec878b
INSTRUCTION.md, SUBMISSION_FORMAT.md, and private evaluate.py as the source of truth.Docker — follow the official guide for your platform. Then add your user to the docker group:
sudo usermod -aG docker $USER
# Check if the user is added to docker group
groups $USER
Node.js 22+ — the agent container is built on the node:22-bookworm image, so install a matching Node.js 22+ runtime on the host as well (from nodejs.org or via nvm) and confirm:
nvm install 22
node --version # must be v22 or newer
EurekAgent drives the experiment loop through Claude Code. It runs both on your host (for the /generate-inputs skill and problem authoring) and inside the agent container (preinstalled by the Docker image below).
a) Install Claude Code on the host (requires Node.js 22+ from Step 2):
npm install -g @anthropic-ai/claude-code
claude --version # sanity check
b) Authenticate and point Claude Code at your model endpoint.
EurekAgent forwards these into the agent container, so configure them once in
~/.claude/settings.json under the "env" block:
{
"env": {
"ANTHROPIC_AUTH_TOKEN": "YOUR_KEY_HERE",
"ANTHROPIC_BASE_URL": "YOUR_BASE_URL_HERE",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-5.1",
"API_TIMEOUT_MS": "3000000"
},
"model": "sonnet"
}
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc
# Clone and enter the project
git clone https://github.com/THU-Team-Eureka/EurekAgent.git && cd EurekAgent
# Install uv-managed Python 3.12
uv python install 3.12.12
docker pull node:22-bookworm
bash docker/build.sh
Verify the image is available:
docker images | grep eureka-agent
If you are behind a proxy or docker pull fails, see the Docker troubleshooting guide.
During a run the agent can search the web for problem context and read live pages. These MCP servers are optional — when absent, the agent falls back to Claude Code's built-in WebSearch. web-search-prime is intended for GLM users; users of other model providers can skip it or configure their preferred search MCP.
a) web-search-prime — structured web search for GLM users only
claude mcp add -s user -t http web-search-prime https://api.z.ai/api/mcp/web_search_prime/mcp --header "Authorization: Bearer YOUR_KEY_HERE"
b) playwright — fetch and read actual webpage content.:
claude mcp add playwright npx @playwright/mcp@latest
npx playwright install chromium # pre-install the headless browser
EurekAgent ships a Playwright config at .claude/playwright-mcp.json (headless Chromium, sandbox flags, timeouts). It is mounted read-only into the agent container automatically — create or edit that file to match your network (e.g. add a proxy) if needed.
bash examples/circle_packing/run.sh
You can use the /generate-inputs skill in Claude Code to interactively generate all required files (INSTRUCTION.md, SUBMISSION_FORMAT.md, evaluate.py, run.sh) from a natural language description of your problem. Just type /generate-inputs and follow the prompts.
Each problem lives in its own directory under examples/. You need the following files:
| File | Purpose | Required? |
|---|---|---|
INSTRUCTION.md |
Problem description for the LLM agent | Yes |
SUBMISSION_FORMAT.md |
JSON schema for candidates + score semantics | Yes |
hidden_eval_dir/evaluate.py |
Private evaluator with grade_submission and is_better |
Yes |
initial.py |
Starting code for the agent | Recommended |
run.sh |
Convenience script to launch a run | Recommended |
The evaluator is the single source of truth for scoring and comparison. It must define two functions:
grade_submission(submission_path: str, context: dict) -> dictCalled by the secure grader server to score a candidate submission.
submission_path: path to the JSON file the agent submittedcontext: dict with workspace_root, approach_id, metadatascore (float): the raw objective value. Do NOT negate. Return the value as-is (e.g., the C5 value for a minimization problem, or sum of radii for a maximization problem).valid (bool): whether the submission is validmessage (str): human-readable feedbackopt_target_met (bool, optional): whether an optimization target was metpublic_metrics (dict, optional): additional metrics for displayfloat("inf") for minimization problems, float("-inf") for maximization, or float("inf") for approach-target problems.is_better(new_score: float, old_score: float) -> boolDefines which score is better. Called by the system to compare scores for ranking, best-result tracking, and display.
True if new_score represents a better result than old_scorereturn new_score < old_scorereturn new_score > old_scorereturn abs(new_score - 3.14159) < abs(old_score - 3.14159)Both functions are required. The system will fail at startup if either is missing.
Must clearly state:
run() functionMust describe:
A convenience script. Must pass at minimum:
--problem: path to INSTRUCTION.md--hidden-eval-dir: path to the directory co