by aisa-group
PostTrainBench measures how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
# Add to your Claude Code skills
git clone https://github.com/aisa-group/PostTrainBenchWe introduce PostTrainBench, a benchmark that measures the ability of CLI agents to post-train pre-trained large language models (LLMs). In PostTrainBench, the agent's task is to improve the performance of a base LLM on a given benchmark. The agent is given access to an evaluation script and 10 hours on an H100 GPU. Performance is measured by the benchmark score of the post-trained LLM. This setup naturally evaluates an agent's ability to conduct AI R&D.
Looking for Collaborators! We are seeking contributors to help expand tasks and agent scaffolds. Substantial contributions can lead to co-authorship on our paper. See Contributing for details.

Scores are weighted averages across 7 benchmarks and 4 models (Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, and Gemma-3-4B). Agents with multiple runs show averaged results.
| Rank | Agent | Scaffold | Avg | AIME 2025 | Arena Hard | BFCL | GPQA | GSM8K | HealthBench | HumanEval | |---:|---|---|---:|---:|---:|---:|---:|---:|---:|---:| | - | Instruction Tuned | - | 51.1 | 29.2 | 70.2 | 85.0 | 36.2 | 87.0 | 43.3 | 71.5 | | 1 | Opus 4.6 | Claude Code | 23.2 | 5.0 | 7.8 | 75.9 | 25.5 | 41.0 | 18.8 | 24.7 | | 2 | Gemini 3.1 Pro | OpenCode | 22.3 | 3.3 | 5.5 | 82.2 | 17.4 | 33.9 | 17.7 | 42.8 | | 3 | GPT-5.2 | Codex CLI | 21.5 | 0.8 | 6.4 | 52.5 | 23.7 | 55.9 | 15.8 | 31.4 | | 4 | GPT 5.1 Codex Max | Codex CLI | 20.2 | 0.3 | 4.0 | 30.8 | 24.0 | 51.6 | 20.3 | 32.7 | | 5 | Gemini 3 Pro | Gemini CLI | 18.3 | 1.7 | 5.8 | 35.3 | 21.5 | 42.6 | 17.7 | 25.3 | | 6 | Opus 4.5 | Claude Code | 17.1 | 2.8 | 3.7 | 61.6 | 19.0 | 28.5 | 8.9 | 28.1 | | 7 | GPT 5.2 Codex | Codex CLI | 16.8 | 0.3 | 2.5 | 40.3 | 24.1 | 37.6 | 11.5 | 23.7 | | 8 | Sonnet 4.6 | Claude Code | 16.4 | 3.3 | 10.2 | 23.8 | 13.8 | 25.7 | 16.2 | 42.4 | | 9 | GLM 5 | OpenCode | 13.9 | 0.8 | 4.2 | 21.5 | 15.2 | 40.3 | 14.6 | 17.4 | | 10 | GPT 5.3 Codex | Codex CLI | 13.8 | 0.3 | 1.0 | 14.8 | 22.8 | 31.7 | 10.2 | 24.0 | | 11 | Sonnet 4.5 | Claude Code | 9.9 | 0.8 | 1.0 | 1.8 | 14.6 | 30.9 | 5.0 | 23.0 | | - | Base Model | Zero Shot | 7.5 | 1.7 | 1.3 | 1.5 | 8.5 | 20.4 | 9.5 | 12.8 |
No comments yet. Be the first to share your thoughts!
"Instruction Tuned" is not directly comparable since it exceeds the 10h + 1 GPU constraint. See the full interactive leaderboard at posttrainbench.com, which includes OpenCode variants and additional agents.
Agents are run through one of 4 CLI scaffolds: Claude Code, Codex CLI, Gemini CLI, and OpenCode.
PostTrainBench includes 7 benchmarks spanning reasoning, tool use, knowledge, math, health, and code: