by aisa-group
PostTrainBench measures how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
# Add to your Claude Code skills
git clone https://github.com/aisa-group/PostTrainBenchWe introduce PostTrainBench, a benchmark that measures the ability of CLI agents to post-train pre-trained large language models (LLMs). In PostTrainBench, the agent's task is to improve the performance of a base LLM on a given benchmark. The agent is given access to an evaluation script and 10 hours on an H100 GPU. Performance is measured by the benchmark score of the post-trained LLM. This setup naturally evaluates an agent's ability to conduct AI R&D.
Looking for Collaborators! We are seeking contributors to help expand tasks and agent scaffolds. Substantial contributions can lead to co-authorship on our paper. See Contributing for details.

Benchmark scores are computed after post-training, for all but the "base model" score.
All scores are averages over 4 models (Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, and Gemma-3-4B).
| Method | Average Score | AIME 2025 | BFCL | GPQA (Main) | GSM8K | HumanEval | |---------------------|---------------|-----------|------|-------------|-------|-----------| | Human Post-Trained* | 61.8 | 29.2 | 85 | 36.2 | 87 | 71.5 | | gpt-5.1-codex-max | 34.9 | 0.8 | 67 | 29.6 | 44.3 | 32.9 | | claude opus 4.5 | 20.1 | 3.3 | 40.3 | 6.8 | 26.7 | 23.5 | | gemini-3-pro | 18 | 0.8 | 16.5 | 19.1 | 30.7 | 23 | | gpt-5.2 | 17.5 | 0 | 13.5 | 19.9 | 34.4 | 19.5 | | claude sonnet 4.5 | 14.7 | 0.8 | 1.5 | 14.6 | 33.4 | 23 | | Base model | 9 | 1.7 | 1.5 | 8.5 | 20.4 | 12.8 |
* "Human Post-Trained" is not directly comparable since it exceeds the 10h + 1 GPU constraint.
Different CLI agents demonstrate varying levels of persistence. Some give up well before the time limit expires.

# 1. Install requirements (apptainer, fuse-overlayfs)
# 2. Build the container
bash containers/build_container.sh standard
# 3. Download HuggingFace cache
bash containers/download_hf_cache/download_hf_cache.sh
# 4. Set API keys
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export GEMINI_API_KEY="your-key"
# 5. Run jobs
bash src/commit_utils/commit.sh
Currently, we only support the HTCondor job scheduler. Slurm support is planned.
| Directory | Description |
|-----------|-------------|
| agents/ | Agent implementations |
| containers/ | Container definition, cache downloads |
| dev_utils/ | Development utility scripts |
| src/ | Main codebase |
| src/commit_utils/ | Job submission utilities (e....