PostTrainBench: Measuring AI Ability to Perform LLM Post-Training

We introduce PostTrainBench, a benchmark that measures the ability of CLI agents to post-train pre-trained large language models (LLMs). In PostTrainBench, the agent's task is to improve the performance of a base LLM on a given benchmark. The agent is given access to an evaluation script and 10 hours on an H100 GPU. Performance is measured by the benchmark score of the post-trained LLM. This setup naturally evaluates an agent's ability to conduct AI R&D.

Looking for Collaborators! We are seeking contributors to help expand tasks and agent scaffolds. Substantial contributions can lead to co-authorship on our paper. See Contributing for details.

Leaderboard

Main Plot

Benchmark scores are computed after post-training, for all but the "base model" score.

All scores are averages over 4 models (Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, and Gemma-3-4B).

| Method | Average Score | AIME 2025 | BFCL | GPQA (Main) | GSM8K | HumanEval | |---------------------|---------------|-----------|------|-------------|-------|-----------| | Human Post-Trained* | 61.8 | 29.2 | 85 | 36.2 | 87 | 71.5 | | gpt-5.1-codex-max | 34.9 | 0.8 | 67 | 29.6 | 44.3 | 32.9 | | claude opus 4.5 | 20.1 | 3.3 | 40.3 | 6.8 | 26.7 | 23.5 | | gemini-3-pro | 18 | 0.8 | 16.5 | 19.1 | 30.7 | 23 | | gpt-5.2 | 17.5 | 0 | 13.5 | 19.9 | 34.4 | 19.5 | | claude sonnet 4.5 | 14.7 | 0.8 | 1.5 | 14.6 | 33.4 | 23 | | Base model | 9 | 1.7 | 1.5 | 8.5 | 20.4 | 12.8 |

PostTrainBench

PostTrainBench: Measuring AI Ability to Perform LLM Post-Training

Leaderboard

Time Spent on Post-Training

Quick Start

Code Structure