aisa-group/PostTrainBench

438 stars · Last commit 2026-07-09

Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours

README preview

# PostTrainBench: Can LLM Agents Automate LLM Post-Training?

[![Website](https://img.shields.io/badge/Website-posttrainbench.com-c17d5a)](http://posttrainbench.com/)

We introduce PostTrainBench, a benchmark that measures the ability of CLI agents to post-train pre-trained large language models (LLMs). In PostTrainBench, the agent's task is to improve the performance of a base LLM on a given benchmark. The agent is given access to an evaluation script and 10 hours on an H100 GPU. Performance is measured by the benchmark score of the post-trained LLM. This setup naturally evaluates an agent's ability to conduct AI R&D.

> [!IMPORTANT]
> **Harbor support coming soon!** This repository currently targets our internal HPC cluster (HTCondor). We are adding [Harbor](https://github.com/harbor-framework/harbor) support to make it straightforward to run on rented hardware (e.g., cloud GPUs). See our [PR](https://github.com/aisa-group/PostTrainBench/pull/8).

## Leaderboard

![Main Plot](assets/main_plot_v1.png)

Scores are weighted averages across 7 benchmarks and 4 models (Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, and Gemma-3-4B). Agents with multiple runs show averaged results.

| Rank | Agent | Scaffold | Avg | AIME 2025 | Arena Hard | BFCL | GPQA | GSM8K | HealthBench | HumanEval |
|---:|---|---|---:|---:|---:|---:|---:|---:|---:|---:|
| - | Official Instruct Models | - | 51.1 | 29.2 | 70.2 | 85.0 | 36.2 | 87.0 | 43.3 | 71.5 |
| 1 | Opus 4.6 | Claude Code | 23.2 | 5.0 | 7.8 | 75.9 | 25.5 | 41.0 | 18.8 | 24.7 |
| 2 | Gemini 3.1 Pro | OpenCode | 21.6 | 3.9 | 7.4 | 62.8 | 18.5 | 45.5 | 14.5 | 40.2 |

View full repository on GitHub →