SWE-bench Pro — Model Performance Timeline

Source Tiering Methodology

All benchmark scores in this visualization are assigned a source tier reflecting the reliability and provenance of the data. Tiers can be toggled on or off using the controls above the charts. Tier 4 (third-party estimates) is hidden by default due to lower reliability.

Tier	Description	Examples	Reliability
T1 — Official	Scores from official technical reports, model cards, or peer-reviewed publications by the model's own lab.	OpenAI system cards, Anthropic model reports, Google technical blogs with methodology details	Highest
T2 — Independent	Scores independently reproduced by third parties using documented methodology (same benchmark version, same evaluation harness).	Scale AI SWE-bench Pro leaderboard verified runs, academic reproductions, independent benchmark suites	High
T3 — Blog	Scores from company blog posts or announcements that lack full methodology details or independent verification.	Launch announcements, marketing materials, conference demos with benchmark claims	Medium
T4 — Estimate	Scores from third-party estimates, leaks, unofficial benchmarks, or community-run evaluations without standardised methodology. Hidden by default.	Social media reports, unofficial leaderboards, crowd-sourced evaluations, rumoured scores	Low

Why this matters: Labs frequently cherry-pick favourable benchmarks and evaluation conditions. Official scores (T1) may use scaffolding, custom prompts, or multiple attempts that inflate results beyond what developers experience in practice. Independent reproductions (T2) provide the most honest comparison. We use absolute 0–100% scales throughout to prevent visual manipulation of score differences.

References

Primary Sources

About SWE-bench Pro

Scale AI Blog — SWE-bench Pro: A New Standard for Evaluating AI Programming

Methodology Notes

SWE-bench Pro uses 1,865 tasks across 41 repositories. Models marked with * use the mini-swe-agent harness (uncapped, 250 turns). Others use the standard SWE-Agent scaffold. Some models were evaluated with capped cost limits and 50-turn caps (noted in tooltips). Scores combine Scale AI leaderboard results with official model-provider benchmark tables. Use source tiers and tooltip notes when comparing results because scaffolds differ significantly.