SWE-bench Pro — Model Performance Timeline

Absolute scale (0–100%). SWE-bench Pro scores with source-tier provenance. May 2024 – June 2026.

Labs (click to toggle):
Source Tier (click to toggle):

Source Tiering Methodology

All benchmark scores in this visualization are assigned a source tier reflecting the reliability and provenance of the data. Tiers can be toggled on or off using the controls above the charts. Tier 4 (third-party estimates) is hidden by default due to lower reliability.

Tier Description Examples Reliability
T1 — Official Scores from official technical reports, model cards, or peer-reviewed publications by the model's own lab. OpenAI system cards, Anthropic model reports, Google technical blogs with methodology details Highest
T2 — Independent Scores independently reproduced by third parties using documented methodology (same benchmark version, same evaluation harness). Scale AI SWE-bench Pro leaderboard verified runs, academic reproductions, independent benchmark suites High
T3 — Blog Scores from company blog posts or announcements that lack full methodology details or independent verification. Launch announcements, marketing materials, conference demos with benchmark claims Medium
T4 — Estimate Scores from third-party estimates, leaks, unofficial benchmarks, or community-run evaluations without standardised methodology. Hidden by default. Social media reports, unofficial leaderboards, crowd-sourced evaluations, rumoured scores Low

Why this matters: Labs frequently cherry-pick favourable benchmarks and evaluation conditions. Official scores (T1) may use scaffolding, custom prompts, or multiple attempts that inflate results beyond what developers experience in practice. Independent reproductions (T2) provide the most honest comparison. We use absolute 0–100% scales throughout to prevent visual manipulation of score differences.

References

Primary Sources

About SWE-bench Pro

Methodology Notes

SWE-bench Pro uses 1,865 tasks across 41 repositories. Models marked with * use the mini-swe-agent harness (uncapped, 250 turns). Others use the standard SWE-Agent scaffold. Some models were evaluated with capped cost limits and 50-turn caps (noted in tooltips). Scores combine Scale AI leaderboard results with official model-provider benchmark tables. Use source tiers and tooltip notes when comparing results because scaffolds differ significantly.