Absolute scale (0–100%). SWE-bench Pro scores with source-tier provenance. May 2024 – June 2026.
All benchmark scores in this visualization are assigned a source tier reflecting the reliability and provenance of the data. Tiers can be toggled on or off using the controls above the charts. Tier 4 (third-party estimates) is hidden by default due to lower reliability.
| Tier | Description | Examples | Reliability |
|---|---|---|---|
| T1 — Official | Scores from official technical reports, model cards, or peer-reviewed publications by the model's own lab. | OpenAI system cards, Anthropic model reports, Google technical blogs with methodology details | Highest |
| T2 — Independent | Scores independently reproduced by third parties using documented methodology (same benchmark version, same evaluation harness). | Scale AI SWE-bench Pro leaderboard verified runs, academic reproductions, independent benchmark suites | High |
| T3 — Blog | Scores from company blog posts or announcements that lack full methodology details or independent verification. | Launch announcements, marketing materials, conference demos with benchmark claims | Medium |
| T4 — Estimate | Scores from third-party estimates, leaks, unofficial benchmarks, or community-run evaluations without standardised methodology. Hidden by default. | Social media reports, unofficial leaderboards, crowd-sourced evaluations, rumoured scores | Low |
Why this matters: Labs frequently cherry-pick favourable benchmarks and evaluation conditions. Official scores (T1) may use scaffolding, custom prompts, or multiple attempts that inflate results beyond what developers experience in practice. Independent reproductions (T2) provide the most honest comparison. We use absolute 0–100% scales throughout to prevent visual manipulation of score differences.
SWE-bench Pro uses 1,865 tasks across 41 repositories. Models marked with * use the mini-swe-agent harness (uncapped, 250 turns). Others use the standard SWE-Agent scaffold. Some models were evaluated with capped cost limits and 50-turn caps (noted in tooltips). Scores combine Scale AI leaderboard results with official model-provider benchmark tables. Use source tiers and tooltip notes when comparing results because scaffolds differ significantly.