SWE-bench Pro Leaderboard (2026): Every Model Score, Claude Fable 5 Leads at 80.3%

Full SWE-bench Pro leaderboard, June 9, 2026. Scale SEAL standardized: GPT-5.4 59.1%, Opus 4.6 51.9%. Vendor-reported: Claude Fable 5 80.3%, Opus 4.8 69.2%. Pro vs Verified deltas, score per dollar, contamination analysis.

June 9, 2026 · 2 min read
SWE-bench Pro Leaderboard (2026): Every Model Score, Claude Fable 5 Leads at 80.3%

Three numbers all claim to be the best SWE-bench Pro score: 80.3% (Claude Fable 5, Anthropic's own scaffold), 59.1% (gpt-5.4 xHigh, Scale's standardized SEAL leaderboard), and 47.1% (Opus 4.6, Scale's private commercial set). All three are real. The spread is scaffolding and data splits, and most pages quoting a score never say which one they mean.

This page keeps all three views side by side: Scale's standardized public and commercial leaderboards, vendor-reported scores, the Pro-vs-Verified delta per model, and score per dollar of output-token price.

Leaderboard data verified June 9, 2026

SWE-bench Pro: SEAL Leaderboard Top 10 (Public Set)

Scale AI standardized scaffolding, Pass@1, 731 public tasks

1GPT-5.4 (xHigh)
59.1%
2Muse Spark
55%
3Opus 4.6 (thinking)
51.9%
4Gemini 3.1 Pro
46.1%
5Opus 4.5
45.9%
6Sonnet 4.5
43.6%
7Gemini 3 Pro
43.3%
8Sonnet 4
42.7%
9GPT-5 (High)
41.8%
10GPT-5.2 Codex
41%

Source: Scale AI SEAL Leaderboard, June 9, 2026. Standardized scaffolding; some entries run the mini-swe-agent harness.

1,865
Tasks across 41 repositories
Pass@1
Scoring metric
4
Languages (Py, Go, TS, JS)
107.4
Avg lines changed per task

SWE-bench Pro Leaderboard: Scale SEAL Public Set

Scale AI runs every model through identical scaffolding, which isolates model capability from harness quality. These are the only directly comparable SWE-bench Pro numbers. Scores below are from the public set (731 tasks), Pass@1, as of June 9, 2026.

GPT-5.4 (xHigh) leads at 59.1%, 4.1 points ahead of the new Muse Spark entry and 7.2 ahead of the best Claude run (Opus 4.6 thinking, 51.9%). Confidence intervals are roughly ±3.5 points, so adjacent ranks below the top 3 overlap.

RankModelScore95% CIRelease
1GPT-5.4 (xHigh)59.1%±3.562026
2Muse Spark (new)55.0%±3.602026
3Claude Opus 4.6 (thinking)51.9%±3.61Feb 2026
4Gemini 3.1 Pro (thinking)46.1%±3.60Feb 2026
5Claude Opus 4.545.9%±3.60Nov 2025
6Claude Sonnet 4.543.6%±3.60Sep 2025
7Gemini 3 Pro (preview)43.3%±3.602025
8Claude Sonnet 442.7%±3.59May 2025
9GPT-5 (High)41.8%±3.49Aug 2025
10GPT-5.2 Codex41.0%±3.57Jan 2026
11Claude Haiku 4.539.5%±3.55Oct 2025
12Qwen3 Coder 480B (open)38.7%±3.552025

Source: Scale AI SEAL Leaderboard, June 9, 2026. Standardized scaffolding; entries marked with an asterisk on Scale's page run the mini-swe-agent harness. Claude Fable 5 (GA June 9, 2026) and Opus 4.8 (May 28, 2026) have no SEAL entries yet.

SWE-bench Pro Commercial Set: Scores on Code No Model Has Seen

The commercial set is 276 tasks from 18 proprietary startup codebases that are not on the public internet. It is the strongest contamination control available, and scores drop hard: every model loses ground versus its public-set number, and the ranking reshuffles.

RankModelScore95% CIPublic-Set Score
1Claude Opus 4.6 (thinking)47.1%±6.0751.9%
2Muse Spark44.7%±6.0555.0%
3GPT-5.4 (xHigh)43.4%±6.0359.1%
4Gemini 3.1 Pro (thinking)32.2%±5.6946.1%
5GPT-5.2 Codex27.7%±5.0941.0%
6GPT-5.223.8%±5.09n/a
7Claude Opus 4.523.4%±5.0745.9%
8Gemini 3 Pro18.0%±4.7843.3%
9Claude Opus 4.117.8%±4.51n/a
10GPT-514.9%±4.2041.8%
11Gemini 2.5 Pro Preview10.1%±3.56n/a
12Claude Sonnet 49.1%±3.3942.7%

Source: Scale AI SEAL Private Leaderboard. Wider confidence intervals reflect the smaller 276-task set.

The reshuffle is the interesting part. GPT-5.4 leads the public set by 4.1 points but falls to third on commercial code. Opus 4.5 drops 22.5 points (45.9% to 23.4%), the largest fall in the top 10. Opus 4.6 holds 47.1%, losing only 4.8 points. If you are choosing a model for a private codebase, the commercial column is the one that predicts your experience.

Vendor-Reported SWE-bench Pro Scores: Fable 5 at 80.3%, Opus 4.8 at 69.2%

Labs also publish SWE-bench Pro numbers run on their own agent scaffolds. These are not comparable to SEAL scores: the harness, context retrieval, and turn budgets are tuned per lab. They are comparable to each other within one lab's table. Anthropic's Claude Fable 5 launch table (June 9, 2026):

ModelScoreOutput Price
Claude Fable 580.3%$50/M tokens
Claude Mythos Preview77.8%n/a
Claude Opus 4.869.2%$25/M tokens
GPT-5.558.6%$30/M tokens
Gemini 3.1 Pro54.2%$12/M tokens

Source: Anthropic launch benchmarks via Vellum's analysis. Prices from the Anthropic and OpenAI/Google API price lists, June 2026.

The vendor-vs-SEAL gap is consistent: Anthropic reports 69.2% for Opus 4.8 while Scale's best standardized Claude run (Opus 4.6 thinking) scores 51.9%. GPT-5.3-Codex reported 57% at launch on OpenAI's scaffold; its predecessor gpt-5.2-codex scores 41.0% under SEAL. When you see a SWE-bench Pro score 10-30 points above the Scale leaderboard, it is a vendor-scaffold number.

Score per Dollar: SWE-bench Pro Points per $1/M Output Tokens

Benchmark points are not free. Dividing each model's SWE-bench Pro score by its output-token price ($/M) shows where capability is cheap. Haiku 4.5 buys 7.9 points per output dollar. Fable 5, the highest scorer, buys 1.6.

ModelPro ScoreScaffold$/M OutputPoints per $
Claude Haiku 4.539.5%Scale SEAL$57.9
GPT-5.4 (xHigh)59.1%Scale SEAL$153.9
Gemini 3.1 Pro46.1%Scale SEAL$123.8
GPT-5.2 Codex41.0%Scale SEAL$142.9
Claude Opus 4.869.2%Vendor$252.8
Claude Opus 4.651.9%Scale SEAL$252.1
GPT-5.558.6%Vendor$302.0
Claude Opus 4.545.9%Scale SEAL$251.8
Claude Fable 580.3%Vendor$501.6

Prices: Anthropic, OpenAI, and Google official API price lists, June 2026. Note: Opus 4.7 and later (including Fable 5) use a tokenizer that can produce up to 35% more tokens for the same text than pre-4.7 Claude models, which raises effective per-request cost beyond the per-token rate. Full cost modeling in our LLM cost calculator.

WarpGrep Impact on SWE-bench Pro (Morph Internal)

Self-reported data

The scores below are from Morph's internal benchmark runs (March 2026), not from the SEAL leaderboard. They show the effect of adding WarpGrep v2 as a search subagent to existing coding agents.

SWE-bench Pro: With vs Without WarpGrep v2

Morph internal benchmarks, public set (731 tasks)

With WarpGrep v2
Without WarpGrep
1Codex 5.3
59.1%
2MiniMax 2.5
57.6%
3Opus 4.6
57.5%

WarpGrep v2 adds 2.1-2.2 points to every model tested.

WarpGrep v2 is an RL-trained search subagent that runs in its own context window. It issues up to 8 parallel tool calls per turn and returns only the relevant file spans. The main coding model never sees files WarpGrep rejected, so its context stays clean.

With Opus 4.6, adding WarpGrep v2 cuts cost by 15.6% and time by 28%. The expensive model spends fewer tokens on search and more on code generation. Read how subagents make coding agents faster for the full breakdown.

SWE-bench Verified Leaderboard (June 2026)

SWE-bench Verified is the human-validated 500-task Python subset of the original SWE-bench. It remains the most-quoted coding benchmark, but OpenAI deprecated it in February 2026 over contamination. Scores below are vendor-reported and aggregated by llm-stats.

RankModelScore
1Claude Fable 595.0%
2Claude Mythos Preview93.9%
3Claude Opus 4.888.6%
4Claude Opus 4.787.6%
5Claude Opus 4.580.9%
6Claude Opus 4.680.8%
7DeepSeek-V4-Pro-Max (open)80.6%
8Gemini 3.1 Pro80.6%
9MiniMax M3 (open)80.5%
10Qwen3.7 Max80.4%

Source: llm-stats SWE-bench Verified tracker, June 2026. Vendor-reported; harness differences apply. See our full Claude benchmarks page for the rest of the suite.

Note the compression: ranks 5 through 10 span 0.5 points (80.9% to 80.4%). When six models from four labs are statistically tied near 80%, the benchmark has stopped discriminating at the frontier. That saturation, plus contamination, is why Pro exists.

SWE-bench Pro vs Verified: Same Model, Different Score

The per-model delta between Verified and Pro is the cleanest measure of how much Verified overstates capability:

ModelVerifiedProDropPro Scaffold
Claude Opus 4.580.9%45.9%−35.0 ptsScale SEAL
Gemini 3.1 Pro80.6%46.1%−34.5 ptsScale SEAL
Claude Opus 4.680.8%51.9%−28.9 ptsScale SEAL
Claude Opus 4.888.6%69.2%−19.4 ptsVendor
Claude Fable 595.0%80.3%−14.7 ptsVendor

GPT-5 is the starkest case the long-tail queries ask about: it scores 41.8% on Pro's public set and 14.9% on the commercial set, against the 70%+ range its generation posted on Verified. The drop is not the model getting worse. It is the benchmark getting honest.

DimensionSWE-bench VerifiedSWE-bench Pro
Tasks5001,865
Repositories12 (all Python)41 (Python, Go, TS, JS)
Avg lines changed11 (median: 4)107.4
Avg files changed~14.1
Minimum task size161/500 tasks are 1-2 linesEvery task is 10+ lines
Contamination resistanceLow: public Python reposHigh: copyleft + proprietary code
StatusDeprecated by OpenAI, Feb 2026Active, recommended

Open-Source Models on SWE-bench: DeepSeek V4, GLM-5.1, MiniMax M3, Qwen

Open-weights models now tie Gemini 3.1 Pro on Verified, but their SWE-bench Pro coverage is thin. Status per model, June 9, 2026:

ModelVerifiedPro (Scale SEAL)Output Price
DeepSeek-V4-Pro-Max80.6%No entry$0.87/M (V4-Pro API)
MiniMax M380.5%No entry$1.20/M
Qwen3.7 Max80.4%No entryn/a
Qwen3 Coder 480Bn/a38.7%n/a
GLM-5.1n/aNo entry$4.40/M

Verified scores: llm-stats, June 2026. Pro: Scale SEAL leaderboard. Prices: official DeepSeek, MiniMax, and Z.AI API price lists.

DeepSeek V4 and SWE-bench: neither DeepSeek V4 Flash nor Pro has a Scale SEAL SWE-bench Pro entry as of June 9, 2026. Third-party trackers circulate 55.4% for V4-Pro on vendor-style scaffolds (unverified by Scale). Its strongest verified result is V4-Pro-Max at 80.6% on SWE-bench Verified, the top open-weights score, tied with Gemini 3.1 Pro. V4 is MIT-licensed, 1.6T total / 49B active parameters (Pro) and 284B / 13B (Flash), with API output at $0.28/M (Flash) and $0.87/M (Pro).

GLM-5.1 and SWE-bench Pro: the 58.4% figure circulating for GLM-5.1 is vendor-reported, not a Scale SEAL entry. Scale's standardized leaderboard has no GLM-5 generation entry; the top open-weights entry under SEAL scaffolding remains qwen3-coder-480b-a35b at 38.7%. GLM-5.1 costs $1.40/M input, $4.40/M output on the official Z.AI API. Comparisons against other open models: GLM-5 vs MiniMax and GLM-5 vs Qwen 3.5.

How SWE-bench Pro Works: 1,865 Tasks, 41 Repos, Pass@1

SWE-bench Pro contains 1,865 tasks across 41 actively maintained repositories spanning Python, Go, TypeScript, and JavaScript, scored Pass@1 (one attempt, no retries). Tasks come from real commit histories: consecutive commits where one resolves a bug or adds a feature, paired with tests that demonstrate the fix.

Three Subsets

Public Set (731 tasks)

Tasks from 11 copyleft (GPL) repositories, openly available on HuggingFace. The primary evaluation target for leaderboard submissions.

Commercial Set (276 tasks)

Tasks from 18 proprietary startup codebases, acquired through Scale AI partnerships. Not publicly accessible: the strongest contamination control.

Held-Out Set (858 tasks)

Tasks from 12 repositories reserved for overfitting detection. Scale can release these to verify that public-set gains generalize.

Three-Stage Human Augmentation

  1. Problem statement creation: original commit messages and issue discussions are synthesized into clear, structured descriptions
  2. Requirements definition: annotators create specification lists grounded in unit tests and gold patches, detailing expected behavior without prescribing implementation
  3. Interface specification: class and function signatures are documented to prevent false negatives from naming mismatches

Evaluation methodology

Evaluation uses containerized, language-specific environments. Each task must pass fail2pass tests (tests that fail before the fix and pass after, verifying the issue is resolved) and pass2pass tests (existing tests that must keep passing). Gold patches are validated across 3 test runs before inclusion. Copyleft licensing makes the public set legally unattractive as training data, and the commercial set is never published at all.

Why Scores Are So Much Lower Than Verified

Four factors compound. Multi-file modifications: Pro tasks touch 4.1 files on average; Verified is mostly single-file. Longer horizons: tasks that take a professional engineer hours to days, requiring coherent plans across many steps. Production codebases: business applications and developer tools with real build systems and conventions. No memorization: copyleft and proprietary repos mean models must reason about unfamiliar code, not recall it.

Failure mode analysis

Scale's trajectory analysis shows where models break: semantic understanding failures (35.9% of Opus 4.1 failures), context overflow (35.6% of Sonnet 4 failures), and tool-use inefficiency (42% of smaller-model failures). Context overflow dominating the strongest models aligns with research showing coding agents spend 60%+ of their time searching for context.

Is SWE-bench Verified Contaminated? Why OpenAI Deprecated It

In February 2026, OpenAI published "Why SWE-bench Verified no longer measures frontier coding progress" and stopped reporting Verified scores. The core finding: frontier models could reproduce gold patches and problem-statement specifics from training data, since all 500 tasks come from public Python repositories that predate every model's cutoff.

Benchmark validity criticism cuts both ways. A widely circulated community analysis claims 68.5% of GPT-5.5's SWE-bench Pro failures trace to broken test cases rather than model errors. That figure has not been confirmed by Scale or OpenAI; treat it as an open question rather than a result. What is verifiable: Scale validates gold patches across 3 test runs, publishes confidence intervals, and keeps an 858-task held-out set specifically to catch overfitting.

Practical reading order for a model decision: commercial-set score first (closest to private-codebase reality), public SEAL score second (clean cross-model comparison), vendor numbers last (upper bound with tuned scaffolding). Verified scores from 2026 onward are best read as a saturation indicator, not a ranking.

Frequently Asked Questions

What is SWE-bench Pro?

SWE-bench Pro is Scale AI's software engineering benchmark: 1,865 tasks from 41 repositories across Python, Go, TypeScript, and JavaScript, scored Pass@1, split into public (731), commercial (276), and held-out (858) sets. Tasks average 107.4 changed lines across 4.1 files.

How hard is SWE-bench Pro?

Models lose 15 to 35 points moving from Verified to Pro. Opus 4.5: 80.9% to 45.9%. Gemini 3.1 Pro: 80.6% to 46.1%. The best standardized score as of June 9, 2026 is 59.1% (GPT-5.4 xHigh). On the proprietary commercial set, no model exceeds 47.1%.

What does Claude Fable 5 score on SWE-bench Pro?

80.3%, per Anthropic's launch table (GA June 9, 2026), versus 69.2% for Opus 4.8 and 58.6% for GPT-5.5 in the same vendor-run comparison. Scale's standardized SEAL leaderboard has no Fable 5 entry yet; its top Claude run is Opus 4.6 (thinking) at 51.9%. Fable 5 is priced at $10/M input, $50/M output with a 1M-token context window.

What does Claude Opus 4.8 score on SWE-bench Pro?

69.2% vendor-reported (Anthropic scaffold). Opus 4.8 (released May 28, 2026) also posts 88.6% on SWE-bench Verified and 74.6% on Terminal-Bench 2.1, at $5/M input and $25/M output.

What does GPT-5.3-Codex score on SWE-bench Pro?

OpenAI reported 57% at launch on its own Codex scaffold. Under Scale's standardized scaffolding, the predecessor gpt-5.2-codex scores 41.0% on the public set and 27.7% on the commercial set. gpt-5.3-codex is priced at $1.75/M input, $14/M output.

Does DeepSeek V4 have a SWE-bench Pro score?

No Scale SEAL entry exists for any DeepSeek V4 variant as of June 9, 2026. Third-party trackers report 55.4% for V4-Pro on vendor-style scaffolds (unverified). DeepSeek-V4-Pro-Max scores 80.6% on SWE-bench Verified, the highest open-weights result. Details on the model family: DeepSeek V4.

What is the best open-source model on SWE-bench?

On Verified: DeepSeek-V4-Pro-Max (80.6%), MiniMax M3 (80.5%), Qwen3.7 Max (80.4%). On Scale's standardized SWE-bench Pro leaderboard, the top open-weights entry is qwen3-coder-480b-a35b at 38.7%. GLM-5.1's circulating 58.4% Pro figure is vendor-reported, not a SEAL entry. See best open-source coding models.

Why do vendor scores and Scale SEAL scores differ?

Scale runs every model through identical scaffolding; vendors run tuned agent harnesses. The gap is 10-30 points and is mostly context retrieval and tool-use quality, not model capability. Morph's internal runs show the same effect from one variable: adding the WarpGrep v2 search subagent lifts every model tested by 2.1-2.2 points.

Is SWE-bench Verified still useful?

As a frontier ranking, no: OpenAI deprecated it in February 2026 over confirmed contamination, and ranks 5-10 now sit within 0.5 points of each other. It still separates weak models from strong ones and runs cheaply. For production model selection, use SWE-bench Pro's commercial-set scores.

WarpGrep v2: Search Subagent for SWE-bench Pro

WarpGrep v2 is the RL-trained search subagent that lifted every model it was paired with by 2+ points on SWE-bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent 15.6% cheaper and 28% faster. Free for 100k requests, then $1 per 1M.