Relax flaky B200 GSM8K accuracy thresholds by alisonshao · Pull Request #20304 · sgl-project/sglang

alisonshao · 2026-03-10T23:11:05Z

Summary

Relax DeepSeek V3 FP4 GSM8K threshold from 0.935 → 0.93 (all 4 test classes)
Relax Eagle DP Attention GSM8K threshold from 0.64 → 0.62

Both tests run in stage-c-test-4-gpu-b200 and have been causing flaky CI failures due to thresholds set too close to the natural variance of accuracy on a 200-question benchmark.

Data Analysis (~1 month of scheduled CI)

Analyzed 200 scheduled CI runs (Jan 20 – Mar 11). Extracted accuracy values from B200 shard logs across stage-c-test-4-gpu-b200 (0) and (2).

Overview

DeepSeek V3 FP4 GSM8K (120 runs, 486 values, Feb 6 – Mar 11)

17/120 runs (14%) had at least one test class ≤ 0.935 (old threshold)
Only 6/486 values (1.2%) ≤ 0.93 (new threshold), all from a single transient outlier on Feb 13
Range: 0.9083 – 0.9560, mean: 0.9443

Sample failures (click run IDs for logs):

Date	Run	Min Accuracy	Max Accuracy
Feb 13	21969599115	0.9121	0.9500
Mar 3	22602086581	0.9333	0.9500
Mar 4	22649006165	0.9325	0.9507
Mar 7	22804312973	0.9333	0.9431
Mar 8	22820753370	0.9318	0.9454
Mar 9	22841054156	0.9340	0.9447

Eagle DP Attention GSM8K (93 data points, 103 runs, Feb 10 – Mar 11)

Tracked across shards — the test ran in shard 2 before ~Mar 6 and moved to shard 0 after.

12/93 values (13%) ≤ 0.64 (old threshold; assertGreater means == 0.64 also fails)
0/93 values ≤ 0.62 (new threshold)
Range: 0.630 – 0.665, mean: 0.652
The full range spans just 7 questions on the 200-Q GSM8K benchmark — the old threshold 0.64 sits right in the middle of this noise band

Runs with failures (accuracy ≤ 0.64):

Date	Run	Attempt 1	Attempt 2	Status
Mar 4	22649006165	0.640 ❌	0.645 ✅	retried
Mar 4	22657546003	0.635 ❌	0.635 ❌	FAILED
Mar 4	22668727618	0.635 ❌	0.640 ❌	FAILED
Mar 5	22714949654	0.640 ❌	0.640 ❌	FAILED
Mar 5	22738497684	0.635 ❌	0.635 ❌	FAILED
Mar 7	22798741537	0.635 ❌	0.630 ❌	FAILED
Mar 8	22826748090	0.640 ❌	0.640 ❌	FAILED
Mar 10	22880962430	0.640 ❌	0.640 ❌	FAILED
Mar 10	22901799075	0.640 ❌	0.640 ❌	FAILED
Mar 10	22917535380	0.630 ❌	0.650 ✅	retried

Test plan

Analyzed 93+ Eagle and 486 FP4 data points over 1 month — new thresholds give headroom below all observed values
No test logic changes — only threshold constants adjusted

- DeepSeek V3 FP4: 0.935 → 0.93 (observed values 0.9318–0.9507 across 12 runs) - Eagle DP Attention: 0.64 → 0.62 (observed values 0.630–0.655 across 12 runs)

gemini-code-assist · 2026-03-10T23:11:08Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

hnyls2002 · 2026-03-11T07:12:10Z

Remove this file. You can only put it in the Pr's body, not in the repo

Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>

Relax flaky B200 GSM8K accuracy thresholds

66e0843

- DeepSeek V3 FP4: 0.935 → 0.93 (observed values 0.9318–0.9507 across 12 runs) - Eagle DP Attention: 0.64 → 0.62 (observed values 0.630–0.655 across 12 runs)

github-actions Bot added the deepseek label Mar 10, 2026

alisonshao added 2 commits March 11, 2026 00:09

Add B200 accuracy chart for PR #20304

e2cf345

Add FP4 accuracy detail chart for PR #20304

9f7a7d2

github-actions Bot added the documentation Improvements or additions to documentation label Mar 11, 2026

Add Eagle accuracy detail chart for PR #20304

b5d18ac

hnyls2002 requested changes Mar 11, 2026

View reviewed changes

alisonshao added 8 commits March 11, 2026 00:21

Add Eagle full history chart for PR #20304

d7c69ed

Update combined chart with full Eagle history

9e467a4

Update combined chart (remove regression framing)

b34eff5

Update eagle chart (remove regression framing)

28f6984

Remove chart file from repo (images belong in PR body only)

bf230a0

Remove chart file from repo (images belong in PR body only)

2cd599e

Remove chart file from repo (images belong in PR body only)

ad8bc6d

Remove chart file from repo (images belong in PR body only)

eb5bf2a

hnyls2002 approved these changes Mar 11, 2026

View reviewed changes

hnyls2002 merged commit 7b44bc9 into main Mar 11, 2026
61 of 67 checks passed

hnyls2002 deleted the fix/flaky-b200-thresholds branch March 11, 2026 19:35

liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026

Relax flaky B200 GSM8K accuracy thresholds (sgl-project#20304)

cc91f47

Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Relax flaky B200 GSM8K accuracy thresholds (sgl-project#20304)

23d8790

Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>

hnyls2002 mentioned this pull request Mar 23, 2026

[CI Infrastructure] Roadmap: Regression-Based CI Checks #21157

Open

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Relax flaky B200 GSM8K accuracy thresholds (sgl-project#20304)

1077b36

Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Relax flaky B200 GSM8K accuracy thresholds (sgl-project#20304)

e9f2d9b

Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax flaky B200 GSM8K accuracy thresholds#20304

Relax flaky B200 GSM8K accuracy thresholds#20304
hnyls2002 merged 12 commits intomainfrom
fix/flaky-b200-thresholds

alisonshao commented Mar 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 10, 2026

Uh oh!

hnyls2002 Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alisonshao commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Data Analysis (~1 month of scheduled CI)

Overview

DeepSeek V3 FP4 GSM8K (120 runs, 486 values, Feb 6 – Mar 11)

Eagle DP Attention GSM8K (93 data points, 103 runs, Feb 10 – Mar 11)

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 10, 2026

Uh oh!

hnyls2002 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alisonshao commented Mar 10, 2026 •

edited

Loading