Skip to content

Relax flaky B200 GSM8K accuracy thresholds#20304

Merged
hnyls2002 merged 12 commits intomainfrom
fix/flaky-b200-thresholds
Mar 11, 2026
Merged

Relax flaky B200 GSM8K accuracy thresholds#20304
hnyls2002 merged 12 commits intomainfrom
fix/flaky-b200-thresholds

Conversation

@alisonshao
Copy link
Copy Markdown
Collaborator

@alisonshao alisonshao commented Mar 10, 2026

Summary

  • Relax DeepSeek V3 FP4 GSM8K threshold from 0.935 → 0.93 (all 4 test classes)
  • Relax Eagle DP Attention GSM8K threshold from 0.64 → 0.62

Both tests run in stage-c-test-4-gpu-b200 and have been causing flaky CI failures due to thresholds set too close to the natural variance of accuracy on a 200-question benchmark.

Data Analysis (~1 month of scheduled CI)

Analyzed 200 scheduled CI runs (Jan 20 – Mar 11). Extracted accuracy values from B200 shard logs across stage-c-test-4-gpu-b200 (0) and (2).

Overview

B200 Accuracy Overview


DeepSeek V3 FP4 GSM8K (120 runs, 486 values, Feb 6 – Mar 11)

FP4 Accuracy Detail

  • 17/120 runs (14%) had at least one test class ≤ 0.935 (old threshold)
  • Only 6/486 values (1.2%) ≤ 0.93 (new threshold), all from a single transient outlier on Feb 13
  • Range: 0.9083 – 0.9560, mean: 0.9443

Sample failures (click run IDs for logs):

Date Run Min Accuracy Max Accuracy
Feb 13 21969599115 0.9121 0.9500
Mar 3 22602086581 0.9333 0.9500
Mar 4 22649006165 0.9325 0.9507
Mar 7 22804312973 0.9333 0.9431
Mar 8 22820753370 0.9318 0.9454
Mar 9 22841054156 0.9340 0.9447

Eagle DP Attention GSM8K (93 data points, 103 runs, Feb 10 – Mar 11)

Eagle Accuracy Full History

Tracked across shards — the test ran in shard 2 before ~Mar 6 and moved to shard 0 after.

  • 12/93 values (13%) ≤ 0.64 (old threshold; assertGreater means == 0.64 also fails)
  • 0/93 values ≤ 0.62 (new threshold)
  • Range: 0.630 – 0.665, mean: 0.652
  • The full range spans just 7 questions on the 200-Q GSM8K benchmark — the old threshold 0.64 sits right in the middle of this noise band

Runs with failures (accuracy ≤ 0.64):

Date Run Attempt 1 Attempt 2 Status
Mar 4 22649006165 0.640 ❌ 0.645 ✅ retried
Mar 4 22657546003 0.635 ❌ 0.635 ❌ FAILED
Mar 4 22668727618 0.635 ❌ 0.640 ❌ FAILED
Mar 5 22714949654 0.640 ❌ 0.640 ❌ FAILED
Mar 5 22738497684 0.635 ❌ 0.635 ❌ FAILED
Mar 7 22798741537 0.635 ❌ 0.630 ❌ FAILED
Mar 8 22826748090 0.640 ❌ 0.640 ❌ FAILED
Mar 10 22880962430 0.640 ❌ 0.640 ❌ FAILED
Mar 10 22901799075 0.640 ❌ 0.640 ❌ FAILED
Mar 10 22917535380 0.630 ❌ 0.650 ✅ retried

Test plan

  • Analyzed 93+ Eagle and 486 FP4 data points over 1 month — new thresholds give headroom below all observed values
  • No test logic changes — only threshold constants adjusted

- DeepSeek V3 FP4: 0.935 → 0.93 (observed values 0.9318–0.9507 across 12 runs)
- Eagle DP Attention: 0.64 → 0.62 (observed values 0.630–0.655 across 12 runs)
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Mar 11, 2026
Comment thread docs/charts/b200_accuracy_pr20304.png Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file. You can only put it in the Pr's body, not in the repo

@hnyls2002 hnyls2002 merged commit 7b44bc9 into main Mar 11, 2026
61 of 67 checks passed
@hnyls2002 hnyls2002 deleted the fix/flaky-b200-thresholds branch March 11, 2026 19:35
liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants