Motivation
SGLang CI uses static thresholds for accuracy and performance checks. This causes:
The goal is regression-based checking: compare each CI run against historical baselines instead of hardcoded numbers.
Roadmap
Phase 0: dump_metric() foundation ✅ DONE
Phase 1: Unify eval & bench entry points 🔄 IN PROGRESS
Phase 2: Wire dump_metric into all paths ⏳ BLOCKED on Phase 1
Phase 3: Durable metrics storage ⏳ BLOCKED on Phase 2
Phase 4: Regression detection logic ⏳ BLOCKED on Phase 3
Phase 0: Metrics Collection Foundation — ✅ DONE
dump_metric() merged (#16064), nightly upload to GH Artifacts (#17696), performance dashboard (#17725).
Phase 1: Unify Entry Points — 🔄 IN PROGRESS
dump_metric only works if callers use it. Currently only run_eval.py calls it. Must consolidate fragmented eval/bench systems first.
Phase 2: Wire dump_metric Into All Paths
Once entry points are unified, add dump_metric() at each one — accuracy (score, latency) and performance (throughput, TTFT, ITL, E2E latency).
Phase 3: Durable Metrics Storage
Metrics must persist across CI runs for baseline comparison. S3 upload attempted (#21057, blocked on AWS creds). Need to decide: S3 vs GH Artifacts vs git-based vs dashboard DB.
Phase 4: Regression Detection
Automated current_metric >= baseline * (1 - tolerance) check in CI.
Key decisions TBD:
- Baseline selection (rolling median of last N nightly runs on main?)
- Per-metric tolerance (accuracy ±2%, throughput ±10%, latency ±15%)
- CI action (block merge vs warning comment)
- Roll out on nightly first, then expand to PR CI
Related
| # |
Title |
Phase |
| #16064 |
dump_metric() core |
0 ✅ |
| #17696 |
Nightly metrics → GH Artifacts |
0 ✅ |
| #17725 |
Performance dashboard |
0 ✅ |
| #21046 |
Consolidate eval systems |
1 |
| #10177 |
Benchmark scripts refactor |
1 |
| #16579 |
CI metrics for stage-b jobs |
2 |
| #21057 |
S3 upload for nightly metrics |
3 |
Motivation
SGLang CI uses static thresholds for accuracy and performance checks. This causes:
The goal is regression-based checking: compare each CI run against historical baselines instead of hardcoded numbers.
Roadmap
Phase 0: Metrics Collection Foundation — ✅ DONE
dump_metric()merged (#16064), nightly upload to GH Artifacts (#17696), performance dashboard (#17725).Phase 1: Unify Entry Points — 🔄 IN PROGRESS
dump_metriconly works if callers use it. Currently onlyrun_eval.pycalls it. Must consolidate fragmented eval/bench systems first.bench_serving.py+NightlyBenchmarkRunner+benchmark/scripts — converge to unified runner. Tracked by [Refactor] Benchmark Scripts Refactor #10177.Phase 2: Wire
dump_metricInto All PathsOnce entry points are unified, add
dump_metric()at each one — accuracy (score, latency) and performance (throughput, TTFT, ITL, E2E latency).Phase 3: Durable Metrics Storage
Metrics must persist across CI runs for baseline comparison. S3 upload attempted (#21057, blocked on AWS creds). Need to decide: S3 vs GH Artifacts vs git-based vs dashboard DB.
Phase 4: Regression Detection
Automated
current_metric >= baseline * (1 - tolerance)check in CI.Key decisions TBD:
Related
dump_metric()core