Skip to content

[CI Infrastructure] Roadmap: Regression-Based CI Checks #21157

@hnyls2002

Description

@hnyls2002

Motivation

SGLang CI uses static thresholds for accuracy and performance checks. This causes:

The goal is regression-based checking: compare each CI run against historical baselines instead of hardcoded numbers.

Roadmap

Phase 0: dump_metric() foundation         ✅ DONE
Phase 1: Unify eval & bench entry points   🔄 IN PROGRESS
Phase 2: Wire dump_metric into all paths   ⏳ BLOCKED on Phase 1
Phase 3: Durable metrics storage           ⏳ BLOCKED on Phase 2
Phase 4: Regression detection logic        ⏳ BLOCKED on Phase 3

Phase 0: Metrics Collection Foundation — ✅ DONE

dump_metric() merged (#16064), nightly upload to GH Artifacts (#17696), performance dashboard (#17725).

Phase 1: Unify Entry Points — 🔄 IN PROGRESS

dump_metric only works if callers use it. Currently only run_eval.py calls it. Must consolidate fragmented eval/bench systems first.

Phase 2: Wire dump_metric Into All Paths

Once entry points are unified, add dump_metric() at each one — accuracy (score, latency) and performance (throughput, TTFT, ITL, E2E latency).

Phase 3: Durable Metrics Storage

Metrics must persist across CI runs for baseline comparison. S3 upload attempted (#21057, blocked on AWS creds). Need to decide: S3 vs GH Artifacts vs git-based vs dashboard DB.

Phase 4: Regression Detection

Automated current_metric >= baseline * (1 - tolerance) check in CI.

Key decisions TBD:

  • Baseline selection (rolling median of last N nightly runs on main?)
  • Per-metric tolerance (accuracy ±2%, throughput ±10%, latency ±15%)
  • CI action (block merge vs warning comment)
  • Roll out on nightly first, then expand to PR CI

Related

# Title Phase
#16064 dump_metric() core 0 ✅
#17696 Nightly metrics → GH Artifacts 0 ✅
#17725 Performance dashboard 0 ✅
#21046 Consolidate eval systems 1
#10177 Benchmark scripts refactor 1
#16579 CI metrics for stage-b jobs 2
#21057 S3 upload for nightly metrics 3

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions