[CI Infrastructure] Roadmap: Regression-Based CI Checks

## Motivation

SGLang CI uses **static thresholds** for accuracy and performance checks. This causes:
- **Flaky tests** — natural variance triggers false failures, leading to repeated manual threshold adjustments (#20304, #19402, #20068, #16593)
- **Late-discovered regressions** — no automated baseline comparison, regressions found only via user reports (#16797, #14498, #21012)

The goal is **regression-based checking**: compare each CI run against historical baselines instead of hardcoded numbers.

## Roadmap

```
Phase 0: dump_metric() foundation         ✅ DONE
Phase 1: Unify eval & bench entry points   🔄 IN PROGRESS
Phase 2: Wire dump_metric into all paths   ⏳ BLOCKED on Phase 1
Phase 3: Durable metrics storage           ⏳ BLOCKED on Phase 2
Phase 4: Regression detection logic        ⏳ BLOCKED on Phase 3
```

### Phase 0: Metrics Collection Foundation — ✅ DONE

`dump_metric()` merged (#16064), nightly upload to GH Artifacts (#17696), performance dashboard (#17725).

### Phase 1: Unify Entry Points — 🔄 IN PROGRESS

`dump_metric` only works if callers use it. Currently **only `run_eval.py` calls it**. Must consolidate fragmented eval/bench systems first.

- **Accuracy eval**: 5 GSM8K implementations, 4 MMLU — consolidate to ≤2 each. Tracked by #21046.
- **Performance bench**: `bench_serving.py` + `NightlyBenchmarkRunner` + `benchmark/` scripts — converge to unified runner. Tracked by #10177.

### Phase 2: Wire `dump_metric` Into All Paths

Once entry points are unified, add `dump_metric()` at each one — accuracy (score, latency) and performance (throughput, TTFT, ITL, E2E latency).

### Phase 3: Durable Metrics Storage

Metrics must persist across CI runs for baseline comparison. S3 upload attempted (#21057, blocked on AWS creds). Need to decide: S3 vs GH Artifacts vs git-based vs dashboard DB.

### Phase 4: Regression Detection

Automated `current_metric >= baseline * (1 - tolerance)` check in CI.

Key decisions TBD:
- Baseline selection (rolling median of last N nightly runs on main?)
- Per-metric tolerance (accuracy ±2%, throughput ±10%, latency ±15%)
- CI action (block merge vs warning comment)
- Roll out on nightly first, then expand to PR CI

## Related

| # | Title | Phase |
|---|-------|-------|
| #16064 | `dump_metric()` core | 0 ✅ |
| #17696 | Nightly metrics → GH Artifacts | 0 ✅ |
| #17725 | Performance dashboard | 0 ✅ |
| #21046 | Consolidate eval systems | 1 |
| #10177 | Benchmark scripts refactor | 1 |
| #16579 | CI metrics for stage-b jobs | 2 |
| #21057 | S3 upload for nightly metrics | 3 |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI Infrastructure] Roadmap: Regression-Based CI Checks #21157

Motivation

Roadmap

Phase 0: Metrics Collection Foundation — ✅ DONE

Phase 1: Unify Entry Points — 🔄 IN PROGRESS

Phase 2: Wire `dump_metric` Into All Paths

Phase 3: Durable Metrics Storage

Phase 4: Regression Detection

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Title	Phase
#16064	`dump_metric()` core	0 ✅
#17696	Nightly metrics → GH Artifacts	0 ✅
#17725	Performance dashboard	0 ✅
#21046	Consolidate eval systems	1
#10177	Benchmark scripts refactor	1
#16579	CI metrics for stage-b jobs	2
#21057	S3 upload for nightly metrics	3

[CI Infrastructure] Roadmap: Regression-Based CI Checks #21157

Description

Motivation

Roadmap

Phase 0: Metrics Collection Foundation — ✅ DONE

Phase 1: Unify Entry Points — 🔄 IN PROGRESS

Phase 2: Wire dump_metric Into All Paths

Phase 3: Durable Metrics Storage

Phase 4: Regression Detection

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Phase 2: Wire `dump_metric` Into All Paths