[CI Infrastructure] Unified output format for regression check

# Motivation

SGLang's CI system currently lacks systematic performance regression detection. Tests either pass or fail based on fixed thresholds, but gradual performance degradation often goes unnoticed until it accumulates into a significant problem. Additionally, some critical tests exhibit flakiness that is difficult to diagnose due to lack of structured metrics collection.

**Concrete examples of the problem:**

1. **Flaky test_hicache_variants.py**: Cache hit rate fluctuations cause intermittent failures
   - Reference: https://sgl-fru7574.slack.com/archives/C090S2LHFA8/p1766794638228659

2. **Flaky test_standalone_speculative_decoding.py**: Speculative acceptance rate varies across runs
   - Reference: https://github.com/sgl-project/sglang/actions/runs/20527003659/job/58972707509#step:5:2540

These issues stem from a fundamental limitation: we cannot distinguish between real regressions and environmental noise without historical performance data.

## Background

Current testing approach:
- Metrics are printed to stdout in inconsistent formats (print vs logger)
- Test assertions use fixed thresholds (e.g., "accuracy > 0.7")
- No structured data for trend analysis or automated alerting
- Difficult to correlate performance changes with code changes

This makes it hard to:
- Detect gradual performance regressions that stay within thresholds
- Diagnose root causes of flaky tests (code issue vs infrastructure noise)
- Track performance improvements from optimizations
- Make data-driven decisions about test threshold tuning

## Proposed Solution

Introduce a standardized metrics collection system with:

1. **Unified output utility**: \`dump_metric()\` function in \`python/sglang/test/test_utils.py\`
2. **Structured data format**: JSONL for machine readability and streaming safety
3. **CI integration**: GitHub Actions artifacts + summary tables for visibility

### Core Design: dump_metric() Function

```python
def dump_metric(name: str, value: Any):
    """
    Record a test metric in unified format.

    The function automatically captures context (filename, test case name) and:
    1. Appends to JSONL file specified by SGLANG_TEST_METRICS_OUTPUT env var
    2. Writes key metrics to GITHUB_STEP_SUMMARY for PR visibility (if in CI)
    3. Prints human-readable output to stdout

    Args:
        name: Metric identifier (e.g., "spec_decode_acceptance_rate")
        value: Measured value (supports int, float, str)
    """
```

**Output format** (JSONL, one JSON object per line):

```json
{"filename": "test_standalone_speculative_decoding.py", "test_case": "test_gsm8k", "metric_name": "spec_decode_acceptance_rate", "value": 3.85}
{"filename": "test_hicache_variants.py", "test_case": "test_mmlu", "metric_name": "cache_hit_rate", "value": 0.95}
{"filename": "test_hicache_variants.py", "test_case": "test_mmlu", "metric_name": "mmlu_score", "value": 0.68}
```

### Implementation Details (current plan, v1)

**Schema (JSONL)**
- One JSON object per line:
  {"filename": ..., "test_case": ..., "metric_name": ..., "value": ...}
- Optional fields (not required for v1):
  - "ts": unix timestamp (float)
  - "labels": dict for low-cardinality tags (e.g., {"model": "...", "tp": 2})
- "value" supports int/float/str in v1.

**Context capture**
- filename: repo-relative path if possible, fallback to basename.
- test_case: prefer pytest node id when available (e.g., via PYTEST_CURRENT_TEST), fallback to inspect.stack() to infer function name.

**File I/O strategy**
- Controlled by env var SGLANG_TEST_METRICS_OUTPUT.
  - If set: append JSONL lines to file(s).
  - If not set: stdout only.
- Robustness: dump_metric must never fail the test. All I/O is wrapped in try/except and degrades gracefully.
- Concurrency: to avoid multi-process write interleaving, v1 writes per-worker or per-process output:
  - Example: ${SGLANG_TEST_METRICS_OUTPUT}.${PYTEST_XDIST_WORKER-or-pid}.jsonl
  - CI step merges them into a single JSONL for upload.

**GitHub Actions integration**
- Metrics files are uploaded as artifacts with `if: always()` so we can inspect metrics even on failures.
- CI will merge per-worker/per-process JSONL files into a single `test_metrics.jsonl` before upload.
- Artifact naming (tentative): `regression-metrics-${{ github.run_id }}-${{ github.job }}`.
- Upload uses `if: always()` so metrics are available even on failures.
- GitHub summary:
  - v1 uses a small allowlist of "key metrics" per test suite.
  - dump_metric writes these allowlisted metrics to GITHUB_STEP_SUMMARY for PR visibility.
  - For v1, the allowlist will be defined in `python/sglang/test/test_utils.py` (hardcoded dict keyed by filename or suite) to minimize callsite changes.
  - Non-allowlisted metrics still go to JSONL artifacts, not summary, to avoid noise.


## Target Test Files

### 1. Speculative Decoding
- **File**: \`test/srt/test_standalone_speculative_decoding.py\`
- **Current issue**: https://github.com/sgl-project/sglang/actions/runs/20527003659/job/58972707509#step:5:2540
- **Metrics to track**:
  - \`spec_decode_acceptance_rate\`: Currently printed as \`avg_spec_accept_length\`
  - \`gsm8k_accuracy\`: End-to-end accuracy score
  - \`token_throughput\`: Tokens per second

### 2. HiCache Variants
- **File**: \`test/srt/hicache/test_hicache_variants.py\`
- **Current issue**: https://sgl-fru7574.slack.com/archives/C090S2LHFA8/p1766794638228659
- **Metrics to track**:
  - \`cache_hit_rate\`: RadixAttention cache utilization
  - \`mmlu_score\`: Accuracy metric
  - \`eviction_count\`: Number of cache evictions (diagnostic)

Note: Vision chunked prefill tests have been fixed by @lianmin and are not included in this migration.

## Expected Benefits

1. **Regression detection**: Historical data enables automated detection of performance degradation trends
2. **Flakiness diagnosis**: Rich metrics help distinguish code issues from environmental variance
3. **CI transparency**: Key metrics visible in GitHub UI without downloading artifacts
4. **Foundation for monitoring**: Sets up infrastructure for future automated alerting

## Implementation Roadmap

1. Implement \`dump_metric()\` in \`test_utils.py\` with unit tests
2. Update CI workflow to set \`SGLANG_TEST_METRICS_OUTPUT\` and upload artifacts
3. Migrate \`test_standalone_speculative_decoding.py\`
4. Migrate \`test_hicache_variants.py\`
5. (Future) Gradually extend to nightly tests and other test suites



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI Infrastructure] Unified output format for regression check #15996

Motivation

Background

Proposed Solution

Core Design: dump_metric() Function

Implementation Details (current plan, v1)

Target Test Files

1. Speculative Decoding

2. HiCache Variants

Expected Benefits

Implementation Roadmap

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CI Infrastructure] Unified output format for regression check #15996

Description

Motivation

Background

Proposed Solution

Core Design: dump_metric() Function

Implementation Details (current plan, v1)

Target Test Files

1. Speculative Decoding

2. HiCache Variants

Expected Benefits

Implementation Roadmap

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions