Skip to content

[CI Infrastructure] Unified output format for regression check #15996

@harvenstar

Description

@harvenstar

Motivation

SGLang's CI system currently lacks systematic performance regression detection. Tests either pass or fail based on fixed thresholds, but gradual performance degradation often goes unnoticed until it accumulates into a significant problem. Additionally, some critical tests exhibit flakiness that is difficult to diagnose due to lack of structured metrics collection.

Concrete examples of the problem:

  1. Flaky test_hicache_variants.py: Cache hit rate fluctuations cause intermittent failures

  2. Flaky test_standalone_speculative_decoding.py: Speculative acceptance rate varies across runs

These issues stem from a fundamental limitation: we cannot distinguish between real regressions and environmental noise without historical performance data.

Background

Current testing approach:

  • Metrics are printed to stdout in inconsistent formats (print vs logger)
  • Test assertions use fixed thresholds (e.g., "accuracy > 0.7")
  • No structured data for trend analysis or automated alerting
  • Difficult to correlate performance changes with code changes

This makes it hard to:

  • Detect gradual performance regressions that stay within thresholds
  • Diagnose root causes of flaky tests (code issue vs infrastructure noise)
  • Track performance improvements from optimizations
  • Make data-driven decisions about test threshold tuning

Proposed Solution

Introduce a standardized metrics collection system with:

  1. Unified output utility: `dump_metric()` function in `python/sglang/test/test_utils.py`
  2. Structured data format: JSONL for machine readability and streaming safety
  3. CI integration: GitHub Actions artifacts + summary tables for visibility

Core Design: dump_metric() Function

def dump_metric(name: str, value: Any):
    """
    Record a test metric in unified format.

    The function automatically captures context (filename, test case name) and:
    1. Appends to JSONL file specified by SGLANG_TEST_METRICS_OUTPUT env var
    2. Writes key metrics to GITHUB_STEP_SUMMARY for PR visibility (if in CI)
    3. Prints human-readable output to stdout

    Args:
        name: Metric identifier (e.g., "spec_decode_acceptance_rate")
        value: Measured value (supports int, float, str)
    """

Output format (JSONL, one JSON object per line):

{"filename": "test_standalone_speculative_decoding.py", "test_case": "test_gsm8k", "metric_name": "spec_decode_acceptance_rate", "value": 3.85}
{"filename": "test_hicache_variants.py", "test_case": "test_mmlu", "metric_name": "cache_hit_rate", "value": 0.95}
{"filename": "test_hicache_variants.py", "test_case": "test_mmlu", "metric_name": "mmlu_score", "value": 0.68}

Implementation Details (current plan, v1)

Schema (JSONL)

  • One JSON object per line:
    {"filename": ..., "test_case": ..., "metric_name": ..., "value": ...}
  • Optional fields (not required for v1):
    • "ts": unix timestamp (float)
    • "labels": dict for low-cardinality tags (e.g., {"model": "...", "tp": 2})
  • "value" supports int/float/str in v1.

Context capture

  • filename: repo-relative path if possible, fallback to basename.
  • test_case: prefer pytest node id when available (e.g., via PYTEST_CURRENT_TEST), fallback to inspect.stack() to infer function name.

File I/O strategy

  • Controlled by env var SGLANG_TEST_METRICS_OUTPUT.
    • If set: append JSONL lines to file(s).
    • If not set: stdout only.
  • Robustness: dump_metric must never fail the test. All I/O is wrapped in try/except and degrades gracefully.
  • Concurrency: to avoid multi-process write interleaving, v1 writes per-worker or per-process output:
    • Example: ${SGLANG_TEST_METRICS_OUTPUT}.${PYTEST_XDIST_WORKER-or-pid}.jsonl
    • CI step merges them into a single JSONL for upload.

GitHub Actions integration

  • Metrics files are uploaded as artifacts with if: always() so we can inspect metrics even on failures.
  • CI will merge per-worker/per-process JSONL files into a single test_metrics.jsonl before upload.
  • Artifact naming (tentative): regression-metrics-${{ github.run_id }}-${{ github.job }}.
  • Upload uses if: always() so metrics are available even on failures.
  • GitHub summary:
    • v1 uses a small allowlist of "key metrics" per test suite.
    • dump_metric writes these allowlisted metrics to GITHUB_STEP_SUMMARY for PR visibility.
    • For v1, the allowlist will be defined in python/sglang/test/test_utils.py (hardcoded dict keyed by filename or suite) to minimize callsite changes.
    • Non-allowlisted metrics still go to JSONL artifacts, not summary, to avoid noise.

Target Test Files

1. Speculative Decoding

2. HiCache Variants

Note: Vision chunked prefill tests have been fixed by @lianmin and are not included in this migration.

Expected Benefits

  1. Regression detection: Historical data enables automated detection of performance degradation trends
  2. Flakiness diagnosis: Rich metrics help distinguish code issues from environmental variance
  3. CI transparency: Key metrics visible in GitHub UI without downloading artifacts
  4. Foundation for monitoring: Sets up infrastructure for future automated alerting

Implementation Roadmap

  1. Implement `dump_metric()` in `test_utils.py` with unit tests
  2. Update CI workflow to set `SGLANG_TEST_METRICS_OUTPUT` and upload artifacts
  3. Migrate `test_standalone_speculative_decoding.py`
  4. Migrate `test_hicache_variants.py`
  5. (Future) Gradually extend to nightly tests and other test suites

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions