Skip to content

feat: enable unified metrics collection in nightly CI with S3 upload#21057

Open
alisonshao wants to merge 1 commit intomainfrom
feat/unified-metrics-s3
Open

feat: enable unified metrics collection in nightly CI with S3 upload#21057
alisonshao wants to merge 1 commit intomainfrom
feat/unified-metrics-s3

Conversation

@alisonshao
Copy link
Copy Markdown
Collaborator

Summary

  • Enable dump_metric() JSONL output in 6 nightly jobs by setting SGLANG_TEST_METRICS_OUTPUT env var
  • Collect JSONL metrics as artifacts and consolidate them in the consolidate-metrics job
  • Add S3 upload script that uploads both consolidated benchmark JSON and test metrics JSONL to s3://rdxa-eng-json-logs-05354378/ci-metrics/nightly/{date}/{run_id}/
  • S3 upload gracefully no-ops when AWS secrets are not configured

How it works

The existing dump_metric() function (PR #16064) already writes JSONL when SGLANG_TEST_METRICS_OUTPUT is set, and run_eval.py already calls it for eval scores/latency. This PR simply wires it into the nightly workflow:

  1. Each nightly eval/perf job sets SGLANG_TEST_METRICS_OUTPUT=test-metrics and uploads the resulting JSONL files as artifacts
  2. The consolidate-metrics job downloads all artifacts, concatenates JSONL files, and uploads everything to S3
  3. S3 path: ci-metrics/nightly/{date}/{run_id}/ containing consolidated-metrics.json, test-metrics.jsonl, and run-metadata.json

Blocked on

  • S3 credentials: Waiting on IT to provision long-lived AWS credentials. Need CI_METRICS_AWS_ACCESS_KEY_ID and CI_METRICS_AWS_SECRET_ACCESS_KEY added as GitHub repo secrets. Until then, metrics are collected as GitHub artifacts (90-day retention) but not uploaded to S3.

Jobs affected

  • nightly-test-text-accuracy-2-gpu-h100
  • nightly-test-text-perf-2-gpu-h100
  • nightly-test-vlm-accuracy-2-gpu-h100
  • nightly-test-vlm-perf-2-gpu-h100
  • nightly-test-general-8-gpu-h200
  • nightly-test-general-8-gpu-b200
  • consolidate-metrics

Test plan

  • Verify nightly run produces JSONL artifacts (even without S3 credentials)
  • Once AWS secrets are added, verify S3 upload works

- Set SGLANG_TEST_METRICS_OUTPUT env var in 6 nightly jobs (text
  accuracy/perf, VLM accuracy/perf, 8-GPU H200, 8-GPU B200) so
  dump_metric() writes JSONL files during test execution
- Upload per-job JSONL metrics as artifacts
- Consolidate all JSONL metrics in the consolidate-metrics job
- Add S3 upload script (scripts/ci/utils/upload_metrics_to_s3.py)
  that uploads consolidated JSON + JSONL to
  s3://rdxa-eng-json-logs-05354378/ci-metrics/nightly/{date}/{run_id}/
- S3 upload gracefully no-ops when AWS secrets are not configured

Waiting on IT to provision long-lived AWS credentials and add
CI_METRICS_AWS_ACCESS_KEY_ID / CI_METRICS_AWS_SECRET_ACCESS_KEY
as GitHub repo secrets. Until then, metrics are still collected
as GitHub artifacts (90-day retention).
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant