feat: enable unified metrics collection in nightly CI with S3 upload#21057
Open
alisonshao wants to merge 1 commit intomainfrom
Open
feat: enable unified metrics collection in nightly CI with S3 upload#21057alisonshao wants to merge 1 commit intomainfrom
alisonshao wants to merge 1 commit intomainfrom
Conversation
- Set SGLANG_TEST_METRICS_OUTPUT env var in 6 nightly jobs (text
accuracy/perf, VLM accuracy/perf, 8-GPU H200, 8-GPU B200) so
dump_metric() writes JSONL files during test execution
- Upload per-job JSONL metrics as artifacts
- Consolidate all JSONL metrics in the consolidate-metrics job
- Add S3 upload script (scripts/ci/utils/upload_metrics_to_s3.py)
that uploads consolidated JSON + JSONL to
s3://rdxa-eng-json-logs-05354378/ci-metrics/nightly/{date}/{run_id}/
- S3 upload gracefully no-ops when AWS secrets are not configured
Waiting on IT to provision long-lived AWS credentials and add
CI_METRICS_AWS_ACCESS_KEY_ID / CI_METRICS_AWS_SECRET_ACCESS_KEY
as GitHub repo secrets. Until then, metrics are still collected
as GitHub artifacts (90-day retention).
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dump_metric()JSONL output in 6 nightly jobs by settingSGLANG_TEST_METRICS_OUTPUTenv varconsolidate-metricsjobs3://rdxa-eng-json-logs-05354378/ci-metrics/nightly/{date}/{run_id}/How it works
The existing
dump_metric()function (PR #16064) already writes JSONL whenSGLANG_TEST_METRICS_OUTPUTis set, andrun_eval.pyalready calls it for eval scores/latency. This PR simply wires it into the nightly workflow:SGLANG_TEST_METRICS_OUTPUT=test-metricsand uploads the resulting JSONL files as artifactsconsolidate-metricsjob downloads all artifacts, concatenates JSONL files, and uploads everything to S3ci-metrics/nightly/{date}/{run_id}/containingconsolidated-metrics.json,test-metrics.jsonl, andrun-metadata.jsonBlocked on
CI_METRICS_AWS_ACCESS_KEY_IDandCI_METRICS_AWS_SECRET_ACCESS_KEYadded as GitHub repo secrets. Until then, metrics are collected as GitHub artifacts (90-day retention) but not uploaded to S3.Jobs affected
nightly-test-text-accuracy-2-gpu-h100nightly-test-text-perf-2-gpu-h100nightly-test-vlm-accuracy-2-gpu-h100nightly-test-vlm-perf-2-gpu-h100nightly-test-general-8-gpu-h200nightly-test-general-8-gpu-b200consolidate-metricsTest plan