Summary
Add a scheduled GitHub Actions workflow that analyzes overnight nightly E2E results and publishes a scorecard to the GitHub Actions job summary. This is the foundation for nightly CI health visibility — no external dependencies required.
Parent tracking issue: #2612
Motivation
The nightly E2E suite runs 17–19 jobs multiple times overnight. The only notification today is notify-on-failure, which creates a GitHub issue on any failure but doesn't summarize trends, identify repeat offenders, or report when things are healthy. The team has to manually inspect runs to understand CI health.
Design
Workflow: .github/workflows/nightly-scorecard.yaml
Triggers:
schedule: cron at 12:00 UTC (8am ET) daily
workflow_dispatch for manual testing
Steps:
- Use
gh api to fetch the last 24h of nightly-e2e.yaml workflow runs
- For each completed (non-cancelled) run, fetch job-level pass/fail results
- Compute scorecard metrics:
- Total runs (success / failure / cancelled)
- Perfect runs count (every job passed)
- Per-job failure frequency (identify top flaky jobs — failed 2+ times)
- Trend vs previous day (improving / degrading / stable)
- Best run of the night (highest pass rate)
- Write formatted scorecard to
$GITHUB_STEP_SUMMARY
- If
SLACK_WEBHOOK_URL secret exists, post to Slack (graceful no-op if missing)
Scorecard Format
🌅 NemoClaw Nightly Scorecard — Apr 28
Overnight runs: 24 completed
✅ 4 perfect (18/18)
❌ 12 failures
⊘ 8 cancelled
Top flaky jobs (failed 2+ times):
sandbox-operations-e2e — 4 failures
cloud-e2e — 3 failures
issue-2478-crash-loop — 3 failures
Trend: ↗️ Improving (yesterday: 1 perfect → today: 4 perfect)
🔗 https://github.com/NVIDIA/NemoClaw/actions/workflows/nightly-e2e.yaml
Implementation Notes
- Use
actions/github-script with the Octokit API, or bash + gh api — whichever is simpler
- Fetch runs with:
GET /repos/{owner}/{repo}/actions/workflows/nightly-e2e.yaml/runs?created=>={yesterday}
- Fetch jobs per run with:
GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs
- Exclude
gpu-e2e (always skipped) and notify-on-failure (meta job) from pass/fail counts
- For trend comparison, fetch the previous day's runs with the same API
Acceptance Criteria
Summary
Add a scheduled GitHub Actions workflow that analyzes overnight nightly E2E results and publishes a scorecard to the GitHub Actions job summary. This is the foundation for nightly CI health visibility — no external dependencies required.
Parent tracking issue: #2612
Motivation
The nightly E2E suite runs 17–19 jobs multiple times overnight. The only notification today is
notify-on-failure, which creates a GitHub issue on any failure but doesn't summarize trends, identify repeat offenders, or report when things are healthy. The team has to manually inspect runs to understand CI health.Design
Workflow:
.github/workflows/nightly-scorecard.yamlTriggers:
schedule: cronat 12:00 UTC (8am ET) dailyworkflow_dispatchfor manual testingSteps:
gh apito fetch the last 24h ofnightly-e2e.yamlworkflow runs$GITHUB_STEP_SUMMARYSLACK_WEBHOOK_URLsecret exists, post to Slack (graceful no-op if missing)Scorecard Format
Implementation Notes
actions/github-scriptwith the Octokit API, or bash +gh api— whichever is simplerGET /repos/{owner}/{repo}/actions/workflows/nightly-e2e.yaml/runs?created=>={yesterday}GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobsgpu-e2e(always skipped) andnotify-on-failure(meta job) from pass/fail countsAcceptance Criteria
nightly-scorecard.yamlruns daily at 12:00 UTC via cronworkflow_dispatchtrigger works for manual testinggpu-e2eandnotify-on-failureexcluded from job countsSLACK_WEBHOOK_URLexists, graceful no-op otherwise)