Skip to content

feat(ci): nightly E2E scorecard workflow with GitHub Actions summary #2613

@jyaunches

Description

@jyaunches

Summary

Add a scheduled GitHub Actions workflow that analyzes overnight nightly E2E results and publishes a scorecard to the GitHub Actions job summary. This is the foundation for nightly CI health visibility — no external dependencies required.

Parent tracking issue: #2612

Motivation

The nightly E2E suite runs 17–19 jobs multiple times overnight. The only notification today is notify-on-failure, which creates a GitHub issue on any failure but doesn't summarize trends, identify repeat offenders, or report when things are healthy. The team has to manually inspect runs to understand CI health.

Design

Workflow: .github/workflows/nightly-scorecard.yaml

Triggers:

  • schedule: cron at 12:00 UTC (8am ET) daily
  • workflow_dispatch for manual testing

Steps:

  1. Use gh api to fetch the last 24h of nightly-e2e.yaml workflow runs
  2. For each completed (non-cancelled) run, fetch job-level pass/fail results
  3. Compute scorecard metrics:
    • Total runs (success / failure / cancelled)
    • Perfect runs count (every job passed)
    • Per-job failure frequency (identify top flaky jobs — failed 2+ times)
    • Trend vs previous day (improving / degrading / stable)
    • Best run of the night (highest pass rate)
  4. Write formatted scorecard to $GITHUB_STEP_SUMMARY
  5. If SLACK_WEBHOOK_URL secret exists, post to Slack (graceful no-op if missing)

Scorecard Format

🌅 NemoClaw Nightly Scorecard — Apr 28

Overnight runs: 24 completed
  ✅ 4 perfect (18/18)
  ❌ 12 failures
  ⊘  8 cancelled

Top flaky jobs (failed 2+ times):
  sandbox-operations-e2e    — 4 failures
  cloud-e2e                 — 3 failures
  issue-2478-crash-loop     — 3 failures

Trend: ↗️ Improving (yesterday: 1 perfect → today: 4 perfect)

🔗 https://github.com/NVIDIA/NemoClaw/actions/workflows/nightly-e2e.yaml

Implementation Notes

  • Use actions/github-script with the Octokit API, or bash + gh api — whichever is simpler
  • Fetch runs with: GET /repos/{owner}/{repo}/actions/workflows/nightly-e2e.yaml/runs?created=>={yesterday}
  • Fetch jobs per run with: GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs
  • Exclude gpu-e2e (always skipped) and notify-on-failure (meta job) from pass/fail counts
  • For trend comparison, fetch the previous day's runs with the same API

Acceptance Criteria

  • nightly-scorecard.yaml runs daily at 12:00 UTC via cron
  • workflow_dispatch trigger works for manual testing
  • Scorecard appears in GitHub Actions job summary
  • Scorecard includes: run counts, perfect run count, top flaky jobs (2+ failures), trend vs yesterday
  • Workflow succeeds even when no nightly runs exist (e.g., first run)
  • gpu-e2e and notify-on-failure excluded from job counts
  • Includes Slack webhook hook point (conditional post if SLACK_WEBHOOK_URL exists, graceful no-op otherwise)
  • No secrets or API keys hardcoded

Metadata

Metadata

Assignees

Labels

area: ciCI workflows, checks, release automation, or GitHub Actionsarea: e2eEnd-to-end tests, nightly failures, or validation infrastructure

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions