docs: add ADR-003 for scaling KWOK recipe tests by mchmarny · Pull Request #424 · NVIDIA/aicr

mchmarny · 2026-03-18T11:28:27Z

Summary

Add ADR-003 proposing a tiered testing strategy for KWOK recipe validation in CI
Current 36-job matrix grows multiplicatively with new services/accelerators (~150-200+ jobs projected)
Proposes three tiers: generic PR gate (fast), diff-aware accelerator tests (conditional), full matrix (merge/nightly)
Estimated ~70% reduction in PR runner time while preserving full coverage on main

Test plan

Review ADR with team for feedback on tiered approach
No code changes — design document only

yuanchen8911

From CodeX

High: ADR claims full coverage on every merge to main, but current concurrency model can cancel in-flight main runs during rapid successive merges. That means Tier 3 is not guaranteed per-merge unless concurrency is adjusted.

aicr/docs/design/003-scaling-recipe-tests.md

Line 99 in 95e8507

- Every merge to `main` (post-merge validation)
aicr/docs/design/003-scaling-recipe-tests.md

Line 140 in 95e8507

- **Full coverage is preserved.** Every overlay is tested on every merge to main
aicr/.github/workflows/kwok-recipes.yaml

Line 46 in 95e8507

concurrency:

Medium: "Only test-tier1 and test-tier2 are required checks" is operationally underspecified for matrix jobs. Matrix check names drift with overlay set changes, so branch protection can become brittle unless you define a stable aggregate required check.

aicr/docs/design/003-scaling-recipe-tests.md

Line 129 in 95e8507

Only `test-tier1` and `test-tier2` are required for PR merge. `test-tier3` is
aicr/docs/design/003-scaling-recipe-tests.md

Line 114 in 95e8507

test-tier1 (PR + push to main)

Low: Context text has minor mismatch with current workflow scope.

It says PR runs trigger on recipes/**, kwok/**, or workflow file, but current triggers also include .github/actions/kwok-test/**.
It describes discovery as cloud-service criteria, but includes kind in the matrix.
aicr/docs/design/003-scaling-recipe-tests.md

Line 11 in 95e8507

with a cloud `service` criteria and creates one parallel GitHub Actions job per overlay.
aicr/docs/design/003-scaling-recipe-tests.md

Line 19 in 95e8507

This runs on every PR that touches `recipes/**`, `kwok/**`, or the workflow itself.
aicr/.github/workflows/kwok-recipes.yaml

Line 25 in 95e8507

- '.github/actions/kwok-test/**'

Open question:

Should Tier 3 be guaranteed per-main-merge (no cancellation), or is eventual coverage via nightly + latest-main acceptable? The ADR should state this explicitly.

- Tier 3 concurrency: use cancel-in-progress: false with per-SHA concurrency group to guarantee full coverage on every merge - Required checks: use stable summary job instead of individual matrix job names to avoid branch protection brittleness - Context fixes: include .github/actions/kwok-test/** in trigger paths, clarify Kind is included in service discovery

mchmarny · 2026-03-18T18:55:42Z

Thanks for the thorough review — all three points are valid. Pushed updates in b225b49:

High (Tier 3 concurrency): Added explicit concurrency policy for Tier 3 — cancel-in-progress: false with a per-SHA concurrency group so main runs are never cancelled by subsequent merges. Nightly provides a backstop.
Medium (required checks): Replaced the per-tier required checks with a single stable KWOK Test Summary aggregate job. Individual matrix job names are not required in branch protection, avoiding brittleness as overlays change.
Low (context accuracy): Added .github/actions/kwok-test/** to the trigger path description. Clarified that discovery includes local environments (Kind) alongside cloud services.

Re: open question — the ADR now explicitly states Tier 3 uses cancel-in-progress: false to guarantee per-merge coverage, with nightly as a safety net for operational edge cases.

yuanchen8911

/lgtm

xdu31

Scope explicitly to KWOK scheduling tests

The post-merge failure gap is low-risk for KWOK (no real hardware, fast to fix), but this tiered pattern should not be generalized to hardware-dependent validation (NCCL bandwidth, GPU operator health, etc.) KWOK simulates node topology but not GPU hardware, operator pods, or NCCL fabrics. Scoping this out avoids confusion about what "full coverage" means.

Nightly as release qualification gate

Nightly Tier 3 can double as qualification for release candidates. This separates recipe correctness (KWOK — will it schedule?) from runtime correctness (real clusters — does it work?). The release strategy becomes: PR runs Tier 1 + Tier 2, merge runs full KWOK matrix, nightly adds real-cluster validators, and only SHAs where nightly passed are promoted as release candidates.

Equivalence-class grouping deserves a follow-up

At 200+ overlays, even Tier 1 could reach 30-40 jobs. Tracking topology fingerprinting as a future ADR is worthwhile — many overlays differ only in values but schedule identically.

mchmarny · 2026-03-19T10:10:57Z

Addressed all three items from this review:

Scope to KWOK scheduling tests — Added a new "Scope" section explicitly stating this ADR applies only to KWOK scheduling simulation, not hardware-dependent validation (GPU operator health, NCCL bandwidth, real-cluster conformance).
Nightly as release qualification gate — Added a "Release qualification" paragraph to Tier 3 describing how nightly runs double as a qualification gate for release candidates, separating recipe correctness from runtime correctness.
Equivalence-class grouping follow-up — Already tracked in "Alternatives Considered §2" as deferred, with a trigger condition (~200 overlays). No change needed.

yuanchen8911

/lgtm

docs: add ADR-003 for scaling KWOK recipe tests in CI

a171a5e

mchmarny requested a review from a team as a code owner March 18, 2026 11:28

github-actions bot added area/docs size/L labels Mar 18, 2026

mchmarny self-assigned this Mar 18, 2026

mchmarny added enhancement New feature or request do-not-merge PR should not be merged or auto-closed labels Mar 18, 2026

mchmarny requested review from iamkhaledh, lalitadithya, lockwobr, xdu31 and yuanchen8911 March 18, 2026 11:35

Merge branch 'main' into feat/adr-003-scaling-recipe-tests

95e8507

yuanchen8911 reviewed Mar 18, 2026

View reviewed changes

yuanchen8911 self-requested a review March 18, 2026 19:04

yuanchen8911 previously approved these changes Mar 18, 2026

View reviewed changes

Merge branch 'main' into feat/adr-003-scaling-recipe-tests

e597ea7

mchmarny mentioned this pull request Mar 18, 2026

feat: add Oracle OKE recipe overlays #429

Open

xdu31 reviewed Mar 19, 2026

View reviewed changes

mchmarny requested review from dims and removed request for iamkhaledh and lockwobr March 19, 2026 10:10

docs: address xdu31 review — scope to KWOK, add release qualification

241f403

mchmarny dismissed yuanchen8911’s stale review via 241f403 March 19, 2026 10:13

lalitadithya approved these changes Mar 19, 2026

View reviewed changes

Merge branch 'main' into feat/adr-003-scaling-recipe-tests

30ff316

mchmarny merged commit f4d3a66 into main Mar 19, 2026
11 checks passed

mchmarny deleted the feat/adr-003-scaling-recipe-tests branch March 19, 2026 12:47

mchmarny mentioned this pull request Mar 19, 2026

ci(kwok): implement tiered testing strategy per ADR-003 #432

Merged

5 tasks

yuanchen8911 reviewed Mar 19, 2026

View reviewed changes

xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026

docs: add ADR-003 for scaling KWOK recipe tests (NVIDIA#424)

f2a06f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add ADR-003 for scaling KWOK recipe tests#424

docs: add ADR-003 for scaling KWOK recipe tests#424
mchmarny merged 6 commits intomainfrom
feat/adr-003-scaling-recipe-tests

mchmarny commented Mar 18, 2026 •

edited

Loading

Uh oh!

yuanchen8911 left a comment •

edited

Loading

Uh oh!

mchmarny commented Mar 18, 2026

Uh oh!

yuanchen8911 left a comment

Uh oh!

xdu31 left a comment •

edited

Loading

Uh oh!

mchmarny commented Mar 19, 2026

Uh oh!

Uh oh!

yuanchen8911 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mchmarny commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

yuanchen8911 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mchmarny commented Mar 18, 2026

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

xdu31 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mchmarny commented Mar 19, 2026

Uh oh!

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mchmarny commented Mar 18, 2026 •

edited

Loading

yuanchen8911 left a comment •

edited

Loading

xdu31 left a comment •

edited

Loading