docs: add ADR-003 for scaling KWOK recipe tests#424
Conversation
There was a problem hiding this comment.
From CodeX
- High: ADR claims full coverage on every merge to
main, but current concurrency model can cancel in-flightmainruns during rapid successive merges. That means Tier 3 is not guaranteed per-merge unless concurrency is adjusted.
aicr/docs/design/003-scaling-recipe-tests.md
Line 140 in 95e8507
aicr/.github/workflows/kwok-recipes.yaml
Line 46 in 95e8507
- Medium: "Only
test-tier1andtest-tier2are required checks" is operationally underspecified for matrix jobs. Matrix check names drift with overlay set changes, so branch protection can become brittle unless you define a stable aggregate required check.
aicr/docs/design/003-scaling-recipe-tests.md
Line 129 in 95e8507
aicr/docs/design/003-scaling-recipe-tests.md
Line 114 in 95e8507
- Low: Context text has minor mismatch with current workflow scope.
- It says PR runs trigger on
recipes/**,kwok/**, or workflow file, but current triggers also include.github/actions/kwok-test/**. - It describes discovery as cloud-service criteria, but includes
kindin the matrix. aicr/.github/workflows/kwok-recipes.yaml
Line 25 in 95e8507
Open question:
- Should Tier 3 be guaranteed per-main-merge (no cancellation), or is eventual coverage via nightly + latest-main acceptable? The ADR should state this explicitly.
- Tier 3 concurrency: use cancel-in-progress: false with per-SHA concurrency group to guarantee full coverage on every merge - Required checks: use stable summary job instead of individual matrix job names to avoid branch protection brittleness - Context fixes: include .github/actions/kwok-test/** in trigger paths, clarify Kind is included in service discovery
|
Thanks for the thorough review — all three points are valid. Pushed updates in b225b49:
Re: open question — the ADR now explicitly states Tier 3 uses |
There was a problem hiding this comment.
- Scope explicitly to KWOK scheduling tests
The post-merge failure gap is low-risk for KWOK (no real hardware, fast to fix), but this tiered pattern should not be generalized to hardware-dependent validation (NCCL bandwidth, GPU operator health, etc.) KWOK simulates node topology but not GPU hardware, operator pods, or NCCL fabrics. Scoping this out avoids confusion about what "full coverage" means.
- Nightly as release qualification gate
Nightly Tier 3 can double as qualification for release candidates. This separates recipe correctness (KWOK — will it schedule?) from runtime correctness (real clusters — does it work?). The release strategy becomes: PR runs Tier 1 + Tier 2, merge runs full KWOK matrix, nightly adds real-cluster validators, and only SHAs where nightly passed are promoted as release candidates.
- Equivalence-class grouping deserves a follow-up
At 200+ overlays, even Tier 1 could reach 30-40 jobs. Tracking topology fingerprinting as a future ADR is worthwhile — many overlays differ only in values but schedule identically.
|
Addressed all three items from this review:
|
Summary
Test plan