Skip to content

docs: add ADR-003 for scaling KWOK recipe tests#424

Merged
mchmarny merged 6 commits intomainfrom
feat/adr-003-scaling-recipe-tests
Mar 19, 2026
Merged

docs: add ADR-003 for scaling KWOK recipe tests#424
mchmarny merged 6 commits intomainfrom
feat/adr-003-scaling-recipe-tests

Conversation

@mchmarny
Copy link
Copy Markdown
Member

@mchmarny mchmarny commented Mar 18, 2026

Summary

  • Add ADR-003 proposing a tiered testing strategy for KWOK recipe validation in CI
  • Current 36-job matrix grows multiplicatively with new services/accelerators (~150-200+ jobs projected)
  • Proposes three tiers: generic PR gate (fast), diff-aware accelerator tests (conditional), full matrix (merge/nightly)
  • Estimated ~70% reduction in PR runner time while preserving full coverage on main

Test plan

  • Review ADR with team for feedback on tiered approach
  • No code changes — design document only

@mchmarny mchmarny requested a review from a team as a code owner March 18, 2026 11:28
@mchmarny mchmarny self-assigned this Mar 18, 2026
@mchmarny mchmarny added enhancement New feature or request do-not-merge PR should not be merged or auto-closed labels Mar 18, 2026
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From CodeX

  1. High: ADR claims full coverage on every merge to main, but current concurrency model can cancel in-flight main runs during rapid successive merges. That means Tier 3 is not guaranteed per-merge unless concurrency is adjusted.
  1. Medium: "Only test-tier1 and test-tier2 are required checks" is operationally underspecified for matrix jobs. Matrix check names drift with overlay set changes, so branch protection can become brittle unless you define a stable aggregate required check.
  1. Low: Context text has minor mismatch with current workflow scope.

Open question:

  • Should Tier 3 be guaranteed per-main-merge (no cancellation), or is eventual coverage via nightly + latest-main acceptable? The ADR should state this explicitly.

- Tier 3 concurrency: use cancel-in-progress: false with per-SHA
  concurrency group to guarantee full coverage on every merge
- Required checks: use stable summary job instead of individual
  matrix job names to avoid branch protection brittleness
- Context fixes: include .github/actions/kwok-test/** in trigger
  paths, clarify Kind is included in service discovery
@mchmarny
Copy link
Copy Markdown
Member Author

Thanks for the thorough review — all three points are valid. Pushed updates in b225b49:

  1. High (Tier 3 concurrency): Added explicit concurrency policy for Tier 3 — cancel-in-progress: false with a per-SHA concurrency group so main runs are never cancelled by subsequent merges. Nightly provides a backstop.

  2. Medium (required checks): Replaced the per-tier required checks with a single stable KWOK Test Summary aggregate job. Individual matrix job names are not required in branch protection, avoiding brittleness as overlays change.

  3. Low (context accuracy): Added .github/actions/kwok-test/** to the trigger path description. Clarified that discovery includes local environments (Kind) alongside cloud services.

Re: open question — the ADR now explicitly states Tier 3 uses cancel-in-progress: false to guarantee per-merge coverage, with nightly as a safety net for operational edge cases.

@yuanchen8911 yuanchen8911 self-requested a review March 18, 2026 19:04
yuanchen8911
yuanchen8911 previously approved these changes Mar 18, 2026
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Copy Markdown
Contributor

@xdu31 xdu31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Scope explicitly to KWOK scheduling tests

The post-merge failure gap is low-risk for KWOK (no real hardware, fast to fix), but this tiered pattern should not be generalized to hardware-dependent validation (NCCL bandwidth, GPU operator health, etc.) KWOK simulates node topology but not GPU hardware, operator pods, or NCCL fabrics. Scoping this out avoids confusion about what "full coverage" means.

  1. Nightly as release qualification gate

Nightly Tier 3 can double as qualification for release candidates. This separates recipe correctness (KWOK — will it schedule?) from runtime correctness (real clusters — does it work?). The release strategy becomes: PR runs Tier 1 + Tier 2, merge runs full KWOK matrix, nightly adds real-cluster validators, and only SHAs where nightly passed are promoted as release candidates.

  1. Equivalence-class grouping deserves a follow-up

At 200+ overlays, even Tier 1 could reach 30-40 jobs. Tracking topology fingerprinting as a future ADR is worthwhile — many overlays differ only in values but schedule identically.

@mchmarny mchmarny requested review from dims and removed request for iamkhaledh and lockwobr March 19, 2026 10:10
@mchmarny
Copy link
Copy Markdown
Member Author

Addressed all three items from this review:

  1. Scope to KWOK scheduling tests — Added a new "Scope" section explicitly stating this ADR applies only to KWOK scheduling simulation, not hardware-dependent validation (GPU operator health, NCCL bandwidth, real-cluster conformance).

  2. Nightly as release qualification gate — Added a "Release qualification" paragraph to Tier 3 describing how nightly runs double as a qualification gate for release candidates, separating recipe correctness from runtime correctness.

  3. Equivalence-class grouping follow-up — Already tracked in "Alternatives Considered §2" as deferred, with a trigger condition (~200 overlays). No change needed.

@mchmarny mchmarny merged commit f4d3a66 into main Mar 19, 2026
11 checks passed
@mchmarny mchmarny deleted the feat/adr-003-scaling-recipe-tests branch March 19, 2026 12:47
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs do-not-merge PR should not be merged or auto-closed enhancement New feature or request size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants