Skip to content

ci(kwok): implement tiered testing strategy per ADR-003#432

Merged
mchmarny merged 4 commits intomainfrom
feat/kwok-tiered-testing
Mar 19, 2026
Merged

ci(kwok): implement tiered testing strategy per ADR-003#432
mchmarny merged 4 commits intomainfrom
feat/kwok-tiered-testing

Conversation

@mchmarny
Copy link
Copy Markdown
Member

@mchmarny mchmarny commented Mar 19, 2026

Implements #424 (ADR-003)

Summary

  • Implements ADR-003 tiered KWOK testing strategy to replace flat "test every overlay on every PR" model
  • Splits discovery into three tiers: generic PR gate (Tier 1), diff-aware accelerator tests (Tier 2), and full matrix on merge/nightly (Tier 3)
  • Fixes kube-prometheus-stack emptyDir: null CRD rejection in KWOK clusters by post-processing bundle values

Details

Workflow changes (.github/workflows/kwok-recipes.yaml)

Tier When What Jobs
Tier 1 Every PR + push Generic overlays (no accelerator) ~11
Tier 2 PR only, conditional Diff-affected accelerator overlays 0–25
Tier 3 Push to main + nightly Full overlay matrix 36+

Tier 2 diff-aware selection traces three dependency paths:

  1. Direct overlay file changes
  2. Ancestor overlays in the spec.base chain
  3. Component valuesFile references from the overlay and its base chain

Concurrency: Tier 3 uses per-SHA concurrency group (cancel-in-progress: false) so successive merges to main never cancel in-flight runs. PRs keep cancel-in-progress: true.

Nightly schedule added at 03:00 UTC as a Tier 3 backstop per ADR-003.

KWOK fix (kwok/scripts/validate-scheduling.sh)

Cloud overlays (EKS, AKS) set storageSpec.emptyDir: null + volumeClaimTemplate for PVC storage. The Prometheus CRD rejects emptyDir: null in Kind/KWOK clusters. Added post-bundle processing to restore emptyDir and remove volumeClaimTemplate for KWOK tests.

Impact

  • ~70% reduction in PR runner time (36 → 11 jobs for typical PRs)
  • Full coverage preserved on every merge to main and nightly
  • Branch protection: KWOK Test Summary job aggregates all tiers

Test plan

  • Unit tests pass (make test)
  • YAML lint passes
  • KWOK tested 23/36 recipes across all 4 services (aks, eks, gke, kind) — zero failures
  • Verified emptyDir fix triggers correctly for cloud overlays and skips for kind/base overlays
  • Pre-existing lint failure (gke-tcpxo-networking.md sidebar) confirmed unrelated

Replace flat "test every overlay on every PR" model with tiered strategy:

- Tier 1 (PR gate): generic overlays only (no accelerator), ~11 jobs
- Tier 2 (PR, conditional): diff-aware selection of affected accelerator
  overlays based on changed files, base chain, and component refs
- Tier 3 (merge + nightly): full matrix with per-SHA concurrency group
  to prevent successive merges from canceling in-flight runs

Also fix kube-prometheus-stack storageSpec for KWOK: cloud overlays set
emptyDir: null + volumeClaimTemplate, which the Prometheus CRD rejects
in Kind clusters. Post-process bundle values to restore emptyDir.

Reduces typical PR runner time ~70% (36 → 11 jobs) while preserving
full coverage on every merge to main and nightly schedule.

Implements: docs/design/003-scaling-recipe-tests.md
@mchmarny mchmarny requested review from a team as code owners March 19, 2026 15:02
@mchmarny mchmarny self-assigned this Mar 19, 2026
@mchmarny mchmarny added this to the M2 - KubeCon EU milestone Mar 19, 2026
@mchmarny mchmarny added the ci label Mar 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

Coverage Report ✅

Metric Value
Coverage 73.5%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-73.5%25-green)

No Go source files changed in this PR.

@yuanchen8911
Copy link
Copy Markdown
Contributor

Review findings (ordered by severity):

  1. High: PRs that touch recipes/registry.yaml or recipes/overlays/base.yaml can merge without any accelerator coverage.
    In discover, Tier 2 is explicitly skipped for those changes (L116-L121), while Tier 3 is disabled on PRs (L294-L296).
    Impact: high-risk recipe graph changes are only validated post-merge/nightly.

  2. Medium: Nightly schedule currently runs Tier 1 in addition to Tier 3.
    test-tier1 has no event filter (L217-L219), so scheduled runs execute both Tier 1 and Tier 3.
    Impact: duplicate coverage and extra nightly runtime/cost.

  3. Medium: Tier 2 diff computation can silently degrade to “no changed files”.
    git diff --name-only ... || true suppresses failures (L114).
    Impact: if diff resolution fails, Tier 2 may be empty and the PR can pass with missing accelerator checks.

Open question:

  • Is skipping accelerator validation for registry.yaml / base.yaml changes an intentional ADR-003 tradeoff, or should those cases force Tier 2-all (or Tier 3 on PR)?

dims
dims previously approved these changes Mar 19, 2026
Copy link
Copy Markdown
Collaborator

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xdu31
Copy link
Copy Markdown
Contributor

xdu31 commented Mar 19, 2026

lgtm

…1 on schedule, fail on diff error

- registry.yaml or base.yaml changes now promote all accelerator overlays
  to Tier 2 instead of skipping, ensuring PR-time validation coverage
- Tier 1 skipped on schedule runs (Tier 3 already covers full matrix)
- git diff failure now fails loudly instead of silently producing empty Tier 2
@mchmarny
Copy link
Copy Markdown
Member Author

Thanks for the review — all three points are valid. Addressed in 9ffc89d:

  1. High (registry.yaml/base.yaml → no accelerator coverage): Fixed. When registry.yaml or base.yaml changes, all accelerator overlays are now promoted to Tier 2 instead of being skipped. This ensures PR-time validation for high-risk recipe graph changes.

  2. Medium (duplicate Tier 1 on schedule): Fixed. Tier 1 now skips on schedule events since Tier 3 already runs the full matrix (which is a superset of Tier 1).

  3. Medium (silent diff failure): Fixed. git diff failure now fails the discover job loudly with ::error:: annotation instead of silently producing an empty Tier 2.

To the open question: Skipping accelerator validation for registry.yaml/base.yaml was the ADR-003 tradeoff (Tier 3 catches it post-merge). But the reviewer is right that these are high-risk changes — promoting to Tier 2-all on PR is the safer choice without significant cost increase (25 additional jobs only when base/registry changes, which is infrequent).

Copy link
Copy Markdown
Contributor

@iamkhaledh iamkhaledh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@mchmarny mchmarny merged commit d4e818f into main Mar 19, 2026
33 checks passed
@mchmarny mchmarny deleted the feat/kwok-tiered-testing branch March 19, 2026 17:28
xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants