ci(kwok): implement tiered testing strategy per ADR-003#432
Conversation
Replace flat "test every overlay on every PR" model with tiered strategy: - Tier 1 (PR gate): generic overlays only (no accelerator), ~11 jobs - Tier 2 (PR, conditional): diff-aware selection of affected accelerator overlays based on changed files, base chain, and component refs - Tier 3 (merge + nightly): full matrix with per-SHA concurrency group to prevent successive merges from canceling in-flight runs Also fix kube-prometheus-stack storageSpec for KWOK: cloud overlays set emptyDir: null + volumeClaimTemplate, which the Prometheus CRD rejects in Kind clusters. Post-process bundle values to restore emptyDir. Reduces typical PR runner time ~70% (36 → 11 jobs) while preserving full coverage on every merge to main and nightly schedule. Implements: docs/design/003-scaling-recipe-tests.md
Coverage Report ✅
Coverage BadgeNo Go source files changed in this PR. |
|
Review findings (ordered by severity):
Open question:
|
|
lgtm |
…1 on schedule, fail on diff error - registry.yaml or base.yaml changes now promote all accelerator overlays to Tier 2 instead of skipping, ensuring PR-time validation coverage - Tier 1 skipped on schedule runs (Tier 3 already covers full matrix) - git diff failure now fails loudly instead of silently producing empty Tier 2
|
Thanks for the review — all three points are valid. Addressed in 9ffc89d:
To the open question: Skipping accelerator validation for |
Implements #424 (ADR-003)
Summary
emptyDir: nullCRD rejection in KWOK clusters by post-processing bundle valuesDetails
Workflow changes (
.github/workflows/kwok-recipes.yaml)Tier 2 diff-aware selection traces three dependency paths:
spec.basechainvaluesFilereferences from the overlay and its base chainConcurrency: Tier 3 uses per-SHA concurrency group (
cancel-in-progress: false) so successive merges to main never cancel in-flight runs. PRs keepcancel-in-progress: true.Nightly schedule added at 03:00 UTC as a Tier 3 backstop per ADR-003.
KWOK fix (
kwok/scripts/validate-scheduling.sh)Cloud overlays (EKS, AKS) set
storageSpec.emptyDir: null+volumeClaimTemplatefor PVC storage. The Prometheus CRD rejectsemptyDir: nullin Kind/KWOK clusters. Added post-bundle processing to restoreemptyDirand removevolumeClaimTemplatefor KWOK tests.Impact
KWOK Test Summaryjob aggregates all tiersTest plan
make test)emptyDirfix triggers correctly for cloud overlays and skips for kind/base overlaysgke-tcpxo-networking.mdsidebar) confirmed unrelated