chore: migrate e2e cluster to KWOK nodes for faster CI#489
Merged
Conversation
ranrubin
added a commit
to ranrubin/grove
that referenced
this pull request
Mar 16, 2026
… add inline docs - Integrate changes from PR ai-dynamo#489 (KWOK nodes + Go build caching): - Add Go module cache to actions/setup-go fallback step - Add pydantic_settings to Python dep verification - Add run-e2e-startup-order-full Makefile target (real kubelet workers, no KWOK) - Refactor run-e2e-mnnvl-full env vars to E2E_CLUSTER__* / E2E_KWOK__* naming - Switch e2e-cluster-down to use infra-manager.py - Point startup_ordering matrix entry to new make target - Add inline docs explaining DinD memory limit rationale in cluster.py - Expand refreshWorkerNodes doc comment in shared_cluster.go to explain the DinD node instability context that motivated the stale-list fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
shayasoolin
reviewed
Mar 16, 2026
Replace 30 k3d workers with 2 workers + 30 KWOK fake nodes. Update test pod affinities/tolerations to target KWOK nodes. Add startup-order target with KWOK disabled (real kubelet needed). Disable KWOK for MNNVL tests. Fixes ai-dynamo#488 Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Tests call PrepareForTest(10) which requires exactly 10 schedulable worker nodes. Set WORKER_NODES=10, WORKER_MEMORY=150m, KWOK=0 for the startup-ordering target — real kubelet + minimum resources. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
150Mi per node limits Kai to 1 pod/node (80Mi requests), breaking host-colocation topology constraints and small gang scheduling. 8Gi per node allows multiple pods per KWOK node, matching test expectations. Also enable Go build cache in CI for faster test compilation. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Replace startup-order-specific make target with generic real-node cluster targets (e2e-cluster-up-real, run-e2e-real-full) so any test suite can run on either KWOK or real k3d nodes. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
7149333 to
bb05d50
Compare
The node_role.e2e.grove.nvidia.com taint already prevents non-test pods from scheduling on KWOK nodes, making fake-node redundant. Removes the taint from KWOK node creation and 45 toleration blocks across 21 e2e YAML files, making tests cluster-agnostic. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Replace hardcoded port 5001 with $(E2E_REGISTRY_PORT), remove no-op inline env vars from sub-make calls, declare E2E_CREATE_FLAGS, and add richer documentation to new real-node targets. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
danbar2
approved these changes
Mar 18, 2026
shayasoolin
approved these changes
Mar 18, 2026
danbar2
pushed a commit
to danbar2/grove
that referenced
this pull request
Mar 18, 2026
enoodle
pushed a commit
to enoodle/grove
that referenced
this pull request
Mar 24, 2026
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Migrates the e2e test cluster from 30 k3d worker nodes to 2 k3d workers + 30 KWOK fake nodes. Test pods don't require real container execution — they only need to appear as Running/Ready. KWOK simulates pod lifecycle at a fraction of the infrastructure cost.
Benefits:
Performance Results
CI benchmarks — real k3d workers (avg of 4 runs across 3 branches) vs KWOK nodes (avg of 2 runs on this branch):
Changes:
e2e.yaml): no k3d workers , add KWOK config (30 nodes, 150Mi memory, 4 CPU each)node_role.e2e.grove.nvidia.com=agent:NoSchedule. The previously redundantfake-node=true:NoScheduletaint has been removed from KWOK nodes.KWOK_FAKE_NODE_TAINT_KEYconstant and its usage inkwok.py:node_manifest()workload3-6.yamlrun on real k3d workers (init-container readiness requires real kubelet). Addedrun-e2e-startup-order-fullMakefile target withE2E_KWOK_NODES=0.startup_orderingjob usesrun-e2e-real-fullto ensure no KWOK nodes; same override forrun-e2e-mnnvl-full.Which issue(s) this PR fixes:
Fixes #488
Special notes for your reviewer:
kubernetes.io/zone,kubernetes.io/block,kubernetes.io/rack) viakwok.py:topology_labels(), so TAS tests work unchanged.Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: