Skip to content

chore: migrate e2e cluster to KWOK nodes for faster CI#489

Merged
Ronkahn21 merged 8 commits into
ai-dynamo:mainfrom
Ronkahn21:chore/cluster-script-use
Mar 18, 2026
Merged

chore: migrate e2e cluster to KWOK nodes for faster CI#489
Ronkahn21 merged 8 commits into
ai-dynamo:mainfrom
Ronkahn21:chore/cluster-script-use

Conversation

@Ronkahn21

@Ronkahn21 Ronkahn21 commented Mar 15, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Migrates the e2e test cluster from 30 k3d worker nodes to 2 k3d workers + 30 KWOK fake nodes. Test pods don't require real container execution — they only need to appear as Running/Ready. KWOK simulates pod lifecycle at a fraction of the infrastructure cost.

Benefits:

  • Faster cluster startup: KWOK nodes are created via API in ~2s vs ~60s for k3d workers
  • Lower memory: 0 k3d workers + 30 KWOK nodes vs 30 × 150MB k3d workers (~4.5GB)

Performance Results

CI benchmarks — real k3d workers (avg of 4 runs across 3 branches) vs KWOK nodes (avg of 2 runs on this branch):

Job Real Nodes (avg) KWOK (avg) Speedup
gang_scheduling 28m 19m 32%
rolling_updates 23m 10m 57%
Topology_Aware_Scheduling 17m 8m 53%
startup_ordering 11m 6m 45%
cert_management 9m 4m 56%
  • Wall clock (longest parallel job): 28m → 19m (~32% faster)
  • Memory: ~4.5GB → ~0 (KWOK nodes are API-only, no kubelet/container runtime)
  • Flakiness: Pre-KWOK runs had e2e failures in 5/6 runs (gang_scheduling, rolling_updates most common). Post-KWOK: 0 failures across 2 completed runs

Changes:

  1. Cluster config (e2e.yaml): no k3d workers , add KWOK config (30 nodes, 150Mi memory, 4 CPU each)
  2. Unified node role taint: all test nodes (k3d workers + KWOK) use node_role.e2e.grove.nvidia.com=agent:NoSchedule. The previously redundant fake-node=true:NoSchedule taint has been removed from KWOK nodes.
  3. Python infra_manager: removed KWOK_FAKE_NODE_TAINT_KEY constant and its usage in kwok.py:node_manifest()
  4. Startup-order isolation: workload3-6.yaml run on real k3d workers (init-container readiness requires real kubelet). Added run-e2e-startup-order-full Makefile target with E2E_KWOK_NODES=0.
  5. CI matrix: startup_ordering job uses run-e2e-real-full to ensure no KWOK nodes; same override for run-e2e-mnnvl-full.

Which issue(s) this PR fixes:

Fixes #488

Special notes for your reviewer:

  • KWOK nodes receive topology labels (kubernetes.io/zone, kubernetes.io/block, kubernetes.io/rack) via kwok.py:topology_labels(), so TAS tests work unchanged.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@Ronkahn21 Ronkahn21 marked this pull request as ready for review March 15, 2026 15:55
ranrubin added a commit to ranrubin/grove that referenced this pull request Mar 16, 2026
… add inline docs

- Integrate changes from PR ai-dynamo#489 (KWOK nodes + Go build caching):
  - Add Go module cache to actions/setup-go fallback step
  - Add pydantic_settings to Python dep verification
  - Add run-e2e-startup-order-full Makefile target (real kubelet workers, no KWOK)
  - Refactor run-e2e-mnnvl-full env vars to E2E_CLUSTER__* / E2E_KWOK__* naming
  - Switch e2e-cluster-down to use infra-manager.py
  - Point startup_ordering matrix entry to new make target
- Add inline docs explaining DinD memory limit rationale in cluster.py
- Expand refreshWorkerNodes doc comment in shared_cluster.go to explain
  the DinD node instability context that motivated the stale-list fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread .github/workflows/build-check-test.yaml Outdated
Replace 30 k3d workers with 2 workers + 30 KWOK fake nodes.
Update test pod affinities/tolerations to target KWOK nodes.
Add startup-order target with KWOK disabled (real kubelet needed).
Disable KWOK for MNNVL tests.

Fixes ai-dynamo#488

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Tests call PrepareForTest(10) which requires exactly 10 schedulable
worker nodes. Set WORKER_NODES=10, WORKER_MEMORY=150m, KWOK=0 for
the startup-ordering target — real kubelet + minimum resources.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
150Mi per node limits Kai to 1 pod/node (80Mi requests), breaking
host-colocation topology constraints and small gang scheduling.
8Gi per node allows multiple pods per KWOK node, matching test expectations.

Also enable Go build cache in CI for faster test compilation.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Replace startup-order-specific make target with generic real-node
cluster targets (e2e-cluster-up-real, run-e2e-real-full) so any test
suite can run on either KWOK or real k3d nodes.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 force-pushed the chore/cluster-script-use branch from 7149333 to bb05d50 Compare March 17, 2026 14:00
The node_role.e2e.grove.nvidia.com taint already prevents non-test
pods from scheduling on KWOK nodes, making fake-node redundant.
Removes the taint from KWOK node creation and 45 toleration blocks
across 21 e2e YAML files, making tests cluster-agnostic.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Replace hardcoded port 5001 with $(E2E_REGISTRY_PORT), remove no-op
inline env vars from sub-make calls, declare E2E_CREATE_FLAGS, and
add richer documentation to new real-node targets.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 merged commit 3d684fe into ai-dynamo:main Mar 18, 2026
22 of 23 checks passed
@Ronkahn21 Ronkahn21 deleted the chore/cluster-script-use branch March 18, 2026 08:36
danbar2 pushed a commit to danbar2/grove that referenced this pull request Mar 18, 2026
enoodle pushed a commit to enoodle/grove that referenced this pull request Mar 24, 2026
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore: migrate e2e cluster from 30 k3d workers to KWOK nodes for faster CI

3 participants