chore: migrate e2e cluster to KWOK nodes for faster CI by Ronkahn21 · Pull Request #489 · ai-dynamo/grove

Ronkahn21 · 2026-03-15T15:47:22Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Migrates the e2e test cluster from 30 k3d worker nodes to 2 k3d workers + 30 KWOK fake nodes. Test pods don't require real container execution — they only need to appear as Running/Ready. KWOK simulates pod lifecycle at a fraction of the infrastructure cost.

Benefits:

Faster cluster startup: KWOK nodes are created via API in ~2s vs ~60s for k3d workers
Lower memory: 0 k3d workers + 30 KWOK nodes vs 30 × 150MB k3d workers (~4.5GB)

Performance Results

CI benchmarks — real k3d workers (avg of 4 runs across 3 branches) vs KWOK nodes (avg of 2 runs on this branch):

Job	Real Nodes (avg)	KWOK (avg)	Speedup
gang_scheduling	28m	19m	32%
rolling_updates	23m	10m	57%
Topology_Aware_Scheduling	17m	8m	53%
startup_ordering	11m	6m	45%
cert_management	9m	4m	56%

Wall clock (longest parallel job): 28m → 19m (~32% faster)
Memory: ~4.5GB → ~0 (KWOK nodes are API-only, no kubelet/container runtime)
Flakiness: Pre-KWOK runs had e2e failures in 5/6 runs (gang_scheduling, rolling_updates most common). Post-KWOK: 0 failures across 2 completed runs

Changes:

Cluster config (e2e.yaml): no k3d workers , add KWOK config (30 nodes, 150Mi memory, 4 CPU each)
Unified node role taint: all test nodes (k3d workers + KWOK) use node_role.e2e.grove.nvidia.com=agent:NoSchedule. The previously redundant fake-node=true:NoSchedule taint has been removed from KWOK nodes.
Python infra_manager: removed KWOK_FAKE_NODE_TAINT_KEY constant and its usage in kwok.py:node_manifest()
Startup-order isolation: workload3-6.yaml run on real k3d workers (init-container readiness requires real kubelet). Added run-e2e-startup-order-full Makefile target with E2E_KWOK_NODES=0.
CI matrix: startup_ordering job uses run-e2e-real-full to ensure no KWOK nodes; same override for run-e2e-mnnvl-full.

Which issue(s) this PR fixes:

Fixes #488

Special notes for your reviewer:

KWOK nodes receive topology labels (kubernetes.io/zone, kubernetes.io/block, kubernetes.io/rack) via kwok.py:topology_labels(), so TAS tests work unchanged.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

… add inline docs - Integrate changes from PR ai-dynamo#489 (KWOK nodes + Go build caching): - Add Go module cache to actions/setup-go fallback step - Add pydantic_settings to Python dep verification - Add run-e2e-startup-order-full Makefile target (real kubelet workers, no KWOK) - Refactor run-e2e-mnnvl-full env vars to E2E_CLUSTER__* / E2E_KWOK__* naming - Switch e2e-cluster-down to use infra-manager.py - Point startup_ordering matrix entry to new make target - Add inline docs explaining DinD memory limit rationale in cluster.py - Expand refreshWorkerNodes doc comment in shared_cluster.go to explain the DinD node instability context that motivated the stale-list fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace 30 k3d workers with 2 workers + 30 KWOK fake nodes. Update test pod affinities/tolerations to target KWOK nodes. Add startup-order target with KWOK disabled (real kubelet needed). Disable KWOK for MNNVL tests. Fixes ai-dynamo#488 Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Tests call PrepareForTest(10) which requires exactly 10 schedulable worker nodes. Set WORKER_NODES=10, WORKER_MEMORY=150m, KWOK=0 for the startup-ordering target — real kubelet + minimum resources. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

150Mi per node limits Kai to 1 pod/node (80Mi requests), breaking host-colocation topology constraints and small gang scheduling. 8Gi per node allows multiple pods per KWOK node, matching test expectations. Also enable Go build cache in CI for faster test compilation. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

… testing

Replace startup-order-specific make target with generic real-node cluster targets (e2e-cluster-up-real, run-e2e-real-full) so any test suite can run on either KWOK or real k3d nodes. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

The node_role.e2e.grove.nvidia.com taint already prevents non-test pods from scheduling on KWOK nodes, making fake-node redundant. Removes the taint from KWOK node creation and 45 toleration blocks across 21 e2e YAML files, making tests cluster-agnostic. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Replace hardcoded port 5001 with $(E2E_REGISTRY_PORT), remove no-op inline env vars from sub-make calls, declare E2E_CREATE_FLAGS, and add richer documentation to new real-node targets. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>

Ronkahn21 marked this pull request as ready for review March 15, 2026 15:55

Ronkahn21 requested review from gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners March 15, 2026 15:55

shayasoolin reviewed Mar 16, 2026

View reviewed changes

Comment thread .github/workflows/build-check-test.yaml Outdated

Ronkahn21 added 6 commits March 17, 2026 14:46

chore: update node configurations and tolerations for agent role

839a0d8

chore: increase nodes per block and update cluster initialization for…

df6b05f

… testing

chore: decouple e2e cluster types from test execution

bb05d50

Replace startup-order-specific make target with generic real-node cluster targets (e2e-cluster-up-real, run-e2e-real-full) so any test suite can run on either KWOK or real k3d nodes. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 force-pushed the chore/cluster-script-use branch from 7149333 to bb05d50 Compare March 17, 2026 14:00

Ronkahn21 added 2 commits March 17, 2026 18:37

danbar2 approved these changes Mar 18, 2026

View reviewed changes

shayasoolin approved these changes Mar 18, 2026

View reviewed changes

Ronkahn21 merged commit 3d684fe into ai-dynamo:main Mar 18, 2026
22 of 23 checks passed

Ronkahn21 deleted the chore/cluster-script-use branch March 18, 2026 08:36

danbar2 pushed a commit to danbar2/grove that referenced this pull request Mar 18, 2026

chore: migrate e2e cluster to KWOK nodes for faster CI (ai-dynamo#489)

0f6e1d4

enoodle pushed a commit to enoodle/grove that referenced this pull request Mar 24, 2026

chore: migrate e2e cluster to KWOK nodes for faster CI (ai-dynamo#489)

fad68a2

Signed-off-by: Erez Freiberger <enoodle@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: migrate e2e cluster to KWOK nodes for faster CI#489

chore: migrate e2e cluster to KWOK nodes for faster CI#489
Ronkahn21 merged 8 commits into
ai-dynamo:mainfrom
Ronkahn21:chore/cluster-script-use

Ronkahn21 commented Mar 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Ronkahn21 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Performance Results

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ronkahn21 commented Mar 15, 2026 •

edited

Loading