replace deprecated runner by ranrubin · Pull Request #447 · ai-dynamo/grove

ranrubin · 2026-02-24T10:03:25Z

What type of PR is this?

What this PR does / why we need it:

The runner that grove currently use cpu-amd-m5-2xlarge is on Dynamo's Legacy cluster. That runner is the only runner left on that cluster and is blocking its deletion.

Closes #448

Added prod-grove-e2e-v1 runner to Dynamo's production cluster. The Dockerfile can be found here, and the configuration can be found here.

Because our production runner uses DinD with a k8s native sidecar, I had to make some changes to the test infra:

Memory limits — --agents-memory 150m doesn't work in DinD (broken /proc/meminfo bind-mount). Several tests depend on nodes having limited memory to verify pods stay Pending when resources are insufficient. We now use kubelet system-reserved to achieve the same effective allocatable memory, triggered via a new --dind-memory-mode flag.

Stale node list — With 30 k3d nodes, occasionally one dies mid-run. Our node monitor replaces it, but PrepareForTest still had the old node cached, causing all subsequent tests to fail trying to cordon a deleted node. Fixed by refreshing the node list from the cluster before each test.
Both GS and TAS suites passing on the new runners

Special notes for your reviewer:

Does this PR introduce a API change?

None

Additional documentation e.g., enhancement proposals, usage docs, etc.:

sanjaychatterjee · 2026-02-24T18:59:36Z

@gflarity @danbar2 @Ronkahn21 ptal.

gflarity

What is the memory and CPU of this new runner? The reason we had selected the current one was that it was the only one with the CPU and RAM our tests needed at the time. Seems like these new runners might not have enough RAM for us.

ranrubin · 2026-03-02T19:34:28Z

@gflarity this is a draft pull request, I will address your comments when it's ready for review :)

The new prod-grove-e2e-v1 runner is slower at starting k3d nodes than the old cpu-amd-m5-2xlarge runner. With 30 nodes, only 19 became Ready within the 5-minute kubectl wait timeout, causing E2E cluster-up to fail. Doubled the timeout to 10m to accommodate the new runner's startup speed. Also temporarily limit matrix to Topology_Aware_Scheduling only for validation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

copy-pr-bot · 2026-03-05T11:13:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ranrubin · 2026-03-12T08:02:40Z

What is the memory and CPU of this new runner? The reason we had selected the current one was that it was the only one with the CPU and RAM our tests needed at the time. Seems like these new runners might not have enough RAM for us.

You can see the complete configuration here. I confirmed that the resource suffice

Ronkahn21

Ran CodeRabbit review with project CLAUDE.md rules — two findings in operator/hack/e2e-cluster/create-e2e-cluster.py:

🐛 Bug: Wrong node condition used for readiness check (wait_for_nodes, ~line 375)

conditions[-1].status retrieves the last condition in the array, which is not guaranteed to be the Ready condition. Kubernetes nodes have multiple conditions (MemoryPressure, DiskPressure, PIDPressure, Ready), and their order is not fixed. This could cause nodes to be incorrectly identified as NotReady (or vice versa).

Fix — use JSONPath to explicitly select the Ready condition:

-            "-o", "custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].status",
+            "-o", "custom-columns=NAME:.metadata.name,STATUS:.status.conditions[?(@.type=='Ready')].status",

🧹 Unused parameter: config in wait_for_nodes (~line 354)

config: ClusterConfig is declared but never referenced in the function body. Either remove it or use it to scope container restarts to this cluster (by filtering on the k3d-<cluster_name>- node name prefix), which would be safer in environments with multiple clusters running.

Option A — remove it:

-def wait_for_nodes(config: ClusterConfig, max_restart_rounds: int = 2):
+def wait_for_nodes(max_restart_rounds: int = 2):

Option B — use it for safety:

# Only restart containers belonging to this cluster
if not node_name.startswith(f"k3d-{config.cluster_name}-"):
    continue

ranrubin · 2026-03-15T13:58:44Z

Ran CodeRabbit review with project CLAUDE.md rules — two findings in operator/hack/e2e-cluster/create-e2e-cluster.py:

🐛 Bug: Wrong node condition used for readiness check (wait_for_nodes, ~line 375)

conditions[-1].status retrieves the last condition in the array, which is not guaranteed to be the Ready condition. Kubernetes nodes have multiple conditions (MemoryPressure, DiskPressure, PIDPressure, Ready), and their order is not fixed. This could cause nodes to be incorrectly identified as NotReady (or vice versa).

Fix — use JSONPath to explicitly select the Ready condition:
-            "-o", "custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].status",
+            "-o", "custom-columns=NAME:.metadata.name,STATUS:.status.conditions[?(@.type=='Ready')].status",
🧹 Unused parameter: config in wait_for_nodes (~line 354)

config: ClusterConfig is declared but never referenced in the function body. Either remove it or use it to scope container restarts to this cluster (by filtering on the k3d-<cluster_name>- node name prefix), which would be safer in environments with multiple clusters running.

Option A — remove it:
-def wait_for_nodes(config: ClusterConfig, max_restart_rounds: int = 2):
+def wait_for_nodes(max_restart_rounds: int = 2):
Option B — use it for safety:
# Only restart containers belonging to this cluster
if not node_name.startswith(f"k3d-{config.cluster_name}-"):
    continue

fixed. A green workflow: https://github.com/ai-dynamo/grove/actions/runs/23110702572?pr=447

… add inline docs - Integrate changes from PR ai-dynamo#489 (KWOK nodes + Go build caching): - Add Go module cache to actions/setup-go fallback step - Add pydantic_settings to Python dep verification - Add run-e2e-startup-order-full Makefile target (real kubelet workers, no KWOK) - Refactor run-e2e-mnnvl-full env vars to E2E_CLUSTER__* / E2E_KWOK__* naming - Switch e2e-cluster-down to use infra-manager.py - Point startup_ordering matrix entry to new make target - Add inline docs explaining DinD memory limit rationale in cluster.py - Expand refreshWorkerNodes doc comment in shared_cluster.go to explain the DinD node instability context that motivated the stale-list fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

e2e-cluster-down now calls infra-manager.py, which depends on tenacity (and other packages from operator/hack/requirements.txt). Previously only operator/hack/e2e-cluster/requirements.txt was installed, causing a ModuleNotFoundError on cluster teardown. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Changes were addressed, Goeff is out for the coming weeks

…pport e2e-cluster-up was calling create-e2e-cluster.py (flat E2E_* env vars) while the Makefile intentionally exports E2E_CLUSTER__* format for infra-manager. This caused E2E_CLUSTER__WORKER_NODES=2 to be ignored, falling back to 30 nodes which fails cluster creation in DinD mode. - Point e2e-cluster-up at infra-manager.py setup so all E2E_CLUSTER__*, E2E_CLUSTER__PREPULL_IMAGES, and E2E_KWOK__* exports are honored - Add dind_memory_mode field to ClusterConfig (E2E_CLUSTER__DIND_MEMORY_MODE) - Add --dind-memory-mode flag to infra-manager setup, preserving CI interface - Port DinD kubelet system-reserved logic from create-e2e-cluster.py into infra_manager/cluster.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove the redundant integer constant DEFAULT_WORKER_MEMORY_MB (which mirrored DEFAULT_WORKER_MEMORY = "150m") and replace it with a parse_memory_mb() function in constants.py. Usage sites in cluster.py now derive the MB value from cfg.worker_memory at runtime (correctly honouring non-default values), and create-e2e-cluster.py uses a local inline parse against DEFAULT_WORKER_MEMORY. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The migration to infra-manager (ab6af81) left wait_for_nodes() as a bare one-shot kubectl wait with no recovery. The original create-e2e-cluster.py detected NotReady nodes after the initial 5m timeout, restarted their Docker containers, and retried up to 2 rounds — handling the DinD flakiness where k3s-agent processes die silently under resource contention. Port that logic to cluster.py so e2e-cluster-up self-heals instead of failing immediately on node-ready timeout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replacing deprecated Runner and using new infra-manager

Replacing deprecated Runner and using new infra-manager Signed-off-by: Erez Freiberger <enoodle@gmail.com>

replace deprecated runner

aef7fdb

ranrubin requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners February 24, 2026 10:03

ranrubin marked this pull request as draft February 24, 2026 15:49

ranrubin marked this pull request as ready for review February 25, 2026 12:27

Merge branch 'main' into ci/change-e2e-runner

dc9a0aa

ranrubin marked this pull request as draft February 25, 2026 12:35

ranrubin added 6 commits February 26, 2026 16:38

wip

b7c0858

wip

30b7588

wip

79a5dc6

setting baseline

699b53e

new runner

bedac13

test

a913a93

gflarity previously requested changes Mar 2, 2026

View reviewed changes

Comment thread .github/workflows/build-check-test.yaml Outdated

Comment thread .github/workflows/build-check-test.yaml Outdated

Comment thread operator/hack/e2e-cluster/create-e2e-cluster.py Outdated

test

2395a1d

ranrubin and others added 9 commits March 3, 2026 14:50

Merge branch 'main' into ci/change-e2e-runner

b35e321

one

76ca4fa

one

3132e52

one

d205c0c

one pod per node

0403384

Merge branch 'main' into ci/change-e2e-runner

172b886

original performance

ed5cd5d

ready?

0a283d1

ranrubin added 2 commits March 11, 2026 20:37

ready

eff97d1

ready

7013149

ranrubin marked this pull request as ready for review March 11, 2026 18:53

Merge branch 'main' into ci/change-e2e-runner

64dd9b2

ranrubin requested a review from gflarity March 12, 2026 08:38

Merge branch 'main' into ci/change-e2e-runner

38eff10

Ronkahn21 reviewed Mar 15, 2026

View reviewed changes

ranrubin added 2 commits March 15, 2026 14:49

fix and test

39bc60a

verified

1a2a371

ranrubin and others added 6 commits March 16, 2026 11:47

ready

57c11a1

upgrade k3s and test

bbc9f4d

ready

90b774a

remove E2E_CLUSTER__K3S_IMAGE for run-e2e-mnnvl-full:

15e88c6

Ronkahn21 previously approved these changes Mar 16, 2026

View reviewed changes

ranrubin dismissed Ronkahn21’s stale review via ab6af81 March 16, 2026 13:48

ranrubin and others added 3 commits March 17, 2026 11:19

tidy constants

9570420

Ronkahn21 approved these changes Mar 17, 2026

View reviewed changes

danbar2 approved these changes Mar 17, 2026

View reviewed changes

ranrubin merged commit cc3551e into ai-dynamo:main Mar 17, 2026
11 checks passed

ranrubin deleted the ci/change-e2e-runner branch March 17, 2026 11:19

danbar2 pushed a commit to danbar2/grove that referenced this pull request Mar 18, 2026

replace deprecated runner (ai-dynamo#447)

20e098b

Replacing deprecated Runner and using new infra-manager

enoodle pushed a commit to enoodle/grove that referenced this pull request Mar 24, 2026

replace deprecated runner (ai-dynamo#447)

a0d0638

Replacing deprecated Runner and using new infra-manager Signed-off-by: Erez Freiberger <enoodle@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace deprecated runner#447

replace deprecated runner#447
ranrubin merged 48 commits into
ai-dynamo:mainfrom
ranrubin:ci/change-e2e-runner

ranrubin commented Feb 24, 2026 •

edited

Loading

Uh oh!

sanjaychatterjee commented Feb 24, 2026

Uh oh!

gflarity left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ranrubin commented Mar 2, 2026

Uh oh!

copy-pr-bot Bot commented Mar 5, 2026

Uh oh!

ranrubin commented Mar 12, 2026

Uh oh!

Ronkahn21 left a comment

Uh oh!

ranrubin commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ranrubin commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

sanjaychatterjee commented Feb 24, 2026

Uh oh!

gflarity left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ranrubin commented Mar 2, 2026

Uh oh!

copy-pr-bot Bot commented Mar 5, 2026

Uh oh!

ranrubin commented Mar 12, 2026

Uh oh!

Ronkahn21 left a comment

Choose a reason for hiding this comment

Uh oh!

ranrubin commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ranrubin commented Feb 24, 2026 •

edited

Loading