Skip to content

replace deprecated runner#447

Merged
ranrubin merged 48 commits into
ai-dynamo:mainfrom
ranrubin:ci/change-e2e-runner
Mar 17, 2026
Merged

replace deprecated runner#447
ranrubin merged 48 commits into
ai-dynamo:mainfrom
ranrubin:ci/change-e2e-runner

Conversation

@ranrubin

@ranrubin ranrubin commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

What this PR does / why we need it:

The runner that grove currently use cpu-amd-m5-2xlarge is on Dynamo's Legacy cluster. That runner is the only runner left on that cluster and is blocking its deletion.

Closes #448

Added prod-grove-e2e-v1 runner to Dynamo's production cluster. The Dockerfile can be found here, and the configuration can be found here.

Because our production runner uses DinD with a k8s native sidecar, I had to make some changes to the test infra:

Memory limits--agents-memory 150m doesn't work in DinD (broken /proc/meminfo bind-mount). Several tests depend on nodes having limited memory to verify pods stay Pending when resources are insufficient. We now use kubelet system-reserved to achieve the same effective allocatable memory, triggered via a new --dind-memory-mode flag.

Stale node list — With 30 k3d nodes, occasionally one dies mid-run. Our node monitor replaces it, but PrepareForTest still had the old node cached, causing all subsequent tests to fail trying to cordon a deleted node. Fixed by refreshing the node list from the cluster before each test.
Both GS and TAS suites passing on the new runners

Special notes for your reviewer:

Does this PR introduce a API change?

None


Additional documentation e.g., enhancement proposals, usage docs, etc.:


@sanjaychatterjee

Copy link
Copy Markdown
Collaborator

@gflarity @danbar2 @Ronkahn21 ptal.

@ranrubin ranrubin marked this pull request as ready for review February 25, 2026 12:27
@ranrubin ranrubin marked this pull request as draft February 25, 2026 12:35

@gflarity gflarity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the memory and CPU of this new runner? The reason we had selected the current one was that it was the only one with the CPU and RAM our tests needed at the time. Seems like these new runners might not have enough RAM for us.

Comment thread .github/workflows/build-check-test.yaml Outdated
Comment thread .github/workflows/build-check-test.yaml Outdated
Comment thread operator/hack/e2e-cluster/create-e2e-cluster.py Outdated
@ranrubin

ranrubin commented Mar 2, 2026

Copy link
Copy Markdown
Contributor Author

@gflarity this is a draft pull request, I will address your comments when it's ready for review :)

ranrubin and others added 9 commits March 3, 2026 14:50
The new prod-grove-e2e-v1 runner is slower at starting k3d nodes than
the old cpu-amd-m5-2xlarge runner. With 30 nodes, only 19 became Ready
within the 5-minute kubectl wait timeout, causing E2E cluster-up to fail.
Doubled the timeout to 10m to accommodate the new runner's startup speed.

Also temporarily limit matrix to Topology_Aware_Scheduling only for validation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@copy-pr-bot

copy-pr-bot Bot commented Mar 5, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ranrubin ranrubin marked this pull request as ready for review March 11, 2026 18:53
@ranrubin

Copy link
Copy Markdown
Contributor Author

What is the memory and CPU of this new runner? The reason we had selected the current one was that it was the only one with the CPU and RAM our tests needed at the time. Seems like these new runners might not have enough RAM for us.

You can see the complete configuration here. I confirmed that the resource suffice

@ranrubin ranrubin requested a review from gflarity March 12, 2026 08:38

@Ronkahn21 Ronkahn21 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran CodeRabbit review with project CLAUDE.md rules — two findings in operator/hack/e2e-cluster/create-e2e-cluster.py:


🐛 Bug: Wrong node condition used for readiness check (wait_for_nodes, ~line 375)

conditions[-1].status retrieves the last condition in the array, which is not guaranteed to be the Ready condition. Kubernetes nodes have multiple conditions (MemoryPressure, DiskPressure, PIDPressure, Ready), and their order is not fixed. This could cause nodes to be incorrectly identified as NotReady (or vice versa).

Fix — use JSONPath to explicitly select the Ready condition:

-            "-o", "custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].status",
+            "-o", "custom-columns=NAME:.metadata.name,STATUS:.status.conditions[?(@.type=='Ready')].status",

🧹 Unused parameter: config in wait_for_nodes (~line 354)

config: ClusterConfig is declared but never referenced in the function body. Either remove it or use it to scope container restarts to this cluster (by filtering on the k3d-<cluster_name>- node name prefix), which would be safer in environments with multiple clusters running.

Option A — remove it:

-def wait_for_nodes(config: ClusterConfig, max_restart_rounds: int = 2):
+def wait_for_nodes(max_restart_rounds: int = 2):

Option B — use it for safety:

# Only restart containers belonging to this cluster
if not node_name.startswith(f"k3d-{config.cluster_name}-"):
    continue

@ranrubin

Copy link
Copy Markdown
Contributor Author

Ran CodeRabbit review with project CLAUDE.md rules — two findings in operator/hack/e2e-cluster/create-e2e-cluster.py:

🐛 Bug: Wrong node condition used for readiness check (wait_for_nodes, ~line 375)

conditions[-1].status retrieves the last condition in the array, which is not guaranteed to be the Ready condition. Kubernetes nodes have multiple conditions (MemoryPressure, DiskPressure, PIDPressure, Ready), and their order is not fixed. This could cause nodes to be incorrectly identified as NotReady (or vice versa).

Fix — use JSONPath to explicitly select the Ready condition:

-            "-o", "custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].status",
+            "-o", "custom-columns=NAME:.metadata.name,STATUS:.status.conditions[?(@.type=='Ready')].status",

🧹 Unused parameter: config in wait_for_nodes (~line 354)

config: ClusterConfig is declared but never referenced in the function body. Either remove it or use it to scope container restarts to this cluster (by filtering on the k3d-<cluster_name>- node name prefix), which would be safer in environments with multiple clusters running.

Option A — remove it:

-def wait_for_nodes(config: ClusterConfig, max_restart_rounds: int = 2):
+def wait_for_nodes(max_restart_rounds: int = 2):

Option B — use it for safety:

# Only restart containers belonging to this cluster
if not node_name.startswith(f"k3d-{config.cluster_name}-"):
    continue

fixed. A green workflow: https://github.com/ai-dynamo/grove/actions/runs/23110702572?pr=447

ranrubin and others added 6 commits March 16, 2026 11:47
… add inline docs

- Integrate changes from PR ai-dynamo#489 (KWOK nodes + Go build caching):
  - Add Go module cache to actions/setup-go fallback step
  - Add pydantic_settings to Python dep verification
  - Add run-e2e-startup-order-full Makefile target (real kubelet workers, no KWOK)
  - Refactor run-e2e-mnnvl-full env vars to E2E_CLUSTER__* / E2E_KWOK__* naming
  - Switch e2e-cluster-down to use infra-manager.py
  - Point startup_ordering matrix entry to new make target
- Add inline docs explaining DinD memory limit rationale in cluster.py
- Expand refreshWorkerNodes doc comment in shared_cluster.go to explain
  the DinD node instability context that motivated the stale-list fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
e2e-cluster-down now calls infra-manager.py, which depends on tenacity
(and other packages from operator/hack/requirements.txt). Previously
only operator/hack/e2e-cluster/requirements.txt was installed, causing
a ModuleNotFoundError on cluster teardown.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ronkahn21
Ronkahn21 previously approved these changes Mar 16, 2026
@Ronkahn21 Ronkahn21 dismissed gflarity’s stale review March 16, 2026 13:34

Changes were addressed, Goeff is out for the coming weeks

…pport

e2e-cluster-up was calling create-e2e-cluster.py (flat E2E_* env vars)
while the Makefile intentionally exports E2E_CLUSTER__* format for
infra-manager. This caused E2E_CLUSTER__WORKER_NODES=2 to be ignored,
falling back to 30 nodes which fails cluster creation in DinD mode.

- Point e2e-cluster-up at infra-manager.py setup so all E2E_CLUSTER__*,
  E2E_CLUSTER__PREPULL_IMAGES, and E2E_KWOK__* exports are honored
- Add dind_memory_mode field to ClusterConfig (E2E_CLUSTER__DIND_MEMORY_MODE)
- Add --dind-memory-mode flag to infra-manager setup, preserving CI interface
- Port DinD kubelet system-reserved logic from create-e2e-cluster.py into
  infra_manager/cluster.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ranrubin and others added 3 commits March 17, 2026 11:19
Remove the redundant integer constant DEFAULT_WORKER_MEMORY_MB (which
mirrored DEFAULT_WORKER_MEMORY = "150m") and replace it with a
parse_memory_mb() function in constants.py. Usage sites in cluster.py
now derive the MB value from cfg.worker_memory at runtime (correctly
honouring non-default values), and create-e2e-cluster.py uses a local
inline parse against DEFAULT_WORKER_MEMORY.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The migration to infra-manager (ab6af81) left wait_for_nodes() as a
bare one-shot kubectl wait with no recovery. The original
create-e2e-cluster.py detected NotReady nodes after the initial 5m
timeout, restarted their Docker containers, and retried up to 2 rounds
— handling the DinD flakiness where k3s-agent processes die silently
under resource contention. Port that logic to cluster.py so e2e-cluster-up
self-heals instead of failing immediately on node-ready timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ranrubin ranrubin merged commit cc3551e into ai-dynamo:main Mar 17, 2026
11 checks passed
@ranrubin ranrubin deleted the ci/change-e2e-runner branch March 17, 2026 11:19
danbar2 pushed a commit to danbar2/grove that referenced this pull request Mar 18, 2026
Replacing deprecated Runner and using new infra-manager
enoodle pushed a commit to enoodle/grove that referenced this pull request Mar 24, 2026
Replacing deprecated Runner and using new infra-manager 
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI is using a deprecated runner

5 participants