replace deprecated runner#447
Conversation
|
@gflarity @danbar2 @Ronkahn21 ptal. |
gflarity
left a comment
There was a problem hiding this comment.
What is the memory and CPU of this new runner? The reason we had selected the current one was that it was the only one with the CPU and RAM our tests needed at the time. Seems like these new runners might not have enough RAM for us.
|
@gflarity this is a draft pull request, I will address your comments when it's ready for review :) |
The new prod-grove-e2e-v1 runner is slower at starting k3d nodes than the old cpu-amd-m5-2xlarge runner. With 30 nodes, only 19 became Ready within the 5-minute kubectl wait timeout, causing E2E cluster-up to fail. Doubled the timeout to 10m to accommodate the new runner's startup speed. Also temporarily limit matrix to Topology_Aware_Scheduling only for validation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
You can see the complete configuration here. I confirmed that the resource suffice |
Ronkahn21
left a comment
There was a problem hiding this comment.
Ran CodeRabbit review with project CLAUDE.md rules — two findings in operator/hack/e2e-cluster/create-e2e-cluster.py:
🐛 Bug: Wrong node condition used for readiness check (wait_for_nodes, ~line 375)
conditions[-1].status retrieves the last condition in the array, which is not guaranteed to be the Ready condition. Kubernetes nodes have multiple conditions (MemoryPressure, DiskPressure, PIDPressure, Ready), and their order is not fixed. This could cause nodes to be incorrectly identified as NotReady (or vice versa).
Fix — use JSONPath to explicitly select the Ready condition:
- "-o", "custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].status",
+ "-o", "custom-columns=NAME:.metadata.name,STATUS:.status.conditions[?(@.type=='Ready')].status",🧹 Unused parameter: config in wait_for_nodes (~line 354)
config: ClusterConfig is declared but never referenced in the function body. Either remove it or use it to scope container restarts to this cluster (by filtering on the k3d-<cluster_name>- node name prefix), which would be safer in environments with multiple clusters running.
Option A — remove it:
-def wait_for_nodes(config: ClusterConfig, max_restart_rounds: int = 2):
+def wait_for_nodes(max_restart_rounds: int = 2):Option B — use it for safety:
# Only restart containers belonging to this cluster
if not node_name.startswith(f"k3d-{config.cluster_name}-"):
continue
fixed. A green workflow: https://github.com/ai-dynamo/grove/actions/runs/23110702572?pr=447 |
… add inline docs - Integrate changes from PR ai-dynamo#489 (KWOK nodes + Go build caching): - Add Go module cache to actions/setup-go fallback step - Add pydantic_settings to Python dep verification - Add run-e2e-startup-order-full Makefile target (real kubelet workers, no KWOK) - Refactor run-e2e-mnnvl-full env vars to E2E_CLUSTER__* / E2E_KWOK__* naming - Switch e2e-cluster-down to use infra-manager.py - Point startup_ordering matrix entry to new make target - Add inline docs explaining DinD memory limit rationale in cluster.py - Expand refreshWorkerNodes doc comment in shared_cluster.go to explain the DinD node instability context that motivated the stale-list fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
e2e-cluster-down now calls infra-manager.py, which depends on tenacity (and other packages from operator/hack/requirements.txt). Previously only operator/hack/e2e-cluster/requirements.txt was installed, causing a ModuleNotFoundError on cluster teardown. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Changes were addressed, Goeff is out for the coming weeks
…pport e2e-cluster-up was calling create-e2e-cluster.py (flat E2E_* env vars) while the Makefile intentionally exports E2E_CLUSTER__* format for infra-manager. This caused E2E_CLUSTER__WORKER_NODES=2 to be ignored, falling back to 30 nodes which fails cluster creation in DinD mode. - Point e2e-cluster-up at infra-manager.py setup so all E2E_CLUSTER__*, E2E_CLUSTER__PREPULL_IMAGES, and E2E_KWOK__* exports are honored - Add dind_memory_mode field to ClusterConfig (E2E_CLUSTER__DIND_MEMORY_MODE) - Add --dind-memory-mode flag to infra-manager setup, preserving CI interface - Port DinD kubelet system-reserved logic from create-e2e-cluster.py into infra_manager/cluster.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove the redundant integer constant DEFAULT_WORKER_MEMORY_MB (which mirrored DEFAULT_WORKER_MEMORY = "150m") and replace it with a parse_memory_mb() function in constants.py. Usage sites in cluster.py now derive the MB value from cfg.worker_memory at runtime (correctly honouring non-default values), and create-e2e-cluster.py uses a local inline parse against DEFAULT_WORKER_MEMORY. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The migration to infra-manager (ab6af81) left wait_for_nodes() as a bare one-shot kubectl wait with no recovery. The original create-e2e-cluster.py detected NotReady nodes after the initial 5m timeout, restarted their Docker containers, and retried up to 2 rounds — handling the DinD flakiness where k3s-agent processes die silently under resource contention. Port that logic to cluster.py so e2e-cluster-up self-heals instead of failing immediately on node-ready timeout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replacing deprecated Runner and using new infra-manager
Replacing deprecated Runner and using new infra-manager Signed-off-by: Erez Freiberger <enoodle@gmail.com>
What type of PR is this?
What this PR does / why we need it:
The runner that grove currently use
cpu-amd-m5-2xlargeis on Dynamo's Legacy cluster. That runner is the only runner left on that cluster and is blocking its deletion.Closes #448
Added
prod-grove-e2e-v1runner to Dynamo's production cluster. The Dockerfile can be found here, and the configuration can be found here.Because our production runner uses DinD with a k8s native sidecar, I had to make some changes to the test infra:
Memory limits —
--agents-memory 150mdoesn't work in DinD (broken/proc/meminfobind-mount). Several tests depend on nodes having limited memory to verify pods stay Pending when resources are insufficient. We now use kubeletsystem-reservedto achieve the same effective allocatable memory, triggered via a new--dind-memory-modeflag.Stale node list — With 30 k3d nodes, occasionally one dies mid-run. Our node monitor replaces it, but
PrepareForTeststill had the old node cached, causing all subsequent tests to fail trying to cordon a deleted node. Fixed by refreshing the node list from the cluster before each test.Both GS and TAS suites passing on the new runners
Special notes for your reviewer:
Does this PR introduce a API change?
None
Additional documentation e.g., enhancement proposals, usage docs, etc.: