feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus by kaim-eng · Pull Request #9683 · ai-dynamo/dynamo

kaim-eng · 2026-05-18T15:54:08Z

Part of the PR #9369 split plan. This is PR 2 of 6 (PR 1b - Planner Power Infrastructure). Predecessor: #9682 - Power Agent DaemonSet (base = pr1a/power-agent; #9682 must land first) Successor: #9684 (Draft) - Power budget enforcement ## Summary Planner-side wires from config to pod annotation. Nothing in this PR changes the scaling policy to consume power yet; Phase 1 emits and observes the power infrastructure, while the AIC closed-loop optimizer and admission coupling remain deferred. Current branch shape: 32 files changed against pr1a/power-agent. The 6 shared files with later planner PRs are hand-authored here at the Phase-1 line subset from the split plan. - 6 power-aware config fields: enable_power_awareness (default False), total_gpu_power_limit, prefill_engine_gpu_power_limit, decode_engine_gpu_power_limit, power_agent_safe_default_watts, and power_annotation_interval_seconds (default 60s sweep throttle). The per-engine power-limit fields are ge=1 validated. - _apply_power_annotations() and _publish_power_budget_metrics() in core/base.py: emit dynamo.nvidia.com/gpu-power-limit per worker pod and expose Phase-1 power-budget gauges. Advisory mode is a hard no-op for pod mutation, and steady-state annotation sweeps are throttled with a scale-up force window. - 7e74aa65ea fixes the sweep force-window logic for the orchestrator path: _scaling_up() now reads cached WorkerCounts before PSM internals and never constructs a PSM just to classify scale-up. - KubernetesConnector / KubernetesAPI: patch_pod_annotation, list_pods_by_label, get_component_pods, list_frontend_pods, and resolve_frontend_http_port. - Prometheus plumbing: DCGM get_total_dgd_power() plus the NaN-filter fix in _get_metric_increase. - Operator-chart planner pod RBAC: pods [get, list, watch, patch] for the planner ServiceAccount in both namespace and cluster-wide modes. The platform chart stays at the main baseline 1.3.0; the obsolete standalone deploy/planner-pod-rbac-dev.yaml patch was removed. - examples/multi-dgd-live-test/: multi-DGD live-repro scaffolding and test_multi_dgd_live.py; personal identifiers were scrubbed. Reviewer context: - Design context: docs/design-docs/powerplanner-design.md sections 5.5, 6.4, and 7 (lands later in the stack; readable from pr5/docs-devenv). - Dev environment: docs/components/planner/dpp-dev-env.md section 8 (also in PR 5). - Split-plan context: sections 2.2, 2.2.2, 3.2, and 5. ## Validation - Required/current GitHub checks pass at 7e74aa65ea; rust jobs and team-request are intentionally skipped by filters. - Dev-pod validation from 2026-05-18, before the final 7e74aa65ea sweep-throttle fix: - Full planner sweep, excluding test_virtual_connector.py: 483 passed, 2 skipped in 12.46s. - Same sweep with RUN_DISRUPTIVE_TESTS=1: 484 passed, 1 skipped. - Phase-1 critical files (test_actuation_knobs.py, test_kubernetes_connector.py, test_prometheus.py): 121 / 121 passed. - test_actuation_knobs_live.py: 9 passed, 2 skipped; with disruptive opt-in: 10 passed, 1 skipped. - Power Agent regression at PR 1b tip at that time: 43 / 43 passed. - 7e74aa65ea adds 3 focused unit tests in test_actuation_knobs.py for cached worker-count precedence, orchestrator/no-PSM behavior, and non-scale-up cached counts. These assertions are lint/import clean and required CI is green, but the dev-pod planner pytest count has not been refreshed after this final commit. - Pre-commit on origin/main..pr1b/planner-infra: 14 hooks pass, 4 skipped; marker compliance was reconfirmed in a Linux pod with Missing sets: 0. The Windows host has the known pre-existing POSIX-only fcntl import issue in untouched test utilities; Linux CI is unaffected. - DCO: commits are signed off. Expected skips from the May 18 dev-pod run: - TestPostBusyThresholdLive::test_post_busy_threshold_returns_2xx: frontend image lacks /busy_threshold; admission control lands later. - TestScalingRealMutation::test_set_component_replicas_mutates_dgd_spec: opt-in via RUN_DISRUPTIVE_TESTS=1; passes when enabled. ## Merge Strategy Rebase-and-merge, no squash, per split-plan section 4.3. ## Stack Order Merge #9682 first. After #9682 lands, #9683 can be rebased/retargeted onto the updated base and merged. #9790 is a sibling stacked on #9682 and can merge after #9682 as well.

copy-pr-bot · 2026-05-18T15:54:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live test that already ships with this PR). Validates invariants that single-DGD tests structurally cannot reach: - A2 Three planners patch DISJOINT pod sets in one namespace - A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls) - B2 Power Agent applies 8 distinct caps on one tick - B4 Happy-path zero-counter invariant - B5 Multi-pod conflict + safe-default behaviour (skipped pending Finding #7 — CDI-safe bystander rewrite tracked separately) - 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the live driver state (not via mocks) Layout introduced by this commit: components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 + TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker pods with hardcoded annotations) and skips Path-A-only tests cleanly when running Path B. DCGM tests gated by DCGM_HOSTENGINE_AVAILABLE=1. examples/multi-dgd-live-test/ Manifests + helm values overrides + teardown script + README walk-through that an operator applies before running the test: 00-namespace.yaml — namespace + pull secret 10-power-agent-values-nvml.yaml — Power Agent NVML helm values 11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values 20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS 30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A) 40-conflict-bystander-pod.yaml — multi-pod conflict fixture 50-stub-workers.yaml — Path-B stub workers 99-teardown.sh — idempotent cleanup w/ cap- restore verification README.md — full reproduction playbook including a 2026-05-21 live run record + 11 findings docs/design-docs/multi-tenant-test.md Design doc (§8 — multi-DGD test plan) that establishes the ground-truth topology, the planner / agent contract under test, and the rationale for the artifact split above. pyproject.toml Registers two new test markers consumed by the live test's `pytestmark`: multi_dgd_live — scenario selector for this specific file power_agent — shared with components/power_agent/tests/ Module-level gating fails fast on misconfiguration (missing TEST_NODE, missing kubeconfig) with actionable error messages, and skips silently when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the default unit-test invocation. Verified: pytest --collect-only is a clean no-op without the env vars, and full-import works under RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not ImportError). Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review. Signed-off-by: Kai Ma <kaim@nvidia.com>

…nd DCGM 4.5 Two stacked defects surfaced on the 2026-05-21T20:11Z Path-B re-run caused `test_dcgm_per_gpu_cap_matches_topology` and `test_dcgm_target_config_registered` to skip silently when they should have PASSED: 1. **`kubernetes-client>=35.0.0` mangles JSON pod-exec stdout.** `kubernetes.stream.stream(..., _preload_content=True)` sniffs JSON-shaped stdout, parses it client-side, and returns `str(dict)`. That means `printf '%s' '{"a":1}'` comes back as the literal 8-char string `{'a': 1}` (single quotes, Python-dict repr), which then trips `json.loads` at every read-back site — including the four inside `test_dcgm_*`. The intended-as-defensive "did not return JSON" skip branch fired instead. 2. **DCGM 4.5+ rotated the `dcgmi config --get -g <id> -j` JSON shape.** Pre-4.5: `body` is a list of per-GPU entries keyed by `"Power Limit (W)"` with flat `{"Current": "...", "Target": "..."}` children. 4.5+: `body` is a dict of fields keyed by `"Power Limit"` (no `(W)` suffix) with a `"children"` wrapper whose leaves are `{"value": "..."}` objects. Three breaking changes in one upgrade — `body` flipped list→dict, the field name lost its suffix, the leaves gained a `value` wrapper. The previous implementation handled neither, so even if defect (1) were absent, the strict per-field reader would still skip. This commit lands a single combined fix: - **`_exec_in_pod`**: switch from `_preload_content=True` (default) to `_preload_content=False` + raw stdout/stderr accumulation off the `WSClient` via `peek_stdout`/`read_stdout`. Bypasses the mangling regardless of which `kubernetes-client` version is installed, and is the documented "I want bytes back" pattern in the kubernetes-client examples. - **New `_extract_power_limit_from_dcgmi_body(body)` helper**: walks both shapes uniformly and returns `(current_w, target_w)` ints or `(None, None)` if neither shape matches. Both DCGM read-back tests (`test_dcgm_per_gpu_cap_matches_topology` and `test_dcgm_target_config_registered`) now consume this helper instead of inlining duplicated nested-`dict.get` chains. Verified locally: - synthetic 4.0-shape payload `[{"Power Limit (W)": {"Current": "350", "Target": "350"}}]` → `(350, 350)` ✓ - synthetic 4.5-shape payload `{"Power Limit": {"children": {"Current": {"value": "375"}, "Target": {"value": "375"}}}}` → `(375, 375)` ✓ - missing field, wrong types, partial children: graceful `(None, None)` / `(int, None)` returns ✓ - test module still collects cleanly (`pytest --collect-only`) `examples/multi-dgd-live-test/README.md` Finding #11 updated to "Status: fixed" with a cross-reference to this commit and the expected SKIPPED→PASSED flip on the next live re-run. Refs: PR #9683 review (Finding #11 in examples/multi-dgd-live-test/README.md). Signed-off-by: Kai Ma <kaim@nvidia.com>

`TestMultiPodConflict` previously created a bystander pod with `NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS clusters running the NVIDIA Container Toolkit in CDI device mode (the dpp-dev-env default, and the direction NVIDIA is pushing for newer operator releases) reject that with `unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to admit the pod — so the test class was carrying a `@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled the whole §8.5.7 contract on every run. This commit lands the CDI-mode-safe rewrite: - Bystander now claims `nvidia.com/gpu: 1` from the device plugin (both `requests` and `limits` per K8s extended-resource rules). CDI-mode-agnostic — the device plugin owns the actual GPU wiring via its own machinery. - `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml` mirrors the inline Python manifest exactly so YAML-debugging and test-debugging agree (the README explicitly calls this out). - New helper `_per_gpu_compute_pids()` snapshots `nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the agent pod (which has hostPID + NVML access) and returns `{gpu_idx: {pid, ...}}`. The test polls this until it sees a new PID on some index after the bystander reaches Running — that index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES pin needed. - Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` → `test_conflict_triggers_safe_default_on_assigned_gpu` to reflect the index-agnostic contract. Blast-radius / recovery asserts key off the discovered `assigned_gpu`; the loop over the other GPUs iterates `range(8)` and skips that index instead of hard-coding `range(1, 8)`. - Class-level `@pytest.mark.skip` decorator removed. The supporting `_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither in code nor as a runtime branch, since neither path is needed once the CDI-safe flow works. - `bystander_uuid` class fixture removed (no longer needed — the UUID/index discovery is post-creation, not pre). `examples/multi-dgd-live-test/README.md` updated: - Finding #7 marked **Status: fixed** with cross-reference to this commit and an explicit description of the new (a)/(b)/(c) flow. - Expected-outcome counts updated: Path B now expects 5 PASSED / 2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the next live run. - Live-run record row updated to note the rewrite and expected PASS outcome. Verified locally: test module compiles + collects cleanly. The end-to-end CDI-vs-non-CDI assignment behaviour is by construction the device plugin's responsibility, so the failure mode of this rewrite is bounded to "device plugin returns no GPU within 90 s" — handled with a clear pytest.fail message that includes a before/after PID snapshot for debugging. Refs: PR #9683 review (Finding #7 in examples/multi-dgd-live-test/README.md). Signed-off-by: Kai Ma <kaim@nvidia.com>

…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>

… lychee Lychee on PR #9683 fails on 5 URLs that aren't really URLs (caught in review). All five are inherited from f7f3226 (the original "feat(planner): power infrastructure …" commit on this branch) but block #9683 from going green either way. The 5 cited locations and fix shape: - `components/src/dynamo/planner/config/defaults.py:48` - `components/src/dynamo/planner/config/planner_config.py:147` In-cluster Prometheus URL `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090`. `*.svc.cluster.local` is the canonical Kubernetes-internal DNS suffix and is not externally resolvable. Add an ignore entry for the suffix so any cluster-internal default URL passes lychee. - `components/src/dynamo/planner/config/planner_config.py:317` Docstring placeholder `http://host:port/`. - `deploy/helm/charts/platform/README.md:118-119` Helm-rendered Format hints `"nats://hostname:port"` / `"http://hostname:port"` / `"https://hostname:port"`. Add three anchored ignore entries for these exact placeholder forms (kept narrow so a real `*.com` hostname can't accidentally slip past as `hostname:port`). - `deploy/helm/charts/platform/Chart.yaml:23` `home: https://nvidia.com`. The same file already uses `https://www.nvidia.com` on line 19 for `url:` and that one passes lychee, so the bare-apex form on line 23 must be hitting a 4xx (Cloudflare bot challenge, redirect-loop, etc.). Canonicalise to the `www.` form so the two `nvidia.com` URLs in the same file agree, and no ignore entry is needed. Verified all 5 URLs match the new ignore patterns via local regex spot-check; `https://www.nvidia.com` (existing line 19 + line 23 fix) deliberately does NOT match any ignore pattern. Signed-off-by: Kai Ma <kaim@nvidia.com>

Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live test that already ships with this PR). Validates invariants that single-DGD tests structurally cannot reach: - A2 Three planners patch DISJOINT pod sets in one namespace - A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls) - B2 Power Agent applies 8 distinct caps on one tick - B4 Happy-path zero-counter invariant - B5 Multi-pod conflict + safe-default behaviour (skipped pending Finding #7 — CDI-safe bystander rewrite tracked separately) - 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the live driver state (not via mocks) Layout introduced by this commit: components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 + TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker pods with hardcoded annotations) and skips Path-A-only tests cleanly when running Path B. DCGM tests gated by DCGM_HOSTENGINE_AVAILABLE=1. examples/multi-dgd-live-test/ Manifests + helm values overrides + teardown script + README walk-through that an operator applies before running the test: 00-namespace.yaml — namespace + pull secret 10-power-agent-values-nvml.yaml — Power Agent NVML helm values 11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values 20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS 30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A) 40-conflict-bystander-pod.yaml — multi-pod conflict fixture 50-stub-workers.yaml — Path-B stub workers 99-teardown.sh — idempotent cleanup w/ cap- restore verification README.md — full reproduction playbook including a 2026-05-21 live run record + 11 findings docs/design-docs/multi-tenant-test.md Design doc (§8 — multi-DGD test plan) that establishes the ground-truth topology, the planner / agent contract under test, and the rationale for the artifact split above. pyproject.toml Registers two new test markers consumed by the live test's `pytestmark`: multi_dgd_live — scenario selector for this specific file power_agent — shared with components/power_agent/tests/ Module-level gating fails fast on misconfiguration (missing TEST_NODE, missing kubeconfig) with actionable error messages, and skips silently when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the default unit-test invocation. Verified: pytest --collect-only is a clean no-op without the env vars, and full-import works under RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not ImportError). Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review. Signed-off-by: Kai Ma <kaim@nvidia.com>

…nd DCGM 4.5 Two stacked defects surfaced on the 2026-05-21T20:11Z Path-B re-run caused `test_dcgm_per_gpu_cap_matches_topology` and `test_dcgm_target_config_registered` to skip silently when they should have PASSED: 1. **`kubernetes-client>=35.0.0` mangles JSON pod-exec stdout.** `kubernetes.stream.stream(..., _preload_content=True)` sniffs JSON-shaped stdout, parses it client-side, and returns `str(dict)`. That means `printf '%s' '{"a":1}'` comes back as the literal 8-char string `{'a': 1}` (single quotes, Python-dict repr), which then trips `json.loads` at every read-back site — including the four inside `test_dcgm_*`. The intended-as-defensive "did not return JSON" skip branch fired instead. 2. **DCGM 4.5+ rotated the `dcgmi config --get -g <id> -j` JSON shape.** Pre-4.5: `body` is a list of per-GPU entries keyed by `"Power Limit (W)"` with flat `{"Current": "...", "Target": "..."}` children. 4.5+: `body` is a dict of fields keyed by `"Power Limit"` (no `(W)` suffix) with a `"children"` wrapper whose leaves are `{"value": "..."}` objects. Three breaking changes in one upgrade — `body` flipped list→dict, the field name lost its suffix, the leaves gained a `value` wrapper. The previous implementation handled neither, so even if defect (1) were absent, the strict per-field reader would still skip. This commit lands a single combined fix: - **`_exec_in_pod`**: switch from `_preload_content=True` (default) to `_preload_content=False` + raw stdout/stderr accumulation off the `WSClient` via `peek_stdout`/`read_stdout`. Bypasses the mangling regardless of which `kubernetes-client` version is installed, and is the documented "I want bytes back" pattern in the kubernetes-client examples. - **New `_extract_power_limit_from_dcgmi_body(body)` helper**: walks both shapes uniformly and returns `(current_w, target_w)` ints or `(None, None)` if neither shape matches. Both DCGM read-back tests (`test_dcgm_per_gpu_cap_matches_topology` and `test_dcgm_target_config_registered`) now consume this helper instead of inlining duplicated nested-`dict.get` chains. Verified locally: - synthetic 4.0-shape payload `[{"Power Limit (W)": {"Current": "350", "Target": "350"}}]` → `(350, 350)` ✓ - synthetic 4.5-shape payload `{"Power Limit": {"children": {"Current": {"value": "375"}, "Target": {"value": "375"}}}}` → `(375, 375)` ✓ - missing field, wrong types, partial children: graceful `(None, None)` / `(int, None)` returns ✓ - test module still collects cleanly (`pytest --collect-only`) `examples/multi-dgd-live-test/README.md` Finding #11 updated to "Status: fixed" with a cross-reference to this commit and the expected SKIPPED→PASSED flip on the next live re-run. Refs: PR #9683 review (Finding #11 in examples/multi-dgd-live-test/README.md). Signed-off-by: Kai Ma <kaim@nvidia.com>

…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>

… lychee Lychee on PR #9683 fails on 5 URLs that aren't really URLs (caught in review). All five are inherited from f7f3226 (the original "feat(planner): power infrastructure …" commit on this branch) but block #9683 from going green either way. The 5 cited locations and fix shape: - `components/src/dynamo/planner/config/defaults.py:48` - `components/src/dynamo/planner/config/planner_config.py:147` In-cluster Prometheus URL `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090`. `*.svc.cluster.local` is the canonical Kubernetes-internal DNS suffix and is not externally resolvable. Add an ignore entry for the suffix so any cluster-internal default URL passes lychee. - `components/src/dynamo/planner/config/planner_config.py:317` Docstring placeholder `http://host:port/`. - `deploy/helm/charts/platform/README.md:118-119` Helm-rendered Format hints `"nats://hostname:port"` / `"http://hostname:port"` / `"https://hostname:port"`. Add three anchored ignore entries for these exact placeholder forms (kept narrow so a real `*.com` hostname can't accidentally slip past as `hostname:port`). - `deploy/helm/charts/platform/Chart.yaml:23` `home: https://nvidia.com`. The same file already uses `https://www.nvidia.com` on line 19 for `url:` and that one passes lychee, so the bare-apex form on line 23 must be hitting a 4xx (Cloudflare bot challenge, redirect-loop, etc.). Canonicalise to the `www.` form so the two `nvidia.com` URLs in the same file agree, and no ignore entry is needed. Verified all 5 URLs match the new ignore patterns via local regex spot-check; `https://www.nvidia.com` (existing line 19 + line 23 fix) deliberately does NOT match any ignore pattern. Signed-off-by: Kai Ma <kaim@nvidia.com>

…at parse time Follow-up to b10cb8c. Reviewer reproduced lychee still failing on that head with two underlying problems the previous ignore-based fix did not address: 1. Lychee parses URLs *before* applying ignore rules. The `host:port` / `hostname:port` placeholders are extracted as URL candidates, then fail RFC-3986 port parsing (port is not numeric) and surface as errors that bypass the .lycheeignore matcher. 2. The CI workflow (`.github/workflows/docs-link-check.yml`) invokes lychee with `--root-dir $WS . --exclude-path "ATTRIBUTIONS" --exclude-path "lib/llm/tests/data/.*"` and nothing else, so `.lycheeignore` itself is one of the files lychee scans. The b10cb8c version of that file contained literal `://hostname:port` text in both regex bodies AND comments — lychee extracted those as URLs too. Two-part fix: A. Fix the placeholders at the source. Switch every `host:port` / `hostname:port` doc-string to the standard angle-bracket convention `<host>:<port>` / `<hostname>:<port>` — the same form used in Cobra/Click CLI help text and Markdown documentation. `<` is reserved by RFC 3986, so lychee's URL extractor halts at the `<` and never produces a candidate. Touches: - components/src/dynamo/planner/config/planner_config.py:317 (live_dashboard_port Field description). - deploy/helm/charts/platform/values.yaml:67, 70 (operator natsAddr / etcdAddr Format hints — source of truth for helm-docs). - deploy/helm/charts/platform/README.md:118-119 (helm-docs-rendered downstream copy of the values.yaml hints — kept in lockstep so a future regeneration is a no-op). B. Make .lycheeignore safe to scan. Replace literal `://` with the character class `:[/][/]` in every regex body, and strip URL-shaped text from comments (the `*.svc.cluster.local` suffix in the surviving comment is a DNS suffix, not a URL, and lychee does not extract bare hostnames). Since (A) removes the need for `host:port` / `hostname:port` ignores entirely, drop those three entries — `.svc.cluster.local` (for the in-cluster Prometheus default) is the only domain-level escape we still need. Verified locally: `.lycheeignore` now contains 0 literal `://` substrings, and the surviving `\.svc\.cluster\.local` pattern still matches `defaults.py:48` and `planner_config.py:147`. Signed-off-by: Kai Ma <kaim@nvidia.com>

Per PR #9683 review (dynamo-ops on core/base.py:964): advisory mode is supposed to log decisions without mutating cluster state, but the per-tick call to _apply_power_annotations() was still issuing PATCH requests against customer-owned Pod objects on every tick. Pod annotations are a cluster-visible side effect, so this violated the same contract that _apply_scaling_targets already follows (core/base.py:826). Fix: add `if self.config.advisory: return` as the first guard in _apply_power_annotations, mirroring the sibling method's pattern. Documented in the docstring so the contract is discoverable from future readers without grepping commit history. Test: components/src/dynamo/planner/tests/unit/test_actuation_knobs.py::TestApplyPowerAnnotations::test_advisory_mode_skips_patch — uses the existing _bare_planner / _power_config / _mock_pod fixtures and asserts both that get_component_pods is never called AND that patch_pod_annotation is never called, even when a stale pod would otherwise trigger a PATCH. Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng · 2026-06-08T16:55:51Z

Rebased onto `pr1a/power-agent` — ready for review

This branch has been rebased onto the current tip of the stack base
pr1a/power-agent (#9682). All 8 planner commits replayed cleanly
(git range-diff confirms a 1:1 mapping; the diff vs the base is
planner-only — no power-agent files leaked in).

Conflicts resolved during the rebase

Chart versions (deploy/helm/charts/platform/Chart.yaml,
components/operator/Chart.yaml, README.md): main had advanced both
charts to 1.3.0. Re-applied this PR's patch-level bump (planner pod
RBAC templates) on top → 1.3.1 (chart + appVersion + dependency
pin + rendered README badges/table).
.lycheeignore: kept both sides — main's etcd.io/martinfowler
entries and this PR's cluster-internal DNS + host:port placeholder
entries.
Chart.yaml home: took the lychee-friendly https://www.nvidia.com.
test_kubernetes_connector.py: both sides were additive — kept
main's new _main_container/_component/_deployment helpers and
this PR's hermetic monkeypatch.delenv(...) in
test_kubernetes_connector_no_env_var.

Semantic-merge fix (ruff F821, caught by CI pre-commit)

main renamed the planner helper
get_service_from_sub_component_type_or_name →
get_component_from_type_or_name (now positional sub_component_type).
The git auto-merge updated every existing call site but this PR's new
KubernetesConnector.get_component_pods call site still used the old
name. Fixed connectors/kubernetes.py:326 to call
get_component_from_type_or_name(deployment, sub_component_type),
matching the sibling call sites (e.g. lines 115/133) and the definition in
monitoring/dgd_services.py. Folded into the feature commit.

Verification

ruff / flake8 / black / isort + all pre-commit content hooks: green.
helm lint operator subchart: green; helm template renders the new
planner.yaml RBAC (ClusterRole cluster-wide / Role+SA+binding in
namespace-restricted mode) without error.
DCO: 8/8 commits signed off, conventional-commit prefixes.
Planner unit suite is build-gated (compiled dynamo bindings) and runs in
CI rather than locally; py_compile is clean on all resolved files.

CI status

Required checks green: DCO, copyright-checks,
pre-merge-status-check (includes pre-commit, operator, snapshot).
The three NVIDIA Test Lab required checks
(backend-/deploy-/dynamo-status-check) await the maintainer-triggered
internal pipeline, same as feat(power-agent): per-node power-cap enforcement DaemonSet #9682 / feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default) #9790.
The lone red check is non-required lychee, failing on a single link
https://charts.bitnami.com/bitnami in the auto-generated platform chart
README (the etcd dependency's repo URL). Bitnami sunset their public
Helm registry, so this fails repo-wide from CI runners; it is not
introduced by this PR and does not block merge. Left out of scope here
(it belongs in a standalone docs/deps maintenance change).

@sttts this is the planner half of the #9369 split, rebased and green on
everything in scope — ready for a review stamp.

sttts · 2026-06-09T15:12:10Z

+    / "planner"
+    / "tests"
+    / "testbed"
+    / "_runtime_stub.py"


this comes in a follow-up PR ?

Keeping the hook here for now. The pre-merge guard makes it a harmless no-op in this PR, so it isn't blocking. If you'd prefer zero early no-op code, I can move it into #9686 — let me know and I'll relocate it.

Quick clarification on the mechanism — my earlier "pre-merge guard" wording was imprecise and could read as pytest-marker gating, which isn't what's happening.

components/src/conftest.py is a no-op on this PR purely because of a file-existence check, not any marker/pre-merge logic:

_STUB_PATH = pathlib.Path(__file__).parent / "dynamo" / "planner" / "tests" / "testbed" / "_runtime_stub.py" def _install_dynamo_core_stub_if_needed() -> None: if not _STUB_PATH.is_file(): return ...

_STUB_PATH points at dynamo/planner/tests/testbed/_runtime_stub.py, which is not part of #9683 — it ships with the testbed in #9686. So on this branch _STUB_PATH.is_file() is False and the installer returns immediately; it only does real work once the testbed lands. (And if the real dynamo._core is importable, it's a no-op regardless.)

So it's harmless here for that concrete reason. Offer still stands: if you'd prefer zero early no-op code, I can relocate the hook to #9686 — just say the word.

sttts · 2026-06-09T15:42:03Z

+  # upstream (Dockerfile DCGM tag + bindings path + missing logger.py shim).
+  # This NVML-pass build used DCGM 3.3.9 bindings (no DCGM-mode in
+  # this pass, so the 3.3.9 vs 4.5+ struct gap is irrelevant here).
+  repository: ttl.sh/dynamo-pa-kaim-4c3c606


should be the official repo?

Intentional — ttl.sh/dynamo-pa-* is an ephemeral 24h-TTL image used only for this live-repro scaffold, not a release artifact. The file's IMAGE NOTE already flags the production path: switch to nvcr.io/nvidia/dynamo/power-agent:vX.Y.Z once the NGC build lands. Keeping ttl.sh here so the example stays runnable today without depending on an unpublished registry tag. (The personal -kaim infix was scrubbed in 1326d0c737.)

To state it explicitly: ttl.sh/dynamo-pa-* is a dev / live-repro-only image — it is never a production default.

It appears only in examples/multi-dgd-live-test/10-power-agent-values-nvml.yaml, which is throwaway scaffolding for reproducing the multi-DGD scenario (24h-TTL anonymous registry, no imagePullSecret).

The shipped chart deploy/helm/charts/power-agent/values.yaml defaults image.repository to nvcr.io/nvidia/ai-dynamo/power-agent with a REQUIRED, no-default image.tag: "", so no release path can ever resolve to ttl.sh.

Keeping the ephemeral tag in the example only so it stays runnable today without depending on an unpublished NGC tag; the file's IMAGE NOTE already documents the production switch.

One correction I noticed while confirming this: that IMAGE NOTE still spells the registry nvcr.io/nvidia/dynamo/power-agent, whereas the production chart (and every other Dynamo image) uses nvcr.io/nvidia/ai-dynamo/power-agent. Happy to align the comment to ai-dynamo so the example's documented production path matches the chart.

sttts · 2026-06-09T15:42:39Z

+  # the agent. The agent then calls `list_pod_for_all_namespaces` and
+  # gets 403 Forbidden. Until the chart is fixed to wire
+  # `--namespace={{ .Release.Namespace }}` when namespaceRestricted=true,
+  # use cluster-scoped RBAC. See power_agent.py:798 (--namespace flag exists).


what about these findings? They should stay?

These are deliberate repro notes documenting real upstream gaps I hit while running this example (namespace-restricted RBAC 403, DCGM 3.3.9-vs-4.5+ struct gap, bindings path, missing logger.py shim). They stay — this directory is live-test reproduction material, and the notes are what let someone re-run it without re-discovering the same chart bugs. Happy to file a tracking issue and trim them to a one-line pointer if you'd rather keep the example terse.

…ersion revert - planner_config: enforce ge=1 on prefill/decode_engine_gpu_power_limit so a 0 W or negative per-GPU cap is rejected at config validation time. - test_actuation_knobs: convert stale v1alpha1 spec.services mocks to the v1beta1 spec.components list shape (type: prefill/decode) so the label-selector tests exercise real DGD role resolution instead of short-circuiting to [] via SubComponentNotFoundError. - platform/operator charts: revert version 1.3.1 -> 1.3.0, operator appVersion, the dynamo-operator dependency version, and the home URL back to the main baseline (out-of-scope chart metadata churn). Addresses sttts review feedback on #9683. Signed-off-by: Kai Ma <kaim@nvidia.com>

The Phase-1 power-annotation reconcile ran on every tick, and each sweep resolved prefill and decode pods independently — two DGD GETs plus two pod lists per tick per DGD (~4 reads / 5s) before any PATCH. This multiplies apiserver load for a side effect that rarely changes. - Throttle steady-state sweeps to power_annotation_interval_seconds (new config, default 60s). run() gates _apply_power_annotations via _should_sweep_power_annotations(); the method itself stays an unthrottled single reconcile so direct-call tests are unaffected. - Force a per-tick sweep for one interval after a scale-up (_scaling_up): freshly-created pods get annotated promptly instead of waiting out the throttle, so the throttle never strands new pods. - Resolve prefill+decode from one DGD read per sweep by threading the fetched deployment through get_component_pods(deployment=...), cutting reads-per-sweep from 4 to 3 (the two pod lists remain). Addresses sttts review feedback on #9683 (apiserver load from the per-tick annotation scan). Signed-off-by: Kai Ma <kaim@nvidia.com>

The multi-DGD power-aware live-test scaffold carried environment- and author-specific identifiers, making it read as a personal repro rather than a reusable example. Genericize them: - namespace kaim-dynamo-system -> dynamo-power-test - node-selector value power-test/node: kaim -> target - image tags drop the personal infix (ttl.sh/dynamo-pa-kaim-* -> ttl.sh/dynamo-pa-*, ttl.sh/dynamo-dcgm45-kaim -> ttl.sh/dynamo-dcgm45) - state hostPath dynamo-power-agent-kaim-test -> dynamo-power-agent-test - README build scratch dir /tmp/kaim -> /tmp/dynamo-build, concrete AKS node names -> placeholder/$TEST_NODE, and the dpp-dev-env cluster name -> generic AKS-test-cluster wording (heading anchor kept in sync) No behavioral change to the manifests; this is identifier hygiene only. Addresses sttts review feedback on #9683. Signed-off-by: Kai Ma <kaim@nvidia.com>

deploy/planner-pod-rbac-dev.yaml was a transitional Role + RoleBinding granting pods get/list/watch/patch to planner-dev-sa, for clusters running operator chart <= 1.2.0 whose planner ClusterRole lacked any pods permissions. The operator chart now ships those rules built in: templates/planner.yaml grants pods [get, list, watch, patch] to the planner ServiceAccount in both namespace-restricted (planner-role) and cluster-wide (...-planner ClusterRole) modes, a strict superset of the dev file. The file was self-marked DEPRECATED for chart 1.2.1+, and the chart here is 1.3.0, so the workaround is dead code. Addresses sttts review feedback on #9683. Signed-off-by: Kai Ma <kaim@nvidia.com>

Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng · 2026-06-09T22:23:53Z

Review status — all @sttts threads resolved

Thanks for the thorough pass. Below is a consolidated status of every review thread: 8 resolved via code changes (verified on pr1b/planner-infra) and 3 resolved as deliberate review decisions with rationale posted in-thread.

Thread	Resolution	Verification
`.gitignore`	Removed the personal scratch block; will use a global personal excludes file (`core.excludesFile`) instead	`.gitignore` is back to the `main` baseline — no `.tmp-*` / `dev-pod` / `qwen3` / `planner-dev` / `test_k8s_access` / `test_planner_launch` entries
`components/src/conftest.py`	No code change — inert in this PR because the installer is file-existence gated (`if not _STUB_PATH.is_file(): return`), and `testbed/_runtime_stub.py` ships in the testbed PR (#9686), not here	`conftest.py:39` early-returns; not marker-gated. Clarification posted in-thread
`deploy/helm/charts/platform/Chart.yaml`	Kept `version: 1.3.0` as the dev target; reverted the unrelated `home:` churn	`version: 1.3.0` confirmed on branch
`deploy/planner-pod-rbac-dev.yaml`	Deleted the deprecated dev RBAC manifest (operator chart already grants the superset)	File no longer exists on branch
`planner_config.py`	Added `ge=1` to both per-GPU power-limit fields so a 0/negative cap is rejected at config validation	`prefill_engine_gpu_power_limit` and `decode_engine_gpu_power_limit` both carry `ge=1`
`test_actuation_knobs.py`	Converted dead v1alpha1 `spec.services` dict mocks to the v1beta1 `spec.components` list shape	Mocks now use `"components": [{"name": ..., "type": "prefill"/"decode"}]`
`examples/.../10-power-agent-values-nvml.yaml` (image)	`ttl.sh` stays as the dev/live-repro-only image (not a production default); the stale production-followup comment was corrected `nvcr.io/nvidia/dynamo/power-agent` → `nvcr.io/nvidia/ai-dynamo/power-agent`	Tag appears only under `examples/multi-dgd-live-test/`; production chart defaults to `nvcr.io/nvidia/ai-dynamo/power-agent`, `tag: ""` (required)
`examples/.../10-power-agent-values-nvml.yaml:56` (findings)	No code change — kept the findings as deliberate live-repro notes	Reply posted with rationale
`examples/.../20-nvidia-dcgm-standalone-ds.yaml`	Removed personal namespace/identifier references	No `kaim` / `dpp-dev-env` identifiers remain in the example manifests
`examples/.../README.md`	Removed personal namespace references; also scrubbed the related integration test	No `kaim` / `dpp-dev-env` in README or `test_multi_dgd_live.py`
`core/base.py::_apply_power_annotations`	What changed: added `_should_sweep_power_annotations()` throttling on `power_annotation_interval_seconds` (default 60s) with a force-window on scale-up, plus a single deployment-scoped pod read threaded into both the prefill and decode sweeps. Why it satisfies the concern: steady-state apiserver load drops from ~4 reads / 5s to ~3 reads / 60s per DGD, while freshly-scaled pods are still annotated promptly.	`_should_sweep_power_annotations`, the `60.0` (`gt=0`) interval default, and `get_component_pods(..., deployment=deployment)` for both roles all confirmed on branch

…/ai-dynamo/power-agent The multi-DGD live-test values comment and the multi-tenant test doc pointed the production-followup image at nvcr.io/nvidia/dynamo/power-agent, which is missing the ai- prefix. The shipped chart default (deploy/helm/charts/power-agent/values.yaml) is nvcr.io/nvidia/ai-dynamo/power-agent; align both references. Signed-off-by: Kai Ma <kaim@nvidia.com>

Signed-off-by: Kai Ma <kaim@nvidia.com> # Conflicts: # components/src/dynamo/planner/core/base.py # components/src/dynamo/planner/monitoring/planner_metrics.py

Signed-off-by: Kai Ma <kaim@nvidia.com>

…safe sweep throttle) _scaling_up read self.state_machine._num_p_workers/_num_d_workers directly. On the orchestrator path the `state_machine` property lazily constructs a PlannerStateMachine (and warms predictors) when none exists, and the freshly-built PSM reports zero worker counts the orchestrator path never updates. Every positive scaling decision then looks like a scale-up, holding the power-annotation force-window permanently open and defeating the steady-state sweep throttle. Extract _current_worker_counts() (cached self._last_worker_counts first, PSM internals only when a state machine already exists, else 0,0) and use it in both _scaling_up and _log_decision_summary so the two cannot drift -- _scaling_up's docstring already claimed to mirror _log_decision_summary. Tests: cached counts take precedence over PSM internals; orchestrator path (no state machine) uses cached counts without constructing a PSM; cached counts at/above the decision do not force a sweep. Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng · 2026-06-12T15:03:59Z

Status update -- Phase-1 planner infra.

Resolved 9 review threads: advisory-mode gate (if self.config.advisory: return first in _apply_power_annotations), .gitignore revert, Chart.yaml back to 1.3.0, planner-pod-rbac-dev.yaml removal, ge=1 validators on the engine power limits, v1beta1 components/type test mocks, the apiserver-load sweep throttle, and the multi-DGD identifier scrub. Also landed 7e74aa65ea: _scaling_up now reads cached worker counts (orchestrator-safe) via a shared _current_worker_counts() helper, fixing a latent throttle-defeat where the lazily-built PSM reported zero counts. CI re-ran green -> CLEAN.

@sttts -- three threads are left open because they are your call rather than a clear-cut fix:

conftest stub hook (#discussion_r3381755500) -- happy to relocate it to feat(planner): power planner stress testbed (α + γ) #9686 if you'd prefer zero early no-op code; otherwise it's a file-existence no-op on this branch.
ttl.sh image in the live-repro example (#discussion_r3381962018) -- intentional dev-only ephemeral tag; the shipped chart defaults to nvcr.io/nvidia/ai-dynamo/power-agent. Keep, or switch the example too?
repro findings notes (#discussion_r3381965777) -- kept as live-test reproduction material; can trim to a tracking-issue pointer if you'd rather.

Tell me your preference on those three and I'll action them.

kaim-eng requested review from a team as code owners May 18, 2026 15:54

pull-request-size Bot added the size/XXL label May 18, 2026

github-actions Bot added feat documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes planner actions labels May 18, 2026

This was referenced May 18, 2026

feat(power-agent): per-node power-cap enforcement DaemonSet #9682

Open

feat(planner): power budget enforcement #9684

Draft

kaim-eng force-pushed the pr1a/power-agent branch from 3cde449 to 1338028 Compare May 18, 2026 16:03

kaim-eng force-pushed the pr1b/planner-infra branch from 2222dbf to 1d7b243 Compare May 18, 2026 16:03

dynamo-ops reviewed May 18, 2026

View reviewed changes

Comment thread components/src/dynamo/planner/core/base.py Outdated

kaim-eng force-pushed the pr1a/power-agent branch from 1338028 to ec21081 Compare May 19, 2026 12:54

kaim-eng force-pushed the pr1b/planner-infra branch 2 times, most recently from 8126b1a to f7f3226 Compare May 19, 2026 15:54

kaim-eng mentioned this pull request May 20, 2026

feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default) #9790

Open

kaim-eng requested a review from a team as a code owner May 22, 2026 14:23

kaim-eng force-pushed the pr1a/power-agent branch 2 times, most recently from 1e1ef6f to 29f0831 Compare May 25, 2026 13:46

kaim-eng force-pushed the pr1b/planner-infra branch from 7696f3f to 8f47413 Compare May 25, 2026 14:17

kaim-eng added 4 commits June 8, 2026 12:39

kaim-eng force-pushed the pr1b/planner-infra branch from 35716ac to b1ebd8b Compare June 8, 2026 16:41