feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus#9683
feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus#9683kaim-eng wants to merge 18 commits into
Conversation
3cde449 to
1338028
Compare
2222dbf to
1d7b243
Compare
1338028 to
ec21081
Compare
8126b1a to
f7f3226
Compare
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:
- A2 Three planners patch DISJOINT pod sets in one namespace
- A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls)
- B2 Power Agent applies 8 distinct caps on one tick
- B4 Happy-path zero-counter invariant
- B5 Multi-pod conflict + safe-default behaviour (skipped pending
Finding #7 — CDI-safe bystander rewrite tracked separately)
- 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the
live driver state (not via mocks)
Layout introduced by this commit:
components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
pods with hardcoded annotations) and skips Path-A-only tests
cleanly when running Path B. DCGM tests gated by
DCGM_HOSTENGINE_AVAILABLE=1.
examples/multi-dgd-live-test/
Manifests + helm values overrides + teardown script + README
walk-through that an operator applies before running the test:
00-namespace.yaml — namespace + pull secret
10-power-agent-values-nvml.yaml — Power Agent NVML helm values
11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values
20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A)
40-conflict-bystander-pod.yaml — multi-pod conflict fixture
50-stub-workers.yaml — Path-B stub workers
99-teardown.sh — idempotent cleanup w/ cap-
restore verification
README.md — full reproduction playbook
including a 2026-05-21 live
run record + 11 findings
docs/design-docs/multi-tenant-test.md
Design doc (§8 — multi-DGD test plan) that establishes the
ground-truth topology, the planner / agent contract under test,
and the rationale for the artifact split above.
pyproject.toml
Registers two new test markers consumed by the live test's
`pytestmark`:
multi_dgd_live — scenario selector for this specific file
power_agent — shared with components/power_agent/tests/
Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).
Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
…nd DCGM 4.5
Two stacked defects surfaced on the 2026-05-21T20:11Z Path-B re-run
caused `test_dcgm_per_gpu_cap_matches_topology` and
`test_dcgm_target_config_registered` to skip silently when they should
have PASSED:
1. **`kubernetes-client>=35.0.0` mangles JSON pod-exec stdout.**
`kubernetes.stream.stream(..., _preload_content=True)` sniffs
JSON-shaped stdout, parses it client-side, and returns `str(dict)`.
That means `printf '%s' '{"a":1}'` comes back as the literal 8-char
string `{'a': 1}` (single quotes, Python-dict repr), which then
trips `json.loads` at every read-back site — including the four
inside `test_dcgm_*`. The intended-as-defensive
"did not return JSON" skip branch fired instead.
2. **DCGM 4.5+ rotated the `dcgmi config --get -g <id> -j` JSON shape.**
Pre-4.5: `body` is a list of per-GPU entries keyed by `"Power Limit (W)"`
with flat `{"Current": "...", "Target": "..."}` children.
4.5+: `body` is a dict of fields keyed by `"Power Limit"`
(no `(W)` suffix) with a `"children"` wrapper whose leaves
are `{"value": "..."}` objects.
Three breaking changes in one upgrade — `body` flipped list→dict,
the field name lost its suffix, the leaves gained a `value` wrapper.
The previous implementation handled neither, so even if defect (1)
were absent, the strict per-field reader would still skip.
This commit lands a single combined fix:
- **`_exec_in_pod`**: switch from `_preload_content=True` (default) to
`_preload_content=False` + raw stdout/stderr accumulation off the
`WSClient` via `peek_stdout`/`read_stdout`. Bypasses the mangling
regardless of which `kubernetes-client` version is installed, and is
the documented "I want bytes back" pattern in the kubernetes-client
examples.
- **New `_extract_power_limit_from_dcgmi_body(body)` helper**: walks
both shapes uniformly and returns `(current_w, target_w)` ints or
`(None, None)` if neither shape matches. Both DCGM read-back tests
(`test_dcgm_per_gpu_cap_matches_topology` and
`test_dcgm_target_config_registered`) now consume this helper
instead of inlining duplicated nested-`dict.get` chains.
Verified locally:
- synthetic 4.0-shape payload `[{"Power Limit (W)": {"Current": "350",
"Target": "350"}}]` → `(350, 350)` ✓
- synthetic 4.5-shape payload `{"Power Limit": {"children":
{"Current": {"value": "375"}, "Target": {"value": "375"}}}}`
→ `(375, 375)` ✓
- missing field, wrong types, partial children: graceful `(None, None)`
/ `(int, None)` returns ✓
- test module still collects cleanly (`pytest --collect-only`)
`examples/multi-dgd-live-test/README.md` Finding #11 updated to
"Status: fixed" with a cross-reference to this commit and the expected
SKIPPED→PASSED flip on the next live re-run.
Refs: PR #9683 review (Finding #11 in
examples/multi-dgd-live-test/README.md).
Signed-off-by: Kai Ma <kaim@nvidia.com>
`TestMultiPodConflict` previously created a bystander pod with
`NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS
clusters running the NVIDIA Container Toolkit in CDI device mode (the
dpp-dev-env default, and the direction NVIDIA is pushing for newer
operator releases) reject that with
`unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to
admit the pod — so the test class was carrying a
`@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled
the whole §8.5.7 contract on every run.
This commit lands the CDI-mode-safe rewrite:
- Bystander now claims `nvidia.com/gpu: 1` from the device plugin
(both `requests` and `limits` per K8s extended-resource rules).
CDI-mode-agnostic — the device plugin owns the actual GPU wiring
via its own machinery.
- `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`
mirrors the inline Python manifest exactly so YAML-debugging and
test-debugging agree (the README explicitly calls this out).
- New helper `_per_gpu_compute_pids()` snapshots
`nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the
agent pod (which has hostPID + NVML access) and returns
`{gpu_idx: {pid, ...}}`. The test polls this until it sees a new
PID on some index after the bystander reaches Running — that
index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES
pin needed.
- Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` →
`test_conflict_triggers_safe_default_on_assigned_gpu` to reflect
the index-agnostic contract. Blast-radius / recovery asserts key
off the discovered `assigned_gpu`; the loop over the other GPUs
iterates `range(8)` and skips that index instead of hard-coding
`range(1, 8)`.
- Class-level `@pytest.mark.skip` decorator removed. The supporting
`_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither
in code nor as a runtime branch, since neither path is needed
once the CDI-safe flow works.
- `bystander_uuid` class fixture removed (no longer needed — the
UUID/index discovery is post-creation, not pre).
`examples/multi-dgd-live-test/README.md` updated:
- Finding #7 marked **Status: fixed** with cross-reference to this
commit and an explicit description of the new (a)/(b)/(c) flow.
- Expected-outcome counts updated: Path B now expects 5 PASSED /
2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves
from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the
next live run.
- Live-run record row updated to note the rewrite and expected
PASS outcome.
Verified locally: test module compiles + collects cleanly. The
end-to-end CDI-vs-non-CDI assignment behaviour is by construction the
device plugin's responsibility, so the failure mode of this rewrite
is bounded to "device plugin returns no GPU within 90 s" — handled
with a clear pytest.fail message that includes a before/after PID
snapshot for debugging.
Refs: PR #9683 review (Finding #7 in
examples/multi-dgd-live-test/README.md).
Signed-off-by: Kai Ma <kaim@nvidia.com>
…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>
… lychee Lychee on PR #9683 fails on 5 URLs that aren't really URLs (caught in review). All five are inherited from f7f3226 (the original "feat(planner): power infrastructure …" commit on this branch) but block #9683 from going green either way. The 5 cited locations and fix shape: - `components/src/dynamo/planner/config/defaults.py:48` - `components/src/dynamo/planner/config/planner_config.py:147` In-cluster Prometheus URL `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090`. `*.svc.cluster.local` is the canonical Kubernetes-internal DNS suffix and is not externally resolvable. Add an ignore entry for the suffix so any cluster-internal default URL passes lychee. - `components/src/dynamo/planner/config/planner_config.py:317` Docstring placeholder `http://host:port/`. - `deploy/helm/charts/platform/README.md:118-119` Helm-rendered Format hints `"nats://hostname:port"` / `"http://hostname:port"` / `"https://hostname:port"`. Add three anchored ignore entries for these exact placeholder forms (kept narrow so a real `*.com` hostname can't accidentally slip past as `hostname:port`). - `deploy/helm/charts/platform/Chart.yaml:23` `home: https://nvidia.com`. The same file already uses `https://www.nvidia.com` on line 19 for `url:` and that one passes lychee, so the bare-apex form on line 23 must be hitting a 4xx (Cloudflare bot challenge, redirect-loop, etc.). Canonicalise to the `www.` form so the two `nvidia.com` URLs in the same file agree, and no ignore entry is needed. Verified all 5 URLs match the new ignore patterns via local regex spot-check; `https://www.nvidia.com` (existing line 19 + line 23 fix) deliberately does NOT match any ignore pattern. Signed-off-by: Kai Ma <kaim@nvidia.com>
1e1ef6f to
29f0831
Compare
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:
- A2 Three planners patch DISJOINT pod sets in one namespace
- A3 AggPlanner-C uses only the DECODE branch (no PREFILL calls)
- B2 Power Agent applies 8 distinct caps on one tick
- B4 Happy-path zero-counter invariant
- B5 Multi-pod conflict + safe-default behaviour (skipped pending
Finding #7 — CDI-safe bystander rewrite tracked separately)
- 8.5.6 / 8.6.2 NVML / DCGM cap-application proven by reading the
live driver state (not via mocks)
Layout introduced by this commit:
components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
pods with hardcoded annotations) and skips Path-A-only tests
cleanly when running Path B. DCGM tests gated by
DCGM_HOSTENGINE_AVAILABLE=1.
examples/multi-dgd-live-test/
Manifests + helm values overrides + teardown script + README
walk-through that an operator applies before running the test:
00-namespace.yaml — namespace + pull secret
10-power-agent-values-nvml.yaml — Power Agent NVML helm values
11-power-agent-values-dcgm.yaml — Power Agent DCGM helm values
20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
30/31/32-vllm-dgd-{A,B,C}.yaml — three DGDs (Path A)
40-conflict-bystander-pod.yaml — multi-pod conflict fixture
50-stub-workers.yaml — Path-B stub workers
99-teardown.sh — idempotent cleanup w/ cap-
restore verification
README.md — full reproduction playbook
including a 2026-05-21 live
run record + 11 findings
docs/design-docs/multi-tenant-test.md
Design doc (§8 — multi-DGD test plan) that establishes the
ground-truth topology, the planner / agent contract under test,
and the rationale for the artifact split above.
pyproject.toml
Registers two new test markers consumed by the live test's
`pytestmark`:
multi_dgd_live — scenario selector for this specific file
power_agent — shared with components/power_agent/tests/
Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).
Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
7696f3f to
8f47413
Compare
…nd DCGM 4.5
Two stacked defects surfaced on the 2026-05-21T20:11Z Path-B re-run
caused `test_dcgm_per_gpu_cap_matches_topology` and
`test_dcgm_target_config_registered` to skip silently when they should
have PASSED:
1. **`kubernetes-client>=35.0.0` mangles JSON pod-exec stdout.**
`kubernetes.stream.stream(..., _preload_content=True)` sniffs
JSON-shaped stdout, parses it client-side, and returns `str(dict)`.
That means `printf '%s' '{"a":1}'` comes back as the literal 8-char
string `{'a': 1}` (single quotes, Python-dict repr), which then
trips `json.loads` at every read-back site — including the four
inside `test_dcgm_*`. The intended-as-defensive
"did not return JSON" skip branch fired instead.
2. **DCGM 4.5+ rotated the `dcgmi config --get -g <id> -j` JSON shape.**
Pre-4.5: `body` is a list of per-GPU entries keyed by `"Power Limit (W)"`
with flat `{"Current": "...", "Target": "..."}` children.
4.5+: `body` is a dict of fields keyed by `"Power Limit"`
(no `(W)` suffix) with a `"children"` wrapper whose leaves
are `{"value": "..."}` objects.
Three breaking changes in one upgrade — `body` flipped list→dict,
the field name lost its suffix, the leaves gained a `value` wrapper.
The previous implementation handled neither, so even if defect (1)
were absent, the strict per-field reader would still skip.
This commit lands a single combined fix:
- **`_exec_in_pod`**: switch from `_preload_content=True` (default) to
`_preload_content=False` + raw stdout/stderr accumulation off the
`WSClient` via `peek_stdout`/`read_stdout`. Bypasses the mangling
regardless of which `kubernetes-client` version is installed, and is
the documented "I want bytes back" pattern in the kubernetes-client
examples.
- **New `_extract_power_limit_from_dcgmi_body(body)` helper**: walks
both shapes uniformly and returns `(current_w, target_w)` ints or
`(None, None)` if neither shape matches. Both DCGM read-back tests
(`test_dcgm_per_gpu_cap_matches_topology` and
`test_dcgm_target_config_registered`) now consume this helper
instead of inlining duplicated nested-`dict.get` chains.
Verified locally:
- synthetic 4.0-shape payload `[{"Power Limit (W)": {"Current": "350",
"Target": "350"}}]` → `(350, 350)` ✓
- synthetic 4.5-shape payload `{"Power Limit": {"children":
{"Current": {"value": "375"}, "Target": {"value": "375"}}}}`
→ `(375, 375)` ✓
- missing field, wrong types, partial children: graceful `(None, None)`
/ `(int, None)` returns ✓
- test module still collects cleanly (`pytest --collect-only`)
`examples/multi-dgd-live-test/README.md` Finding #11 updated to
"Status: fixed" with a cross-reference to this commit and the expected
SKIPPED→PASSED flip on the next live re-run.
Refs: PR #9683 review (Finding #11 in
examples/multi-dgd-live-test/README.md).
Signed-off-by: Kai Ma <kaim@nvidia.com>
…dd SPDX header Two follow-up fixes to commit 552be92 caught in PR #9683 review. 1. Bystander cannot schedule. The previous CDI-safe rewrite swapped `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device- plugin claim, expecting the plugin to allocate a free GPU. But the 5 stub workers in `50-stub-workers.yaml` already claim every `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so the bystander's claim sits Pending forever and the conflict never materialises. Switch to the same pattern the Power Agent itself uses (`deploy/helm/charts/power-agent/values.yaml`): `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the toolkit without consuming a device-plugin claim. The bystander pins itself to `cuda:0` (deterministically the first visible device — same NVML index 0 as the agent, since both have `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery loop in the test still verifies which physical GPU the context landed on rather than trusting the mapping, so the test stays robust under future toolkit / CDI changes. Touches the inline `bystander_manifest` in `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`, plus the README's Finding #7 status to document both non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI rejection AND the `nvidia.com/gpu: 1` over-subscription) and the surviving fix-shape. 2. Copyright check failed. `40-conflict-bystander-pod.yaml` was imported from `.aic_sandbox/` in commit 004df25 without the standard two-line SPDX header that every other YAML in `examples/multi-dgd-live-test/` carries (00-namespace through 50-stub-workers — verified via head-of-file scan). Add the header. Signed-off-by: Kai Ma <kaim@nvidia.com>
… lychee Lychee on PR #9683 fails on 5 URLs that aren't really URLs (caught in review). All five are inherited from f7f3226 (the original "feat(planner): power infrastructure …" commit on this branch) but block #9683 from going green either way. The 5 cited locations and fix shape: - `components/src/dynamo/planner/config/defaults.py:48` - `components/src/dynamo/planner/config/planner_config.py:147` In-cluster Prometheus URL `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090`. `*.svc.cluster.local` is the canonical Kubernetes-internal DNS suffix and is not externally resolvable. Add an ignore entry for the suffix so any cluster-internal default URL passes lychee. - `components/src/dynamo/planner/config/planner_config.py:317` Docstring placeholder `http://host:port/`. - `deploy/helm/charts/platform/README.md:118-119` Helm-rendered Format hints `"nats://hostname:port"` / `"http://hostname:port"` / `"https://hostname:port"`. Add three anchored ignore entries for these exact placeholder forms (kept narrow so a real `*.com` hostname can't accidentally slip past as `hostname:port`). - `deploy/helm/charts/platform/Chart.yaml:23` `home: https://nvidia.com`. The same file already uses `https://www.nvidia.com` on line 19 for `url:` and that one passes lychee, so the bare-apex form on line 23 must be hitting a 4xx (Cloudflare bot challenge, redirect-loop, etc.). Canonicalise to the `www.` form so the two `nvidia.com` URLs in the same file agree, and no ignore entry is needed. Verified all 5 URLs match the new ignore patterns via local regex spot-check; `https://www.nvidia.com` (existing line 19 + line 23 fix) deliberately does NOT match any ignore pattern. Signed-off-by: Kai Ma <kaim@nvidia.com>
…at parse time Follow-up to b10cb8c. Reviewer reproduced lychee still failing on that head with two underlying problems the previous ignore-based fix did not address: 1. Lychee parses URLs *before* applying ignore rules. The `host:port` / `hostname:port` placeholders are extracted as URL candidates, then fail RFC-3986 port parsing (port is not numeric) and surface as errors that bypass the .lycheeignore matcher. 2. The CI workflow (`.github/workflows/docs-link-check.yml`) invokes lychee with `--root-dir $WS . --exclude-path "ATTRIBUTIONS" --exclude-path "lib/llm/tests/data/.*"` and nothing else, so `.lycheeignore` itself is one of the files lychee scans. The b10cb8c version of that file contained literal `://hostname:port` text in both regex bodies AND comments — lychee extracted those as URLs too. Two-part fix: A. Fix the placeholders at the source. Switch every `host:port` / `hostname:port` doc-string to the standard angle-bracket convention `<host>:<port>` / `<hostname>:<port>` — the same form used in Cobra/Click CLI help text and Markdown documentation. `<` is reserved by RFC 3986, so lychee's URL extractor halts at the `<` and never produces a candidate. Touches: - components/src/dynamo/planner/config/planner_config.py:317 (live_dashboard_port Field description). - deploy/helm/charts/platform/values.yaml:67, 70 (operator natsAddr / etcdAddr Format hints — source of truth for helm-docs). - deploy/helm/charts/platform/README.md:118-119 (helm-docs-rendered downstream copy of the values.yaml hints — kept in lockstep so a future regeneration is a no-op). B. Make .lycheeignore safe to scan. Replace literal `://` with the character class `:[/][/]` in every regex body, and strip URL-shaped text from comments (the `*.svc.cluster.local` suffix in the surviving comment is a DNS suffix, not a URL, and lychee does not extract bare hostnames). Since (A) removes the need for `host:port` / `hostname:port` ignores entirely, drop those three entries — `.svc.cluster.local` (for the in-cluster Prometheus default) is the only domain-level escape we still need. Verified locally: `.lycheeignore` now contains 0 literal `://` substrings, and the surviving `\.svc\.cluster\.local` pattern still matches `defaults.py:48` and `planner_config.py:147`. Signed-off-by: Kai Ma <kaim@nvidia.com>
Per PR #9683 review (dynamo-ops on core/base.py:964): advisory mode is supposed to log decisions without mutating cluster state, but the per-tick call to _apply_power_annotations() was still issuing PATCH requests against customer-owned Pod objects on every tick. Pod annotations are a cluster-visible side effect, so this violated the same contract that _apply_scaling_targets already follows (core/base.py:826). Fix: add `if self.config.advisory: return` as the first guard in _apply_power_annotations, mirroring the sibling method's pattern. Documented in the docstring so the contract is discoverable from future readers without grepping commit history. Test: components/src/dynamo/planner/tests/unit/test_actuation_knobs.py::TestApplyPowerAnnotations::test_advisory_mode_skips_patch — uses the existing _bare_planner / _power_config / _mock_pod fixtures and asserts both that get_component_pods is never called AND that patch_pod_annotation is never called, even when a stale pod would otherwise trigger a PATCH. Signed-off-by: Kai Ma <kaim@nvidia.com>
35716ac to
b1ebd8b
Compare
Rebased onto
|
| / "planner" | ||
| / "tests" | ||
| / "testbed" | ||
| / "_runtime_stub.py" |
There was a problem hiding this comment.
this comes in a follow-up PR ?
There was a problem hiding this comment.
Keeping the hook here for now. The pre-merge guard makes it a harmless no-op in this PR, so it isn't blocking. If you'd prefer zero early no-op code, I can move it into #9686 — let me know and I'll relocate it.
There was a problem hiding this comment.
Quick clarification on the mechanism — my earlier "pre-merge guard" wording was imprecise and could read as pytest-marker gating, which isn't what's happening.
components/src/conftest.py is a no-op on this PR purely because of a file-existence check, not any marker/pre-merge logic:
_STUB_PATH = pathlib.Path(__file__).parent / "dynamo" / "planner" / "tests" / "testbed" / "_runtime_stub.py"
def _install_dynamo_core_stub_if_needed() -> None:
if not _STUB_PATH.is_file():
return
..._STUB_PATH points at dynamo/planner/tests/testbed/_runtime_stub.py, which is not part of #9683 — it ships with the testbed in #9686. So on this branch _STUB_PATH.is_file() is False and the installer returns immediately; it only does real work once the testbed lands. (And if the real dynamo._core is importable, it's a no-op regardless.)
So it's harmless here for that concrete reason. Offer still stands: if you'd prefer zero early no-op code, I can relocate the hook to #9686 — just say the word.
| # upstream (Dockerfile DCGM tag + bindings path + missing logger.py shim). | ||
| # This NVML-pass build used DCGM 3.3.9 bindings (no DCGM-mode in | ||
| # this pass, so the 3.3.9 vs 4.5+ struct gap is irrelevant here). | ||
| repository: ttl.sh/dynamo-pa-kaim-4c3c606 |
There was a problem hiding this comment.
Intentional — ttl.sh/dynamo-pa-* is an ephemeral 24h-TTL image used only for this live-repro scaffold, not a release artifact. The file's IMAGE NOTE already flags the production path: switch to nvcr.io/nvidia/dynamo/power-agent:vX.Y.Z once the NGC build lands. Keeping ttl.sh here so the example stays runnable today without depending on an unpublished registry tag. (The personal -kaim infix was scrubbed in 1326d0c737.)
There was a problem hiding this comment.
To state it explicitly: ttl.sh/dynamo-pa-* is a dev / live-repro-only image — it is never a production default.
- It appears only in
examples/multi-dgd-live-test/10-power-agent-values-nvml.yaml, which is throwaway scaffolding for reproducing the multi-DGD scenario (24h-TTL anonymous registry, noimagePullSecret). - The shipped chart
deploy/helm/charts/power-agent/values.yamldefaultsimage.repositorytonvcr.io/nvidia/ai-dynamo/power-agentwith a REQUIRED, no-defaultimage.tag: "", so no release path can ever resolve tottl.sh.
Keeping the ephemeral tag in the example only so it stays runnable today without depending on an unpublished NGC tag; the file's IMAGE NOTE already documents the production switch.
One correction I noticed while confirming this: that IMAGE NOTE still spells the registry nvcr.io/nvidia/dynamo/power-agent, whereas the production chart (and every other Dynamo image) uses nvcr.io/nvidia/ai-dynamo/power-agent. Happy to align the comment to ai-dynamo so the example's documented production path matches the chart.
| # the agent. The agent then calls `list_pod_for_all_namespaces` and | ||
| # gets 403 Forbidden. Until the chart is fixed to wire | ||
| # `--namespace={{ .Release.Namespace }}` when namespaceRestricted=true, | ||
| # use cluster-scoped RBAC. See power_agent.py:798 (--namespace flag exists). |
There was a problem hiding this comment.
what about these findings? They should stay?
There was a problem hiding this comment.
These are deliberate repro notes documenting real upstream gaps I hit while running this example (namespace-restricted RBAC 403, DCGM 3.3.9-vs-4.5+ struct gap, bindings path, missing logger.py shim). They stay — this directory is live-test reproduction material, and the notes are what let someone re-run it without re-discovering the same chart bugs. Happy to file a tracking issue and trim them to a one-line pointer if you'd rather keep the example terse.
…ersion revert - planner_config: enforce ge=1 on prefill/decode_engine_gpu_power_limit so a 0 W or negative per-GPU cap is rejected at config validation time. - test_actuation_knobs: convert stale v1alpha1 spec.services mocks to the v1beta1 spec.components list shape (type: prefill/decode) so the label-selector tests exercise real DGD role resolution instead of short-circuiting to [] via SubComponentNotFoundError. - platform/operator charts: revert version 1.3.1 -> 1.3.0, operator appVersion, the dynamo-operator dependency version, and the home URL back to the main baseline (out-of-scope chart metadata churn). Addresses sttts review feedback on #9683. Signed-off-by: Kai Ma <kaim@nvidia.com>
The Phase-1 power-annotation reconcile ran on every tick, and each sweep resolved prefill and decode pods independently — two DGD GETs plus two pod lists per tick per DGD (~4 reads / 5s) before any PATCH. This multiplies apiserver load for a side effect that rarely changes. - Throttle steady-state sweeps to power_annotation_interval_seconds (new config, default 60s). run() gates _apply_power_annotations via _should_sweep_power_annotations(); the method itself stays an unthrottled single reconcile so direct-call tests are unaffected. - Force a per-tick sweep for one interval after a scale-up (_scaling_up): freshly-created pods get annotated promptly instead of waiting out the throttle, so the throttle never strands new pods. - Resolve prefill+decode from one DGD read per sweep by threading the fetched deployment through get_component_pods(deployment=...), cutting reads-per-sweep from 4 to 3 (the two pod lists remain). Addresses sttts review feedback on #9683 (apiserver load from the per-tick annotation scan). Signed-off-by: Kai Ma <kaim@nvidia.com>
The multi-DGD power-aware live-test scaffold carried environment- and author-specific identifiers, making it read as a personal repro rather than a reusable example. Genericize them: - namespace kaim-dynamo-system -> dynamo-power-test - node-selector value power-test/node: kaim -> target - image tags drop the personal infix (ttl.sh/dynamo-pa-kaim-* -> ttl.sh/dynamo-pa-*, ttl.sh/dynamo-dcgm45-kaim -> ttl.sh/dynamo-dcgm45) - state hostPath dynamo-power-agent-kaim-test -> dynamo-power-agent-test - README build scratch dir /tmp/kaim -> /tmp/dynamo-build, concrete AKS node names -> placeholder/$TEST_NODE, and the dpp-dev-env cluster name -> generic AKS-test-cluster wording (heading anchor kept in sync) No behavioral change to the manifests; this is identifier hygiene only. Addresses sttts review feedback on #9683. Signed-off-by: Kai Ma <kaim@nvidia.com>
deploy/planner-pod-rbac-dev.yaml was a transitional Role + RoleBinding granting pods get/list/watch/patch to planner-dev-sa, for clusters running operator chart <= 1.2.0 whose planner ClusterRole lacked any pods permissions. The operator chart now ships those rules built in: templates/planner.yaml grants pods [get, list, watch, patch] to the planner ServiceAccount in both namespace-restricted (planner-role) and cluster-wide (...-planner ClusterRole) modes, a strict superset of the dev file. The file was self-marked DEPRECATED for chart 1.2.1+, and the chart here is 1.3.0, so the workaround is dead code. Addresses sttts review feedback on #9683. Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
Review status — all @sttts threads resolvedThanks for the thorough pass. Below is a consolidated status of every review thread: 8 resolved via code changes (verified on
|
…/ai-dynamo/power-agent The multi-DGD live-test values comment and the multi-tenant test doc pointed the production-followup image at nvcr.io/nvidia/dynamo/power-agent, which is missing the ai- prefix. The shipped chart default (deploy/helm/charts/power-agent/values.yaml) is nvcr.io/nvidia/ai-dynamo/power-agent; align both references. Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com> # Conflicts: # components/src/dynamo/planner/core/base.py # components/src/dynamo/planner/monitoring/planner_metrics.py
Signed-off-by: Kai Ma <kaim@nvidia.com>
…safe sweep throttle) _scaling_up read self.state_machine._num_p_workers/_num_d_workers directly. On the orchestrator path the `state_machine` property lazily constructs a PlannerStateMachine (and warms predictors) when none exists, and the freshly-built PSM reports zero worker counts the orchestrator path never updates. Every positive scaling decision then looks like a scale-up, holding the power-annotation force-window permanently open and defeating the steady-state sweep throttle. Extract _current_worker_counts() (cached self._last_worker_counts first, PSM internals only when a state machine already exists, else 0,0) and use it in both _scaling_up and _log_decision_summary so the two cannot drift -- _scaling_up's docstring already claimed to mirror _log_decision_summary. Tests: cached counts take precedence over PSM internals; orchestrator path (no state machine) uses cached counts without constructing a PSM; cached counts at/above the decision do not force a sweep. Signed-off-by: Kai Ma <kaim@nvidia.com>
|
Status update -- Phase-1 planner infra. Resolved 9 review threads: advisory-mode gate ( @sttts -- three threads are left open because they are your call rather than a clear-cut fix:
Tell me your preference on those three and I'll action them. |
Part of the PR #9369 split plan. This is PR 2 of 6 (PR 1b - Planner Power Infrastructure). Predecessor: #9682 - Power Agent DaemonSet (base =
pr1a/power-agent; #9682 must land first) Successor: #9684 (Draft) - Power budget enforcement ## Summary Planner-side wires from config to pod annotation. Nothing in this PR changes the scaling policy to consume power yet; Phase 1 emits and observes the power infrastructure, while the AIC closed-loop optimizer and admission coupling remain deferred. Current branch shape: 32 files changed againstpr1a/power-agent. The 6 shared files with later planner PRs are hand-authored here at the Phase-1 line subset from the split plan. - 6 power-aware config fields:enable_power_awareness(defaultFalse),total_gpu_power_limit,prefill_engine_gpu_power_limit,decode_engine_gpu_power_limit,power_agent_safe_default_watts, andpower_annotation_interval_seconds(default 60s sweep throttle). The per-engine power-limit fields arege=1validated. -_apply_power_annotations()and_publish_power_budget_metrics()incore/base.py: emitdynamo.nvidia.com/gpu-power-limitper worker pod and expose Phase-1 power-budget gauges. Advisory mode is a hard no-op for pod mutation, and steady-state annotation sweeps are throttled with a scale-up force window. -7e74aa65eafixes the sweep force-window logic for the orchestrator path:_scaling_up()now reads cachedWorkerCountsbefore PSM internals and never constructs a PSM just to classify scale-up. -KubernetesConnector/KubernetesAPI:patch_pod_annotation,list_pods_by_label,get_component_pods,list_frontend_pods, andresolve_frontend_http_port. - Prometheus plumbing: DCGMget_total_dgd_power()plus the NaN-filter fix in_get_metric_increase. - Operator-chart planner pod RBAC:pods [get, list, watch, patch]for the planner ServiceAccount in both namespace and cluster-wide modes. The platform chart stays at themainbaseline1.3.0; the obsolete standalonedeploy/planner-pod-rbac-dev.yamlpatch was removed. -examples/multi-dgd-live-test/: multi-DGD live-repro scaffolding andtest_multi_dgd_live.py; personal identifiers were scrubbed. Reviewer context: - Design context:docs/design-docs/powerplanner-design.mdsections 5.5, 6.4, and 7 (lands later in the stack; readable frompr5/docs-devenv). - Dev environment:docs/components/planner/dpp-dev-env.mdsection 8 (also in PR 5). - Split-plan context: sections 2.2, 2.2.2, 3.2, and 5. ## Validation - Required/current GitHub checks pass at7e74aa65ea; rust jobs and team-request are intentionally skipped by filters. - Dev-pod validation from 2026-05-18, before the final7e74aa65easweep-throttle fix: - Full planner sweep, excludingtest_virtual_connector.py: 483 passed, 2 skipped in 12.46s. - Same sweep withRUN_DISRUPTIVE_TESTS=1: 484 passed, 1 skipped. - Phase-1 critical files (test_actuation_knobs.py,test_kubernetes_connector.py,test_prometheus.py): 121 / 121 passed. -test_actuation_knobs_live.py: 9 passed, 2 skipped; with disruptive opt-in: 10 passed, 1 skipped. - Power Agent regression at PR 1b tip at that time: 43 / 43 passed. -7e74aa65eaadds 3 focused unit tests intest_actuation_knobs.pyfor cached worker-count precedence, orchestrator/no-PSM behavior, and non-scale-up cached counts. These assertions are lint/import clean and required CI is green, but the dev-pod planner pytest count has not been refreshed after this final commit. - Pre-commit onorigin/main..pr1b/planner-infra: 14 hooks pass, 4 skipped; marker compliance was reconfirmed in a Linux pod withMissing sets: 0. The Windows host has the known pre-existing POSIX-onlyfcntlimport issue in untouched test utilities; Linux CI is unaffected. - DCO: commits are signed off. Expected skips from the May 18 dev-pod run: -TestPostBusyThresholdLive::test_post_busy_threshold_returns_2xx: frontend image lacks/busy_threshold; admission control lands later. -TestScalingRealMutation::test_set_component_replicas_mutates_dgd_spec: opt-in viaRUN_DISRUPTIVE_TESTS=1; passes when enabled. ## Merge Strategy Rebase-and-merge, no squash, per split-plan section 4.3. ## Stack Order Merge #9682 first. After #9682 lands, #9683 can be rebased/retargeted onto the updated base and merged. #9790 is a sibling stacked on #9682 and can merge after #9682 as well.