feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default) by kaim-eng · Pull Request #9790 · ai-dynamo/dynamo

kaim-eng · 2026-05-20T16:42:07Z

Part of the PR #9369 split plan.
This is a sibling foundation PR to the planner stack: PR 1b - Power Agent DCGM actuator.

Predecessor: #9682 - Power Agent NVML DaemonSet (base = pr1a/power-agent; #9682 must land first)
Successor: none (sibling to #9683 -> #9687)

Summary

Adds an opt-in DCGM actuator to the Power Agent from #9682. Operators running the GPU Operator with dcgm.enabled=true can set agent.actuator=dcgm so caps are written through the cluster nvidia-dcgm hostengine. Default NVML behavior remains equivalent for operators that do not opt into DCGM, but the implementation is refactored through the new actuator protocol and shared managed-state module.

Current branch shape: 31 files changed against pr1a/power-agent (9,526 insertions / 352 deletions).

Organized as 3 logical changesets, with review-fix commits layered on top:

Design doc: docs/design-docs/power-agent-dual-actuator.md documents the DCGM/NVML decision arc, ACTION_CLEAR mitigation, UUID-keyed identity, LIMIT_DEF vs LIMIT_MAX, Set-then-Enforce-soft behavior, and changelog.
Power Agent implementation:
- components/power_agent/actuator.py: Actuator protocol, NvmlActuator, and DcgmActuator with UUID-keyed DCGM/NVML identity mapping, _apply_cap_inner, and _with_reconnect for DCGM_ST_CONNECTION_NOT_VALID.
- components/power_agent/power_agent.py: reconcile, SIGTERM, and orphan recovery dispatch through the active actuator; adds --actuator, --dcgm-host, --dcgm-port, and --dcgm-enforce; sharpens apply_failures_total semantics and adds dcgm_enforce_failures_total.
- components/power_agent/managed_state.py: single source of truth for mutable managed-GPU sets so __main__ and imported power_agent module copies do not split state.
- components/power_agent/Dockerfile: vendors pydcgm bindings and libdcgm.so from nvcr.io/nvidia/cloud-native/dcgm:4.5.1-1-ubuntu22.04.
- components/power_agent/tests/: 9 new test files covering protocol satisfaction, CLI dispatch, DCGM actuator behavior, orphan recovery, reconcile wiring, SIGTERM dispatch, shared-state identity, and DCGM UUID-identity restore. e2e_actuator_parity.py is under-test scaffolding for real-GPU parity and is excluded from CI by default.
Helm chart:
- Chart.yaml: chart 1.0.0 -> 1.3.0, appVersion 1.0.0 -> 1.2.0.
- values.yaml: adds agent.actuator and agent.dcgm.{host,port,enforce}; defaults keep actuator=nvml and enforce=false.
- templates/daemonset.yaml and templates/dev-pod.yaml: render actuator/DCGM flags and dev-mode script mounts.
- templates/_helpers.tpl: validates actuator, enforce, and image pinning at template time.
- tests/validate_{actuator,enforce,image_tag}_test.yaml: 49 helm-unittest cases (10 actuator, 14 enforce, 25 image_tag).

Reviewer start points:

docs/design-docs/power-agent-dual-actuator.md sections 1, 4, 6, 10, and 13.
components/power_agent/actuator.py, read with design-doc section 6.
components/power_agent/power_agent.py diff versus feat(power-agent): per-node power-cap enforcement DaemonSet #9682, especially reconcile dispatch, metric semantics, and SIGTERM routing.
deploy/helm/charts/power-agent/ values, helpers, templates, and tests.

Validation

Required/current GitHub checks pass at 33198612b9; snapshot/operator/rust jobs and team-request are intentionally skipped by filters.
pytest components/power_agent/tests/: full suite passes locally and required CI is green. The suite has grown past the original 133 tests with shared-state and DCGM UUID-identity regression coverage.
helm unittest deploy/helm/charts/power-agent: 49 passed (10 actuator, 14 enforce, 25 image_tag).
helm template smoke: NVML, DCGM with enforce=true, and invalid enforce=treu render/reject as expected.
Pre-commit on the 31 touched files: isort, black, flake8, codespell, ruff, check-yaml, check-shebang-executable, and mixed-line-ending pass. pytest-marker-report fails only on the Windows dev workstation due to the known pre-existing POSIX-only fcntl import issue; Linux CI is unaffected.
DCO: every commit is signed off.

Merge Strategy

Rebase-and-merge, no squash, after condensing review-fix commits back into the intended docs / code / helm logical split. This preserves a clean 3-commit history in main while retaining the reviewed changes.

Stack Order

Merge #9682 first. #9790 is stacked on pr1a/power-agent and becomes actionable after #9682 lands. #9683 is a sibling on the same base and can merge after #9682 as well.

copy-pr-bot · 2026-05-20T16:42:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-20T16:49:04Z

Walkthrough

This pull request introduces a complete Power Agent DaemonSet for enforcing per-GPU power caps on Dynamo worker nodes, featuring a pluggable dual-actuator design (NVML default, DCGM opt-in), comprehensive test coverage, and production-ready Helm chart deployment with template-time validation.

Changes

Power Agent Dual-Actuator System

Layer / File(s)	Summary
Actuator Protocol and Implementations `components/power_agent/actuator.py`, `components/power_agent/logger.py`, `components/power_agent/tests/test_actuator_protocol.py`, `components/power_agent/tests/test_actuator_selection.py`, `components/power_agent/tests/test_dcgm_actuator.py`, `components/power_agent/tests/e2e_actuator_parity.py`	Defines `Actuator` Protocol with lifecycle and cap control APIs. Implements `NvmlActuator` with lazy NVML imports and `DcgmActuator` with TCP hostengine connection, stale-handle recovery, and cross-library UUID mapping. Includes protocol conformance tests, factory selection tests, comprehensive DcgmActuator unit tests, and e2e parity harness comparing both actuators against `nvidia-smi` ground truth.
Power Agent Core and Reconciliation `components/power_agent/power_agent.py`, `components/power_agent/tests/test_reconcile_wiring.py`, `components/power_agent/tests/test_shutdown.py`, `components/power_agent/tests/test_orphan_recovery.py`, `components/power_agent/tests/test_managed_state_parser.py`, `components/power_agent/tests/test_multi_pod_policy.py`	Implements `PowerAgent` with periodic GPU reconciliation (pod UID mapping, multi-pod cap policy, actuator-driven application), optional Prometheus metrics, SIGTERM-driven graceful shutdown routing through active actuator, and cold-start orphan recovery using UUID-gated managed GPU state. Tests validate actuator routing contract, shutdown behavior with failure isolation, orphan recovery logic, managed state persistence robustness, and policy fallback metrics semantics.
Container Image and Component Documentation `components/power_agent/Dockerfile`, `components/power_agent/README.md`	Multi-stage Dockerfile supporting both NVML and DCGM with vendored DCGM Python bindings and libraries. Component README documents dual-actuator modes, Helm chart compatibility requirements, and troubleshooting including Prometheus metrics semantics.
Helm Chart Values and Configuration `deploy/helm/charts/power-agent/Chart.yaml`, `deploy/helm/charts/power-agent/values.yaml`, `deploy/helm/charts/power-agent/README.md`	Chart metadata (v1.2.0, appVersion v1.1.0) and comprehensive `values.yaml` with image pinning (tag/digest), actuator selection, DCGM connection options, and dev-mode ConfigMap requirements. Chart README provides production and dev install guidance, image pinning rules, troubleshooting, and cross-writer flapping warnings.
Helm Template Validation and Deployment `deploy/helm/charts/power-agent/templates/_helpers.tpl`, `deploy/helm/charts/power-agent/templates/daemonset.yaml`, `deploy/helm/charts/power-agent/templates/dev-pod.yaml`, `deploy/helm/charts/power-agent/templates/NOTES.txt`	Strict image validation helpers (tag/digest mutual exclusivity, `:latest` rejection, whitespace rejection, SHA-256 format enforcement) and actuator/enforce validators. `imageRef` helper renders canonical OCI references. DaemonSet and dev-Pod templates conditionally inject actuator and DCGM flags. Post-install notes guide mode-specific verification and iteration.
Helm Unittest Test Suites `deploy/helm/charts/power-agent/tests/validate_image_tag_test.yaml`, `deploy/helm/charts/power-agent/tests/validate_actuator_test.yaml`, `deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml`	46 unit test cases validating template-time behavior: image tag/digest mutual exclusivity, `:latest` rejection, whitespace/format validation; actuator selection (nvml default, dcgm opt-in, invalid rejection); dcgm-enforce boolean allowlist (yes/no/on/off/true/false with case normalization, rejection of invalid values, cross-validator behavior for unused modes).
Design Documentation `docs/design-docs/power-agent-helm-chart-plan.md`	Comprehensive plan document updated from draft to implementation record, including v1.1.0 dual-actuator expansion and v1.2.0 image-pinning hardening. Detailed sections on chart schema, validation strategy, template-time validators, installation flows, commit gates, risk analysis, and pre-flight/post-push checklists.

🎯 4 (Complex) | ⏱️ ~60 minutes

Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0 chart that PR #9790 ships. The plan was authored for the NVML-only v1.0.0 chart in PR #9682 and went stale once PR #9790 layered agent.actuator + agent.dcgm.* + validateActuator + validateEnforce + the helm-unittest suite on top. Two changelog rows capture the refresh: v1.3 - first pass: 8 reviewer findings closed (3 blocking, 4 major, 2 medium, 1 low) on Status header, values surface, dev ConfigMap recipe, image-tag pinning, helm-unittest gating, internal-dev-doc references, daemonset name, file/LOC accounting. v1.4 - second pass: 5 follow-up findings closed (2 major, 2 medium, 1 low) on Status self-contradiction, stale section 4.2 dev-block comment, overstated helm-unittest coverage prose, unrunnable section 5.4 positive helm template overlays (missing --set image.tag), and the lingering filename reference in the v1.3 changelog. No design reversal: every section 6 decision and the chart shape are unchanged. Only the values surface (extended), helper set (extended with two template-time validators), and validation-gate list (helm-unittest now required) grew. Every claim in the refreshed doc was verified against on-disk state: power_agent.py:706-804 for the 8-flag CLI surface, values.yaml for the dev-block recipe, _helpers.tpl for the five-helper set, and the two tests/validate_*_test.yaml files for the 24 enumerated unittest cases. Signed-off-by: Kai Ma <kaim@nvidia.com>

Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.2.0 chart that the preceding commit ships. The plan was authored for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in the PR #9790 dual-actuator work, and went stale again once v1.2.0 landed image.digest + the canonical-OCI imageRef helper on top. Four reviewer findings closed (v1.5 changelog row covers all four): (1) Truncated SHA-256 digests slipped through the v1.1.0 helper - already fixed in the preceding chart commit; this commit propagates the rule into the §4.4 helper snippet and the §5.4 validation gates. (2) Whitespace-padded tags rendered raw image references - same treatment: §4.4 documents the trim-then-reject contract. (3) §4.3 / §4.4 / §5.4 stale on image.digest: - §4.3 CodeRabbit-comment row rewritten to describe both helpers (validateImageTag + imageRef), both fields (tag + digest), the canonical repo@digest form, and the PR9682 follow-up that added the separate field. - §4.4 helper snippet replaced with the actual v1.2.0 validator (full rule set + per-rule rationale comments) plus the new imageRef helper. - §5.4 expected helm-unittest output bumped 24 -> 46 passed with a one-line breakdown of the +22 new cases. (4) §4.1 / §4.2 stale on file count + helper list + values surface: - §4.1 said 13 files (omitted .helmignore and the new validate_image_tag_test.yaml). Corrected to 14 files (7 templates + 3 helm-unittests + 4 root files). The _helpers.tpl helper-list line gained imageRef and chart entries. The daemonset.yaml annotation now mentions it routes through imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0. - §4.2 values snippet had only image.tag with no image.digest. Added the field with the same per-rule comment block that ships in values.yaml (OCI form, 64-hex requirement, mutex with image.tag, PR9682 rationale). Status header bumped to chart v1.2.0 with a one-line v1.2.0 rationale (additive opt-in field, no breaking change to existing tag-pinned installs). Revision-history table gains the v1.5 entry summarising all four findings; older v1.1 - v1.4 entries untouched. No design reversal: every §6 decision and the chart shape are unchanged. Only the values surface (one additive field), helper set (two helpers: imageRef new, validateImageTag rewritten), and helm-unittest count (24 -> 46) grew. Every claim in the refreshed doc was verified against on-disk state: Chart.yaml for the version bump, values.yaml for the image.digest field comment, _helpers.tpl for both helpers, tests/validate_image_tag_test.yaml for the 22 enumerated cases (verified via `helm unittest`). Signed-off-by: Kai Ma <kaim@nvidia.com>

…ttributes (v1.10 e2e fix) First in-cluster DCGM parity run on an 8xA100 SXM node (PR #9790 e2e) surfaced three production defects in DcgmActuator, all rooted in pydcgm API misuse that the mocked unit suite couldn't catch: 1. `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` read `DCGM_FI_DEV_UUID` via `dcgmEntityGetLatestValues`. That API returns the field cache; on a fresh nv-hostengine 4.5.3 with no companion watcher subscribed via `dcgmWatchFields`, the cache returns `DCGM_STR_BLANK = "<<<NULL>>>"`. The identity map then raised "8 GPU UUID(s) visible to DCGM are not visible to NVML" on every cluster startup. 2. `constraints_w` / `current_w` / `default_w` read the four `DCGM_FI_DEV_POWER_MGMT_LIMIT{,_MIN,_MAX,_DEF}` fields the same way. Every read returned `DCGM_FP64_BLANK = 2^47`, so the clamp in `apply_cap` escalated a requested 250 W up to the blank max and tried to write 140737488355328 W. 3. The workload-power-profile blanking loop in `_apply_cap_inner` reached for `dcgm_structs.DCGM_INT32_BLANK`. The constant actually lives in `dcgmvalue` (`/shared/pydcgm/dcgmvalue.py:17` in the DCGM 4.5.3 apt bindings); pre-v1.10 the first cap write raised `AttributeError: module 'dcgm_structs' has no attribute 'DCGM_INT32_BLANK'`. Fix: - Add a single `_power_limits(gpu_idx)` helper that calls `DcgmSystem.discovery.GetGpuAttributes(gpu_id).powerLimits` - the synchronous device-info API that wraps `dcgmGetDeviceAttributes`. One RPC instead of four, returns a `c_dcgmDevicePowerLimits_v1` with integer-watt fields populated from the hostengine's discovery state, no field-cache dependency. - `constraints_w`, `current_w`, `default_w`, `restore_default` all route through this helper. - `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` now use `GetGpuAttributes(gpu_id).identifiers.uuid`. - `apply_cap` imports `dcgmvalue` and swaps all seven `DCGM_INT32_BLANK` references onto it. Test infrastructure overhaul (`test_dcgm_actuator.py`): - `_make_dcgm_modules` adds a `dcgmvalue` MagicMock with all four blank sentinels; removes `DCGM_INT32_BLANK` from `dcgm_structs` mock. - `_make_gpu_attrs` extended to carry both `.identifiers.uuid` and `.powerLimits.{cur,default,enforced,min,max}PowerLimit`. - `_wire_handle` wires `GetGpuAttributes(gid)` per-gpu_id. - `_seed_constraints_and_uuid` consolidated onto the unified GetGpuAttributes path. - Five `modules["dcgm_agent"].dcgmEntityGetLatestValues.assert_not_called()` regression guards added across the suite - locks in "no more silent field-cache reads of static device info." E2e re-validation: with this fix in place, the same 8xA100 rig reports `PASS: NvmlActuator and DcgmActuator agree on all probes (tolerance +/- 2.0 W)`, exit code 0, with nvidia-smi confirming `apply_cap(250)` -> 250 W and `restore_default()` -> 400 W on every GPU. Doc: `docs/design-docs/power-agent-dual-actuator.md` v1.10 changelog row added; §6.1 / §6.3 / §7 / §12 / §13 patched to match (Protocol-level docstrings for `current_w`/`default_w`, `get_uuid` / `_power_limits` / `restore_default` code samples, `_apply_cap_inner` import + blank-constant sample, attribution table row 2, summary clause, reference-material entries for `pydcgm/dcgmvalue.py` and `GetGpuAttributes`). Local gates: 163/163 power-agent unit tests pass; pre-commit isort/black/flake8/codespell/ruff/EOF/whitespace clean on the three changed files. Broader `tests/` + `pytest-marker-report` blocked by the known Windows-only `fcntl` import in `tests/conftest.py:22`; CI runs those on Linux. Signed-off-by: Kai Ma <kaim@nvidia.com>

Three Dockerfile bugs combined to make the DCGM-mode image unbuildable on a fresh checkout. Fixing any one in isolation leaves the build broken, so they travel together: 1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04' does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect' → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag. 2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/ to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous COPY would silently copy zero files under the new pin. Switch the source path to the 4.5+ location. 3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing 'import logger' — but logger.py lives in DCGM's source tree under testing/python3/ and is NOT packaged. Without a shim every DcgmGroup construction raises ModuleNotFoundError. Add a 10-line stdlib-logging adapter at components/power_agent/logger.py and COPY it into /opt/dcgm/python/logger.py during the runtime stage. This unblocks 'docker build -f components/power_agent/Dockerfile' on a fresh clone (verified locally via 'docker buildx build --build-arg DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21, image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the Path-B live test on aks-a100b-22138447-vmss000000). Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6. Signed-off-by: Kai Ma <kaim@nvidia.com>

components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2, a struct that only exists in DCGM 4.x. With 3.x bindings the agent's DcgmActuator.init() succeeded, opened the hostengine connection, and ran NVML init cleanly — then crashed mid-first-reconcile with 'AttributeError: module dcgm_structs has no attribute c_dcgmDeviceConfig_v2' after some GPUs were already capped. The SIGTERM-restore path won't run when the actuator never finished registering with _active_actuator, so the GPUs are left at custom caps. Add a 7-line hasattr check immediately after the dcgm_structs import in init() that raises a RuntimeError with: - the missing struct name (so the error is grep-able) - the required DCGM major (>=4.0) - the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators know exactly what to bump) Guards against accidental Dockerfile regressions to a 3.x base image. Pair with the Dockerfile fix in 4820ca7 (which bumps the default to 4.5.1-1-ubuntu22.04) so the default build path is consistent with the runtime contract. Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection (MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments don't half-init and leave hostengine sockets dangling. Refs: PR #9790 review, Power Agent live-test finding #5. Signed-off-by: Kai Ma <kaim@nvidia.com>

… log, pod-UID dedup, argparse) Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790): * _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup. * _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container. * _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric. * main(): move 'import argparse' to module scope per the project's import-placement convention. Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there. Signed-off-by: Kai Ma <kaim@nvidia.com>

Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0 chart that PR #9790 ships. The plan was authored for the NVML-only v1.0.0 chart in PR #9682 and went stale once PR #9790 layered agent.actuator + agent.dcgm.* + validateActuator + validateEnforce + the helm-unittest suite on top. Two changelog rows capture the refresh: v1.3 - first pass: 8 reviewer findings closed (3 blocking, 4 major, 2 medium, 1 low) on Status header, values surface, dev ConfigMap recipe, image-tag pinning, helm-unittest gating, internal-dev-doc references, daemonset name, file/LOC accounting. v1.4 - second pass: 5 follow-up findings closed (2 major, 2 medium, 1 low) on Status self-contradiction, stale section 4.2 dev-block comment, overstated helm-unittest coverage prose, unrunnable section 5.4 positive helm template overlays (missing --set image.tag), and the lingering filename reference in the v1.3 changelog. No design reversal: every section 6 decision and the chart shape are unchanged. Only the values surface (extended), helper set (extended with two template-time validators), and validation-gate list (helm-unittest now required) grew. Every claim in the refreshed doc was verified against on-disk state: power_agent.py:706-804 for the 8-flag CLI surface, values.yaml for the dev-block recipe, _helpers.tpl for the five-helper set, and the two tests/validate_*_test.yaml files for the 24 enumerated unittest cases. Signed-off-by: Kai Ma <kaim@nvidia.com>

Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.2.0 chart that the preceding commit ships. The plan was authored for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in the PR #9790 dual-actuator work, and went stale again once v1.2.0 landed image.digest + the canonical-OCI imageRef helper on top. Four reviewer findings closed (v1.5 changelog row covers all four): (1) Truncated SHA-256 digests slipped through the v1.1.0 helper - already fixed in the preceding chart commit; this commit propagates the rule into the §4.4 helper snippet and the §5.4 validation gates. (2) Whitespace-padded tags rendered raw image references - same treatment: §4.4 documents the trim-then-reject contract. (3) §4.3 / §4.4 / §5.4 stale on image.digest: - §4.3 CodeRabbit-comment row rewritten to describe both helpers (validateImageTag + imageRef), both fields (tag + digest), the canonical repo@digest form, and the PR9682 follow-up that added the separate field. - §4.4 helper snippet replaced with the actual v1.2.0 validator (full rule set + per-rule rationale comments) plus the new imageRef helper. - §5.4 expected helm-unittest output bumped 24 -> 46 passed with a one-line breakdown of the +22 new cases. (4) §4.1 / §4.2 stale on file count + helper list + values surface: - §4.1 said 13 files (omitted .helmignore and the new validate_image_tag_test.yaml). Corrected to 14 files (7 templates + 3 helm-unittests + 4 root files). The _helpers.tpl helper-list line gained imageRef and chart entries. The daemonset.yaml annotation now mentions it routes through imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0. - §4.2 values snippet had only image.tag with no image.digest. Added the field with the same per-rule comment block that ships in values.yaml (OCI form, 64-hex requirement, mutex with image.tag, PR9682 rationale). Status header bumped to chart v1.2.0 with a one-line v1.2.0 rationale (additive opt-in field, no breaking change to existing tag-pinned installs). Revision-history table gains the v1.5 entry summarising all four findings; older v1.1 - v1.4 entries untouched. No design reversal: every §6 decision and the chart shape are unchanged. Only the values surface (one additive field), helper set (two helpers: imageRef new, validateImageTag rewritten), and helm-unittest count (24 -> 46) grew. Every claim in the refreshed doc was verified against on-disk state: Chart.yaml for the version bump, values.yaml for the image.digest field comment, _helpers.tpl for both helpers, tests/validate_image_tag_test.yaml for the 22 enumerated cases (verified via `helm unittest`). Signed-off-by: Kai Ma <kaim@nvidia.com>

…ttributes (v1.10 e2e fix) First in-cluster DCGM parity run on an 8xA100 SXM node (PR #9790 e2e) surfaced three production defects in DcgmActuator, all rooted in pydcgm API misuse that the mocked unit suite couldn't catch: 1. `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` read `DCGM_FI_DEV_UUID` via `dcgmEntityGetLatestValues`. That API returns the field cache; on a fresh nv-hostengine 4.5.3 with no companion watcher subscribed via `dcgmWatchFields`, the cache returns `DCGM_STR_BLANK = "<<<NULL>>>"`. The identity map then raised "8 GPU UUID(s) visible to DCGM are not visible to NVML" on every cluster startup. 2. `constraints_w` / `current_w` / `default_w` read the four `DCGM_FI_DEV_POWER_MGMT_LIMIT{,_MIN,_MAX,_DEF}` fields the same way. Every read returned `DCGM_FP64_BLANK = 2^47`, so the clamp in `apply_cap` escalated a requested 250 W up to the blank max and tried to write 140737488355328 W. 3. The workload-power-profile blanking loop in `_apply_cap_inner` reached for `dcgm_structs.DCGM_INT32_BLANK`. The constant actually lives in `dcgmvalue` (`/shared/pydcgm/dcgmvalue.py:17` in the DCGM 4.5.3 apt bindings); pre-v1.10 the first cap write raised `AttributeError: module 'dcgm_structs' has no attribute 'DCGM_INT32_BLANK'`. Fix: - Add a single `_power_limits(gpu_idx)` helper that calls `DcgmSystem.discovery.GetGpuAttributes(gpu_id).powerLimits` - the synchronous device-info API that wraps `dcgmGetDeviceAttributes`. One RPC instead of four, returns a `c_dcgmDevicePowerLimits_v1` with integer-watt fields populated from the hostengine's discovery state, no field-cache dependency. - `constraints_w`, `current_w`, `default_w`, `restore_default` all route through this helper. - `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` now use `GetGpuAttributes(gpu_id).identifiers.uuid`. - `apply_cap` imports `dcgmvalue` and swaps all seven `DCGM_INT32_BLANK` references onto it. Test infrastructure overhaul (`test_dcgm_actuator.py`): - `_make_dcgm_modules` adds a `dcgmvalue` MagicMock with all four blank sentinels; removes `DCGM_INT32_BLANK` from `dcgm_structs` mock. - `_make_gpu_attrs` extended to carry both `.identifiers.uuid` and `.powerLimits.{cur,default,enforced,min,max}PowerLimit`. - `_wire_handle` wires `GetGpuAttributes(gid)` per-gpu_id. - `_seed_constraints_and_uuid` consolidated onto the unified GetGpuAttributes path. - Five `modules["dcgm_agent"].dcgmEntityGetLatestValues.assert_not_called()` regression guards added across the suite - locks in "no more silent field-cache reads of static device info." E2e re-validation: with this fix in place, the same 8xA100 rig reports `PASS: NvmlActuator and DcgmActuator agree on all probes (tolerance +/- 2.0 W)`, exit code 0, with nvidia-smi confirming `apply_cap(250)` -> 250 W and `restore_default()` -> 400 W on every GPU. Doc: `docs/design-docs/power-agent-dual-actuator.md` v1.10 changelog row added; §6.1 / §6.3 / §7 / §12 / §13 patched to match (Protocol-level docstrings for `current_w`/`default_w`, `get_uuid` / `_power_limits` / `restore_default` code samples, `_apply_cap_inner` import + blank-constant sample, attribution table row 2, summary clause, reference-material entries for `pydcgm/dcgmvalue.py` and `GetGpuAttributes`). Local gates: 163/163 power-agent unit tests pass; pre-commit isort/black/flake8/codespell/ruff/EOF/whitespace clean on the three changed files. Broader `tests/` + `pytest-marker-report` blocked by the known Windows-only `fcntl` import in `tests/conftest.py:22`; CI runs those on Linux. Signed-off-by: Kai Ma <kaim@nvidia.com>

Three Dockerfile bugs combined to make the DCGM-mode image unbuildable on a fresh checkout. Fixing any one in isolation leaves the build broken, so they travel together: 1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04' does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect' → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag. 2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/ to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous COPY would silently copy zero files under the new pin. Switch the source path to the 4.5+ location. 3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing 'import logger' — but logger.py lives in DCGM's source tree under testing/python3/ and is NOT packaged. Without a shim every DcgmGroup construction raises ModuleNotFoundError. Add a 10-line stdlib-logging adapter at components/power_agent/logger.py and COPY it into /opt/dcgm/python/logger.py during the runtime stage. This unblocks 'docker build -f components/power_agent/Dockerfile' on a fresh clone (verified locally via 'docker buildx build --build-arg DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21, image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the Path-B live test on aks-a100b-22138447-vmss000000). Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6. Signed-off-by: Kai Ma <kaim@nvidia.com>

components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2, a struct that only exists in DCGM 4.x. With 3.x bindings the agent's DcgmActuator.init() succeeded, opened the hostengine connection, and ran NVML init cleanly — then crashed mid-first-reconcile with 'AttributeError: module dcgm_structs has no attribute c_dcgmDeviceConfig_v2' after some GPUs were already capped. The SIGTERM-restore path won't run when the actuator never finished registering with _active_actuator, so the GPUs are left at custom caps. Add a 7-line hasattr check immediately after the dcgm_structs import in init() that raises a RuntimeError with: - the missing struct name (so the error is grep-able) - the required DCGM major (>=4.0) - the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators know exactly what to bump) Guards against accidental Dockerfile regressions to a 3.x base image. Pair with the Dockerfile fix in 4820ca7 (which bumps the default to 4.5.1-1-ubuntu22.04) so the default build path is consistent with the runtime contract. Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection (MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments don't half-init and leave hostengine sockets dangling. Refs: PR #9790 review, Power Agent live-test finding #5. Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng · 2026-05-25T14:35:25Z

Rebased onto the refreshed pr1a/power-agent (tip 29f0831) and main (~316 commits forward from the prior base). Force-pushed pr1b/power-agent-dcgm-actuator to abcc922.

What moved:

The four PR feat(power-agent): per-node power-cap enforcement DaemonSet #9682 runtime-hardening fixes (state-file parser, NVML shutdown logging, pod-UID dedup, argparse import scope) were backported to pr1a/power-agent so each fix lives with the code it changes. Replies posted on every PR feat(power-agent): per-node power-cap enforcement DaemonSet #9682 review thread.
The advisory-mode gate on _apply_power_annotations landed on pr1b/planner-infra (feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus #9683, commit 8f47413) — addresses dynamo-ops's core/base.py:976 finding.
On this PR, commit 36b9350 keeps the actuator-aware refinements: the per-entry isinstance(str) filter on the state parser, the actuator-aware shutdown logging that names DCGM vs NVML in the error message, and the +29 new test cases under components/power_agent/tests/test_managed_state_parser.py / test_reconcile_wiring.py / test_shutdown.py / test_dcgm_actuator.py.

CI rerunning now.

@coderabbitai review

coderabbitai · 2026-05-25T14:35:41Z

@kaim-eng Thanks for the detailed rebase summary! The backport strategy—keeping each hardening fix with the code it touches on pr1a while retaining the actuator-aware refinements and the 29 new test cases here on pr1b—looks like a clean separation. I'll kick off a fresh review of the force-pushed tip now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

components/power_agent/tests/test_managed_state_parser.py (1)
85-275: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required pytest markers (scheduling + GPU + type) to this test module.

The newly added tests are unmarked and don’t satisfy the required test marker policy.

As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_managed_state_parser.py` around lines 85 -
275, Import pytest at the top of the test module and add a module-level
pytestmark list that applies the required markers (scheduling + GPU + type) to
all tests, e.g. place `import pytest` and `pytestmark = [pytest.mark.scheduling,
pytest.mark.gpu, pytest.mark.unit]` (or whichever type marker is appropriate)
near the top so TestHappyPath, TestMissingFile, TestOsErrorSiblings,
TestMalformedJson, TestNonObjectRoot, TestManagedUuidsNotAList, and
TestEntryTypeValidation all inherit the markers.
components/power_agent/tests/test_orphan_recovery.py (1)
85-286: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required pytest markers (scheduling + GPU + type) to this test module.

This suite needs the mandatory test markers before merge.

As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_orphan_recovery.py` around lines 85 - 286,
The module lacks required pytest markers (scheduling + GPU + type); add a
module-level pytestmark so every test in this file gets those markers: import
pytest at top of the file and define pytestmark =
[pytest.mark.<scheduling_marker>, pytest.mark.<gpu_marker>,
pytest.mark.<type_marker>] (replace placeholders with the project's chosen
markers), which will apply to the test classes such as TestUuidGating,
TestWorkloadBusySkip, TestCurrentVsDefaultGuard, TestPerGpuExceptionIsolation,
and TestManagedSetPruning.
components/power_agent/tests/test_shutdown.py (1)
46-229: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required pytest markers (scheduling + GPU + type) to this test module.

This file currently lacks the mandatory test classification markers.

As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_shutdown.py` around lines 46 - 229, Add the
required pytest classification markers by defining a module-level pytestmark
list (e.g. pytestmark = [pytest.mark.scheduling, pytest.mark.gpu,
pytest.mark.type]) so every test in this module (including
TestSigtermViaActuator and TestSigtermFallback and their methods like
_handle_sigterm interactions) carries scheduling, GPU and type markers; place
the pytestmark definition near the top of the file (above the test classes) and
import pytest if not already imported.
components/power_agent/tests/test_reconcile_wiring.py (1)
46-326: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required pytest markers (scheduling + GPU + type) for this module.

These tests currently have none of the required markers.
Suggested patch
+import pytest
+
+pytestmark = [
+    pytest.mark.post_merge,  # or pre_merge/nightly/etc. per intended cadence
+    pytest.mark.gpu_0,
+    pytest.mark.unit,
+]
As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_reconcile_wiring.py` around lines 46 - 326,
Add the required pytest markers by importing pytest and applying scheduling,
GPU, and type markers to this test module; either set a module-level pytestmark
list or add decorators to each test class (e.g.
TestReconcileGpuRoutesViaActuator, TestReconcileGpuDedupesByPodUid,
TestReconcileGpuPolicyResolution) to include at least one scheduling marker, one
GPU marker, and one type marker per the guidelines so all tests in this file are
properly marked.

🧹 Nitpick comments (4)

deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml (1)

54-95: ⚡ Quick win

Strengthen enforce-flag assertion coverage for accepted inputs and nvml dead-config checks.

Line 54 onward accepts multiple string forms with only notFailedTemplate, and Line 136 claims “ANY allowlisted value” absence while checking only a subset. Add exact contains/notContains assertions for the remaining allowlisted renderable forms to catch silent drift.

Proposed test hardening pattern

  - it: should accept enforce=1 (string)
    set:
      agent.dcgm.enforce: "1"
    asserts:
      - notFailedTemplate: {}
+      - contains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=1

  - it: should accept enforce=0 (string)
    set:
      agent.dcgm.enforce: "0"
    asserts:
      - notFailedTemplate: {}
+      - contains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=0

  - it: should accept enforce=yes
    set:
      agent.dcgm.enforce: "yes"
    asserts:
      - notFailedTemplate: {}
+      - contains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=yes

      - notContains:
          path: spec.template.spec.containers[0].command
          content: --dcgm-enforce=false
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=1
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=0
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=yes
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=no
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=on
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=off

Also applies to: 136-147

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml` around lines
54 - 95, The tests under validate_enforce_test.yaml currently only use
notFailedTemplate for many agent.dcgm.enforce string variants; update each case
(e.g., the tests setting agent.dcgm.enforce to
"1","0","yes","no","on","off","TRUE") to also assert the rendered templates
explicitly include or exclude the expected configuration snippets by adding
contains/notContains assertions for the corresponding normalized values (e.g.,
numeric 1/0, true/false, on/off forms) so silent drift is caught, and apply the
same hardening to the other block referenced (the ANY allowlisted values section
around lines 136-147) to cover all allowed renderable forms and the nvml
dead-config checks rather than relying solely on notFailedTemplate.

components/power_agent/tests/test_reconcile_wiring.py (1)

218-219: ⚡ Quick win

Move function-scope imports to module scope.

from tests.test_multi_pod_policy import _FakeMetrics is imported inside test methods; this should be hoisted to top-level imports.

Suggested patch

 import unittest
 from unittest.mock import MagicMock, patch
 
 import power_agent
 from power_agent import PowerAgent
+from tests.test_multi_pod_policy import _FakeMetrics
@@
-        from tests.test_multi_pod_policy import _FakeMetrics
-
         agent.metrics = _FakeMetrics()
@@
-        from tests.test_multi_pod_policy import _FakeMetrics
-
         agent.metrics = _FakeMetrics()
@@
-        from tests.test_multi_pod_policy import _FakeMetrics
-
         agent.metrics = _FakeMetrics()

As per coding guidelines: “ensure imports are only at module scope (no imports inside functions/classes)”.

Also applies to: 243-244, 269-270

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_reconcile_wiring.py` around lines 218 -
219, The test module imports `_FakeMetrics` inside test functions; hoist `from
tests.test_multi_pod_policy import _FakeMetrics` to the top-level imports in the
module (module scope) and remove the function-scope imports in the test methods
(including the other occurrences where `_FakeMetrics` is imported inside tests).
Ensure the module-level import is added near the other test imports so all tests
reference the same top-level `_FakeMetrics`.

components/power_agent/tests/test_managed_state_parser.py (1)

60-67: ⚡ Quick win

Hoist logging imports to module scope.

import logging is currently inside methods; move it to top-level imports for consistency with Python guidelines.

As per coding guidelines: “ensure imports are only at module scope (no imports inside functions/classes)”.

Also applies to: 114-121

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_managed_state_parser.py` around lines 60 -
67, Move the "import logging" statements out of the test functions and into
module-level imports in test_managed_state_parser.py; remove the in-function
imports at the locations that set up captured, handler, and logger (the blocks
that create captured: list[logging.LogRecord], handler = logging.Handler(),
handler.emit = captured.append, handler.setLevel(logging.WARNING), and logger =
logging.getLogger("power_agent")), so the tests use the top-level logging import
consistently (also apply the same change to the other block referenced in the
file).

components/power_agent/tests/e2e_actuator_parity.py (1)

497-500: ⚡ Quick win

Avoid silently swallowing NVML shutdown errors.

This except Exception: pass hides teardown faults and makes parity-run failures hard to diagnose.

Suggested patch

     finally:
         try:
             pynvml.nvmlShutdown()
-        except Exception:
-            pass
+        except Exception as e:
+            print(f"WARNING: pynvml.nvmlShutdown() failed: {e}", file=sys.stderr)

As per coding guidelines: “fail fast (don’t swallow exceptions, catch specific exceptions only, and if catching Exception then log and re-raise)”.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/e2e_actuator_parity.py` around lines 497 - 500,
The try/except that currently calls pynvml.nvmlShutdown() and swallows all
exceptions should be changed to catch the specific NVML exception type
(pynvml.NVMLError) instead of Exception, log the error with traceback, and
re-raise so teardown failures are visible; update the block around
pynvml.nvmlShutdown() (replace the bare except Exception: pass) to except
pynvml.NVMLError as e: logger.exception("pynvml.nvmlShutdown failed: %s", e) and
raise, and if a module-level logger isn’t present add logger =
logging.getLogger(__name__) and import logging.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/power_agent/actuator.py`:
- Around line 775-785: The except-block around the call to self._apply_cap_inner
should only catch DCGM write-related exceptions instead of all Exception;
replace "except Exception as e:" with a narrow catch such as "except DcgmError
as e" (or the specific DCGM client exception types your project uses, e.g.,
DcgmApiError / DcgmWriteError) and ensure those exception classes are imported,
keep the existing logger.error and metrics.inc for that case, and re-raise any
other unexpected exceptions so programming/runtime bugs are not swallowed by
apply_cap/_apply_cap_inner.
- Around line 539-540: The except block that swallows all exceptions after
calling _handle.Shutdown() must be changed to avoid hiding failures: catch only
expected exceptions or catch Exception but log the exception context before
continuing. Locate the try/except surrounding _handle.Shutdown() in actuator.py
(the block that currently reads "except Exception: pass") and replace the silent
swallow with a logged error using the module logger (include the exception
instance and stacktrace) or re-raise critical exceptions after logging; ensure
the log message mentions "_handle.Shutdown" so it’s easy to find in logs.
- Around line 618-622: The RuntimeError raised in DcgmActuator.list_running_pids
when a GPU UUID is no longer visible should preserve the original KeyError as
the root cause; change the raise so it uses exception chaining (raise
RuntimeError(...) from err) where err is the caught KeyError from the UUID-map
lookup, keeping the message unchanged but appending "from err" to preserve the
original traceback and context for uuid and gpu_idx.

In `@components/power_agent/power_agent.py`:
- Around line 169-177: The current check uses valid = {u for u in uuids if
isinstance(u, str)} which turns uuids into a set and conflates duplicates with
non-string entries; replace this by collecting only string entries without
deduping (e.g., valid_list = [u for u in uuids if isinstance(u, str)]) and
compute non_string_count = sum(1 for u in uuids if not isinstance(u, str)); then
change the logger.warning invocation (the one referencing _MANAGED_STATE_PATH)
to report non_string_count and the number of kept UUIDs (use len(valid_list) or
len(set(valid_list)) depending on whether you intend to report raw entries or
unique UUIDs) instead of comparing len(valid) to len(uuids).

In `@deploy/helm/charts/power-agent/README.md`:
- Around line 49-53: The README currently references a specific chart version
("chart v1.1.0 requires the v1.1.0 image"); update the wording to tie the
requirement to the chart's appVersion instead of a hard-coded chart version:
state that the image tag MUST match the chart's `appVersion`, and explain that
this is necessary because the chart renders the `--actuator` / `--dcgm-*` CLI
flags and `power_agent.py` imports the appVersion-specific `actuator` module;
replace the version-specific sentence with this appVersion-based guidance so
future chart bumps don't require text edits.

In `@deploy/helm/charts/power-agent/templates/_helpers.tpl`:
- Around line 117-119: The helper currently only checks for leading/trailing
whitespace on $tag but allows internal whitespace (e.g., "v1 .1.0") which
produces invalid OCI image references; update the validation in the template
that handles image.tag (the block using $tag and the fail message) to reject any
whitespace characters anywhere in $tag (not just trim differences) by testing
for matches against a whitespace pattern and calling fail with the existing
informative message when any whitespace is found; keep the same fail text and
hint but trigger it when $tag contains internal spaces or other whitespace.

In `@docs/design-docs/power-agent-helm-chart-plan.md`:
- Line 828: The Markdown has unlabeled fenced code blocks that trigger MD040;
update each triple-backtick block in power-agent-helm-chart-plan.md to include a
language identifier (e.g., change ``` to ```text or to a specific language such
as ```yaml, ```bash, or ```json as appropriate) so the linter recognizes the
block language; apply the same change to the other unlabeled fences mentioned in
the comment to resolve MD040.
- Line 29: The heading "### Revision history" is skipping a level; change that
heading to "## Revision history" (or add a parent "##" section before it) so the
document preserves correct heading hierarchy and satisfies MD001; update the
line containing "### Revision history" in
docs/design-docs/power-agent-helm-chart-plan.md accordingly.
- Around line 826-846: Update the stale v1.1.0 summary counts in sections §5.2
and §8 to match the v1.2.0 source-of-truth numbers used elsewhere in this
document: replace "13 files / 2 helm-unittests / 24 passed" with "14 files / 3
helm-unittests / 46 passed" (and any adjacent totals like "Chart-only total" or
LOC if they differ), and make the identical corrections where the same summary
appears around lines 1151-1160 so all references are consistent.

---

Outside diff comments:
In `@components/power_agent/tests/test_managed_state_parser.py`:
- Around line 85-275: Import pytest at the top of the test module and add a
module-level pytestmark list that applies the required markers (scheduling + GPU
+ type) to all tests, e.g. place `import pytest` and `pytestmark =
[pytest.mark.scheduling, pytest.mark.gpu, pytest.mark.unit]` (or whichever type
marker is appropriate) near the top so TestHappyPath, TestMissingFile,
TestOsErrorSiblings, TestMalformedJson, TestNonObjectRoot,
TestManagedUuidsNotAList, and TestEntryTypeValidation all inherit the markers.

In `@components/power_agent/tests/test_orphan_recovery.py`:
- Around line 85-286: The module lacks required pytest markers (scheduling + GPU
+ type); add a module-level pytestmark so every test in this file gets those
markers: import pytest at top of the file and define pytestmark =
[pytest.mark.<scheduling_marker>, pytest.mark.<gpu_marker>,
pytest.mark.<type_marker>] (replace placeholders with the project's chosen
markers), which will apply to the test classes such as TestUuidGating,
TestWorkloadBusySkip, TestCurrentVsDefaultGuard, TestPerGpuExceptionIsolation,
and TestManagedSetPruning.

In `@components/power_agent/tests/test_reconcile_wiring.py`:
- Around line 46-326: Add the required pytest markers by importing pytest and
applying scheduling, GPU, and type markers to this test module; either set a
module-level pytestmark list or add decorators to each test class (e.g.
TestReconcileGpuRoutesViaActuator, TestReconcileGpuDedupesByPodUid,
TestReconcileGpuPolicyResolution) to include at least one scheduling marker, one
GPU marker, and one type marker per the guidelines so all tests in this file are
properly marked.

In `@components/power_agent/tests/test_shutdown.py`:
- Around line 46-229: Add the required pytest classification markers by defining
a module-level pytestmark list (e.g. pytestmark = [pytest.mark.scheduling,
pytest.mark.gpu, pytest.mark.type]) so every test in this module (including
TestSigtermViaActuator and TestSigtermFallback and their methods like
_handle_sigterm interactions) carries scheduling, GPU and type markers; place
the pytestmark definition near the top of the file (above the test classes) and
import pytest if not already imported.

---

Nitpick comments:
In `@components/power_agent/tests/e2e_actuator_parity.py`:
- Around line 497-500: The try/except that currently calls pynvml.nvmlShutdown()
and swallows all exceptions should be changed to catch the specific NVML
exception type (pynvml.NVMLError) instead of Exception, log the error with
traceback, and re-raise so teardown failures are visible; update the block
around pynvml.nvmlShutdown() (replace the bare except Exception: pass) to except
pynvml.NVMLError as e: logger.exception("pynvml.nvmlShutdown failed: %s", e) and
raise, and if a module-level logger isn’t present add logger =
logging.getLogger(__name__) and import logging.

In `@components/power_agent/tests/test_managed_state_parser.py`:
- Around line 60-67: Move the "import logging" statements out of the test
functions and into module-level imports in test_managed_state_parser.py; remove
the in-function imports at the locations that set up captured, handler, and
logger (the blocks that create captured: list[logging.LogRecord], handler =
logging.Handler(), handler.emit = captured.append,
handler.setLevel(logging.WARNING), and logger =
logging.getLogger("power_agent")), so the tests use the top-level logging import
consistently (also apply the same change to the other block referenced in the
file).

In `@components/power_agent/tests/test_reconcile_wiring.py`:
- Around line 218-219: The test module imports `_FakeMetrics` inside test
functions; hoist `from tests.test_multi_pod_policy import _FakeMetrics` to the
top-level imports in the module (module scope) and remove the function-scope
imports in the test methods (including the other occurrences where
`_FakeMetrics` is imported inside tests). Ensure the module-level import is
added near the other test imports so all tests reference the same top-level
`_FakeMetrics`.

In `@deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml`:
- Around line 54-95: The tests under validate_enforce_test.yaml currently only
use notFailedTemplate for many agent.dcgm.enforce string variants; update each
case (e.g., the tests setting agent.dcgm.enforce to
"1","0","yes","no","on","off","TRUE") to also assert the rendered templates
explicitly include or exclude the expected configuration snippets by adding
contains/notContains assertions for the corresponding normalized values (e.g.,
numeric 1/0, true/false, on/off forms) so silent drift is caught, and apply the
same hardening to the other block referenced (the ANY allowlisted values section
around lines 136-147) to cover all allowed renderable forms and the nvml
dead-config checks rather than relying solely on notFailedTemplate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b586c90-dca2-4d5f-aaae-4f35d04c75ef

📥 Commits

Reviewing files that changed from the base of the PR and between 29f0831 and abcc922.

📒 Files selected for processing (26)

components/power_agent/Dockerfile
components/power_agent/README.md
components/power_agent/actuator.py
components/power_agent/logger.py
components/power_agent/power_agent.py
components/power_agent/tests/e2e_actuator_parity.py
components/power_agent/tests/test_actuator_protocol.py
components/power_agent/tests/test_actuator_selection.py
components/power_agent/tests/test_dcgm_actuator.py
components/power_agent/tests/test_managed_state_parser.py
components/power_agent/tests/test_multi_pod_policy.py
components/power_agent/tests/test_orphan_recovery.py
components/power_agent/tests/test_reconcile_wiring.py
components/power_agent/tests/test_shutdown.py
deploy/helm/charts/power-agent/Chart.yaml
deploy/helm/charts/power-agent/README.md
deploy/helm/charts/power-agent/templates/NOTES.txt
deploy/helm/charts/power-agent/templates/_helpers.tpl
deploy/helm/charts/power-agent/templates/daemonset.yaml
deploy/helm/charts/power-agent/templates/dev-pod.yaml
deploy/helm/charts/power-agent/tests/validate_actuator_test.yaml
deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml
deploy/helm/charts/power-agent/tests/validate_image_tag_test.yaml
deploy/helm/charts/power-agent/values.yaml
docs/design-docs/power-agent-dual-actuator.md
docs/design-docs/power-agent-helm-chart-plan.md

…dation Chart counterpart to the Actuator Protocol introduced in the previous commit. Bumps Chart.yaml 1.0.0 -> 1.1.0 (minor: new opt-in DCGM actuator on top of byte-identical NVML default). values.yaml additions: - agent.actuator: nvml | dcgm (default nvml - pre-PR1b behaviour byte-for-byte). No auto-detect: the operator declares which actuator their cluster runs based on whether their GPU Operator has dcgm.enabled=true, and the chart renders one DaemonSet with one actuator locked at startup. Eliminates "NVML and DCGM running simultaneously" by construction. - agent.dcgm.host (default nvidia-dcgm.gpu-operator.svc.cluster.local - the GPU Operator's standard hostengine Service DNS name) - agent.dcgm.port (default 5555 - upstream nvidia-dcgm hostPort) - agent.dcgm.enforce (default false - set-and-forget matches NVML semantics; flip to true to register cap as DCGM's target configuration for auto-reapply after GPU reset/reinit) daemonset.yaml: - --actuator=$(agent.actuator) on every DaemonSet command line - {{- if eq .Values.agent.actuator "dcgm" }} block injects --dcgm-host / --dcgm-port / --dcgm-enforce. NVML mode never sees these flags (no dead args). - Wires four template-time validators at the top of the file: validateImageTag, validateMutex, validateActuator, validateEnforce. dev-pod.yaml: - ConfigMap script mount now expects BOTH power_agent.py AND actuator.py (chart v1.1.0+ requirement - a v1.0.0-style ConfigMap with only power_agent.py will ImportError "No module named 'actuator'" at pod start). README + values.yaml comments + NOTES.txt kubectl-create-cm snippets all updated to reflect the two-file layout. - Command line plumbs the same --actuator / --dcgm-* flags as the production DaemonSet. Pre-v1.7 dev pods silently defaulted to NVML regardless of agent.actuator=dcgm (the values were read but never passed to argparse). _helpers.tpl: - validateActuator: strict allowlist "nvml" or "dcgm" (rejects "NVML", "DCGM", "auto" - the v1.2 reversal - and empty/typo inputs). Errors at helm install time, not pod start. - validateEnforce: mirrors power_agent._parse_bool_strict's allowlist (true/1/yes/on/false/0/no/off case-insensitive). Rejects --set agent.dcgm.enforce=treu at template time with a clear errorPattern instead of letting argparse fail at pod start with a less obvious CrashLoopBackOff trace. tests/ (new directory - helm-unittest suite): - validate_actuator_test.yaml: 11 tests covering the happy nvml/dcgm paths, DCGM flag injection (and exclusion under NVML), and the four typo classes (NVML/DCGM uppercase, auto, unknown, empty). - validate_enforce_test.yaml: 13 tests covering all eight _parse_bool_strict allowlist values, uppercase normalization, the canonical "treu" typo class, and the actuator=nvml + bad enforce cross-validator interaction (no validation fires when the flag is dead). - Assertion style: `contains` / `notContains` against the rendered command: YAML list (NOT matchRegex, which helm-unittest 1.1.0 only supports on string paths). Each `contains` check pins the exact rendered flag string, stricter than the regex form and catches flag-value drift. NOTES.txt: - Adds dcgm_enforce_failures_total to the Prometheus metrics listing with semantic disambiguation from apply_failures_total. README updates: - Compact values-table rows for dev.scriptConfigMap (now two-file) and dev.image (DCGM dev-mode override pointer). - DCGM dev-mode prerequisite block: explicit --set dev.image.repository=nvcr.io/nvidia/dynamo/power-agent --set dev.image.tag=v1.1.0 override because the default vllm-runtime:1.0.1 ships pynvml only (no pydcgm / libdcgm.so). Script iteration via the ConfigMap mount still works against the power-agent image because the /scripts mount overrides /app at runtime. - Install-example image.tag pinning bumped 1.0.0 -> 1.1.0 to match Chart.yaml's appVersion (a v1.0.0 image lacks actuator.py and rejects the new --actuator / --dcgm-* flags). Signed-off-by: Kai Ma <kaim@nvidia.com>

Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0 chart that PR #9790 ships. The plan was authored for the NVML-only v1.0.0 chart in PR #9682 and went stale once PR #9790 layered agent.actuator + agent.dcgm.* + validateActuator + validateEnforce + the helm-unittest suite on top. Two changelog rows capture the refresh: v1.3 - first pass: 8 reviewer findings closed (3 blocking, 4 major, 2 medium, 1 low) on Status header, values surface, dev ConfigMap recipe, image-tag pinning, helm-unittest gating, internal-dev-doc references, daemonset name, file/LOC accounting. v1.4 - second pass: 5 follow-up findings closed (2 major, 2 medium, 1 low) on Status self-contradiction, stale section 4.2 dev-block comment, overstated helm-unittest coverage prose, unrunnable section 5.4 positive helm template overlays (missing --set image.tag), and the lingering filename reference in the v1.3 changelog. No design reversal: every section 6 decision and the chart shape are unchanged. Only the values surface (extended), helper set (extended with two template-time validators), and validation-gate list (helm-unittest now required) grew. Every claim in the refreshed doc was verified against on-disk state: power_agent.py:706-804 for the 8-flag CLI surface, values.yaml for the dev-block recipe, _helpers.tpl for the five-helper set, and the two tests/validate_*_test.yaml files for the 24 enumerated unittest cases. Signed-off-by: Kai Ma <kaim@nvidia.com>

Addresses five PR9682 CodeRabbit review comments on the Python runtime. Pure hardening: no behaviour change on the happy path, every fix narrows a previously silent / fragile failure mode. power_agent.py - _load_previously_managed_gpus is no longer best-effort. The pre-fix try/except caught only FileNotFoundError + JSONDecodeError and used .get() on whatever json.load returned, so the parser could (a) propagate OSError siblings (PermissionError on a stale host-volume bind, EIO on flaky NFS) up the stack and crash the agent at startup, (b) AttributeError on a non-dict JSON root, or (c) silently iterate characters / TypeError on a non-list managed_uuids value. Replaced with a layered parser that catches OSError, validates the root is a dict, validates managed_uuids is a list, and coerces the entry-is-str type guard at the boundary so downstream "uuid in _previously_managed" checks can't compare bytes/ints against str UUIDs and silently never match. Every malformation logs at WARNING and returns empty rather than raising; orphan recovery silently skips this startup, which is strictly better than CrashLoopBackOff with no caps actuated. - _reconcile_gpu now deduplicates pod_annotations by pod UID before handing them to _resolve_cap_for_gpu. Pre-fix one pod with N GPU processes (the common TP/PP/EP topology, ranks-per-GPU, helper workers, profilers) appeared as N rows; _resolve_cap_for_gpu uses len(pod_annotations) > 1 to detect the multi-pod-per-GPU misconfig, so a one-pod / two-PID GPU spuriously fired the "N pods all agree" WARNING and incorrectly bumped the multi_pod_gpu_total agree label. On a pod whose annotation was missing or invalid, the same code path also took the conflict-resolution branch and applied safe_default even though the operator's intent was unambiguous. - _handle_sigterm no longer swallows actuator.shutdown / nvmlShutdown failures silently. Pre-fix "except Exception: pass" made DCGM hostengine faults and NVML teardown errors at SIGTERM time invisible in pod logs - operators had no way to correlate a failed restart with a stuck hostengine. logger.exception writes the full traceback at ERROR while still proceeding to _shutdown.set() so the run loop unblocks (cleanup is best-effort, the container MUST exit promptly on SIGTERM). - argparse hoisted from inside main() to the module-top imports alongside the other stdlib modules (CodeRabbit review-body nit). actuator.py - DcgmActuator.shutdown applied the same logger.exception treatment to its three silent broad catches (group.Delete, handle.Shutdown, pynvml.nvmlShutdown). Group failures name the gpu_id, hostengine shutdown failures name host:port, NVML teardown names the source - so a DCGM resource leak is no longer invisible. Same non-reraise contract: cleanup is best-effort, callers (power_agent._handle_sigterm) must proceed to _shutdown.set(). Tests added / extended (+29 new cases; total 162 passed): - tests/test_managed_state_parser.py (new, 20 cases): happy path, missing file silent, OSError siblings (PermissionError, IsADirectoryError, NotADirectoryError, EIO), malformed JSON, non-dict root (list/string/int/null), non-list managed_uuids (string/int/dict), mixed-type entries (drop count + WARNING fires once). - tests/test_reconcile_wiring.py::TestReconcileGpuDedupesByPodUid (4 new cases): one-pod/multi-PID counts once, doesn't bump multi-pod counter; two-pods/multi-PID-each genuinely counts as two; two disagreeing pods each multi-PID still resolves to safe_default with conflict counter ticking exactly once. - tests/test_shutdown.py (2 new cases): assertLogs ERROR fires with the original exception message when actuator.shutdown raises; same for the defensive raw-NVML fallback when pynvml.nvmlShutdown raises. - tests/test_dcgm_actuator.py (3 new cases): group.Delete failure logs gpu_id; handle.Shutdown failure logs host:port; nvmlShutdown failure logs the source. Signed-off-by: Kai Ma <kaim@nvidia.com>

@sha256

…ical OCI form Addresses three iterative PR9682 CodeRabbit review findings on the chart's image-pinning surface. Bumps Chart.yaml 1.1.0 -> 1.2.0 (minor per SemVer - image.digest is purely additive; existing tag-pinned installs render byte-identically and need no change). Round 1 (original PR9682 finding): image.tag=latest was rejected only at the raw-manifest layer, not at the chart layer. helm template --set image.tag=latest happily rendered nvcr.io/nvidia/dynamo/ power-agent:latest because validateImageTag only checked "if not .Values.image.tag". Now case-insensitively rejects the literal "latest" (LATEST, Latest, " latest " all fail) while still accepting release tags that contain the substring (v1.0.0-latest-rc). Round 2 (digest path rendered an invalid OCI reference): operators were advised in values.yaml + README + tests to set image.tag= sha256:abc... for digest pinning. This rendered repo:sha256:... - INVALID per the OCI spec, which requires repo@sha256:... for digest form (the second ":" in the rendered string parses as repo + tag "sha256" with a stray suffix; kubelet manifest pulls fail with an opaque error). Three changes close this: - values.yaml: add image.digest field with per-rule comment block explaining the OCI form, the 64-hex requirement, the mutex with image.tag, and the PR #9682 rationale. - _helpers.tpl: split validateImageTag into a full rule set (XOR(tag, digest), whitespace rejection on both, latest rejection on tag, sha256: prefix rejection on tag, exact- 64-hex regex on digest) and add a new imageRef helper that emits {repo}@{digest} when digest is set, else {repo}:{tag}. - daemonset.yaml: image: line now routes through imageRef instead of hard-coding the ":" join, so the canonical OCI form is the only thing that can ship. Round 3 (validator was too lenient): - Truncated SHA-256 digests slipped through: regex was ^sha256:[0-9a-fA-F]{32,}$, so half-digests (a common pattern when an operator copies the first 8 bytes from `docker images --digests` output) rendered as repo@sha256:<truncated>. The kubelet then fails the pull with an opaque manifest-mismatch error. Tightened to {64} exactly (SHA-256 is 32 bytes x 2 nybbles). - Whitespace-padded values rendered raw: --set-string 'image.tag= v1.1.0 ' produced image: "repo: v1.1.0 " because the latest comparison trimmed but the renderer used the raw value. Choice: reject (don't silently normalize) so the operator sees their --set quoting typo. Applies symmetrically to image.digest. tests/validate_image_tag_test.yaml (new, 22 cases - chart helm-unittest count grew 24 -> 46): - Negative: unset (both empty), latest in 4 case variants, whitespace on tag (leading/trailing/both), tag+digest mutex, sha256: prefix on tag, non-digest on digest, 32/63/65-hex digest (off-by-one + half-length), whitespace on digest. - Positive: pinned release tag renders {repo}:{tag}; v1.0.0-latest-rc accepted (substring match guard); 64-hex digest renders the canonical {repo}@sha256:... form; uppercase-hex digest preserved verbatim. README updates: production-install section now shows both tag and digest invocations side by side with a callout that digests live on image.digest, not image.tag. Values table gains the image.digest row. Troubleshooting section adds three new error-message entries covering missing-pin, invalid-digest, and mutex-violation cases. Signed-off-by: Kai Ma <kaim@nvidia.com>

Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.2.0 chart that the preceding commit ships. The plan was authored for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in the PR #9790 dual-actuator work, and went stale again once v1.2.0 landed image.digest + the canonical-OCI imageRef helper on top. Four reviewer findings closed (v1.5 changelog row covers all four): (1) Truncated SHA-256 digests slipped through the v1.1.0 helper - already fixed in the preceding chart commit; this commit propagates the rule into the §4.4 helper snippet and the §5.4 validation gates. (2) Whitespace-padded tags rendered raw image references - same treatment: §4.4 documents the trim-then-reject contract. (3) §4.3 / §4.4 / §5.4 stale on image.digest: - §4.3 CodeRabbit-comment row rewritten to describe both helpers (validateImageTag + imageRef), both fields (tag + digest), the canonical repo@digest form, and the PR9682 follow-up that added the separate field. - §4.4 helper snippet replaced with the actual v1.2.0 validator (full rule set + per-rule rationale comments) plus the new imageRef helper. - §5.4 expected helm-unittest output bumped 24 -> 46 passed with a one-line breakdown of the +22 new cases. (4) §4.1 / §4.2 stale on file count + helper list + values surface: - §4.1 said 13 files (omitted .helmignore and the new validate_image_tag_test.yaml). Corrected to 14 files (7 templates + 3 helm-unittests + 4 root files). The _helpers.tpl helper-list line gained imageRef and chart entries. The daemonset.yaml annotation now mentions it routes through imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0. - §4.2 values snippet had only image.tag with no image.digest. Added the field with the same per-rule comment block that ships in values.yaml (OCI form, 64-hex requirement, mutex with image.tag, PR9682 rationale). Status header bumped to chart v1.2.0 with a one-line v1.2.0 rationale (additive opt-in field, no breaking change to existing tag-pinned installs). Revision-history table gains the v1.5 entry summarising all four findings; older v1.1 - v1.4 entries untouched. No design reversal: every §6 decision and the chart shape are unchanged. Only the values surface (one additive field), helper set (two helpers: imageRef new, validateImageTag rewritten), and helm-unittest count (24 -> 46) grew. Every claim in the refreshed doc was verified against on-disk state: Chart.yaml for the version bump, values.yaml for the image.digest field comment, _helpers.tpl for both helpers, tests/validate_image_tag_test.yaml for the 22 enumerated cases (verified via `helm unittest`). Signed-off-by: Kai Ma <kaim@nvidia.com>

…ttributes (v1.10 e2e fix) First in-cluster DCGM parity run on an 8xA100 SXM node (PR #9790 e2e) surfaced three production defects in DcgmActuator, all rooted in pydcgm API misuse that the mocked unit suite couldn't catch: 1. `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` read `DCGM_FI_DEV_UUID` via `dcgmEntityGetLatestValues`. That API returns the field cache; on a fresh nv-hostengine 4.5.3 with no companion watcher subscribed via `dcgmWatchFields`, the cache returns `DCGM_STR_BLANK = "<<<NULL>>>"`. The identity map then raised "8 GPU UUID(s) visible to DCGM are not visible to NVML" on every cluster startup. 2. `constraints_w` / `current_w` / `default_w` read the four `DCGM_FI_DEV_POWER_MGMT_LIMIT{,_MIN,_MAX,_DEF}` fields the same way. Every read returned `DCGM_FP64_BLANK = 2^47`, so the clamp in `apply_cap` escalated a requested 250 W up to the blank max and tried to write 140737488355328 W. 3. The workload-power-profile blanking loop in `_apply_cap_inner` reached for `dcgm_structs.DCGM_INT32_BLANK`. The constant actually lives in `dcgmvalue` (`/shared/pydcgm/dcgmvalue.py:17` in the DCGM 4.5.3 apt bindings); pre-v1.10 the first cap write raised `AttributeError: module 'dcgm_structs' has no attribute 'DCGM_INT32_BLANK'`. Fix: - Add a single `_power_limits(gpu_idx)` helper that calls `DcgmSystem.discovery.GetGpuAttributes(gpu_id).powerLimits` - the synchronous device-info API that wraps `dcgmGetDeviceAttributes`. One RPC instead of four, returns a `c_dcgmDevicePowerLimits_v1` with integer-watt fields populated from the hostengine's discovery state, no field-cache dependency. - `constraints_w`, `current_w`, `default_w`, `restore_default` all route through this helper. - `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` now use `GetGpuAttributes(gpu_id).identifiers.uuid`. - `apply_cap` imports `dcgmvalue` and swaps all seven `DCGM_INT32_BLANK` references onto it. Test infrastructure overhaul (`test_dcgm_actuator.py`): - `_make_dcgm_modules` adds a `dcgmvalue` MagicMock with all four blank sentinels; removes `DCGM_INT32_BLANK` from `dcgm_structs` mock. - `_make_gpu_attrs` extended to carry both `.identifiers.uuid` and `.powerLimits.{cur,default,enforced,min,max}PowerLimit`. - `_wire_handle` wires `GetGpuAttributes(gid)` per-gpu_id. - `_seed_constraints_and_uuid` consolidated onto the unified GetGpuAttributes path. - Five `modules["dcgm_agent"].dcgmEntityGetLatestValues.assert_not_called()` regression guards added across the suite - locks in "no more silent field-cache reads of static device info." E2e re-validation: with this fix in place, the same 8xA100 rig reports `PASS: NvmlActuator and DcgmActuator agree on all probes (tolerance +/- 2.0 W)`, exit code 0, with nvidia-smi confirming `apply_cap(250)` -> 250 W and `restore_default()` -> 400 W on every GPU. Doc: `docs/design-docs/power-agent-dual-actuator.md` v1.10 changelog row added; §6.1 / §6.3 / §7 / §12 / §13 patched to match (Protocol-level docstrings for `current_w`/`default_w`, `get_uuid` / `_power_limits` / `restore_default` code samples, `_apply_cap_inner` import + blank-constant sample, attribution table row 2, summary clause, reference-material entries for `pydcgm/dcgmvalue.py` and `GetGpuAttributes`). Local gates: 163/163 power-agent unit tests pass; pre-commit isort/black/flake8/codespell/ruff/EOF/whitespace clean on the three changed files. Broader `tests/` + `pytest-marker-report` blocked by the known Windows-only `fcntl` import in `tests/conftest.py:22`; CI runs those on Linux. Signed-off-by: Kai Ma <kaim@nvidia.com>

Three Dockerfile bugs combined to make the DCGM-mode image unbuildable on a fresh checkout. Fixing any one in isolation leaves the build broken, so they travel together: 1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04' does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect' → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag. 2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/ to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous COPY would silently copy zero files under the new pin. Switch the source path to the 4.5+ location. 3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing 'import logger' — but logger.py lives in DCGM's source tree under testing/python3/ and is NOT packaged. Without a shim every DcgmGroup construction raises ModuleNotFoundError. Add a 10-line stdlib-logging adapter at components/power_agent/logger.py and COPY it into /opt/dcgm/python/logger.py during the runtime stage. This unblocks 'docker build -f components/power_agent/Dockerfile' on a fresh clone (verified locally via 'docker buildx build --build-arg DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21, image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the Path-B live test on aks-a100b-22138447-vmss000000). Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6. Signed-off-by: Kai Ma <kaim@nvidia.com>

components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2, a struct that only exists in DCGM 4.x. With 3.x bindings the agent's DcgmActuator.init() succeeded, opened the hostengine connection, and ran NVML init cleanly — then crashed mid-first-reconcile with 'AttributeError: module dcgm_structs has no attribute c_dcgmDeviceConfig_v2' after some GPUs were already capped. The SIGTERM-restore path won't run when the actuator never finished registering with _active_actuator, so the GPUs are left at custom caps. Add a 7-line hasattr check immediately after the dcgm_structs import in init() that raises a RuntimeError with: - the missing struct name (so the error is grep-able) - the required DCGM major (>=4.0) - the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators know exactly what to bump) Guards against accidental Dockerfile regressions to a 3.x base image. Pair with the Dockerfile fix in 4820ca7 (which bumps the default to 4.5.1-1-ubuntu22.04) so the default build path is consistent with the runtime contract. Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection (MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments don't half-init and leave hostengine sockets dangling. Refs: PR #9790 review, Power Agent live-test finding #5. Signed-off-by: Kai Ma <kaim@nvidia.com>

Verified each of the 9 fresh CodeRabbit findings on PR #9790 against the actual code (per .cursor/rules/review-feedback.mdc) and accepted all nine. Bundled fix in one commit on top of the previous PR9790 HEAD; no history rewrite. Code (components/power_agent/): * actuator.py:539 — replace silent `except Exception: pass` around the stale-handle Shutdown with `logger.exception(...)` + a comment that reconnect proceeds intentionally on cleanup failure. Continue rather than re-raise: failing reconnect because cleanup of the already-dead handle failed would be strictly worse than logging and moving on. * actuator.py:622 — chain RuntimeError off the originating KeyError via `raise ... from err` so the UUID-map lookup failure stays in the traceback (PEP 3134 / Ruff B904). * actuator.py:777 — narrow `except Exception` in `apply_cap` to `except dcgm_structs.DCGMError`. The broad catch masked binding / programming defects (e.g. the v1.10 `dcgmvalue.DCGM_INT32_BLANK` AttributeError documented in `_apply_cap_inner`) as normal apply failures, which both silenced the bug AND incorrectly ticked `apply_failures_total` — a metric reserved for "the DCGM write itself failed". Lazy-import `dcgm_structs` AFTER the metrics-required check so the constructor path stays runnable on hosts without the DCGM bindings (verified by `test_apply_cap_requires_metrics`). * power_agent.py:170 — compute `invalid_count` separately instead of `len(set) vs len(list)`. The set comprehension deduplicates, so the previous comparison would log "1 non-string entry" when uuids was `["a","a","b"]` (zero invalid entries, just a duplicate). Helm (deploy/helm/charts/power-agent/): * templates/_helpers.tpl:117/129 — replace `ne $tag (trim $tag)` (and same for $digest) with `regexMatch "\\s" ...`. The trim check only caught leading/trailing whitespace; an internal-whitespace tag like `v1 .1.0` (copy/paste typo) slipped through and rendered an invalid OCI image reference, deferring failure from template time to kubelet pull time. CodeRabbit only flagged the tag side, but the same flaw was symmetric on the digest helper, so fix both for parity. * tests/validate_image_tag_test.yaml — update existing "leading/trailing whitespace" assertions to "contains whitespace" (the new helper message), and add four new cases: image.tag with internal whitespace, image.tag with embedded tab, image.digest with internal whitespace, and the existing " latest " ordering pin. Symmetric coverage across both fields. * README.md:49 — rewrite the "image tag must match appVersion" paragraph to anchor on `helm show chart … | grep appVersion` rather than hardcoding "chart v1.1.0 requires the v1.1.0 image". The chart is on v1.2.0 with appVersion 1.1.0; the old wording was already stale and would re-stale on every image bump. Docs (docs/design-docs/power-agent-helm-chart-plan.md): * §0 line 29 — promote `### Revision history` to `## Revision history` so the heading hierarchy goes h1 → h2 (was h1 → h3, MD001). * §5.2 — bump "13 files / 2 helm-unittests" → "14 files / 3 helm-unittests", add the v1.2.0 `validate_image_tag_test.yaml` to the file list, replace the brittle LOC totals with a one-line `wc -l` recipe so the doc stops drifting on every cycle, and tag the three unlabeled fences as ```text``` (MD040). * §8 — bump the "Author all 13 chart files" checklist item to 14 and update the helm-unittest expected count from `24 passed` (v1.1.0) to `46 passed` (v1.2.0), with a note to rerun rather than hardcode. Tests added: * test_apply_cap_propagates_non_dcgm_exception — regression guard for the narrowed apply_cap except: an AttributeError injected on `group.config.Set` MUST propagate, MUST NOT tick `apply_failures_total`, and MUST NOT touch the `_managed_gpu_indices` / `applied_limit_watts` bookkeeping (those run only on the success path). * (Existing `test_apply_cap_dcgm_generic_error_bumps_failure_metric_and_returns` already exercises the "DCGMError still caught" case; both assertions now bracket the narrowed exception contract.) Verification: * pytest components/power_agent/tests/ -q → 170 passed * pre-commit run --files <touched 7 files> → all hooks pass (pytest-marker-report skipped — Windows fcntl issue, runs in Linux CI). * Linter (ReadLints) → no errors on any of the 7 touched files. * The DCGM `import dcgm_structs` is local to `apply_cap` and scoped after the metrics-required check, so `test_apply_cap_requires_metrics` still raises RuntimeError (not ModuleNotFoundError) on hosts without the bindings. Refs: PR #9790 CodeRabbit re-review (2026-05-25). Signed-off-by: Kai Ma <kaim@nvidia.com>

Signed-off-by: Kai Ma <kaim@nvidia.com>

Three high/medium findings from the PR9790 Codex adversarial review, all confirmed against current code (PR9682/PR9683 fixed adjacent issues but did not touch these). power_agent.py - _handle_sigterm now prunes _previously_managed after each successful restore_default and persists the pruned set before shutdown. Pre-fix the SIGTERM handler restored each GPU to default but never updated managed_gpus.json, so on the next startup orphan recovery could clobber a cap applied by another workflow (different DGD, manual nvidia-smi -pl, vendor firmware) on a now-idle GPU still recorded as agent-owned. The actuator branch uses actuator.get_uuid(gpu_idx); the raw-NVML fallback branch uses _nvml_uuid(handle). Failed restores skip the prune (cap may still be live; next-startup orphan recovery is our only chance to reset it). UUID-lookup and persist failures log a WARNING but never crash the SIGTERM handler. - _resolve_cap_for_gpu rewritten so multi-pod GPUs require ALL pods to carry a parseable integer annotation, not just the non-None subset. Pre-fix the resolver filtered None BEFORE computing the agree-set, so a GPU shared by pod A (cap 480) and pod B (no annotation) silently resolved to 480 - pod A's annotation governed pod B's GPU usage, the exact cross-workload policy failure the multi-pod guard exists to contain. Any missing or invalid annotation in the multi-pod case now resolves to safe_default_watts and bumps the conflict counter. Single-pod paths unchanged. - _list_pods_on_node propagates Kubernetes API errors instead of returning []. Pre-fix \"except Exception: return []\" made a transient apiserver outage, RBAC regression, or network blip indistinguishable from a genuinely empty node, silently dropping enforcement. reconcile_once now catches and skips the cycle, leaving previously- applied caps live. New Prometheus counter dynamo_power_agent_k8s_list_failures_total lets operators alert on \">0 over 5m\". Policy choice: skip-cycle (preserve last cap) rather than fail-closed to safe_default; transient hiccups should not cap-down healthy workloads. Tests (+12 new, 1 flipped; 190 total passing): - test_shutdown.py::TestSigtermPrunesManagedGpusState (5 tests): successful prune persists; failed restore retains UUID; UUID-lookup failure is logged and non-fatal; raw-NVML fallback prunes via _nvml_uuid; persist failure does not hang the agent. - test_multi_pod_policy.py: flipped test_mixed_none_and_valid (was asserting buggy 480 inheritance; now asserts SAFE_DEFAULT + conflict); added test_multipod_invalid_annotation_on_one_pod_is_conflict and test_multipod_all_missing_is_conflict. - test_reconcile_wiring.py: TestListPodsOnNodePropagatesErrors (3 tests, namespaced + cluster-wide + happy path); TestReconcileOnceK8sListFailure (2 tests, skip-cycle on failure + per-GPU run on success). Signed-off-by: Kai Ma <kaim@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rebase follow-up: dev-mode examples, helm-unittest fixtures, and design docs still referenced nvcr.io/nvidia/dynamo/power-agent after #9682 standardised on nvcr.io/nvidia/ai-dynamo/power-agent. Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng · 2026-06-08T16:07:01Z

Rebased onto current `pr1a/power-agent`

This branch was 186 commits stale relative to main via its pr1a base. I've rebased the 13 DCGM-only commits (dc4cc53d63 … 464387cd23) onto the current pr1a/power-agent tip (6cd82e8be1), resolving conflicts in:

components/power_agent/Dockerfile — merged pr1a's DOCKER_PROXY CI wiring with the multi-stage DCGM vendor build (pinned 4.5.1-1-ubuntu22.04, logger.py shim)
docs/design-docs/power-agent-helm-chart-plan.md — combined pr1a's Implemented/build-wiring status with v1.1.0/v1.2.0 dual-actuator + image.digest refresh
deploy/helm/charts/power-agent/{values.yaml,README.md} — kept image.digest hardening on top of pr1a's nvcr.io/nvidia/ai-dynamo/power-agent registry path

Plus one follow-up commit (852ad7ee60) aligning remaining dev-mode examples / helm-unittest fixtures / design docs to the ai-dynamo registry path.

What this PR delivers (review context)

Opt-in DCGM actuation alongside the NVML default from #9682:

Actuator protocol + NvmlActuator / DcgmActuator (components/power_agent/actuator.py)
Helm chart v1.3.0 (appVersion 1.2.0): agent.actuator (nvml|dcgm), agent.dcgm.{host,port,enforce}, image.digest with template-time validators + helm-unittest suite
Design doc: docs/design-docs/power-agent-dual-actuator.md

Verification (local, pre-push)

python3.10 -m pytest components/power_agent/tests/ -q → 190 passed (e2e parity excluded)
helm lint deploy/helm/charts/power-agent --set image.tag=v1.2.0 → 0 failures
helm template … --set agent.actuator=dcgm → renders cleanly
pre-commit content hooks pass; DCO signed on all 14 commits

All 9 CodeRabbit threads are resolved. Requesting a human review + approval so this can merge into pr1a/power-agent after #9682 lands.

Suggested merge order: #9682 → #9790 → #9683.

The docs link check fails on README links to docs.nvidia.com/dynamo/kubernetes-deployment/* (404 from GitHub runners as of 2026-06-08). These are inherited from main, not introduced by the DCGM actuator work, and match the existing .lycheeignore pattern for external docs that fail from CI. Signed-off-by: Kai Ma <kaim@nvidia.com>

…n DCGM restore Graceful SIGTERM restore silently dropped every cap in-cluster: the image runs 'python /app/power_agent.py' (module __main__) while actuator.py reaches the agent via 'import power_agent', so the two module copies each got their own _managed_gpu_indices. The actuator recorded caps into one copy; the SIGTERM handler restored from the other (always empty). Hoist the mutable managed-GPU state (_managed_gpu_indices, _previously_managed, state path) into a standalone managed_state module imported by canonical name from both sites, so all copies converge on one set. Orphan reload now mutates in place instead of rebinding. Revert the __main__ delegation hack. Package managed_state.py into the image COPY and the dev-pod ConfigMap; bump chart 1.2.0 -> 1.3.0 and update README/NOTES/dev-pod docs. Harden DcgmActuator.restore_default against post-reconnect GPU re-enumeration: snapshot the capped UUID at apply time and relocate to its current index before writing default TGP. On a proven index mismatch never fall back to the original (known-wrong) index; relocate, or skip without writing or pruning so cold-start orphan recovery retries. Best-effort restore at the original index only when identity is unreadable (mismatch unproven). Adds regression tests (managed-state sharing, fresh-module-copy aliasing, DCGM relocate/skip/best-effort). Addresses PR #9790 sttts review (module-identity split, DCGM re-enumeration) and follow-up review findings. Signed-off-by: Kai Ma <kaim@nvidia.com>

Signed-off-by: Kai Ma <kaim@nvidia.com> # Conflicts: # components/power_agent/power_agent.py

…ersion 1.2.0 The plan header and layout still described chart version 1.2.0 / appVersion 1.1.0 and a 46-test helm-unittest gate, but PR9790 bumped the chart to 1.3.0 / appVersion 1.2.0 (managed_state.py module) and the helm-unittest suite now reports 49 passing tests. Align the status header, the Chart.yaml layout comment, and the validation-gate count. Signed-off-by: Kai Ma <kaim@nvidia.com>

…ion 1.2.0 The helm-chart-plan design doc still described the chart as version/appVersion 1.1.0 with a 46-test helm-unittest snapshot and pinned every install/validation example to image.tag=v1.1.0, while Chart.yaml is now version 1.3.0 / appVersion 1.2.0 and the suite is 49 tests (verified via `helm unittest deploy/helm/charts/power-agent`). Update the current-state claims (risk-register row 6, section 7.4), the section 5.4 validation recipes, the section 1.4.2 / 5.3.1 install examples, and the test-count snapshot. Also make the validateImageTag helper error text track the chart appVersion (printf "v%s" .Chart.AppVersion) rather than a hardcoded v1.1.0 example, and note managed_state.py in the dev-pod ConfigMap comment. Signed-off-by: Kai Ma <kaim@nvidia.com>

…overage Signed-off-by: Kai Ma <kaim@nvidia.com>

Resolve conflicts in components/power_agent/power_agent.py: - _resolve_cap_for_gpu: keep strict multi-pod policy (PR9790 Codex finding #2) over base's lenient None-filtering docstring/intent; the merged code body enforces the strict policy. - _reconcile_gpu: keep actuator-routed PID read + apply_cap (the DCGM dual-actuator path that is the purpose of this PR) over base's inline pynvml; folded base's persistent-cap rationale into the docstring. release.yml and power-agent/README.md auto-merged. Validated: components/power_agent/tests 206 passed. Signed-off-by: Kai Ma <kaim@nvidia.com>

The file was committed before full CI pre-commit ran (workflows were gated by copy-pr-bot), so black/isort line-wrapping of the PID_TO_UID dict and two list comprehensions was never applied. Pure formatting, no logic change; test_multi_dgd_topology 5 passed. Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng · 2026-06-12T15:04:01Z

Status update -- DCGM dual actuator. All three review threads are addressed and resolved:

lychee coverage narrowed to the 3 exact 404 URLs (anchored, so new/renamed links are still checked);
the __main__-vs-power_agent shared-state cap-leak fixed via managed_state.py (single cached module object);
DCGM re-enumeration handled by a UUID-keyed restore that relocates-or-skips on a proven index mismatch (0b0bca38b3).

CI green (snapshot/operator skip via path filters). Stacked on #9682 -- mergeable once that lands. Description refreshed (DCGM 4.5.1-1, chart 1.3.0/appVersion 1.2.0, 49 helm-unittest cases; NVML default described as behaviourally equivalent rather than byte-identical).

kaim-eng requested review from a team as code owners May 20, 2026 16:42

pull-request-size Bot added the size/XXL label May 20, 2026

github-actions Bot added documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes actions labels May 20, 2026

kaim-eng changed the title ~~Pr1b/power agent dcgm actuator~~ feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default) May 20, 2026

kaim-eng changed the base branch from main to pr1a/power-agent May 20, 2026 16:45

github-actions Bot added the feat label May 20, 2026

kaim-eng force-pushed the pr1a/power-agent branch from 0eda1d7 to 1e1ef6f Compare May 25, 2026 13:43

kaim-eng force-pushed the pr1a/power-agent branch from 1e1ef6f to 29f0831 Compare May 25, 2026 13:46

kaim-eng force-pushed the pr1b/power-agent-dcgm-actuator branch from 99138c4 to abcc922 Compare May 25, 2026 14:27

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

kaim-eng and others added 12 commits June 8, 2026 11:58

Add DCGM actuator validation guardrails

609e74c

Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng force-pushed the pr1b/power-agent-dcgm-actuator branch from 464387c to 852ad7e Compare June 8, 2026 16:06

kaim-eng requested a review from sttts June 8, 2026 16:07

kaim-eng mentioned this pull request Jun 8, 2026

feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus #9683

Open

sttts reviewed Jun 9, 2026

View reviewed changes

Comment thread .lycheeignore Outdated

sttts reviewed Jun 9, 2026

View reviewed changes

Comment thread components/power_agent/actuator.py

sttts reviewed Jun 9, 2026

View reviewed changes

Comment thread components/power_agent/actuator.py

kaim-eng added 3 commits June 9, 2026 21:48

Merge branch 'pr1a/power-agent' into pr1b/power-agent-dcgm-actuator

16e8c84

Signed-off-by: Kai Ma <kaim@nvidia.com> # Conflicts: # components/power_agent/power_agent.py

kaim-eng requested a review from a team as a code owner June 10, 2026 03:00

kaim-eng added 4 commits June 9, 2026 23:33

test(power-agent): add multi-DGD multi-framework reconcile topology c…

ce1a5d3

…overage Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng mentioned this pull request Jun 12, 2026

feat(power-agent): per-node power-cap enforcement DaemonSet #9682

Open

Conversation

kaim-eng commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Merge Strategy

Stack Order

Uh oh!

copy-pr-bot Bot commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

kaim-eng commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaim-eng commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rebased onto current pr1a/power-agent

What this PR delivers (review context)

Verification (local, pre-push)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaim-eng commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaim-eng commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

kaim-eng commented Jun 8, 2026 •

edited

Loading

Rebased onto current `pr1a/power-agent`