feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default)#9790
feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default)#9790kaim-eng wants to merge 22 commits into
Conversation
WalkthroughThis pull request introduces a complete Power Agent DaemonSet for enforcing per-GPU power caps on Dynamo worker nodes, featuring a pluggable dual-actuator design (NVML default, DCGM opt-in), comprehensive test coverage, and production-ready Helm chart deployment with template-time validation. ChangesPower Agent Dual-Actuator System
🎯 4 (Complex) | ⏱️ ~60 minutes |
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0 chart that PR #9790 ships. The plan was authored for the NVML-only v1.0.0 chart in PR #9682 and went stale once PR #9790 layered agent.actuator + agent.dcgm.* + validateActuator + validateEnforce + the helm-unittest suite on top. Two changelog rows capture the refresh: v1.3 - first pass: 8 reviewer findings closed (3 blocking, 4 major, 2 medium, 1 low) on Status header, values surface, dev ConfigMap recipe, image-tag pinning, helm-unittest gating, internal-dev-doc references, daemonset name, file/LOC accounting. v1.4 - second pass: 5 follow-up findings closed (2 major, 2 medium, 1 low) on Status self-contradiction, stale section 4.2 dev-block comment, overstated helm-unittest coverage prose, unrunnable section 5.4 positive helm template overlays (missing --set image.tag), and the lingering filename reference in the v1.3 changelog. No design reversal: every section 6 decision and the chart shape are unchanged. Only the values surface (extended), helper set (extended with two template-time validators), and validation-gate list (helm-unittest now required) grew. Every claim in the refreshed doc was verified against on-disk state: power_agent.py:706-804 for the 8-flag CLI surface, values.yaml for the dev-block recipe, _helpers.tpl for the five-helper set, and the two tests/validate_*_test.yaml files for the 24 enumerated unittest cases. Signed-off-by: Kai Ma <kaim@nvidia.com>
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.2.0 chart that the preceding commit ships. The plan was authored for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in the PR #9790 dual-actuator work, and went stale again once v1.2.0 landed image.digest + the canonical-OCI imageRef helper on top. Four reviewer findings closed (v1.5 changelog row covers all four): (1) Truncated SHA-256 digests slipped through the v1.1.0 helper - already fixed in the preceding chart commit; this commit propagates the rule into the §4.4 helper snippet and the §5.4 validation gates. (2) Whitespace-padded tags rendered raw image references - same treatment: §4.4 documents the trim-then-reject contract. (3) §4.3 / §4.4 / §5.4 stale on image.digest: - §4.3 CodeRabbit-comment row rewritten to describe both helpers (validateImageTag + imageRef), both fields (tag + digest), the canonical repo@digest form, and the PR9682 follow-up that added the separate field. - §4.4 helper snippet replaced with the actual v1.2.0 validator (full rule set + per-rule rationale comments) plus the new imageRef helper. - §5.4 expected helm-unittest output bumped 24 -> 46 passed with a one-line breakdown of the +22 new cases. (4) §4.1 / §4.2 stale on file count + helper list + values surface: - §4.1 said 13 files (omitted .helmignore and the new validate_image_tag_test.yaml). Corrected to 14 files (7 templates + 3 helm-unittests + 4 root files). The _helpers.tpl helper-list line gained imageRef and chart entries. The daemonset.yaml annotation now mentions it routes through imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0. - §4.2 values snippet had only image.tag with no image.digest. Added the field with the same per-rule comment block that ships in values.yaml (OCI form, 64-hex requirement, mutex with image.tag, PR9682 rationale). Status header bumped to chart v1.2.0 with a one-line v1.2.0 rationale (additive opt-in field, no breaking change to existing tag-pinned installs). Revision-history table gains the v1.5 entry summarising all four findings; older v1.1 - v1.4 entries untouched. No design reversal: every §6 decision and the chart shape are unchanged. Only the values surface (one additive field), helper set (two helpers: imageRef new, validateImageTag rewritten), and helm-unittest count (24 -> 46) grew. Every claim in the refreshed doc was verified against on-disk state: Chart.yaml for the version bump, values.yaml for the image.digest field comment, _helpers.tpl for both helpers, tests/validate_image_tag_test.yaml for the 22 enumerated cases (verified via `helm unittest`). Signed-off-by: Kai Ma <kaim@nvidia.com>
…ttributes (v1.10 e2e fix) First in-cluster DCGM parity run on an 8xA100 SXM node (PR #9790 e2e) surfaced three production defects in DcgmActuator, all rooted in pydcgm API misuse that the mocked unit suite couldn't catch: 1. `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` read `DCGM_FI_DEV_UUID` via `dcgmEntityGetLatestValues`. That API returns the field cache; on a fresh nv-hostengine 4.5.3 with no companion watcher subscribed via `dcgmWatchFields`, the cache returns `DCGM_STR_BLANK = "<<<NULL>>>"`. The identity map then raised "8 GPU UUID(s) visible to DCGM are not visible to NVML" on every cluster startup. 2. `constraints_w` / `current_w` / `default_w` read the four `DCGM_FI_DEV_POWER_MGMT_LIMIT{,_MIN,_MAX,_DEF}` fields the same way. Every read returned `DCGM_FP64_BLANK = 2^47`, so the clamp in `apply_cap` escalated a requested 250 W up to the blank max and tried to write 140737488355328 W. 3. The workload-power-profile blanking loop in `_apply_cap_inner` reached for `dcgm_structs.DCGM_INT32_BLANK`. The constant actually lives in `dcgmvalue` (`/shared/pydcgm/dcgmvalue.py:17` in the DCGM 4.5.3 apt bindings); pre-v1.10 the first cap write raised `AttributeError: module 'dcgm_structs' has no attribute 'DCGM_INT32_BLANK'`. Fix: - Add a single `_power_limits(gpu_idx)` helper that calls `DcgmSystem.discovery.GetGpuAttributes(gpu_id).powerLimits` - the synchronous device-info API that wraps `dcgmGetDeviceAttributes`. One RPC instead of four, returns a `c_dcgmDevicePowerLimits_v1` with integer-watt fields populated from the hostengine's discovery state, no field-cache dependency. - `constraints_w`, `current_w`, `default_w`, `restore_default` all route through this helper. - `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` now use `GetGpuAttributes(gpu_id).identifiers.uuid`. - `apply_cap` imports `dcgmvalue` and swaps all seven `DCGM_INT32_BLANK` references onto it. Test infrastructure overhaul (`test_dcgm_actuator.py`): - `_make_dcgm_modules` adds a `dcgmvalue` MagicMock with all four blank sentinels; removes `DCGM_INT32_BLANK` from `dcgm_structs` mock. - `_make_gpu_attrs` extended to carry both `.identifiers.uuid` and `.powerLimits.{cur,default,enforced,min,max}PowerLimit`. - `_wire_handle` wires `GetGpuAttributes(gid)` per-gpu_id. - `_seed_constraints_and_uuid` consolidated onto the unified GetGpuAttributes path. - Five `modules["dcgm_agent"].dcgmEntityGetLatestValues.assert_not_called()` regression guards added across the suite - locks in "no more silent field-cache reads of static device info." E2e re-validation: with this fix in place, the same 8xA100 rig reports `PASS: NvmlActuator and DcgmActuator agree on all probes (tolerance +/- 2.0 W)`, exit code 0, with nvidia-smi confirming `apply_cap(250)` -> 250 W and `restore_default()` -> 400 W on every GPU. Doc: `docs/design-docs/power-agent-dual-actuator.md` v1.10 changelog row added; §6.1 / §6.3 / §7 / §12 / §13 patched to match (Protocol-level docstrings for `current_w`/`default_w`, `get_uuid` / `_power_limits` / `restore_default` code samples, `_apply_cap_inner` import + blank-constant sample, attribution table row 2, summary clause, reference-material entries for `pydcgm/dcgmvalue.py` and `GetGpuAttributes`). Local gates: 163/163 power-agent unit tests pass; pre-commit isort/black/flake8/codespell/ruff/EOF/whitespace clean on the three changed files. Broader `tests/` + `pytest-marker-report` blocked by the known Windows-only `fcntl` import in `tests/conftest.py:22`; CI runs those on Linux. Signed-off-by: Kai Ma <kaim@nvidia.com>
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable on a fresh checkout. Fixing any one in isolation leaves the build broken, so they travel together: 1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04' does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect' → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag. 2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/ to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous COPY would silently copy zero files under the new pin. Switch the source path to the 4.5+ location. 3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing 'import logger' — but logger.py lives in DCGM's source tree under testing/python3/ and is NOT packaged. Without a shim every DcgmGroup construction raises ModuleNotFoundError. Add a 10-line stdlib-logging adapter at components/power_agent/logger.py and COPY it into /opt/dcgm/python/logger.py during the runtime stage. This unblocks 'docker build -f components/power_agent/Dockerfile' on a fresh clone (verified locally via 'docker buildx build --build-arg DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21, image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the Path-B live test on aks-a100b-22138447-vmss000000). Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6. Signed-off-by: Kai Ma <kaim@nvidia.com>
components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2,
a struct that only exists in DCGM 4.x. With 3.x bindings the agent's
DcgmActuator.init() succeeded, opened the hostengine connection, and
ran NVML init cleanly — then crashed mid-first-reconcile with
'AttributeError: module dcgm_structs has no attribute
c_dcgmDeviceConfig_v2' after some GPUs were already capped. The
SIGTERM-restore path won't run when the actuator never finished
registering with _active_actuator, so the GPUs are left at custom caps.
Add a 7-line hasattr check immediately after the dcgm_structs import
in init() that raises a RuntimeError with:
- the missing struct name (so the error is grep-able)
- the required DCGM major (>=4.0)
- the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators
know exactly what to bump)
Guards against accidental Dockerfile regressions to a 3.x base image.
Pair with the Dockerfile fix in 4820ca7 (which bumps the default
to 4.5.1-1-ubuntu22.04) so the default build path is consistent with
the runtime contract.
Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection
(MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails
BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments
don't half-init and leave hostengine sockets dangling.
Refs: PR #9790 review, Power Agent live-test finding #5.
Signed-off-by: Kai Ma <kaim@nvidia.com>
0eda1d7 to
1e1ef6f
Compare
… log, pod-UID dedup, argparse) Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790): * _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup. * _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container. * _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric. * main(): move 'import argparse' to module scope per the project's import-placement convention. Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there. Signed-off-by: Kai Ma <kaim@nvidia.com>
… log, pod-UID dedup, argparse) Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790): * _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup. * _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container. * _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric. * main(): move 'import argparse' to module scope per the project's import-placement convention. Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there. Signed-off-by: Kai Ma <kaim@nvidia.com>
1e1ef6f to
29f0831
Compare
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0 chart that PR #9790 ships. The plan was authored for the NVML-only v1.0.0 chart in PR #9682 and went stale once PR #9790 layered agent.actuator + agent.dcgm.* + validateActuator + validateEnforce + the helm-unittest suite on top. Two changelog rows capture the refresh: v1.3 - first pass: 8 reviewer findings closed (3 blocking, 4 major, 2 medium, 1 low) on Status header, values surface, dev ConfigMap recipe, image-tag pinning, helm-unittest gating, internal-dev-doc references, daemonset name, file/LOC accounting. v1.4 - second pass: 5 follow-up findings closed (2 major, 2 medium, 1 low) on Status self-contradiction, stale section 4.2 dev-block comment, overstated helm-unittest coverage prose, unrunnable section 5.4 positive helm template overlays (missing --set image.tag), and the lingering filename reference in the v1.3 changelog. No design reversal: every section 6 decision and the chart shape are unchanged. Only the values surface (extended), helper set (extended with two template-time validators), and validation-gate list (helm-unittest now required) grew. Every claim in the refreshed doc was verified against on-disk state: power_agent.py:706-804 for the 8-flag CLI surface, values.yaml for the dev-block recipe, _helpers.tpl for the five-helper set, and the two tests/validate_*_test.yaml files for the 24 enumerated unittest cases. Signed-off-by: Kai Ma <kaim@nvidia.com>
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.2.0 chart that the preceding commit ships. The plan was authored for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in the PR #9790 dual-actuator work, and went stale again once v1.2.0 landed image.digest + the canonical-OCI imageRef helper on top. Four reviewer findings closed (v1.5 changelog row covers all four): (1) Truncated SHA-256 digests slipped through the v1.1.0 helper - already fixed in the preceding chart commit; this commit propagates the rule into the §4.4 helper snippet and the §5.4 validation gates. (2) Whitespace-padded tags rendered raw image references - same treatment: §4.4 documents the trim-then-reject contract. (3) §4.3 / §4.4 / §5.4 stale on image.digest: - §4.3 CodeRabbit-comment row rewritten to describe both helpers (validateImageTag + imageRef), both fields (tag + digest), the canonical repo@digest form, and the PR9682 follow-up that added the separate field. - §4.4 helper snippet replaced with the actual v1.2.0 validator (full rule set + per-rule rationale comments) plus the new imageRef helper. - §5.4 expected helm-unittest output bumped 24 -> 46 passed with a one-line breakdown of the +22 new cases. (4) §4.1 / §4.2 stale on file count + helper list + values surface: - §4.1 said 13 files (omitted .helmignore and the new validate_image_tag_test.yaml). Corrected to 14 files (7 templates + 3 helm-unittests + 4 root files). The _helpers.tpl helper-list line gained imageRef and chart entries. The daemonset.yaml annotation now mentions it routes through imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0. - §4.2 values snippet had only image.tag with no image.digest. Added the field with the same per-rule comment block that ships in values.yaml (OCI form, 64-hex requirement, mutex with image.tag, PR9682 rationale). Status header bumped to chart v1.2.0 with a one-line v1.2.0 rationale (additive opt-in field, no breaking change to existing tag-pinned installs). Revision-history table gains the v1.5 entry summarising all four findings; older v1.1 - v1.4 entries untouched. No design reversal: every §6 decision and the chart shape are unchanged. Only the values surface (one additive field), helper set (two helpers: imageRef new, validateImageTag rewritten), and helm-unittest count (24 -> 46) grew. Every claim in the refreshed doc was verified against on-disk state: Chart.yaml for the version bump, values.yaml for the image.digest field comment, _helpers.tpl for both helpers, tests/validate_image_tag_test.yaml for the 22 enumerated cases (verified via `helm unittest`). Signed-off-by: Kai Ma <kaim@nvidia.com>
99138c4 to
abcc922
Compare
…ttributes (v1.10 e2e fix) First in-cluster DCGM parity run on an 8xA100 SXM node (PR #9790 e2e) surfaced three production defects in DcgmActuator, all rooted in pydcgm API misuse that the mocked unit suite couldn't catch: 1. `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` read `DCGM_FI_DEV_UUID` via `dcgmEntityGetLatestValues`. That API returns the field cache; on a fresh nv-hostengine 4.5.3 with no companion watcher subscribed via `dcgmWatchFields`, the cache returns `DCGM_STR_BLANK = "<<<NULL>>>"`. The identity map then raised "8 GPU UUID(s) visible to DCGM are not visible to NVML" on every cluster startup. 2. `constraints_w` / `current_w` / `default_w` read the four `DCGM_FI_DEV_POWER_MGMT_LIMIT{,_MIN,_MAX,_DEF}` fields the same way. Every read returned `DCGM_FP64_BLANK = 2^47`, so the clamp in `apply_cap` escalated a requested 250 W up to the blank max and tried to write 140737488355328 W. 3. The workload-power-profile blanking loop in `_apply_cap_inner` reached for `dcgm_structs.DCGM_INT32_BLANK`. The constant actually lives in `dcgmvalue` (`/shared/pydcgm/dcgmvalue.py:17` in the DCGM 4.5.3 apt bindings); pre-v1.10 the first cap write raised `AttributeError: module 'dcgm_structs' has no attribute 'DCGM_INT32_BLANK'`. Fix: - Add a single `_power_limits(gpu_idx)` helper that calls `DcgmSystem.discovery.GetGpuAttributes(gpu_id).powerLimits` - the synchronous device-info API that wraps `dcgmGetDeviceAttributes`. One RPC instead of four, returns a `c_dcgmDevicePowerLimits_v1` with integer-watt fields populated from the hostengine's discovery state, no field-cache dependency. - `constraints_w`, `current_w`, `default_w`, `restore_default` all route through this helper. - `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` now use `GetGpuAttributes(gpu_id).identifiers.uuid`. - `apply_cap` imports `dcgmvalue` and swaps all seven `DCGM_INT32_BLANK` references onto it. Test infrastructure overhaul (`test_dcgm_actuator.py`): - `_make_dcgm_modules` adds a `dcgmvalue` MagicMock with all four blank sentinels; removes `DCGM_INT32_BLANK` from `dcgm_structs` mock. - `_make_gpu_attrs` extended to carry both `.identifiers.uuid` and `.powerLimits.{cur,default,enforced,min,max}PowerLimit`. - `_wire_handle` wires `GetGpuAttributes(gid)` per-gpu_id. - `_seed_constraints_and_uuid` consolidated onto the unified GetGpuAttributes path. - Five `modules["dcgm_agent"].dcgmEntityGetLatestValues.assert_not_called()` regression guards added across the suite - locks in "no more silent field-cache reads of static device info." E2e re-validation: with this fix in place, the same 8xA100 rig reports `PASS: NvmlActuator and DcgmActuator agree on all probes (tolerance +/- 2.0 W)`, exit code 0, with nvidia-smi confirming `apply_cap(250)` -> 250 W and `restore_default()` -> 400 W on every GPU. Doc: `docs/design-docs/power-agent-dual-actuator.md` v1.10 changelog row added; §6.1 / §6.3 / §7 / §12 / §13 patched to match (Protocol-level docstrings for `current_w`/`default_w`, `get_uuid` / `_power_limits` / `restore_default` code samples, `_apply_cap_inner` import + blank-constant sample, attribution table row 2, summary clause, reference-material entries for `pydcgm/dcgmvalue.py` and `GetGpuAttributes`). Local gates: 163/163 power-agent unit tests pass; pre-commit isort/black/flake8/codespell/ruff/EOF/whitespace clean on the three changed files. Broader `tests/` + `pytest-marker-report` blocked by the known Windows-only `fcntl` import in `tests/conftest.py:22`; CI runs those on Linux. Signed-off-by: Kai Ma <kaim@nvidia.com>
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable on a fresh checkout. Fixing any one in isolation leaves the build broken, so they travel together: 1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04' does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect' → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag. 2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/ to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous COPY would silently copy zero files under the new pin. Switch the source path to the 4.5+ location. 3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing 'import logger' — but logger.py lives in DCGM's source tree under testing/python3/ and is NOT packaged. Without a shim every DcgmGroup construction raises ModuleNotFoundError. Add a 10-line stdlib-logging adapter at components/power_agent/logger.py and COPY it into /opt/dcgm/python/logger.py during the runtime stage. This unblocks 'docker build -f components/power_agent/Dockerfile' on a fresh clone (verified locally via 'docker buildx build --build-arg DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21, image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the Path-B live test on aks-a100b-22138447-vmss000000). Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6. Signed-off-by: Kai Ma <kaim@nvidia.com>
components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2,
a struct that only exists in DCGM 4.x. With 3.x bindings the agent's
DcgmActuator.init() succeeded, opened the hostengine connection, and
ran NVML init cleanly — then crashed mid-first-reconcile with
'AttributeError: module dcgm_structs has no attribute
c_dcgmDeviceConfig_v2' after some GPUs were already capped. The
SIGTERM-restore path won't run when the actuator never finished
registering with _active_actuator, so the GPUs are left at custom caps.
Add a 7-line hasattr check immediately after the dcgm_structs import
in init() that raises a RuntimeError with:
- the missing struct name (so the error is grep-able)
- the required DCGM major (>=4.0)
- the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators
know exactly what to bump)
Guards against accidental Dockerfile regressions to a 3.x base image.
Pair with the Dockerfile fix in 4820ca7 (which bumps the default
to 4.5.1-1-ubuntu22.04) so the default build path is consistent with
the runtime contract.
Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection
(MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails
BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments
don't half-init and leave hostengine sockets dangling.
Refs: PR #9790 review, Power Agent live-test finding #5.
Signed-off-by: Kai Ma <kaim@nvidia.com>
|
Rebased onto the refreshed What moved:
CI rerunning now. @coderabbitai review |
|
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
components/power_agent/tests/test_managed_state_parser.py (1)
85-275:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd required pytest markers (scheduling + GPU + type) to this test module.
The newly added tests are unmarked and don’t satisfy the required test marker policy.
As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/tests/test_managed_state_parser.py` around lines 85 - 275, Import pytest at the top of the test module and add a module-level pytestmark list that applies the required markers (scheduling + GPU + type) to all tests, e.g. place `import pytest` and `pytestmark = [pytest.mark.scheduling, pytest.mark.gpu, pytest.mark.unit]` (or whichever type marker is appropriate) near the top so TestHappyPath, TestMissingFile, TestOsErrorSiblings, TestMalformedJson, TestNonObjectRoot, TestManagedUuidsNotAList, and TestEntryTypeValidation all inherit the markers.components/power_agent/tests/test_orphan_recovery.py (1)
85-286:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd required pytest markers (scheduling + GPU + type) to this test module.
This suite needs the mandatory test markers before merge.
As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/tests/test_orphan_recovery.py` around lines 85 - 286, The module lacks required pytest markers (scheduling + GPU + type); add a module-level pytestmark so every test in this file gets those markers: import pytest at top of the file and define pytestmark = [pytest.mark.<scheduling_marker>, pytest.mark.<gpu_marker>, pytest.mark.<type_marker>] (replace placeholders with the project's chosen markers), which will apply to the test classes such as TestUuidGating, TestWorkloadBusySkip, TestCurrentVsDefaultGuard, TestPerGpuExceptionIsolation, and TestManagedSetPruning.components/power_agent/tests/test_shutdown.py (1)
46-229:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd required pytest markers (scheduling + GPU + type) to this test module.
This file currently lacks the mandatory test classification markers.
As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/tests/test_shutdown.py` around lines 46 - 229, Add the required pytest classification markers by defining a module-level pytestmark list (e.g. pytestmark = [pytest.mark.scheduling, pytest.mark.gpu, pytest.mark.type]) so every test in this module (including TestSigtermViaActuator and TestSigtermFallback and their methods like _handle_sigterm interactions) carries scheduling, GPU and type markers; place the pytestmark definition near the top of the file (above the test classes) and import pytest if not already imported.components/power_agent/tests/test_reconcile_wiring.py (1)
46-326:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd required pytest markers (scheduling + GPU + type) for this module.
These tests currently have none of the required markers.
Suggested patch
+import pytest + +pytestmark = [ + pytest.mark.post_merge, # or pre_merge/nightly/etc. per intended cadence + pytest.mark.gpu_0, + pytest.mark.unit, +]As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/tests/test_reconcile_wiring.py` around lines 46 - 326, Add the required pytest markers by importing pytest and applying scheduling, GPU, and type markers to this test module; either set a module-level pytestmark list or add decorators to each test class (e.g. TestReconcileGpuRoutesViaActuator, TestReconcileGpuDedupesByPodUid, TestReconcileGpuPolicyResolution) to include at least one scheduling marker, one GPU marker, and one type marker per the guidelines so all tests in this file are properly marked.
🧹 Nitpick comments (4)
deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml (1)
54-95: ⚡ Quick winStrengthen enforce-flag assertion coverage for accepted inputs and nvml dead-config checks.
Line 54 onward accepts multiple string forms with only
notFailedTemplate, and Line 136 claims “ANY allowlisted value” absence while checking only a subset. Add exactcontains/notContainsassertions for the remaining allowlisted renderable forms to catch silent drift.Proposed test hardening pattern
- it: should accept enforce=1 (string) set: agent.dcgm.enforce: "1" asserts: - notFailedTemplate: {} + - contains: + path: spec.template.spec.containers[0].command + content: --dcgm-enforce=1 - it: should accept enforce=0 (string) set: agent.dcgm.enforce: "0" asserts: - notFailedTemplate: {} + - contains: + path: spec.template.spec.containers[0].command + content: --dcgm-enforce=0 - it: should accept enforce=yes set: agent.dcgm.enforce: "yes" asserts: - notFailedTemplate: {} + - contains: + path: spec.template.spec.containers[0].command + content: --dcgm-enforce=yes- notContains: path: spec.template.spec.containers[0].command content: --dcgm-enforce=false + - notContains: + path: spec.template.spec.containers[0].command + content: --dcgm-enforce=1 + - notContains: + path: spec.template.spec.containers[0].command + content: --dcgm-enforce=0 + - notContains: + path: spec.template.spec.containers[0].command + content: --dcgm-enforce=yes + - notContains: + path: spec.template.spec.containers[0].command + content: --dcgm-enforce=no + - notContains: + path: spec.template.spec.containers[0].command + content: --dcgm-enforce=on + - notContains: + path: spec.template.spec.containers[0].command + content: --dcgm-enforce=offAlso applies to: 136-147
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml` around lines 54 - 95, The tests under validate_enforce_test.yaml currently only use notFailedTemplate for many agent.dcgm.enforce string variants; update each case (e.g., the tests setting agent.dcgm.enforce to "1","0","yes","no","on","off","TRUE") to also assert the rendered templates explicitly include or exclude the expected configuration snippets by adding contains/notContains assertions for the corresponding normalized values (e.g., numeric 1/0, true/false, on/off forms) so silent drift is caught, and apply the same hardening to the other block referenced (the ANY allowlisted values section around lines 136-147) to cover all allowed renderable forms and the nvml dead-config checks rather than relying solely on notFailedTemplate.components/power_agent/tests/test_reconcile_wiring.py (1)
218-219: ⚡ Quick winMove function-scope imports to module scope.
from tests.test_multi_pod_policy import _FakeMetricsis imported inside test methods; this should be hoisted to top-level imports.Suggested patch
import unittest from unittest.mock import MagicMock, patch import power_agent from power_agent import PowerAgent +from tests.test_multi_pod_policy import _FakeMetrics @@ - from tests.test_multi_pod_policy import _FakeMetrics - agent.metrics = _FakeMetrics() @@ - from tests.test_multi_pod_policy import _FakeMetrics - agent.metrics = _FakeMetrics() @@ - from tests.test_multi_pod_policy import _FakeMetrics - agent.metrics = _FakeMetrics()As per coding guidelines: “ensure imports are only at module scope (no imports inside functions/classes)”.
Also applies to: 243-244, 269-270
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/tests/test_reconcile_wiring.py` around lines 218 - 219, The test module imports `_FakeMetrics` inside test functions; hoist `from tests.test_multi_pod_policy import _FakeMetrics` to the top-level imports in the module (module scope) and remove the function-scope imports in the test methods (including the other occurrences where `_FakeMetrics` is imported inside tests). Ensure the module-level import is added near the other test imports so all tests reference the same top-level `_FakeMetrics`.components/power_agent/tests/test_managed_state_parser.py (1)
60-67: ⚡ Quick winHoist
loggingimports to module scope.
import loggingis currently inside methods; move it to top-level imports for consistency with Python guidelines.As per coding guidelines: “ensure imports are only at module scope (no imports inside functions/classes)”.
Also applies to: 114-121
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/tests/test_managed_state_parser.py` around lines 60 - 67, Move the "import logging" statements out of the test functions and into module-level imports in test_managed_state_parser.py; remove the in-function imports at the locations that set up captured, handler, and logger (the blocks that create captured: list[logging.LogRecord], handler = logging.Handler(), handler.emit = captured.append, handler.setLevel(logging.WARNING), and logger = logging.getLogger("power_agent")), so the tests use the top-level logging import consistently (also apply the same change to the other block referenced in the file).components/power_agent/tests/e2e_actuator_parity.py (1)
497-500: ⚡ Quick winAvoid silently swallowing NVML shutdown errors.
This
except Exception: passhides teardown faults and makes parity-run failures hard to diagnose.Suggested patch
finally: try: pynvml.nvmlShutdown() - except Exception: - pass + except Exception as e: + print(f"WARNING: pynvml.nvmlShutdown() failed: {e}", file=sys.stderr)As per coding guidelines: “fail fast (don’t swallow exceptions, catch specific exceptions only, and if catching
Exceptionthen log and re-raise)”.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/tests/e2e_actuator_parity.py` around lines 497 - 500, The try/except that currently calls pynvml.nvmlShutdown() and swallows all exceptions should be changed to catch the specific NVML exception type (pynvml.NVMLError) instead of Exception, log the error with traceback, and re-raise so teardown failures are visible; update the block around pynvml.nvmlShutdown() (replace the bare except Exception: pass) to except pynvml.NVMLError as e: logger.exception("pynvml.nvmlShutdown failed: %s", e) and raise, and if a module-level logger isn’t present add logger = logging.getLogger(__name__) and import logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/power_agent/actuator.py`:
- Around line 775-785: The except-block around the call to self._apply_cap_inner
should only catch DCGM write-related exceptions instead of all Exception;
replace "except Exception as e:" with a narrow catch such as "except DcgmError
as e" (or the specific DCGM client exception types your project uses, e.g.,
DcgmApiError / DcgmWriteError) and ensure those exception classes are imported,
keep the existing logger.error and metrics.inc for that case, and re-raise any
other unexpected exceptions so programming/runtime bugs are not swallowed by
apply_cap/_apply_cap_inner.
- Around line 539-540: The except block that swallows all exceptions after
calling _handle.Shutdown() must be changed to avoid hiding failures: catch only
expected exceptions or catch Exception but log the exception context before
continuing. Locate the try/except surrounding _handle.Shutdown() in actuator.py
(the block that currently reads "except Exception: pass") and replace the silent
swallow with a logged error using the module logger (include the exception
instance and stacktrace) or re-raise critical exceptions after logging; ensure
the log message mentions "_handle.Shutdown" so it’s easy to find in logs.
- Around line 618-622: The RuntimeError raised in DcgmActuator.list_running_pids
when a GPU UUID is no longer visible should preserve the original KeyError as
the root cause; change the raise so it uses exception chaining (raise
RuntimeError(...) from err) where err is the caught KeyError from the UUID-map
lookup, keeping the message unchanged but appending "from err" to preserve the
original traceback and context for uuid and gpu_idx.
In `@components/power_agent/power_agent.py`:
- Around line 169-177: The current check uses valid = {u for u in uuids if
isinstance(u, str)} which turns uuids into a set and conflates duplicates with
non-string entries; replace this by collecting only string entries without
deduping (e.g., valid_list = [u for u in uuids if isinstance(u, str)]) and
compute non_string_count = sum(1 for u in uuids if not isinstance(u, str)); then
change the logger.warning invocation (the one referencing _MANAGED_STATE_PATH)
to report non_string_count and the number of kept UUIDs (use len(valid_list) or
len(set(valid_list)) depending on whether you intend to report raw entries or
unique UUIDs) instead of comparing len(valid) to len(uuids).
In `@deploy/helm/charts/power-agent/README.md`:
- Around line 49-53: The README currently references a specific chart version
("chart v1.1.0 requires the v1.1.0 image"); update the wording to tie the
requirement to the chart's appVersion instead of a hard-coded chart version:
state that the image tag MUST match the chart's `appVersion`, and explain that
this is necessary because the chart renders the `--actuator` / `--dcgm-*` CLI
flags and `power_agent.py` imports the appVersion-specific `actuator` module;
replace the version-specific sentence with this appVersion-based guidance so
future chart bumps don't require text edits.
In `@deploy/helm/charts/power-agent/templates/_helpers.tpl`:
- Around line 117-119: The helper currently only checks for leading/trailing
whitespace on $tag but allows internal whitespace (e.g., "v1 .1.0") which
produces invalid OCI image references; update the validation in the template
that handles image.tag (the block using $tag and the fail message) to reject any
whitespace characters anywhere in $tag (not just trim differences) by testing
for matches against a whitespace pattern and calling fail with the existing
informative message when any whitespace is found; keep the same fail text and
hint but trigger it when $tag contains internal spaces or other whitespace.
In `@docs/design-docs/power-agent-helm-chart-plan.md`:
- Line 828: The Markdown has unlabeled fenced code blocks that trigger MD040;
update each triple-backtick block in power-agent-helm-chart-plan.md to include a
language identifier (e.g., change ``` to ```text or to a specific language such
as ```yaml, ```bash, or ```json as appropriate) so the linter recognizes the
block language; apply the same change to the other unlabeled fences mentioned in
the comment to resolve MD040.
- Line 29: The heading "### Revision history" is skipping a level; change that
heading to "## Revision history" (or add a parent "##" section before it) so the
document preserves correct heading hierarchy and satisfies MD001; update the
line containing "### Revision history" in
docs/design-docs/power-agent-helm-chart-plan.md accordingly.
- Around line 826-846: Update the stale v1.1.0 summary counts in sections §5.2
and §8 to match the v1.2.0 source-of-truth numbers used elsewhere in this
document: replace "13 files / 2 helm-unittests / 24 passed" with "14 files / 3
helm-unittests / 46 passed" (and any adjacent totals like "Chart-only total" or
LOC if they differ), and make the identical corrections where the same summary
appears around lines 1151-1160 so all references are consistent.
---
Outside diff comments:
In `@components/power_agent/tests/test_managed_state_parser.py`:
- Around line 85-275: Import pytest at the top of the test module and add a
module-level pytestmark list that applies the required markers (scheduling + GPU
+ type) to all tests, e.g. place `import pytest` and `pytestmark =
[pytest.mark.scheduling, pytest.mark.gpu, pytest.mark.unit]` (or whichever type
marker is appropriate) near the top so TestHappyPath, TestMissingFile,
TestOsErrorSiblings, TestMalformedJson, TestNonObjectRoot,
TestManagedUuidsNotAList, and TestEntryTypeValidation all inherit the markers.
In `@components/power_agent/tests/test_orphan_recovery.py`:
- Around line 85-286: The module lacks required pytest markers (scheduling + GPU
+ type); add a module-level pytestmark so every test in this file gets those
markers: import pytest at top of the file and define pytestmark =
[pytest.mark.<scheduling_marker>, pytest.mark.<gpu_marker>,
pytest.mark.<type_marker>] (replace placeholders with the project's chosen
markers), which will apply to the test classes such as TestUuidGating,
TestWorkloadBusySkip, TestCurrentVsDefaultGuard, TestPerGpuExceptionIsolation,
and TestManagedSetPruning.
In `@components/power_agent/tests/test_reconcile_wiring.py`:
- Around line 46-326: Add the required pytest markers by importing pytest and
applying scheduling, GPU, and type markers to this test module; either set a
module-level pytestmark list or add decorators to each test class (e.g.
TestReconcileGpuRoutesViaActuator, TestReconcileGpuDedupesByPodUid,
TestReconcileGpuPolicyResolution) to include at least one scheduling marker, one
GPU marker, and one type marker per the guidelines so all tests in this file are
properly marked.
In `@components/power_agent/tests/test_shutdown.py`:
- Around line 46-229: Add the required pytest classification markers by defining
a module-level pytestmark list (e.g. pytestmark = [pytest.mark.scheduling,
pytest.mark.gpu, pytest.mark.type]) so every test in this module (including
TestSigtermViaActuator and TestSigtermFallback and their methods like
_handle_sigterm interactions) carries scheduling, GPU and type markers; place
the pytestmark definition near the top of the file (above the test classes) and
import pytest if not already imported.
---
Nitpick comments:
In `@components/power_agent/tests/e2e_actuator_parity.py`:
- Around line 497-500: The try/except that currently calls pynvml.nvmlShutdown()
and swallows all exceptions should be changed to catch the specific NVML
exception type (pynvml.NVMLError) instead of Exception, log the error with
traceback, and re-raise so teardown failures are visible; update the block
around pynvml.nvmlShutdown() (replace the bare except Exception: pass) to except
pynvml.NVMLError as e: logger.exception("pynvml.nvmlShutdown failed: %s", e) and
raise, and if a module-level logger isn’t present add logger =
logging.getLogger(__name__) and import logging.
In `@components/power_agent/tests/test_managed_state_parser.py`:
- Around line 60-67: Move the "import logging" statements out of the test
functions and into module-level imports in test_managed_state_parser.py; remove
the in-function imports at the locations that set up captured, handler, and
logger (the blocks that create captured: list[logging.LogRecord], handler =
logging.Handler(), handler.emit = captured.append,
handler.setLevel(logging.WARNING), and logger =
logging.getLogger("power_agent")), so the tests use the top-level logging import
consistently (also apply the same change to the other block referenced in the
file).
In `@components/power_agent/tests/test_reconcile_wiring.py`:
- Around line 218-219: The test module imports `_FakeMetrics` inside test
functions; hoist `from tests.test_multi_pod_policy import _FakeMetrics` to the
top-level imports in the module (module scope) and remove the function-scope
imports in the test methods (including the other occurrences where
`_FakeMetrics` is imported inside tests). Ensure the module-level import is
added near the other test imports so all tests reference the same top-level
`_FakeMetrics`.
In `@deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml`:
- Around line 54-95: The tests under validate_enforce_test.yaml currently only
use notFailedTemplate for many agent.dcgm.enforce string variants; update each
case (e.g., the tests setting agent.dcgm.enforce to
"1","0","yes","no","on","off","TRUE") to also assert the rendered templates
explicitly include or exclude the expected configuration snippets by adding
contains/notContains assertions for the corresponding normalized values (e.g.,
numeric 1/0, true/false, on/off forms) so silent drift is caught, and apply the
same hardening to the other block referenced (the ANY allowlisted values section
around lines 136-147) to cover all allowed renderable forms and the nvml
dead-config checks rather than relying solely on notFailedTemplate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4b586c90-dca2-4d5f-aaae-4f35d04c75ef
📒 Files selected for processing (26)
components/power_agent/Dockerfilecomponents/power_agent/README.mdcomponents/power_agent/actuator.pycomponents/power_agent/logger.pycomponents/power_agent/power_agent.pycomponents/power_agent/tests/e2e_actuator_parity.pycomponents/power_agent/tests/test_actuator_protocol.pycomponents/power_agent/tests/test_actuator_selection.pycomponents/power_agent/tests/test_dcgm_actuator.pycomponents/power_agent/tests/test_managed_state_parser.pycomponents/power_agent/tests/test_multi_pod_policy.pycomponents/power_agent/tests/test_orphan_recovery.pycomponents/power_agent/tests/test_reconcile_wiring.pycomponents/power_agent/tests/test_shutdown.pydeploy/helm/charts/power-agent/Chart.yamldeploy/helm/charts/power-agent/README.mddeploy/helm/charts/power-agent/templates/NOTES.txtdeploy/helm/charts/power-agent/templates/_helpers.tpldeploy/helm/charts/power-agent/templates/daemonset.yamldeploy/helm/charts/power-agent/templates/dev-pod.yamldeploy/helm/charts/power-agent/tests/validate_actuator_test.yamldeploy/helm/charts/power-agent/tests/validate_enforce_test.yamldeploy/helm/charts/power-agent/tests/validate_image_tag_test.yamldeploy/helm/charts/power-agent/values.yamldocs/design-docs/power-agent-dual-actuator.mddocs/design-docs/power-agent-helm-chart-plan.md
…dation
Chart counterpart to the Actuator Protocol introduced in the previous
commit. Bumps Chart.yaml 1.0.0 -> 1.1.0 (minor: new opt-in DCGM
actuator on top of byte-identical NVML default).
values.yaml additions:
- agent.actuator: nvml | dcgm (default nvml - pre-PR1b behaviour
byte-for-byte). No auto-detect: the operator declares which
actuator their cluster runs based on whether their GPU Operator
has dcgm.enabled=true, and the chart renders one DaemonSet with
one actuator locked at startup. Eliminates "NVML and DCGM running
simultaneously" by construction.
- agent.dcgm.host (default nvidia-dcgm.gpu-operator.svc.cluster.local
- the GPU Operator's standard hostengine Service DNS name)
- agent.dcgm.port (default 5555 - upstream nvidia-dcgm hostPort)
- agent.dcgm.enforce (default false - set-and-forget matches NVML
semantics; flip to true to register cap as DCGM's target
configuration for auto-reapply after GPU reset/reinit)
daemonset.yaml:
- --actuator=$(agent.actuator) on every DaemonSet command line
- {{- if eq .Values.agent.actuator "dcgm" }} block injects
--dcgm-host / --dcgm-port / --dcgm-enforce. NVML mode never sees
these flags (no dead args).
- Wires four template-time validators at the top of the file:
validateImageTag, validateMutex, validateActuator, validateEnforce.
dev-pod.yaml:
- ConfigMap script mount now expects BOTH power_agent.py AND
actuator.py (chart v1.1.0+ requirement - a v1.0.0-style ConfigMap
with only power_agent.py will ImportError "No module named
'actuator'" at pod start). README + values.yaml comments + NOTES.txt
kubectl-create-cm snippets all updated to reflect the two-file
layout.
- Command line plumbs the same --actuator / --dcgm-* flags as the
production DaemonSet. Pre-v1.7 dev pods silently defaulted to
NVML regardless of agent.actuator=dcgm (the values were read but
never passed to argparse).
_helpers.tpl:
- validateActuator: strict allowlist "nvml" or "dcgm" (rejects
"NVML", "DCGM", "auto" - the v1.2 reversal - and empty/typo
inputs). Errors at helm install time, not pod start.
- validateEnforce: mirrors power_agent._parse_bool_strict's
allowlist (true/1/yes/on/false/0/no/off case-insensitive).
Rejects --set agent.dcgm.enforce=treu at template time with a
clear errorPattern instead of letting argparse fail at pod start
with a less obvious CrashLoopBackOff trace.
tests/ (new directory - helm-unittest suite):
- validate_actuator_test.yaml: 11 tests covering the happy
nvml/dcgm paths, DCGM flag injection (and exclusion under NVML),
and the four typo classes (NVML/DCGM uppercase, auto, unknown,
empty).
- validate_enforce_test.yaml: 13 tests covering all eight
_parse_bool_strict allowlist values, uppercase normalization,
the canonical "treu" typo class, and the actuator=nvml + bad
enforce cross-validator interaction (no validation fires when
the flag is dead).
- Assertion style: `contains` / `notContains` against the rendered
command: YAML list (NOT matchRegex, which helm-unittest 1.1.0
only supports on string paths). Each `contains` check pins the
exact rendered flag string, stricter than the regex form and
catches flag-value drift.
NOTES.txt:
- Adds dcgm_enforce_failures_total to the Prometheus metrics
listing with semantic disambiguation from apply_failures_total.
README updates:
- Compact values-table rows for dev.scriptConfigMap (now
two-file) and dev.image (DCGM dev-mode override pointer).
- DCGM dev-mode prerequisite block: explicit --set
dev.image.repository=nvcr.io/nvidia/dynamo/power-agent --set
dev.image.tag=v1.1.0 override because the default
vllm-runtime:1.0.1 ships pynvml only (no pydcgm / libdcgm.so).
Script iteration via the ConfigMap mount still works against
the power-agent image because the /scripts mount overrides
/app at runtime.
- Install-example image.tag pinning bumped 1.0.0 -> 1.1.0 to
match Chart.yaml's appVersion (a v1.0.0 image lacks actuator.py
and rejects the new --actuator / --dcgm-* flags).
Signed-off-by: Kai Ma <kaim@nvidia.com>
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0 chart that PR #9790 ships. The plan was authored for the NVML-only v1.0.0 chart in PR #9682 and went stale once PR #9790 layered agent.actuator + agent.dcgm.* + validateActuator + validateEnforce + the helm-unittest suite on top. Two changelog rows capture the refresh: v1.3 - first pass: 8 reviewer findings closed (3 blocking, 4 major, 2 medium, 1 low) on Status header, values surface, dev ConfigMap recipe, image-tag pinning, helm-unittest gating, internal-dev-doc references, daemonset name, file/LOC accounting. v1.4 - second pass: 5 follow-up findings closed (2 major, 2 medium, 1 low) on Status self-contradiction, stale section 4.2 dev-block comment, overstated helm-unittest coverage prose, unrunnable section 5.4 positive helm template overlays (missing --set image.tag), and the lingering filename reference in the v1.3 changelog. No design reversal: every section 6 decision and the chart shape are unchanged. Only the values surface (extended), helper set (extended with two template-time validators), and validation-gate list (helm-unittest now required) grew. Every claim in the refreshed doc was verified against on-disk state: power_agent.py:706-804 for the 8-flag CLI surface, values.yaml for the dev-block recipe, _helpers.tpl for the five-helper set, and the two tests/validate_*_test.yaml files for the 24 enumerated unittest cases. Signed-off-by: Kai Ma <kaim@nvidia.com>
Addresses five PR9682 CodeRabbit review comments on the Python runtime. Pure hardening: no behaviour change on the happy path, every fix narrows a previously silent / fragile failure mode. power_agent.py - _load_previously_managed_gpus is no longer best-effort. The pre-fix try/except caught only FileNotFoundError + JSONDecodeError and used .get() on whatever json.load returned, so the parser could (a) propagate OSError siblings (PermissionError on a stale host-volume bind, EIO on flaky NFS) up the stack and crash the agent at startup, (b) AttributeError on a non-dict JSON root, or (c) silently iterate characters / TypeError on a non-list managed_uuids value. Replaced with a layered parser that catches OSError, validates the root is a dict, validates managed_uuids is a list, and coerces the entry-is-str type guard at the boundary so downstream "uuid in _previously_managed" checks can't compare bytes/ints against str UUIDs and silently never match. Every malformation logs at WARNING and returns empty rather than raising; orphan recovery silently skips this startup, which is strictly better than CrashLoopBackOff with no caps actuated. - _reconcile_gpu now deduplicates pod_annotations by pod UID before handing them to _resolve_cap_for_gpu. Pre-fix one pod with N GPU processes (the common TP/PP/EP topology, ranks-per-GPU, helper workers, profilers) appeared as N rows; _resolve_cap_for_gpu uses len(pod_annotations) > 1 to detect the multi-pod-per-GPU misconfig, so a one-pod / two-PID GPU spuriously fired the "N pods all agree" WARNING and incorrectly bumped the multi_pod_gpu_total agree label. On a pod whose annotation was missing or invalid, the same code path also took the conflict-resolution branch and applied safe_default even though the operator's intent was unambiguous. - _handle_sigterm no longer swallows actuator.shutdown / nvmlShutdown failures silently. Pre-fix "except Exception: pass" made DCGM hostengine faults and NVML teardown errors at SIGTERM time invisible in pod logs - operators had no way to correlate a failed restart with a stuck hostengine. logger.exception writes the full traceback at ERROR while still proceeding to _shutdown.set() so the run loop unblocks (cleanup is best-effort, the container MUST exit promptly on SIGTERM). - argparse hoisted from inside main() to the module-top imports alongside the other stdlib modules (CodeRabbit review-body nit). actuator.py - DcgmActuator.shutdown applied the same logger.exception treatment to its three silent broad catches (group.Delete, handle.Shutdown, pynvml.nvmlShutdown). Group failures name the gpu_id, hostengine shutdown failures name host:port, NVML teardown names the source - so a DCGM resource leak is no longer invisible. Same non-reraise contract: cleanup is best-effort, callers (power_agent._handle_sigterm) must proceed to _shutdown.set(). Tests added / extended (+29 new cases; total 162 passed): - tests/test_managed_state_parser.py (new, 20 cases): happy path, missing file silent, OSError siblings (PermissionError, IsADirectoryError, NotADirectoryError, EIO), malformed JSON, non-dict root (list/string/int/null), non-list managed_uuids (string/int/dict), mixed-type entries (drop count + WARNING fires once). - tests/test_reconcile_wiring.py::TestReconcileGpuDedupesByPodUid (4 new cases): one-pod/multi-PID counts once, doesn't bump multi-pod counter; two-pods/multi-PID-each genuinely counts as two; two disagreeing pods each multi-PID still resolves to safe_default with conflict counter ticking exactly once. - tests/test_shutdown.py (2 new cases): assertLogs ERROR fires with the original exception message when actuator.shutdown raises; same for the defensive raw-NVML fallback when pynvml.nvmlShutdown raises. - tests/test_dcgm_actuator.py (3 new cases): group.Delete failure logs gpu_id; handle.Shutdown failure logs host:port; nvmlShutdown failure logs the source. Signed-off-by: Kai Ma <kaim@nvidia.com>
…ical OCI form
Addresses three iterative PR9682 CodeRabbit review findings on the
chart's image-pinning surface. Bumps Chart.yaml 1.1.0 -> 1.2.0 (minor
per SemVer - image.digest is purely additive; existing tag-pinned
installs render byte-identically and need no change).
Round 1 (original PR9682 finding): image.tag=latest was rejected only
at the raw-manifest layer, not at the chart layer. helm template
--set image.tag=latest happily rendered nvcr.io/nvidia/dynamo/
power-agent:latest because validateImageTag only checked
"if not .Values.image.tag". Now case-insensitively rejects the literal
"latest" (LATEST, Latest, " latest " all fail) while still accepting
release tags that contain the substring (v1.0.0-latest-rc).
Round 2 (digest path rendered an invalid OCI reference): operators
were advised in values.yaml + README + tests to set image.tag=
sha256:abc... for digest pinning. This rendered repo:sha256:... -
INVALID per the OCI spec, which requires repo@sha256:... for digest
form (the second ":" in the rendered string parses as repo + tag
"sha256" with a stray suffix; kubelet manifest pulls fail with an
opaque error). Three changes close this:
- values.yaml: add image.digest field with per-rule comment
block explaining the OCI form, the 64-hex requirement, the
mutex with image.tag, and the PR #9682 rationale.
- _helpers.tpl: split validateImageTag into a full rule set
(XOR(tag, digest), whitespace rejection on both, latest
rejection on tag, sha256: prefix rejection on tag, exact-
64-hex regex on digest) and add a new imageRef helper that
emits {repo}@{digest} when digest is set, else {repo}:{tag}.
- daemonset.yaml: image: line now routes through imageRef
instead of hard-coding the ":" join, so the canonical OCI
form is the only thing that can ship.
Round 3 (validator was too lenient):
- Truncated SHA-256 digests slipped through: regex was
^sha256:[0-9a-fA-F]{32,}$, so half-digests (a common pattern
when an operator copies the first 8 bytes from
`docker images --digests` output) rendered as
repo@sha256:<truncated>. The kubelet then fails the pull with
an opaque manifest-mismatch error. Tightened to {64} exactly
(SHA-256 is 32 bytes x 2 nybbles).
- Whitespace-padded values rendered raw: --set-string
'image.tag= v1.1.0 ' produced image: "repo: v1.1.0 " because
the latest comparison trimmed but the renderer used the raw
value. Choice: reject (don't silently normalize) so the
operator sees their --set quoting typo. Applies symmetrically
to image.digest.
tests/validate_image_tag_test.yaml (new, 22 cases - chart
helm-unittest count grew 24 -> 46):
- Negative: unset (both empty), latest in 4 case variants,
whitespace on tag (leading/trailing/both), tag+digest mutex,
sha256: prefix on tag, non-digest on digest, 32/63/65-hex digest
(off-by-one + half-length), whitespace on digest.
- Positive: pinned release tag renders {repo}:{tag};
v1.0.0-latest-rc accepted (substring match guard); 64-hex digest
renders the canonical {repo}@sha256:... form; uppercase-hex
digest preserved verbatim.
README updates: production-install section now shows both tag and
digest invocations side by side with a callout that digests live
on image.digest, not image.tag. Values table gains the image.digest
row. Troubleshooting section adds three new error-message entries
covering missing-pin, invalid-digest, and mutex-violation cases.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.2.0 chart that the preceding commit ships. The plan was authored for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in the PR #9790 dual-actuator work, and went stale again once v1.2.0 landed image.digest + the canonical-OCI imageRef helper on top. Four reviewer findings closed (v1.5 changelog row covers all four): (1) Truncated SHA-256 digests slipped through the v1.1.0 helper - already fixed in the preceding chart commit; this commit propagates the rule into the §4.4 helper snippet and the §5.4 validation gates. (2) Whitespace-padded tags rendered raw image references - same treatment: §4.4 documents the trim-then-reject contract. (3) §4.3 / §4.4 / §5.4 stale on image.digest: - §4.3 CodeRabbit-comment row rewritten to describe both helpers (validateImageTag + imageRef), both fields (tag + digest), the canonical repo@digest form, and the PR9682 follow-up that added the separate field. - §4.4 helper snippet replaced with the actual v1.2.0 validator (full rule set + per-rule rationale comments) plus the new imageRef helper. - §5.4 expected helm-unittest output bumped 24 -> 46 passed with a one-line breakdown of the +22 new cases. (4) §4.1 / §4.2 stale on file count + helper list + values surface: - §4.1 said 13 files (omitted .helmignore and the new validate_image_tag_test.yaml). Corrected to 14 files (7 templates + 3 helm-unittests + 4 root files). The _helpers.tpl helper-list line gained imageRef and chart entries. The daemonset.yaml annotation now mentions it routes through imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0. - §4.2 values snippet had only image.tag with no image.digest. Added the field with the same per-rule comment block that ships in values.yaml (OCI form, 64-hex requirement, mutex with image.tag, PR9682 rationale). Status header bumped to chart v1.2.0 with a one-line v1.2.0 rationale (additive opt-in field, no breaking change to existing tag-pinned installs). Revision-history table gains the v1.5 entry summarising all four findings; older v1.1 - v1.4 entries untouched. No design reversal: every §6 decision and the chart shape are unchanged. Only the values surface (one additive field), helper set (two helpers: imageRef new, validateImageTag rewritten), and helm-unittest count (24 -> 46) grew. Every claim in the refreshed doc was verified against on-disk state: Chart.yaml for the version bump, values.yaml for the image.digest field comment, _helpers.tpl for both helpers, tests/validate_image_tag_test.yaml for the 22 enumerated cases (verified via `helm unittest`). Signed-off-by: Kai Ma <kaim@nvidia.com>
…ttributes (v1.10 e2e fix) First in-cluster DCGM parity run on an 8xA100 SXM node (PR #9790 e2e) surfaced three production defects in DcgmActuator, all rooted in pydcgm API misuse that the mocked unit suite couldn't catch: 1. `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` read `DCGM_FI_DEV_UUID` via `dcgmEntityGetLatestValues`. That API returns the field cache; on a fresh nv-hostengine 4.5.3 with no companion watcher subscribed via `dcgmWatchFields`, the cache returns `DCGM_STR_BLANK = "<<<NULL>>>"`. The identity map then raised "8 GPU UUID(s) visible to DCGM are not visible to NVML" on every cluster startup. 2. `constraints_w` / `current_w` / `default_w` read the four `DCGM_FI_DEV_POWER_MGMT_LIMIT{,_MIN,_MAX,_DEF}` fields the same way. Every read returned `DCGM_FP64_BLANK = 2^47`, so the clamp in `apply_cap` escalated a requested 250 W up to the blank max and tried to write 140737488355328 W. 3. The workload-power-profile blanking loop in `_apply_cap_inner` reached for `dcgm_structs.DCGM_INT32_BLANK`. The constant actually lives in `dcgmvalue` (`/shared/pydcgm/dcgmvalue.py:17` in the DCGM 4.5.3 apt bindings); pre-v1.10 the first cap write raised `AttributeError: module 'dcgm_structs' has no attribute 'DCGM_INT32_BLANK'`. Fix: - Add a single `_power_limits(gpu_idx)` helper that calls `DcgmSystem.discovery.GetGpuAttributes(gpu_id).powerLimits` - the synchronous device-info API that wraps `dcgmGetDeviceAttributes`. One RPC instead of four, returns a `c_dcgmDevicePowerLimits_v1` with integer-watt fields populated from the hostengine's discovery state, no field-cache dependency. - `constraints_w`, `current_w`, `default_w`, `restore_default` all route through this helper. - `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` now use `GetGpuAttributes(gpu_id).identifiers.uuid`. - `apply_cap` imports `dcgmvalue` and swaps all seven `DCGM_INT32_BLANK` references onto it. Test infrastructure overhaul (`test_dcgm_actuator.py`): - `_make_dcgm_modules` adds a `dcgmvalue` MagicMock with all four blank sentinels; removes `DCGM_INT32_BLANK` from `dcgm_structs` mock. - `_make_gpu_attrs` extended to carry both `.identifiers.uuid` and `.powerLimits.{cur,default,enforced,min,max}PowerLimit`. - `_wire_handle` wires `GetGpuAttributes(gid)` per-gpu_id. - `_seed_constraints_and_uuid` consolidated onto the unified GetGpuAttributes path. - Five `modules["dcgm_agent"].dcgmEntityGetLatestValues.assert_not_called()` regression guards added across the suite - locks in "no more silent field-cache reads of static device info." E2e re-validation: with this fix in place, the same 8xA100 rig reports `PASS: NvmlActuator and DcgmActuator agree on all probes (tolerance +/- 2.0 W)`, exit code 0, with nvidia-smi confirming `apply_cap(250)` -> 250 W and `restore_default()` -> 400 W on every GPU. Doc: `docs/design-docs/power-agent-dual-actuator.md` v1.10 changelog row added; §6.1 / §6.3 / §7 / §12 / §13 patched to match (Protocol-level docstrings for `current_w`/`default_w`, `get_uuid` / `_power_limits` / `restore_default` code samples, `_apply_cap_inner` import + blank-constant sample, attribution table row 2, summary clause, reference-material entries for `pydcgm/dcgmvalue.py` and `GetGpuAttributes`). Local gates: 163/163 power-agent unit tests pass; pre-commit isort/black/flake8/codespell/ruff/EOF/whitespace clean on the three changed files. Broader `tests/` + `pytest-marker-report` blocked by the known Windows-only `fcntl` import in `tests/conftest.py:22`; CI runs those on Linux. Signed-off-by: Kai Ma <kaim@nvidia.com>
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable on a fresh checkout. Fixing any one in isolation leaves the build broken, so they travel together: 1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04' does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect' → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag. 2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/ to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous COPY would silently copy zero files under the new pin. Switch the source path to the 4.5+ location. 3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing 'import logger' — but logger.py lives in DCGM's source tree under testing/python3/ and is NOT packaged. Without a shim every DcgmGroup construction raises ModuleNotFoundError. Add a 10-line stdlib-logging adapter at components/power_agent/logger.py and COPY it into /opt/dcgm/python/logger.py during the runtime stage. This unblocks 'docker build -f components/power_agent/Dockerfile' on a fresh clone (verified locally via 'docker buildx build --build-arg DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21, image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the Path-B live test on aks-a100b-22138447-vmss000000). Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6. Signed-off-by: Kai Ma <kaim@nvidia.com>
components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2,
a struct that only exists in DCGM 4.x. With 3.x bindings the agent's
DcgmActuator.init() succeeded, opened the hostengine connection, and
ran NVML init cleanly — then crashed mid-first-reconcile with
'AttributeError: module dcgm_structs has no attribute
c_dcgmDeviceConfig_v2' after some GPUs were already capped. The
SIGTERM-restore path won't run when the actuator never finished
registering with _active_actuator, so the GPUs are left at custom caps.
Add a 7-line hasattr check immediately after the dcgm_structs import
in init() that raises a RuntimeError with:
- the missing struct name (so the error is grep-able)
- the required DCGM major (>=4.0)
- the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators
know exactly what to bump)
Guards against accidental Dockerfile regressions to a 3.x base image.
Pair with the Dockerfile fix in 4820ca7 (which bumps the default
to 4.5.1-1-ubuntu22.04) so the default build path is consistent with
the runtime contract.
Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection
(MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails
BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments
don't half-init and leave hostengine sockets dangling.
Refs: PR #9790 review, Power Agent live-test finding #5.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Verified each of the 9 fresh CodeRabbit findings on PR #9790 against the actual code (per .cursor/rules/review-feedback.mdc) and accepted all nine. Bundled fix in one commit on top of the previous PR9790 HEAD; no history rewrite. Code (components/power_agent/): * actuator.py:539 — replace silent `except Exception: pass` around the stale-handle Shutdown with `logger.exception(...)` + a comment that reconnect proceeds intentionally on cleanup failure. Continue rather than re-raise: failing reconnect because cleanup of the already-dead handle failed would be strictly worse than logging and moving on. * actuator.py:622 — chain RuntimeError off the originating KeyError via `raise ... from err` so the UUID-map lookup failure stays in the traceback (PEP 3134 / Ruff B904). * actuator.py:777 — narrow `except Exception` in `apply_cap` to `except dcgm_structs.DCGMError`. The broad catch masked binding / programming defects (e.g. the v1.10 `dcgmvalue.DCGM_INT32_BLANK` AttributeError documented in `_apply_cap_inner`) as normal apply failures, which both silenced the bug AND incorrectly ticked `apply_failures_total` — a metric reserved for "the DCGM write itself failed". Lazy-import `dcgm_structs` AFTER the metrics-required check so the constructor path stays runnable on hosts without the DCGM bindings (verified by `test_apply_cap_requires_metrics`). * power_agent.py:170 — compute `invalid_count` separately instead of `len(set) vs len(list)`. The set comprehension deduplicates, so the previous comparison would log "1 non-string entry" when uuids was `["a","a","b"]` (zero invalid entries, just a duplicate). Helm (deploy/helm/charts/power-agent/): * templates/_helpers.tpl:117/129 — replace `ne $tag (trim $tag)` (and same for $digest) with `regexMatch "\\s" ...`. The trim check only caught leading/trailing whitespace; an internal-whitespace tag like `v1 .1.0` (copy/paste typo) slipped through and rendered an invalid OCI image reference, deferring failure from template time to kubelet pull time. CodeRabbit only flagged the tag side, but the same flaw was symmetric on the digest helper, so fix both for parity. * tests/validate_image_tag_test.yaml — update existing "leading/trailing whitespace" assertions to "contains whitespace" (the new helper message), and add four new cases: image.tag with internal whitespace, image.tag with embedded tab, image.digest with internal whitespace, and the existing " latest " ordering pin. Symmetric coverage across both fields. * README.md:49 — rewrite the "image tag must match appVersion" paragraph to anchor on `helm show chart … | grep appVersion` rather than hardcoding "chart v1.1.0 requires the v1.1.0 image". The chart is on v1.2.0 with appVersion 1.1.0; the old wording was already stale and would re-stale on every image bump. Docs (docs/design-docs/power-agent-helm-chart-plan.md): * §0 line 29 — promote `### Revision history` to `## Revision history` so the heading hierarchy goes h1 → h2 (was h1 → h3, MD001). * §5.2 — bump "13 files / 2 helm-unittests" → "14 files / 3 helm-unittests", add the v1.2.0 `validate_image_tag_test.yaml` to the file list, replace the brittle LOC totals with a one-line `wc -l` recipe so the doc stops drifting on every cycle, and tag the three unlabeled fences as ```text``` (MD040). * §8 — bump the "Author all 13 chart files" checklist item to 14 and update the helm-unittest expected count from `24 passed` (v1.1.0) to `46 passed` (v1.2.0), with a note to rerun rather than hardcode. Tests added: * test_apply_cap_propagates_non_dcgm_exception — regression guard for the narrowed apply_cap except: an AttributeError injected on `group.config.Set` MUST propagate, MUST NOT tick `apply_failures_total`, and MUST NOT touch the `_managed_gpu_indices` / `applied_limit_watts` bookkeeping (those run only on the success path). * (Existing `test_apply_cap_dcgm_generic_error_bumps_failure_metric_and_returns` already exercises the "DCGMError still caught" case; both assertions now bracket the narrowed exception contract.) Verification: * pytest components/power_agent/tests/ -q → 170 passed * pre-commit run --files <touched 7 files> → all hooks pass (pytest-marker-report skipped — Windows fcntl issue, runs in Linux CI). * Linter (ReadLints) → no errors on any of the 7 touched files. * The DCGM `import dcgm_structs` is local to `apply_cap` and scoped after the metrics-required check, so `test_apply_cap_requires_metrics` still raises RuntimeError (not ModuleNotFoundError) on hosts without the bindings. Refs: PR #9790 CodeRabbit re-review (2026-05-25). Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
Three high/medium findings from the PR9790 Codex adversarial review, all confirmed against current code (PR9682/PR9683 fixed adjacent issues but did not touch these). power_agent.py - _handle_sigterm now prunes _previously_managed after each successful restore_default and persists the pruned set before shutdown. Pre-fix the SIGTERM handler restored each GPU to default but never updated managed_gpus.json, so on the next startup orphan recovery could clobber a cap applied by another workflow (different DGD, manual nvidia-smi -pl, vendor firmware) on a now-idle GPU still recorded as agent-owned. The actuator branch uses actuator.get_uuid(gpu_idx); the raw-NVML fallback branch uses _nvml_uuid(handle). Failed restores skip the prune (cap may still be live; next-startup orphan recovery is our only chance to reset it). UUID-lookup and persist failures log a WARNING but never crash the SIGTERM handler. - _resolve_cap_for_gpu rewritten so multi-pod GPUs require ALL pods to carry a parseable integer annotation, not just the non-None subset. Pre-fix the resolver filtered None BEFORE computing the agree-set, so a GPU shared by pod A (cap 480) and pod B (no annotation) silently resolved to 480 - pod A's annotation governed pod B's GPU usage, the exact cross-workload policy failure the multi-pod guard exists to contain. Any missing or invalid annotation in the multi-pod case now resolves to safe_default_watts and bumps the conflict counter. Single-pod paths unchanged. - _list_pods_on_node propagates Kubernetes API errors instead of returning []. Pre-fix \"except Exception: return []\" made a transient apiserver outage, RBAC regression, or network blip indistinguishable from a genuinely empty node, silently dropping enforcement. reconcile_once now catches and skips the cycle, leaving previously- applied caps live. New Prometheus counter dynamo_power_agent_k8s_list_failures_total lets operators alert on \">0 over 5m\". Policy choice: skip-cycle (preserve last cap) rather than fail-closed to safe_default; transient hiccups should not cap-down healthy workloads. Tests (+12 new, 1 flipped; 190 total passing): - test_shutdown.py::TestSigtermPrunesManagedGpusState (5 tests): successful prune persists; failed restore retains UUID; UUID-lookup failure is logged and non-fatal; raw-NVML fallback prunes via _nvml_uuid; persist failure does not hang the agent. - test_multi_pod_policy.py: flipped test_mixed_none_and_valid (was asserting buggy 480 inheritance; now asserts SAFE_DEFAULT + conflict); added test_multipod_invalid_annotation_on_one_pod_is_conflict and test_multipod_all_missing_is_conflict. - test_reconcile_wiring.py: TestListPodsOnNodePropagatesErrors (3 tests, namespaced + cluster-wide + happy path); TestReconcileOnceK8sListFailure (2 tests, skip-cycle on failure + per-GPU run on success). Signed-off-by: Kai Ma <kaim@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebase follow-up: dev-mode examples, helm-unittest fixtures, and design docs still referenced nvcr.io/nvidia/dynamo/power-agent after #9682 standardised on nvcr.io/nvidia/ai-dynamo/power-agent. Signed-off-by: Kai Ma <kaim@nvidia.com>
464387c to
852ad7e
Compare
Rebased onto current
|
The docs link check fails on README links to docs.nvidia.com/dynamo/kubernetes-deployment/* (404 from GitHub runners as of 2026-06-08). These are inherited from main, not introduced by the DCGM actuator work, and match the existing .lycheeignore pattern for external docs that fail from CI. Signed-off-by: Kai Ma <kaim@nvidia.com>
…n DCGM restore Graceful SIGTERM restore silently dropped every cap in-cluster: the image runs 'python /app/power_agent.py' (module __main__) while actuator.py reaches the agent via 'import power_agent', so the two module copies each got their own _managed_gpu_indices. The actuator recorded caps into one copy; the SIGTERM handler restored from the other (always empty). Hoist the mutable managed-GPU state (_managed_gpu_indices, _previously_managed, state path) into a standalone managed_state module imported by canonical name from both sites, so all copies converge on one set. Orphan reload now mutates in place instead of rebinding. Revert the __main__ delegation hack. Package managed_state.py into the image COPY and the dev-pod ConfigMap; bump chart 1.2.0 -> 1.3.0 and update README/NOTES/dev-pod docs. Harden DcgmActuator.restore_default against post-reconnect GPU re-enumeration: snapshot the capped UUID at apply time and relocate to its current index before writing default TGP. On a proven index mismatch never fall back to the original (known-wrong) index; relocate, or skip without writing or pruning so cold-start orphan recovery retries. Best-effort restore at the original index only when identity is unreadable (mismatch unproven). Adds regression tests (managed-state sharing, fresh-module-copy aliasing, DCGM relocate/skip/best-effort). Addresses PR #9790 sttts review (module-identity split, DCGM re-enumeration) and follow-up review findings. Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com> # Conflicts: # components/power_agent/power_agent.py
…ersion 1.2.0 The plan header and layout still described chart version 1.2.0 / appVersion 1.1.0 and a 46-test helm-unittest gate, but PR9790 bumped the chart to 1.3.0 / appVersion 1.2.0 (managed_state.py module) and the helm-unittest suite now reports 49 passing tests. Align the status header, the Chart.yaml layout comment, and the validation-gate count. Signed-off-by: Kai Ma <kaim@nvidia.com>
…ion 1.2.0 The helm-chart-plan design doc still described the chart as version/appVersion 1.1.0 with a 46-test helm-unittest snapshot and pinned every install/validation example to image.tag=v1.1.0, while Chart.yaml is now version 1.3.0 / appVersion 1.2.0 and the suite is 49 tests (verified via `helm unittest deploy/helm/charts/power-agent`). Update the current-state claims (risk-register row 6, section 7.4), the section 5.4 validation recipes, the section 1.4.2 / 5.3.1 install examples, and the test-count snapshot. Also make the validateImageTag helper error text track the chart appVersion (printf "v%s" .Chart.AppVersion) rather than a hardcoded v1.1.0 example, and note managed_state.py in the dev-pod ConfigMap comment. Signed-off-by: Kai Ma <kaim@nvidia.com>
…overage Signed-off-by: Kai Ma <kaim@nvidia.com>
Resolve conflicts in components/power_agent/power_agent.py: - _resolve_cap_for_gpu: keep strict multi-pod policy (PR9790 Codex finding #2) over base's lenient None-filtering docstring/intent; the merged code body enforces the strict policy. - _reconcile_gpu: keep actuator-routed PID read + apply_cap (the DCGM dual-actuator path that is the purpose of this PR) over base's inline pynvml; folded base's persistent-cap rationale into the docstring. release.yml and power-agent/README.md auto-merged. Validated: components/power_agent/tests 206 passed. Signed-off-by: Kai Ma <kaim@nvidia.com>
The file was committed before full CI pre-commit ran (workflows were gated by copy-pr-bot), so black/isort line-wrapping of the PID_TO_UID dict and two list comprehensions was never applied. Pure formatting, no logic change; test_multi_dgd_topology 5 passed. Signed-off-by: Kai Ma <kaim@nvidia.com>
|
Status update -- DCGM dual actuator. All three review threads are addressed and resolved:
CI green ( |
Part of the PR #9369 split plan.
This is a sibling foundation PR to the planner stack: PR 1b - Power Agent DCGM actuator.
Predecessor: #9682 - Power Agent NVML DaemonSet (base =
pr1a/power-agent; #9682 must land first)Successor: none (sibling to #9683 -> #9687)
Summary
Adds an opt-in DCGM actuator to the Power Agent from #9682. Operators running the GPU Operator with
dcgm.enabled=truecan setagent.actuator=dcgmso caps are written through the clusternvidia-dcgmhostengine. Default NVML behavior remains equivalent for operators that do not opt into DCGM, but the implementation is refactored through the new actuator protocol and shared managed-state module.Current branch shape: 31 files changed against
pr1a/power-agent(9,526 insertions / 352 deletions).Organized as 3 logical changesets, with review-fix commits layered on top:
docs/design-docs/power-agent-dual-actuator.mddocuments the DCGM/NVML decision arc, ACTION_CLEAR mitigation, UUID-keyed identity, LIMIT_DEF vs LIMIT_MAX, Set-then-Enforce-soft behavior, and changelog.components/power_agent/actuator.py:Actuatorprotocol,NvmlActuator, andDcgmActuatorwith UUID-keyed DCGM/NVML identity mapping,_apply_cap_inner, and_with_reconnectforDCGM_ST_CONNECTION_NOT_VALID.components/power_agent/power_agent.py: reconcile, SIGTERM, and orphan recovery dispatch through the active actuator; adds--actuator,--dcgm-host,--dcgm-port, and--dcgm-enforce; sharpensapply_failures_totalsemantics and addsdcgm_enforce_failures_total.components/power_agent/managed_state.py: single source of truth for mutable managed-GPU sets so__main__and importedpower_agentmodule copies do not split state.components/power_agent/Dockerfile: vendors pydcgm bindings andlibdcgm.sofromnvcr.io/nvidia/cloud-native/dcgm:4.5.1-1-ubuntu22.04.components/power_agent/tests/: 9 new test files covering protocol satisfaction, CLI dispatch, DCGM actuator behavior, orphan recovery, reconcile wiring, SIGTERM dispatch, shared-state identity, and DCGM UUID-identity restore.e2e_actuator_parity.pyis under-test scaffolding for real-GPU parity and is excluded from CI by default.Chart.yaml: chart1.0.0 -> 1.3.0, appVersion1.0.0 -> 1.2.0.values.yaml: addsagent.actuatorandagent.dcgm.{host,port,enforce}; defaults keepactuator=nvmlandenforce=false.templates/daemonset.yamlandtemplates/dev-pod.yaml: render actuator/DCGM flags and dev-mode script mounts.templates/_helpers.tpl: validates actuator, enforce, and image pinning at template time.tests/validate_{actuator,enforce,image_tag}_test.yaml: 49 helm-unittest cases (10 actuator, 14 enforce, 25 image_tag).Reviewer start points:
docs/design-docs/power-agent-dual-actuator.mdsections 1, 4, 6, 10, and 13.components/power_agent/actuator.py, read with design-doc section 6.components/power_agent/power_agent.pydiff versus feat(power-agent): per-node power-cap enforcement DaemonSet #9682, especially reconcile dispatch, metric semantics, and SIGTERM routing.deploy/helm/charts/power-agent/values, helpers, templates, and tests.Validation
33198612b9; snapshot/operator/rust jobs and team-request are intentionally skipped by filters.pytest components/power_agent/tests/: full suite passes locally and required CI is green. The suite has grown past the original 133 tests with shared-state and DCGM UUID-identity regression coverage.helm unittest deploy/helm/charts/power-agent: 49 passed (10 actuator, 14 enforce, 25 image_tag).helm templatesmoke: NVML, DCGM withenforce=true, and invalidenforce=treurender/reject as expected.pytest-marker-reportfails only on the Windows dev workstation due to the known pre-existing POSIX-onlyfcntlimport issue; Linux CI is unaffected.Merge Strategy
Rebase-and-merge, no squash, after condensing review-fix commits back into the intended docs / code / helm logical split. This preserves a clean 3-commit history in
mainwhile retaining the reviewed changes.Stack Order
Merge #9682 first. #9790 is stacked on
pr1a/power-agentand becomes actionable after #9682 lands. #9683 is a sibling on the same base and can merge after #9682 as well.