Skip to content

feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default)#9790

Open
kaim-eng wants to merge 22 commits into
pr1a/power-agentfrom
pr1b/power-agent-dcgm-actuator
Open

feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default)#9790
kaim-eng wants to merge 22 commits into
pr1a/power-agentfrom
pr1b/power-agent-dcgm-actuator

Conversation

@kaim-eng

@kaim-eng kaim-eng commented May 20, 2026

Copy link
Copy Markdown

Part of the PR #9369 split plan.
This is a sibling foundation PR to the planner stack: PR 1b - Power Agent DCGM actuator.

Predecessor: #9682 - Power Agent NVML DaemonSet (base = pr1a/power-agent; #9682 must land first)
Successor: none (sibling to #9683 -> #9687)

Summary

Adds an opt-in DCGM actuator to the Power Agent from #9682. Operators running the GPU Operator with dcgm.enabled=true can set agent.actuator=dcgm so caps are written through the cluster nvidia-dcgm hostengine. Default NVML behavior remains equivalent for operators that do not opt into DCGM, but the implementation is refactored through the new actuator protocol and shared managed-state module.

Current branch shape: 31 files changed against pr1a/power-agent (9,526 insertions / 352 deletions).

Organized as 3 logical changesets, with review-fix commits layered on top:

  1. Design doc: docs/design-docs/power-agent-dual-actuator.md documents the DCGM/NVML decision arc, ACTION_CLEAR mitigation, UUID-keyed identity, LIMIT_DEF vs LIMIT_MAX, Set-then-Enforce-soft behavior, and changelog.
  2. Power Agent implementation:
    • components/power_agent/actuator.py: Actuator protocol, NvmlActuator, and DcgmActuator with UUID-keyed DCGM/NVML identity mapping, _apply_cap_inner, and _with_reconnect for DCGM_ST_CONNECTION_NOT_VALID.
    • components/power_agent/power_agent.py: reconcile, SIGTERM, and orphan recovery dispatch through the active actuator; adds --actuator, --dcgm-host, --dcgm-port, and --dcgm-enforce; sharpens apply_failures_total semantics and adds dcgm_enforce_failures_total.
    • components/power_agent/managed_state.py: single source of truth for mutable managed-GPU sets so __main__ and imported power_agent module copies do not split state.
    • components/power_agent/Dockerfile: vendors pydcgm bindings and libdcgm.so from nvcr.io/nvidia/cloud-native/dcgm:4.5.1-1-ubuntu22.04.
    • components/power_agent/tests/: 9 new test files covering protocol satisfaction, CLI dispatch, DCGM actuator behavior, orphan recovery, reconcile wiring, SIGTERM dispatch, shared-state identity, and DCGM UUID-identity restore. e2e_actuator_parity.py is under-test scaffolding for real-GPU parity and is excluded from CI by default.
  3. Helm chart:
    • Chart.yaml: chart 1.0.0 -> 1.3.0, appVersion 1.0.0 -> 1.2.0.
    • values.yaml: adds agent.actuator and agent.dcgm.{host,port,enforce}; defaults keep actuator=nvml and enforce=false.
    • templates/daemonset.yaml and templates/dev-pod.yaml: render actuator/DCGM flags and dev-mode script mounts.
    • templates/_helpers.tpl: validates actuator, enforce, and image pinning at template time.
    • tests/validate_{actuator,enforce,image_tag}_test.yaml: 49 helm-unittest cases (10 actuator, 14 enforce, 25 image_tag).

Reviewer start points:

  • docs/design-docs/power-agent-dual-actuator.md sections 1, 4, 6, 10, and 13.
  • components/power_agent/actuator.py, read with design-doc section 6.
  • components/power_agent/power_agent.py diff versus feat(power-agent): per-node power-cap enforcement DaemonSet #9682, especially reconcile dispatch, metric semantics, and SIGTERM routing.
  • deploy/helm/charts/power-agent/ values, helpers, templates, and tests.

Validation

  • Required/current GitHub checks pass at 33198612b9; snapshot/operator/rust jobs and team-request are intentionally skipped by filters.
  • pytest components/power_agent/tests/: full suite passes locally and required CI is green. The suite has grown past the original 133 tests with shared-state and DCGM UUID-identity regression coverage.
  • helm unittest deploy/helm/charts/power-agent: 49 passed (10 actuator, 14 enforce, 25 image_tag).
  • helm template smoke: NVML, DCGM with enforce=true, and invalid enforce=treu render/reject as expected.
  • Pre-commit on the 31 touched files: isort, black, flake8, codespell, ruff, check-yaml, check-shebang-executable, and mixed-line-ending pass. pytest-marker-report fails only on the Windows dev workstation due to the known pre-existing POSIX-only fcntl import issue; Linux CI is unaffected.
  • DCO: every commit is signed off.

Merge Strategy

Rebase-and-merge, no squash, after condensing review-fix commits back into the intended docs / code / helm logical split. This preserves a clean 3-commit history in main while retaining the reviewed changes.

Stack Order

Merge #9682 first. #9790 is stacked on pr1a/power-agent and becomes actionable after #9682 lands. #9683 is a sibling on the same base and can merge after #9682 as well.

@kaim-eng kaim-eng requested review from a team as code owners May 20, 2026 16:42
@copy-pr-bot

copy-pr-bot Bot commented May 20, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes actions labels May 20, 2026
@kaim-eng kaim-eng changed the title Pr1b/power agent dcgm actuator feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default) May 20, 2026
@kaim-eng kaim-eng changed the base branch from main to pr1a/power-agent May 20, 2026 16:45
@github-actions github-actions Bot added the feat label May 20, 2026
@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This pull request introduces a complete Power Agent DaemonSet for enforcing per-GPU power caps on Dynamo worker nodes, featuring a pluggable dual-actuator design (NVML default, DCGM opt-in), comprehensive test coverage, and production-ready Helm chart deployment with template-time validation.

Changes

Power Agent Dual-Actuator System

Layer / File(s) Summary
Actuator Protocol and Implementations
components/power_agent/actuator.py, components/power_agent/logger.py, components/power_agent/tests/test_actuator_protocol.py, components/power_agent/tests/test_actuator_selection.py, components/power_agent/tests/test_dcgm_actuator.py, components/power_agent/tests/e2e_actuator_parity.py
Defines Actuator Protocol with lifecycle and cap control APIs. Implements NvmlActuator with lazy NVML imports and DcgmActuator with TCP hostengine connection, stale-handle recovery, and cross-library UUID mapping. Includes protocol conformance tests, factory selection tests, comprehensive DcgmActuator unit tests, and e2e parity harness comparing both actuators against nvidia-smi ground truth.
Power Agent Core and Reconciliation
components/power_agent/power_agent.py, components/power_agent/tests/test_reconcile_wiring.py, components/power_agent/tests/test_shutdown.py, components/power_agent/tests/test_orphan_recovery.py, components/power_agent/tests/test_managed_state_parser.py, components/power_agent/tests/test_multi_pod_policy.py
Implements PowerAgent with periodic GPU reconciliation (pod UID mapping, multi-pod cap policy, actuator-driven application), optional Prometheus metrics, SIGTERM-driven graceful shutdown routing through active actuator, and cold-start orphan recovery using UUID-gated managed GPU state. Tests validate actuator routing contract, shutdown behavior with failure isolation, orphan recovery logic, managed state persistence robustness, and policy fallback metrics semantics.
Container Image and Component Documentation
components/power_agent/Dockerfile, components/power_agent/README.md
Multi-stage Dockerfile supporting both NVML and DCGM with vendored DCGM Python bindings and libraries. Component README documents dual-actuator modes, Helm chart compatibility requirements, and troubleshooting including Prometheus metrics semantics.
Helm Chart Values and Configuration
deploy/helm/charts/power-agent/Chart.yaml, deploy/helm/charts/power-agent/values.yaml, deploy/helm/charts/power-agent/README.md
Chart metadata (v1.2.0, appVersion v1.1.0) and comprehensive values.yaml with image pinning (tag/digest), actuator selection, DCGM connection options, and dev-mode ConfigMap requirements. Chart README provides production and dev install guidance, image pinning rules, troubleshooting, and cross-writer flapping warnings.
Helm Template Validation and Deployment
deploy/helm/charts/power-agent/templates/_helpers.tpl, deploy/helm/charts/power-agent/templates/daemonset.yaml, deploy/helm/charts/power-agent/templates/dev-pod.yaml, deploy/helm/charts/power-agent/templates/NOTES.txt
Strict image validation helpers (tag/digest mutual exclusivity, :latest rejection, whitespace rejection, SHA-256 format enforcement) and actuator/enforce validators. imageRef helper renders canonical OCI references. DaemonSet and dev-Pod templates conditionally inject actuator and DCGM flags. Post-install notes guide mode-specific verification and iteration.
Helm Unittest Test Suites
deploy/helm/charts/power-agent/tests/validate_image_tag_test.yaml, deploy/helm/charts/power-agent/tests/validate_actuator_test.yaml, deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml
46 unit test cases validating template-time behavior: image tag/digest mutual exclusivity, :latest rejection, whitespace/format validation; actuator selection (nvml default, dcgm opt-in, invalid rejection); dcgm-enforce boolean allowlist (yes/no/on/off/true/false with case normalization, rejection of invalid values, cross-validator behavior for unused modes).
Design Documentation
docs/design-docs/power-agent-helm-chart-plan.md
Comprehensive plan document updated from draft to implementation record, including v1.1.0 dual-actuator expansion and v1.2.0 image-pinning hardening. Detailed sections on chart schema, validation strategy, template-time validators, installation flows, commit gates, risk analysis, and pre-flight/post-push checklists.

🎯 4 (Complex) | ⏱️ ~60 minutes

kaim-eng added a commit that referenced this pull request May 20, 2026
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0
chart that PR #9790 ships. The plan was authored for the NVML-only
v1.0.0 chart in PR #9682 and went stale once PR #9790 layered
agent.actuator + agent.dcgm.* + validateActuator + validateEnforce +
the helm-unittest suite on top.

Two changelog rows capture the refresh:
  v1.3 - first pass: 8 reviewer findings closed (3 blocking,
                     4 major, 2 medium, 1 low) on Status header,
                     values surface, dev ConfigMap recipe, image-tag
                     pinning, helm-unittest gating, internal-dev-doc
                     references, daemonset name, file/LOC accounting.
  v1.4 - second pass: 5 follow-up findings closed (2 major,
                     2 medium, 1 low) on Status self-contradiction,
                     stale section 4.2 dev-block comment, overstated
                     helm-unittest coverage prose, unrunnable
                     section 5.4 positive helm template overlays
                     (missing --set image.tag), and the lingering
                     filename reference in the v1.3 changelog.

No design reversal: every section 6 decision and the chart shape are
unchanged. Only the values surface (extended), helper set (extended
with two template-time validators), and validation-gate list
(helm-unittest now required) grew. Every claim in the refreshed
doc was verified against on-disk state: power_agent.py:706-804
for the 8-flag CLI surface, values.yaml for the dev-block recipe,
_helpers.tpl for the five-helper set, and the two
tests/validate_*_test.yaml files for the 24 enumerated unittest
cases.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 20, 2026
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the
v1.2.0 chart that the preceding commit ships. The plan was authored
for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in
the PR #9790 dual-actuator work, and went stale again once v1.2.0
landed image.digest + the canonical-OCI imageRef helper on top.

Four reviewer findings closed (v1.5 changelog row covers all four):

(1) Truncated SHA-256 digests slipped through the v1.1.0 helper -
already fixed in the preceding chart commit; this commit propagates
the rule into the §4.4 helper snippet and the §5.4 validation gates.

(2) Whitespace-padded tags rendered raw image references - same
treatment: §4.4 documents the trim-then-reject contract.

(3) §4.3 / §4.4 / §5.4 stale on image.digest:
  - §4.3 CodeRabbit-comment row rewritten to describe both
    helpers (validateImageTag + imageRef), both fields
    (tag + digest), the canonical repo@digest form, and the
    PR9682 follow-up that added the separate field.
  - §4.4 helper snippet replaced with the actual v1.2.0
    validator (full rule set + per-rule rationale comments)
    plus the new imageRef helper.
  - §5.4 expected helm-unittest output bumped 24 -> 46 passed
    with a one-line breakdown of the +22 new cases.

(4) §4.1 / §4.2 stale on file count + helper list + values surface:
  - §4.1 said 13 files (omitted .helmignore and the new
    validate_image_tag_test.yaml). Corrected to 14 files (7
    templates + 3 helm-unittests + 4 root files). The _helpers.tpl
    helper-list line gained imageRef and chart entries. The
    daemonset.yaml annotation now mentions it routes through
    imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a
    breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0.
  - §4.2 values snippet had only image.tag with no image.digest.
    Added the field with the same per-rule comment block that
    ships in values.yaml (OCI form, 64-hex requirement, mutex
    with image.tag, PR9682 rationale).

Status header bumped to chart v1.2.0 with a one-line v1.2.0
rationale (additive opt-in field, no breaking change to existing
tag-pinned installs). Revision-history table gains the v1.5 entry
summarising all four findings; older v1.1 - v1.4 entries
untouched.

No design reversal: every §6 decision and the chart shape are
unchanged. Only the values surface (one additive field), helper
set (two helpers: imageRef new, validateImageTag rewritten), and
helm-unittest count (24 -> 46) grew. Every claim in the refreshed
doc was verified against on-disk state: Chart.yaml for the
version bump, values.yaml for the image.digest field comment,
_helpers.tpl for both helpers, tests/validate_image_tag_test.yaml
for the 22 enumerated cases (verified via `helm unittest`).

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 20, 2026
…ttributes (v1.10 e2e fix)

First in-cluster DCGM parity run on an 8xA100 SXM node (PR #9790
e2e) surfaced three production defects in DcgmActuator, all rooted
in pydcgm API misuse that the mocked unit suite couldn't catch:

1. `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` read
   `DCGM_FI_DEV_UUID` via `dcgmEntityGetLatestValues`. That API
   returns the field cache; on a fresh nv-hostengine 4.5.3 with
   no companion watcher subscribed via `dcgmWatchFields`, the
   cache returns `DCGM_STR_BLANK = "<<<NULL>>>"`. The identity
   map then raised "8 GPU UUID(s) visible to DCGM are not visible
   to NVML" on every cluster startup.

2. `constraints_w` / `current_w` / `default_w` read the four
   `DCGM_FI_DEV_POWER_MGMT_LIMIT{,_MIN,_MAX,_DEF}` fields the
   same way. Every read returned `DCGM_FP64_BLANK = 2^47`, so
   the clamp in `apply_cap` escalated a requested 250 W up to
   the blank max and tried to write 140737488355328 W.

3. The workload-power-profile blanking loop in `_apply_cap_inner`
   reached for `dcgm_structs.DCGM_INT32_BLANK`. The constant
   actually lives in `dcgmvalue` (`/shared/pydcgm/dcgmvalue.py:17`
   in the DCGM 4.5.3 apt bindings); pre-v1.10 the first cap write
   raised `AttributeError: module 'dcgm_structs' has no attribute
   'DCGM_INT32_BLANK'`.

Fix:

- Add a single `_power_limits(gpu_idx)` helper that calls
  `DcgmSystem.discovery.GetGpuAttributes(gpu_id).powerLimits` -
  the synchronous device-info API that wraps
  `dcgmGetDeviceAttributes`. One RPC instead of four, returns a
  `c_dcgmDevicePowerLimits_v1` with integer-watt fields populated
  from the hostengine's discovery state, no field-cache dependency.
- `constraints_w`, `current_w`, `default_w`, `restore_default` all
  route through this helper.
- `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` now use
  `GetGpuAttributes(gpu_id).identifiers.uuid`.
- `apply_cap` imports `dcgmvalue` and swaps all seven
  `DCGM_INT32_BLANK` references onto it.

Test infrastructure overhaul (`test_dcgm_actuator.py`):

- `_make_dcgm_modules` adds a `dcgmvalue` MagicMock with all four
  blank sentinels; removes `DCGM_INT32_BLANK` from `dcgm_structs`
  mock.
- `_make_gpu_attrs` extended to carry both `.identifiers.uuid` and
  `.powerLimits.{cur,default,enforced,min,max}PowerLimit`.
- `_wire_handle` wires `GetGpuAttributes(gid)` per-gpu_id.
- `_seed_constraints_and_uuid` consolidated onto the unified
  GetGpuAttributes path.
- Five `modules["dcgm_agent"].dcgmEntityGetLatestValues.assert_not_called()`
  regression guards added across the suite - locks in "no more
  silent field-cache reads of static device info."

E2e re-validation: with this fix in place, the same 8xA100 rig
reports `PASS: NvmlActuator and DcgmActuator agree on all probes
(tolerance +/- 2.0 W)`, exit code 0, with nvidia-smi confirming
`apply_cap(250)` -> 250 W and `restore_default()` -> 400 W on
every GPU.

Doc: `docs/design-docs/power-agent-dual-actuator.md` v1.10
changelog row added; §6.1 / §6.3 / §7 / §12 / §13 patched to
match (Protocol-level docstrings for `current_w`/`default_w`,
`get_uuid` / `_power_limits` / `restore_default` code samples,
`_apply_cap_inner` import + blank-constant sample, attribution
table row 2, summary clause, reference-material entries for
`pydcgm/dcgmvalue.py` and `GetGpuAttributes`).

Local gates: 163/163 power-agent unit tests pass; pre-commit
isort/black/flake8/codespell/ruff/EOF/whitespace clean on the
three changed files. Broader `tests/` + `pytest-marker-report`
blocked by the known Windows-only `fcntl` import in
`tests/conftest.py:22`; CI runs those on Linux.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable

on a fresh checkout. Fixing any one in isolation leaves the build broken,

so they travel together:

1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04'

   does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect'

   → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag.

2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/

   to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous

   COPY would silently copy zero files under the new pin. Switch the source

   path to the 4.5+ location.

3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing

   'import logger' — but logger.py lives in DCGM's source tree under

   testing/python3/ and is NOT packaged. Without a shim every DcgmGroup

   construction raises ModuleNotFoundError. Add a 10-line stdlib-logging

   adapter at components/power_agent/logger.py and COPY it into

   /opt/dcgm/python/logger.py during the runtime stage.

This unblocks 'docker build -f components/power_agent/Dockerfile' on a

fresh clone (verified locally via 'docker buildx build --build-arg

DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21,

image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the

Path-B live test on aks-a100b-22138447-vmss000000).

Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2,

a struct that only exists in DCGM 4.x. With 3.x bindings the agent's

DcgmActuator.init() succeeded, opened the hostengine connection, and

ran NVML init cleanly — then crashed mid-first-reconcile with

'AttributeError: module dcgm_structs has no attribute

c_dcgmDeviceConfig_v2' after some GPUs were already capped. The

SIGTERM-restore path won't run when the actuator never finished

registering with _active_actuator, so the GPUs are left at custom caps.

Add a 7-line hasattr check immediately after the dcgm_structs import

in init() that raises a RuntimeError with:

  - the missing struct name (so the error is grep-able)

  - the required DCGM major (>=4.0)

  - the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators

    know exactly what to bump)

Guards against accidental Dockerfile regressions to a 3.x base image.

Pair with the Dockerfile fix in 4820ca7 (which bumps the default

to 4.5.1-1-ubuntu22.04) so the default build path is consistent with

the runtime contract.

Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection

(MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails

BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments

don't half-init and leave hostengine sockets dangling.

Refs: PR #9790 review, Power Agent live-test finding #5.
Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 0eda1d7 to 1e1ef6f Compare May 25, 2026 13:43
kaim-eng added a commit that referenced this pull request May 25, 2026
… log, pod-UID dedup, argparse)

Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790):

* _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup.

* _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container.

* _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric.

* main(): move 'import argparse' to module scope per the project's import-placement convention.

Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
… log, pod-UID dedup, argparse)

Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790):

* _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup.

* _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container.

* _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric.

* main(): move 'import argparse' to module scope per the project's import-placement convention.

Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 1e1ef6f to 29f0831 Compare May 25, 2026 13:46
kaim-eng added a commit that referenced this pull request May 25, 2026
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0
chart that PR #9790 ships. The plan was authored for the NVML-only
v1.0.0 chart in PR #9682 and went stale once PR #9790 layered
agent.actuator + agent.dcgm.* + validateActuator + validateEnforce +
the helm-unittest suite on top.

Two changelog rows capture the refresh:
  v1.3 - first pass: 8 reviewer findings closed (3 blocking,
                     4 major, 2 medium, 1 low) on Status header,
                     values surface, dev ConfigMap recipe, image-tag
                     pinning, helm-unittest gating, internal-dev-doc
                     references, daemonset name, file/LOC accounting.
  v1.4 - second pass: 5 follow-up findings closed (2 major,
                     2 medium, 1 low) on Status self-contradiction,
                     stale section 4.2 dev-block comment, overstated
                     helm-unittest coverage prose, unrunnable
                     section 5.4 positive helm template overlays
                     (missing --set image.tag), and the lingering
                     filename reference in the v1.3 changelog.

No design reversal: every section 6 decision and the chart shape are
unchanged. Only the values surface (extended), helper set (extended
with two template-time validators), and validation-gate list
(helm-unittest now required) grew. Every claim in the refreshed
doc was verified against on-disk state: power_agent.py:706-804
for the 8-flag CLI surface, values.yaml for the dev-block recipe,
_helpers.tpl for the five-helper set, and the two
tests/validate_*_test.yaml files for the 24 enumerated unittest
cases.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the
v1.2.0 chart that the preceding commit ships. The plan was authored
for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in
the PR #9790 dual-actuator work, and went stale again once v1.2.0
landed image.digest + the canonical-OCI imageRef helper on top.

Four reviewer findings closed (v1.5 changelog row covers all four):

(1) Truncated SHA-256 digests slipped through the v1.1.0 helper -
already fixed in the preceding chart commit; this commit propagates
the rule into the §4.4 helper snippet and the §5.4 validation gates.

(2) Whitespace-padded tags rendered raw image references - same
treatment: §4.4 documents the trim-then-reject contract.

(3) §4.3 / §4.4 / §5.4 stale on image.digest:
  - §4.3 CodeRabbit-comment row rewritten to describe both
    helpers (validateImageTag + imageRef), both fields
    (tag + digest), the canonical repo@digest form, and the
    PR9682 follow-up that added the separate field.
  - §4.4 helper snippet replaced with the actual v1.2.0
    validator (full rule set + per-rule rationale comments)
    plus the new imageRef helper.
  - §5.4 expected helm-unittest output bumped 24 -> 46 passed
    with a one-line breakdown of the +22 new cases.

(4) §4.1 / §4.2 stale on file count + helper list + values surface:
  - §4.1 said 13 files (omitted .helmignore and the new
    validate_image_tag_test.yaml). Corrected to 14 files (7
    templates + 3 helm-unittests + 4 root files). The _helpers.tpl
    helper-list line gained imageRef and chart entries. The
    daemonset.yaml annotation now mentions it routes through
    imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a
    breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0.
  - §4.2 values snippet had only image.tag with no image.digest.
    Added the field with the same per-rule comment block that
    ships in values.yaml (OCI form, 64-hex requirement, mutex
    with image.tag, PR9682 rationale).

Status header bumped to chart v1.2.0 with a one-line v1.2.0
rationale (additive opt-in field, no breaking change to existing
tag-pinned installs). Revision-history table gains the v1.5 entry
summarising all four findings; older v1.1 - v1.4 entries
untouched.

No design reversal: every §6 decision and the chart shape are
unchanged. Only the values surface (one additive field), helper
set (two helpers: imageRef new, validateImageTag rewritten), and
helm-unittest count (24 -> 46) grew. Every claim in the refreshed
doc was verified against on-disk state: Chart.yaml for the
version bump, values.yaml for the image.digest field comment,
_helpers.tpl for both helpers, tests/validate_image_tag_test.yaml
for the 22 enumerated cases (verified via `helm unittest`).

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1b/power-agent-dcgm-actuator branch from 99138c4 to abcc922 Compare May 25, 2026 14:27
kaim-eng added a commit that referenced this pull request May 25, 2026
…ttributes (v1.10 e2e fix)

First in-cluster DCGM parity run on an 8xA100 SXM node (PR #9790
e2e) surfaced three production defects in DcgmActuator, all rooted
in pydcgm API misuse that the mocked unit suite couldn't catch:

1. `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` read
   `DCGM_FI_DEV_UUID` via `dcgmEntityGetLatestValues`. That API
   returns the field cache; on a fresh nv-hostengine 4.5.3 with
   no companion watcher subscribed via `dcgmWatchFields`, the
   cache returns `DCGM_STR_BLANK = "<<<NULL>>>"`. The identity
   map then raised "8 GPU UUID(s) visible to DCGM are not visible
   to NVML" on every cluster startup.

2. `constraints_w` / `current_w` / `default_w` read the four
   `DCGM_FI_DEV_POWER_MGMT_LIMIT{,_MIN,_MAX,_DEF}` fields the
   same way. Every read returned `DCGM_FP64_BLANK = 2^47`, so
   the clamp in `apply_cap` escalated a requested 250 W up to
   the blank max and tried to write 140737488355328 W.

3. The workload-power-profile blanking loop in `_apply_cap_inner`
   reached for `dcgm_structs.DCGM_INT32_BLANK`. The constant
   actually lives in `dcgmvalue` (`/shared/pydcgm/dcgmvalue.py:17`
   in the DCGM 4.5.3 apt bindings); pre-v1.10 the first cap write
   raised `AttributeError: module 'dcgm_structs' has no attribute
   'DCGM_INT32_BLANK'`.

Fix:

- Add a single `_power_limits(gpu_idx)` helper that calls
  `DcgmSystem.discovery.GetGpuAttributes(gpu_id).powerLimits` -
  the synchronous device-info API that wraps
  `dcgmGetDeviceAttributes`. One RPC instead of four, returns a
  `c_dcgmDevicePowerLimits_v1` with integer-watt fields populated
  from the hostengine's discovery state, no field-cache dependency.
- `constraints_w`, `current_w`, `default_w`, `restore_default` all
  route through this helper.
- `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` now use
  `GetGpuAttributes(gpu_id).identifiers.uuid`.
- `apply_cap` imports `dcgmvalue` and swaps all seven
  `DCGM_INT32_BLANK` references onto it.

Test infrastructure overhaul (`test_dcgm_actuator.py`):

- `_make_dcgm_modules` adds a `dcgmvalue` MagicMock with all four
  blank sentinels; removes `DCGM_INT32_BLANK` from `dcgm_structs`
  mock.
- `_make_gpu_attrs` extended to carry both `.identifiers.uuid` and
  `.powerLimits.{cur,default,enforced,min,max}PowerLimit`.
- `_wire_handle` wires `GetGpuAttributes(gid)` per-gpu_id.
- `_seed_constraints_and_uuid` consolidated onto the unified
  GetGpuAttributes path.
- Five `modules["dcgm_agent"].dcgmEntityGetLatestValues.assert_not_called()`
  regression guards added across the suite - locks in "no more
  silent field-cache reads of static device info."

E2e re-validation: with this fix in place, the same 8xA100 rig
reports `PASS: NvmlActuator and DcgmActuator agree on all probes
(tolerance +/- 2.0 W)`, exit code 0, with nvidia-smi confirming
`apply_cap(250)` -> 250 W and `restore_default()` -> 400 W on
every GPU.

Doc: `docs/design-docs/power-agent-dual-actuator.md` v1.10
changelog row added; §6.1 / §6.3 / §7 / §12 / §13 patched to
match (Protocol-level docstrings for `current_w`/`default_w`,
`get_uuid` / `_power_limits` / `restore_default` code samples,
`_apply_cap_inner` import + blank-constant sample, attribution
table row 2, summary clause, reference-material entries for
`pydcgm/dcgmvalue.py` and `GetGpuAttributes`).

Local gates: 163/163 power-agent unit tests pass; pre-commit
isort/black/flake8/codespell/ruff/EOF/whitespace clean on the
three changed files. Broader `tests/` + `pytest-marker-report`
blocked by the known Windows-only `fcntl` import in
`tests/conftest.py:22`; CI runs those on Linux.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable

on a fresh checkout. Fixing any one in isolation leaves the build broken,

so they travel together:

1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04'

   does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect'

   → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag.

2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/

   to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous

   COPY would silently copy zero files under the new pin. Switch the source

   path to the 4.5+ location.

3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing

   'import logger' — but logger.py lives in DCGM's source tree under

   testing/python3/ and is NOT packaged. Without a shim every DcgmGroup

   construction raises ModuleNotFoundError. Add a 10-line stdlib-logging

   adapter at components/power_agent/logger.py and COPY it into

   /opt/dcgm/python/logger.py during the runtime stage.

This unblocks 'docker build -f components/power_agent/Dockerfile' on a

fresh clone (verified locally via 'docker buildx build --build-arg

DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21,

image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the

Path-B live test on aks-a100b-22138447-vmss000000).

Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2,

a struct that only exists in DCGM 4.x. With 3.x bindings the agent's

DcgmActuator.init() succeeded, opened the hostengine connection, and

ran NVML init cleanly — then crashed mid-first-reconcile with

'AttributeError: module dcgm_structs has no attribute

c_dcgmDeviceConfig_v2' after some GPUs were already capped. The

SIGTERM-restore path won't run when the actuator never finished

registering with _active_actuator, so the GPUs are left at custom caps.

Add a 7-line hasattr check immediately after the dcgm_structs import

in init() that raises a RuntimeError with:

  - the missing struct name (so the error is grep-able)

  - the required DCGM major (>=4.0)

  - the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators

    know exactly what to bump)

Guards against accidental Dockerfile regressions to a 3.x base image.

Pair with the Dockerfile fix in 4820ca7 (which bumps the default

to 4.5.1-1-ubuntu22.04) so the default build path is consistent with

the runtime contract.

Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection

(MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails

BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments

don't half-init and leave hostengine sockets dangling.

Refs: PR #9790 review, Power Agent live-test finding #5.
Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng

Copy link
Copy Markdown
Author

Rebased onto the refreshed pr1a/power-agent (tip 29f0831) and main (~316 commits forward from the prior base). Force-pushed pr1b/power-agent-dcgm-actuator to abcc922.

What moved:

CI rerunning now.

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

@kaim-eng Thanks for the detailed rebase summary! The backport strategy—keeping each hardening fix with the code it touches on pr1a while retaining the actuator-aware refinements and the 29 new test cases here on pr1b—looks like a clean separation. I'll kick off a fresh review of the force-pushed tip now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
components/power_agent/tests/test_managed_state_parser.py (1)

85-275: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required pytest markers (scheduling + GPU + type) to this test module.

The newly added tests are unmarked and don’t satisfy the required test marker policy.

As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_managed_state_parser.py` around lines 85 -
275, Import pytest at the top of the test module and add a module-level
pytestmark list that applies the required markers (scheduling + GPU + type) to
all tests, e.g. place `import pytest` and `pytestmark = [pytest.mark.scheduling,
pytest.mark.gpu, pytest.mark.unit]` (or whichever type marker is appropriate)
near the top so TestHappyPath, TestMissingFile, TestOsErrorSiblings,
TestMalformedJson, TestNonObjectRoot, TestManagedUuidsNotAList, and
TestEntryTypeValidation all inherit the markers.
components/power_agent/tests/test_orphan_recovery.py (1)

85-286: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required pytest markers (scheduling + GPU + type) to this test module.

This suite needs the mandatory test markers before merge.

As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_orphan_recovery.py` around lines 85 - 286,
The module lacks required pytest markers (scheduling + GPU + type); add a
module-level pytestmark so every test in this file gets those markers: import
pytest at top of the file and define pytestmark =
[pytest.mark.<scheduling_marker>, pytest.mark.<gpu_marker>,
pytest.mark.<type_marker>] (replace placeholders with the project's chosen
markers), which will apply to the test classes such as TestUuidGating,
TestWorkloadBusySkip, TestCurrentVsDefaultGuard, TestPerGpuExceptionIsolation,
and TestManagedSetPruning.
components/power_agent/tests/test_shutdown.py (1)

46-229: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required pytest markers (scheduling + GPU + type) to this test module.

This file currently lacks the mandatory test classification markers.

As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_shutdown.py` around lines 46 - 229, Add the
required pytest classification markers by defining a module-level pytestmark
list (e.g. pytestmark = [pytest.mark.scheduling, pytest.mark.gpu,
pytest.mark.type]) so every test in this module (including
TestSigtermViaActuator and TestSigtermFallback and their methods like
_handle_sigterm interactions) carries scheduling, GPU and type markers; place
the pytestmark definition near the top of the file (above the test classes) and
import pytest if not already imported.
components/power_agent/tests/test_reconcile_wiring.py (1)

46-326: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required pytest markers (scheduling + GPU + type) for this module.

These tests currently have none of the required markers.

Suggested patch
+import pytest
+
+pytestmark = [
+    pytest.mark.post_merge,  # or pre_merge/nightly/etc. per intended cadence
+    pytest.mark.gpu_0,
+    pytest.mark.unit,
+]

As per coding guidelines: “Every test must have at least one scheduling marker… at least one GPU marker… and at least one type marker…”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_reconcile_wiring.py` around lines 46 - 326,
Add the required pytest markers by importing pytest and applying scheduling,
GPU, and type markers to this test module; either set a module-level pytestmark
list or add decorators to each test class (e.g.
TestReconcileGpuRoutesViaActuator, TestReconcileGpuDedupesByPodUid,
TestReconcileGpuPolicyResolution) to include at least one scheduling marker, one
GPU marker, and one type marker per the guidelines so all tests in this file are
properly marked.
🧹 Nitpick comments (4)
deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml (1)

54-95: ⚡ Quick win

Strengthen enforce-flag assertion coverage for accepted inputs and nvml dead-config checks.

Line 54 onward accepts multiple string forms with only notFailedTemplate, and Line 136 claims “ANY allowlisted value” absence while checking only a subset. Add exact contains/notContains assertions for the remaining allowlisted renderable forms to catch silent drift.

Proposed test hardening pattern
  - it: should accept enforce=1 (string)
    set:
      agent.dcgm.enforce: "1"
    asserts:
      - notFailedTemplate: {}
+      - contains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=1

  - it: should accept enforce=0 (string)
    set:
      agent.dcgm.enforce: "0"
    asserts:
      - notFailedTemplate: {}
+      - contains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=0

  - it: should accept enforce=yes
    set:
      agent.dcgm.enforce: "yes"
    asserts:
      - notFailedTemplate: {}
+      - contains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=yes
      - notContains:
          path: spec.template.spec.containers[0].command
          content: --dcgm-enforce=false
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=1
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=0
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=yes
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=no
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=on
+      - notContains:
+          path: spec.template.spec.containers[0].command
+          content: --dcgm-enforce=off

Also applies to: 136-147

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml` around lines
54 - 95, The tests under validate_enforce_test.yaml currently only use
notFailedTemplate for many agent.dcgm.enforce string variants; update each case
(e.g., the tests setting agent.dcgm.enforce to
"1","0","yes","no","on","off","TRUE") to also assert the rendered templates
explicitly include or exclude the expected configuration snippets by adding
contains/notContains assertions for the corresponding normalized values (e.g.,
numeric 1/0, true/false, on/off forms) so silent drift is caught, and apply the
same hardening to the other block referenced (the ANY allowlisted values section
around lines 136-147) to cover all allowed renderable forms and the nvml
dead-config checks rather than relying solely on notFailedTemplate.
components/power_agent/tests/test_reconcile_wiring.py (1)

218-219: ⚡ Quick win

Move function-scope imports to module scope.

from tests.test_multi_pod_policy import _FakeMetrics is imported inside test methods; this should be hoisted to top-level imports.

Suggested patch
 import unittest
 from unittest.mock import MagicMock, patch
 
 import power_agent
 from power_agent import PowerAgent
+from tests.test_multi_pod_policy import _FakeMetrics
@@
-        from tests.test_multi_pod_policy import _FakeMetrics
-
         agent.metrics = _FakeMetrics()
@@
-        from tests.test_multi_pod_policy import _FakeMetrics
-
         agent.metrics = _FakeMetrics()
@@
-        from tests.test_multi_pod_policy import _FakeMetrics
-
         agent.metrics = _FakeMetrics()

As per coding guidelines: “ensure imports are only at module scope (no imports inside functions/classes)”.

Also applies to: 243-244, 269-270

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_reconcile_wiring.py` around lines 218 -
219, The test module imports `_FakeMetrics` inside test functions; hoist `from
tests.test_multi_pod_policy import _FakeMetrics` to the top-level imports in the
module (module scope) and remove the function-scope imports in the test methods
(including the other occurrences where `_FakeMetrics` is imported inside tests).
Ensure the module-level import is added near the other test imports so all tests
reference the same top-level `_FakeMetrics`.
components/power_agent/tests/test_managed_state_parser.py (1)

60-67: ⚡ Quick win

Hoist logging imports to module scope.

import logging is currently inside methods; move it to top-level imports for consistency with Python guidelines.

As per coding guidelines: “ensure imports are only at module scope (no imports inside functions/classes)”.

Also applies to: 114-121

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_managed_state_parser.py` around lines 60 -
67, Move the "import logging" statements out of the test functions and into
module-level imports in test_managed_state_parser.py; remove the in-function
imports at the locations that set up captured, handler, and logger (the blocks
that create captured: list[logging.LogRecord], handler = logging.Handler(),
handler.emit = captured.append, handler.setLevel(logging.WARNING), and logger =
logging.getLogger("power_agent")), so the tests use the top-level logging import
consistently (also apply the same change to the other block referenced in the
file).
components/power_agent/tests/e2e_actuator_parity.py (1)

497-500: ⚡ Quick win

Avoid silently swallowing NVML shutdown errors.

This except Exception: pass hides teardown faults and makes parity-run failures hard to diagnose.

Suggested patch
     finally:
         try:
             pynvml.nvmlShutdown()
-        except Exception:
-            pass
+        except Exception as e:
+            print(f"WARNING: pynvml.nvmlShutdown() failed: {e}", file=sys.stderr)

As per coding guidelines: “fail fast (don’t swallow exceptions, catch specific exceptions only, and if catching Exception then log and re-raise)”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/e2e_actuator_parity.py` around lines 497 - 500,
The try/except that currently calls pynvml.nvmlShutdown() and swallows all
exceptions should be changed to catch the specific NVML exception type
(pynvml.NVMLError) instead of Exception, log the error with traceback, and
re-raise so teardown failures are visible; update the block around
pynvml.nvmlShutdown() (replace the bare except Exception: pass) to except
pynvml.NVMLError as e: logger.exception("pynvml.nvmlShutdown failed: %s", e) and
raise, and if a module-level logger isn’t present add logger =
logging.getLogger(__name__) and import logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/power_agent/actuator.py`:
- Around line 775-785: The except-block around the call to self._apply_cap_inner
should only catch DCGM write-related exceptions instead of all Exception;
replace "except Exception as e:" with a narrow catch such as "except DcgmError
as e" (or the specific DCGM client exception types your project uses, e.g.,
DcgmApiError / DcgmWriteError) and ensure those exception classes are imported,
keep the existing logger.error and metrics.inc for that case, and re-raise any
other unexpected exceptions so programming/runtime bugs are not swallowed by
apply_cap/_apply_cap_inner.
- Around line 539-540: The except block that swallows all exceptions after
calling _handle.Shutdown() must be changed to avoid hiding failures: catch only
expected exceptions or catch Exception but log the exception context before
continuing. Locate the try/except surrounding _handle.Shutdown() in actuator.py
(the block that currently reads "except Exception: pass") and replace the silent
swallow with a logged error using the module logger (include the exception
instance and stacktrace) or re-raise critical exceptions after logging; ensure
the log message mentions "_handle.Shutdown" so it’s easy to find in logs.
- Around line 618-622: The RuntimeError raised in DcgmActuator.list_running_pids
when a GPU UUID is no longer visible should preserve the original KeyError as
the root cause; change the raise so it uses exception chaining (raise
RuntimeError(...) from err) where err is the caught KeyError from the UUID-map
lookup, keeping the message unchanged but appending "from err" to preserve the
original traceback and context for uuid and gpu_idx.

In `@components/power_agent/power_agent.py`:
- Around line 169-177: The current check uses valid = {u for u in uuids if
isinstance(u, str)} which turns uuids into a set and conflates duplicates with
non-string entries; replace this by collecting only string entries without
deduping (e.g., valid_list = [u for u in uuids if isinstance(u, str)]) and
compute non_string_count = sum(1 for u in uuids if not isinstance(u, str)); then
change the logger.warning invocation (the one referencing _MANAGED_STATE_PATH)
to report non_string_count and the number of kept UUIDs (use len(valid_list) or
len(set(valid_list)) depending on whether you intend to report raw entries or
unique UUIDs) instead of comparing len(valid) to len(uuids).

In `@deploy/helm/charts/power-agent/README.md`:
- Around line 49-53: The README currently references a specific chart version
("chart v1.1.0 requires the v1.1.0 image"); update the wording to tie the
requirement to the chart's appVersion instead of a hard-coded chart version:
state that the image tag MUST match the chart's `appVersion`, and explain that
this is necessary because the chart renders the `--actuator` / `--dcgm-*` CLI
flags and `power_agent.py` imports the appVersion-specific `actuator` module;
replace the version-specific sentence with this appVersion-based guidance so
future chart bumps don't require text edits.

In `@deploy/helm/charts/power-agent/templates/_helpers.tpl`:
- Around line 117-119: The helper currently only checks for leading/trailing
whitespace on $tag but allows internal whitespace (e.g., "v1 .1.0") which
produces invalid OCI image references; update the validation in the template
that handles image.tag (the block using $tag and the fail message) to reject any
whitespace characters anywhere in $tag (not just trim differences) by testing
for matches against a whitespace pattern and calling fail with the existing
informative message when any whitespace is found; keep the same fail text and
hint but trigger it when $tag contains internal spaces or other whitespace.

In `@docs/design-docs/power-agent-helm-chart-plan.md`:
- Line 828: The Markdown has unlabeled fenced code blocks that trigger MD040;
update each triple-backtick block in power-agent-helm-chart-plan.md to include a
language identifier (e.g., change ``` to ```text or to a specific language such
as ```yaml, ```bash, or ```json as appropriate) so the linter recognizes the
block language; apply the same change to the other unlabeled fences mentioned in
the comment to resolve MD040.
- Line 29: The heading "### Revision history" is skipping a level; change that
heading to "## Revision history" (or add a parent "##" section before it) so the
document preserves correct heading hierarchy and satisfies MD001; update the
line containing "### Revision history" in
docs/design-docs/power-agent-helm-chart-plan.md accordingly.
- Around line 826-846: Update the stale v1.1.0 summary counts in sections §5.2
and §8 to match the v1.2.0 source-of-truth numbers used elsewhere in this
document: replace "13 files / 2 helm-unittests / 24 passed" with "14 files / 3
helm-unittests / 46 passed" (and any adjacent totals like "Chart-only total" or
LOC if they differ), and make the identical corrections where the same summary
appears around lines 1151-1160 so all references are consistent.

---

Outside diff comments:
In `@components/power_agent/tests/test_managed_state_parser.py`:
- Around line 85-275: Import pytest at the top of the test module and add a
module-level pytestmark list that applies the required markers (scheduling + GPU
+ type) to all tests, e.g. place `import pytest` and `pytestmark =
[pytest.mark.scheduling, pytest.mark.gpu, pytest.mark.unit]` (or whichever type
marker is appropriate) near the top so TestHappyPath, TestMissingFile,
TestOsErrorSiblings, TestMalformedJson, TestNonObjectRoot,
TestManagedUuidsNotAList, and TestEntryTypeValidation all inherit the markers.

In `@components/power_agent/tests/test_orphan_recovery.py`:
- Around line 85-286: The module lacks required pytest markers (scheduling + GPU
+ type); add a module-level pytestmark so every test in this file gets those
markers: import pytest at top of the file and define pytestmark =
[pytest.mark.<scheduling_marker>, pytest.mark.<gpu_marker>,
pytest.mark.<type_marker>] (replace placeholders with the project's chosen
markers), which will apply to the test classes such as TestUuidGating,
TestWorkloadBusySkip, TestCurrentVsDefaultGuard, TestPerGpuExceptionIsolation,
and TestManagedSetPruning.

In `@components/power_agent/tests/test_reconcile_wiring.py`:
- Around line 46-326: Add the required pytest markers by importing pytest and
applying scheduling, GPU, and type markers to this test module; either set a
module-level pytestmark list or add decorators to each test class (e.g.
TestReconcileGpuRoutesViaActuator, TestReconcileGpuDedupesByPodUid,
TestReconcileGpuPolicyResolution) to include at least one scheduling marker, one
GPU marker, and one type marker per the guidelines so all tests in this file are
properly marked.

In `@components/power_agent/tests/test_shutdown.py`:
- Around line 46-229: Add the required pytest classification markers by defining
a module-level pytestmark list (e.g. pytestmark = [pytest.mark.scheduling,
pytest.mark.gpu, pytest.mark.type]) so every test in this module (including
TestSigtermViaActuator and TestSigtermFallback and their methods like
_handle_sigterm interactions) carries scheduling, GPU and type markers; place
the pytestmark definition near the top of the file (above the test classes) and
import pytest if not already imported.

---

Nitpick comments:
In `@components/power_agent/tests/e2e_actuator_parity.py`:
- Around line 497-500: The try/except that currently calls pynvml.nvmlShutdown()
and swallows all exceptions should be changed to catch the specific NVML
exception type (pynvml.NVMLError) instead of Exception, log the error with
traceback, and re-raise so teardown failures are visible; update the block
around pynvml.nvmlShutdown() (replace the bare except Exception: pass) to except
pynvml.NVMLError as e: logger.exception("pynvml.nvmlShutdown failed: %s", e) and
raise, and if a module-level logger isn’t present add logger =
logging.getLogger(__name__) and import logging.

In `@components/power_agent/tests/test_managed_state_parser.py`:
- Around line 60-67: Move the "import logging" statements out of the test
functions and into module-level imports in test_managed_state_parser.py; remove
the in-function imports at the locations that set up captured, handler, and
logger (the blocks that create captured: list[logging.LogRecord], handler =
logging.Handler(), handler.emit = captured.append,
handler.setLevel(logging.WARNING), and logger =
logging.getLogger("power_agent")), so the tests use the top-level logging import
consistently (also apply the same change to the other block referenced in the
file).

In `@components/power_agent/tests/test_reconcile_wiring.py`:
- Around line 218-219: The test module imports `_FakeMetrics` inside test
functions; hoist `from tests.test_multi_pod_policy import _FakeMetrics` to the
top-level imports in the module (module scope) and remove the function-scope
imports in the test methods (including the other occurrences where
`_FakeMetrics` is imported inside tests). Ensure the module-level import is
added near the other test imports so all tests reference the same top-level
`_FakeMetrics`.

In `@deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml`:
- Around line 54-95: The tests under validate_enforce_test.yaml currently only
use notFailedTemplate for many agent.dcgm.enforce string variants; update each
case (e.g., the tests setting agent.dcgm.enforce to
"1","0","yes","no","on","off","TRUE") to also assert the rendered templates
explicitly include or exclude the expected configuration snippets by adding
contains/notContains assertions for the corresponding normalized values (e.g.,
numeric 1/0, true/false, on/off forms) so silent drift is caught, and apply the
same hardening to the other block referenced (the ANY allowlisted values section
around lines 136-147) to cover all allowed renderable forms and the nvml
dead-config checks rather than relying solely on notFailedTemplate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b586c90-dca2-4d5f-aaae-4f35d04c75ef

📥 Commits

Reviewing files that changed from the base of the PR and between 29f0831 and abcc922.

📒 Files selected for processing (26)
  • components/power_agent/Dockerfile
  • components/power_agent/README.md
  • components/power_agent/actuator.py
  • components/power_agent/logger.py
  • components/power_agent/power_agent.py
  • components/power_agent/tests/e2e_actuator_parity.py
  • components/power_agent/tests/test_actuator_protocol.py
  • components/power_agent/tests/test_actuator_selection.py
  • components/power_agent/tests/test_dcgm_actuator.py
  • components/power_agent/tests/test_managed_state_parser.py
  • components/power_agent/tests/test_multi_pod_policy.py
  • components/power_agent/tests/test_orphan_recovery.py
  • components/power_agent/tests/test_reconcile_wiring.py
  • components/power_agent/tests/test_shutdown.py
  • deploy/helm/charts/power-agent/Chart.yaml
  • deploy/helm/charts/power-agent/README.md
  • deploy/helm/charts/power-agent/templates/NOTES.txt
  • deploy/helm/charts/power-agent/templates/_helpers.tpl
  • deploy/helm/charts/power-agent/templates/daemonset.yaml
  • deploy/helm/charts/power-agent/templates/dev-pod.yaml
  • deploy/helm/charts/power-agent/tests/validate_actuator_test.yaml
  • deploy/helm/charts/power-agent/tests/validate_enforce_test.yaml
  • deploy/helm/charts/power-agent/tests/validate_image_tag_test.yaml
  • deploy/helm/charts/power-agent/values.yaml
  • docs/design-docs/power-agent-dual-actuator.md
  • docs/design-docs/power-agent-helm-chart-plan.md

Comment thread components/power_agent/actuator.py Outdated
Comment thread components/power_agent/actuator.py Outdated
Comment thread components/power_agent/actuator.py
Comment thread components/power_agent/power_agent.py
Comment thread deploy/helm/charts/power-agent/README.md Outdated
Comment thread deploy/helm/charts/power-agent/templates/_helpers.tpl Outdated
Comment thread docs/design-docs/power-agent-helm-chart-plan.md Outdated
Comment thread docs/design-docs/power-agent-helm-chart-plan.md Outdated
Comment thread docs/design-docs/power-agent-helm-chart-plan.md Outdated
kaim-eng and others added 12 commits June 8, 2026 11:58
…dation

Chart counterpart to the Actuator Protocol introduced in the previous
commit. Bumps Chart.yaml 1.0.0 -> 1.1.0 (minor: new opt-in DCGM
actuator on top of byte-identical NVML default).

values.yaml additions:
- agent.actuator: nvml | dcgm (default nvml - pre-PR1b behaviour
  byte-for-byte). No auto-detect: the operator declares which
  actuator their cluster runs based on whether their GPU Operator
  has dcgm.enabled=true, and the chart renders one DaemonSet with
  one actuator locked at startup. Eliminates "NVML and DCGM running
  simultaneously" by construction.
- agent.dcgm.host (default nvidia-dcgm.gpu-operator.svc.cluster.local
  - the GPU Operator's standard hostengine Service DNS name)
- agent.dcgm.port (default 5555 - upstream nvidia-dcgm hostPort)
- agent.dcgm.enforce (default false - set-and-forget matches NVML
  semantics; flip to true to register cap as DCGM's target
  configuration for auto-reapply after GPU reset/reinit)

daemonset.yaml:
- --actuator=$(agent.actuator) on every DaemonSet command line
- {{- if eq .Values.agent.actuator "dcgm" }} block injects
  --dcgm-host / --dcgm-port / --dcgm-enforce. NVML mode never sees
  these flags (no dead args).
- Wires four template-time validators at the top of the file:
  validateImageTag, validateMutex, validateActuator, validateEnforce.

dev-pod.yaml:
- ConfigMap script mount now expects BOTH power_agent.py AND
  actuator.py (chart v1.1.0+ requirement - a v1.0.0-style ConfigMap
  with only power_agent.py will ImportError "No module named
  'actuator'" at pod start). README + values.yaml comments + NOTES.txt
  kubectl-create-cm snippets all updated to reflect the two-file
  layout.
- Command line plumbs the same --actuator / --dcgm-* flags as the
  production DaemonSet. Pre-v1.7 dev pods silently defaulted to
  NVML regardless of agent.actuator=dcgm (the values were read but
  never passed to argparse).

_helpers.tpl:
- validateActuator: strict allowlist "nvml" or "dcgm" (rejects
  "NVML", "DCGM", "auto" - the v1.2 reversal - and empty/typo
  inputs). Errors at helm install time, not pod start.
- validateEnforce: mirrors power_agent._parse_bool_strict's
  allowlist (true/1/yes/on/false/0/no/off case-insensitive).
  Rejects --set agent.dcgm.enforce=treu at template time with a
  clear errorPattern instead of letting argparse fail at pod start
  with a less obvious CrashLoopBackOff trace.

tests/ (new directory - helm-unittest suite):
- validate_actuator_test.yaml: 11 tests covering the happy
  nvml/dcgm paths, DCGM flag injection (and exclusion under NVML),
  and the four typo classes (NVML/DCGM uppercase, auto, unknown,
  empty).
- validate_enforce_test.yaml: 13 tests covering all eight
  _parse_bool_strict allowlist values, uppercase normalization,
  the canonical "treu" typo class, and the actuator=nvml + bad
  enforce cross-validator interaction (no validation fires when
  the flag is dead).
- Assertion style: `contains` / `notContains` against the rendered
  command: YAML list (NOT matchRegex, which helm-unittest 1.1.0
  only supports on string paths). Each `contains` check pins the
  exact rendered flag string, stricter than the regex form and
  catches flag-value drift.

NOTES.txt:
- Adds dcgm_enforce_failures_total to the Prometheus metrics
  listing with semantic disambiguation from apply_failures_total.

README updates:
- Compact values-table rows for dev.scriptConfigMap (now
  two-file) and dev.image (DCGM dev-mode override pointer).
- DCGM dev-mode prerequisite block: explicit --set
  dev.image.repository=nvcr.io/nvidia/dynamo/power-agent --set
  dev.image.tag=v1.1.0 override because the default
  vllm-runtime:1.0.1 ships pynvml only (no pydcgm / libdcgm.so).
  Script iteration via the ConfigMap mount still works against
  the power-agent image because the /scripts mount overrides
  /app at runtime.
- Install-example image.tag pinning bumped 1.0.0 -> 1.1.0 to
  match Chart.yaml's appVersion (a v1.0.0 image lacks actuator.py
  and rejects the new --actuator / --dcgm-* flags).

Signed-off-by: Kai Ma <kaim@nvidia.com>
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0
chart that PR #9790 ships. The plan was authored for the NVML-only
v1.0.0 chart in PR #9682 and went stale once PR #9790 layered
agent.actuator + agent.dcgm.* + validateActuator + validateEnforce +
the helm-unittest suite on top.

Two changelog rows capture the refresh:
  v1.3 - first pass: 8 reviewer findings closed (3 blocking,
                     4 major, 2 medium, 1 low) on Status header,
                     values surface, dev ConfigMap recipe, image-tag
                     pinning, helm-unittest gating, internal-dev-doc
                     references, daemonset name, file/LOC accounting.
  v1.4 - second pass: 5 follow-up findings closed (2 major,
                     2 medium, 1 low) on Status self-contradiction,
                     stale section 4.2 dev-block comment, overstated
                     helm-unittest coverage prose, unrunnable
                     section 5.4 positive helm template overlays
                     (missing --set image.tag), and the lingering
                     filename reference in the v1.3 changelog.

No design reversal: every section 6 decision and the chart shape are
unchanged. Only the values surface (extended), helper set (extended
with two template-time validators), and validation-gate list
(helm-unittest now required) grew. Every claim in the refreshed
doc was verified against on-disk state: power_agent.py:706-804
for the 8-flag CLI surface, values.yaml for the dev-block recipe,
_helpers.tpl for the five-helper set, and the two
tests/validate_*_test.yaml files for the 24 enumerated unittest
cases.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Addresses five PR9682 CodeRabbit review comments on the Python
runtime. Pure hardening: no behaviour change on the happy path,
every fix narrows a previously silent / fragile failure mode.

power_agent.py
- _load_previously_managed_gpus is no longer best-effort. The
  pre-fix try/except caught only FileNotFoundError + JSONDecodeError
  and used .get() on whatever json.load returned, so the parser
  could (a) propagate OSError siblings (PermissionError on a stale
  host-volume bind, EIO on flaky NFS) up the stack and crash the
  agent at startup, (b) AttributeError on a non-dict JSON root, or
  (c) silently iterate characters / TypeError on a non-list
  managed_uuids value. Replaced with a layered parser that catches
  OSError, validates the root is a dict, validates managed_uuids is
  a list, and coerces the entry-is-str type guard at the boundary
  so downstream "uuid in _previously_managed" checks can't compare
  bytes/ints against str UUIDs and silently never match. Every
  malformation logs at WARNING and returns empty rather than
  raising; orphan recovery silently skips this startup, which is
  strictly better than CrashLoopBackOff with no caps actuated.
- _reconcile_gpu now deduplicates pod_annotations by pod UID before
  handing them to _resolve_cap_for_gpu. Pre-fix one pod with N GPU
  processes (the common TP/PP/EP topology, ranks-per-GPU, helper
  workers, profilers) appeared as N rows; _resolve_cap_for_gpu uses
  len(pod_annotations) > 1 to detect the multi-pod-per-GPU misconfig,
  so a one-pod / two-PID GPU spuriously fired the "N pods all agree"
  WARNING and incorrectly bumped the multi_pod_gpu_total agree label.
  On a pod whose annotation was missing or invalid, the same code
  path also took the conflict-resolution branch and applied
  safe_default even though the operator's intent was unambiguous.
- _handle_sigterm no longer swallows actuator.shutdown / nvmlShutdown
  failures silently. Pre-fix "except Exception: pass" made DCGM
  hostengine faults and NVML teardown errors at SIGTERM time
  invisible in pod logs - operators had no way to correlate a
  failed restart with a stuck hostengine. logger.exception writes
  the full traceback at ERROR while still proceeding to
  _shutdown.set() so the run loop unblocks (cleanup is best-effort,
  the container MUST exit promptly on SIGTERM).
- argparse hoisted from inside main() to the module-top imports
  alongside the other stdlib modules (CodeRabbit review-body nit).

actuator.py
- DcgmActuator.shutdown applied the same logger.exception
  treatment to its three silent broad catches (group.Delete,
  handle.Shutdown, pynvml.nvmlShutdown). Group failures name the
  gpu_id, hostengine shutdown failures name host:port, NVML
  teardown names the source - so a DCGM resource leak is no
  longer invisible. Same non-reraise contract: cleanup is
  best-effort, callers (power_agent._handle_sigterm) must proceed
  to _shutdown.set().

Tests added / extended (+29 new cases; total 162 passed):
- tests/test_managed_state_parser.py (new, 20 cases): happy path,
  missing file silent, OSError siblings (PermissionError,
  IsADirectoryError, NotADirectoryError, EIO), malformed JSON,
  non-dict root (list/string/int/null), non-list managed_uuids
  (string/int/dict), mixed-type entries (drop count + WARNING
  fires once).
- tests/test_reconcile_wiring.py::TestReconcileGpuDedupesByPodUid
  (4 new cases): one-pod/multi-PID counts once, doesn't bump
  multi-pod counter; two-pods/multi-PID-each genuinely counts as
  two; two disagreeing pods each multi-PID still resolves to
  safe_default with conflict counter ticking exactly once.
- tests/test_shutdown.py (2 new cases): assertLogs ERROR fires
  with the original exception message when actuator.shutdown
  raises; same for the defensive raw-NVML fallback when
  pynvml.nvmlShutdown raises.
- tests/test_dcgm_actuator.py (3 new cases): group.Delete failure
  logs gpu_id; handle.Shutdown failure logs host:port; nvmlShutdown
  failure logs the source.

Signed-off-by: Kai Ma <kaim@nvidia.com>
…ical OCI form

Addresses three iterative PR9682 CodeRabbit review findings on the
chart's image-pinning surface. Bumps Chart.yaml 1.1.0 -> 1.2.0 (minor
per SemVer - image.digest is purely additive; existing tag-pinned
installs render byte-identically and need no change).

Round 1 (original PR9682 finding): image.tag=latest was rejected only
at the raw-manifest layer, not at the chart layer. helm template
--set image.tag=latest happily rendered nvcr.io/nvidia/dynamo/
power-agent:latest because validateImageTag only checked
"if not .Values.image.tag". Now case-insensitively rejects the literal
"latest" (LATEST, Latest, " latest " all fail) while still accepting
release tags that contain the substring (v1.0.0-latest-rc).

Round 2 (digest path rendered an invalid OCI reference): operators
were advised in values.yaml + README + tests to set image.tag=
sha256:abc... for digest pinning. This rendered repo:sha256:... -
INVALID per the OCI spec, which requires repo@sha256:... for digest
form (the second ":" in the rendered string parses as repo + tag
"sha256" with a stray suffix; kubelet manifest pulls fail with an
opaque error). Three changes close this:
  - values.yaml: add image.digest field with per-rule comment
    block explaining the OCI form, the 64-hex requirement, the
    mutex with image.tag, and the PR #9682 rationale.
  - _helpers.tpl: split validateImageTag into a full rule set
    (XOR(tag, digest), whitespace rejection on both, latest
    rejection on tag, sha256: prefix rejection on tag, exact-
    64-hex regex on digest) and add a new imageRef helper that
    emits {repo}@{digest} when digest is set, else {repo}:{tag}.
  - daemonset.yaml: image: line now routes through imageRef
    instead of hard-coding the ":" join, so the canonical OCI
    form is the only thing that can ship.

Round 3 (validator was too lenient):
  - Truncated SHA-256 digests slipped through: regex was
    ^sha256:[0-9a-fA-F]{32,}$, so half-digests (a common pattern
    when an operator copies the first 8 bytes from
    `docker images --digests` output) rendered as
    repo@sha256:<truncated>. The kubelet then fails the pull with
    an opaque manifest-mismatch error. Tightened to {64} exactly
    (SHA-256 is 32 bytes x 2 nybbles).
  - Whitespace-padded values rendered raw: --set-string
    'image.tag= v1.1.0 ' produced image: "repo: v1.1.0 " because
    the latest comparison trimmed but the renderer used the raw
    value. Choice: reject (don't silently normalize) so the
    operator sees their --set quoting typo. Applies symmetrically
    to image.digest.

tests/validate_image_tag_test.yaml (new, 22 cases - chart
helm-unittest count grew 24 -> 46):
- Negative: unset (both empty), latest in 4 case variants,
  whitespace on tag (leading/trailing/both), tag+digest mutex,
  sha256: prefix on tag, non-digest on digest, 32/63/65-hex digest
  (off-by-one + half-length), whitespace on digest.
- Positive: pinned release tag renders {repo}:{tag};
  v1.0.0-latest-rc accepted (substring match guard); 64-hex digest
  renders the canonical {repo}@sha256:... form; uppercase-hex
  digest preserved verbatim.

README updates: production-install section now shows both tag and
digest invocations side by side with a callout that digests live
on image.digest, not image.tag. Values table gains the image.digest
row. Troubleshooting section adds three new error-message entries
covering missing-pin, invalid-digest, and mutex-violation cases.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the
v1.2.0 chart that the preceding commit ships. The plan was authored
for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in
the PR #9790 dual-actuator work, and went stale again once v1.2.0
landed image.digest + the canonical-OCI imageRef helper on top.

Four reviewer findings closed (v1.5 changelog row covers all four):

(1) Truncated SHA-256 digests slipped through the v1.1.0 helper -
already fixed in the preceding chart commit; this commit propagates
the rule into the §4.4 helper snippet and the §5.4 validation gates.

(2) Whitespace-padded tags rendered raw image references - same
treatment: §4.4 documents the trim-then-reject contract.

(3) §4.3 / §4.4 / §5.4 stale on image.digest:
  - §4.3 CodeRabbit-comment row rewritten to describe both
    helpers (validateImageTag + imageRef), both fields
    (tag + digest), the canonical repo@digest form, and the
    PR9682 follow-up that added the separate field.
  - §4.4 helper snippet replaced with the actual v1.2.0
    validator (full rule set + per-rule rationale comments)
    plus the new imageRef helper.
  - §5.4 expected helm-unittest output bumped 24 -> 46 passed
    with a one-line breakdown of the +22 new cases.

(4) §4.1 / §4.2 stale on file count + helper list + values surface:
  - §4.1 said 13 files (omitted .helmignore and the new
    validate_image_tag_test.yaml). Corrected to 14 files (7
    templates + 3 helm-unittests + 4 root files). The _helpers.tpl
    helper-list line gained imageRef and chart entries. The
    daemonset.yaml annotation now mentions it routes through
    imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a
    breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0.
  - §4.2 values snippet had only image.tag with no image.digest.
    Added the field with the same per-rule comment block that
    ships in values.yaml (OCI form, 64-hex requirement, mutex
    with image.tag, PR9682 rationale).

Status header bumped to chart v1.2.0 with a one-line v1.2.0
rationale (additive opt-in field, no breaking change to existing
tag-pinned installs). Revision-history table gains the v1.5 entry
summarising all four findings; older v1.1 - v1.4 entries
untouched.

No design reversal: every §6 decision and the chart shape are
unchanged. Only the values surface (one additive field), helper
set (two helpers: imageRef new, validateImageTag rewritten), and
helm-unittest count (24 -> 46) grew. Every claim in the refreshed
doc was verified against on-disk state: Chart.yaml for the
version bump, values.yaml for the image.digest field comment,
_helpers.tpl for both helpers, tests/validate_image_tag_test.yaml
for the 22 enumerated cases (verified via `helm unittest`).

Signed-off-by: Kai Ma <kaim@nvidia.com>
…ttributes (v1.10 e2e fix)

First in-cluster DCGM parity run on an 8xA100 SXM node (PR #9790
e2e) surfaced three production defects in DcgmActuator, all rooted
in pydcgm API misuse that the mocked unit suite couldn't catch:

1. `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` read
   `DCGM_FI_DEV_UUID` via `dcgmEntityGetLatestValues`. That API
   returns the field cache; on a fresh nv-hostengine 4.5.3 with
   no companion watcher subscribed via `dcgmWatchFields`, the
   cache returns `DCGM_STR_BLANK = "<<<NULL>>>"`. The identity
   map then raised "8 GPU UUID(s) visible to DCGM are not visible
   to NVML" on every cluster startup.

2. `constraints_w` / `current_w` / `default_w` read the four
   `DCGM_FI_DEV_POWER_MGMT_LIMIT{,_MIN,_MAX,_DEF}` fields the
   same way. Every read returned `DCGM_FP64_BLANK = 2^47`, so
   the clamp in `apply_cap` escalated a requested 250 W up to
   the blank max and tried to write 140737488355328 W.

3. The workload-power-profile blanking loop in `_apply_cap_inner`
   reached for `dcgm_structs.DCGM_INT32_BLANK`. The constant
   actually lives in `dcgmvalue` (`/shared/pydcgm/dcgmvalue.py:17`
   in the DCGM 4.5.3 apt bindings); pre-v1.10 the first cap write
   raised `AttributeError: module 'dcgm_structs' has no attribute
   'DCGM_INT32_BLANK'`.

Fix:

- Add a single `_power_limits(gpu_idx)` helper that calls
  `DcgmSystem.discovery.GetGpuAttributes(gpu_id).powerLimits` -
  the synchronous device-info API that wraps
  `dcgmGetDeviceAttributes`. One RPC instead of four, returns a
  `c_dcgmDevicePowerLimits_v1` with integer-watt fields populated
  from the hostengine's discovery state, no field-cache dependency.
- `constraints_w`, `current_w`, `default_w`, `restore_default` all
  route through this helper.
- `get_uuid` and `_ensure_identity_map._read_dcgm_uuids` now use
  `GetGpuAttributes(gpu_id).identifiers.uuid`.
- `apply_cap` imports `dcgmvalue` and swaps all seven
  `DCGM_INT32_BLANK` references onto it.

Test infrastructure overhaul (`test_dcgm_actuator.py`):

- `_make_dcgm_modules` adds a `dcgmvalue` MagicMock with all four
  blank sentinels; removes `DCGM_INT32_BLANK` from `dcgm_structs`
  mock.
- `_make_gpu_attrs` extended to carry both `.identifiers.uuid` and
  `.powerLimits.{cur,default,enforced,min,max}PowerLimit`.
- `_wire_handle` wires `GetGpuAttributes(gid)` per-gpu_id.
- `_seed_constraints_and_uuid` consolidated onto the unified
  GetGpuAttributes path.
- Five `modules["dcgm_agent"].dcgmEntityGetLatestValues.assert_not_called()`
  regression guards added across the suite - locks in "no more
  silent field-cache reads of static device info."

E2e re-validation: with this fix in place, the same 8xA100 rig
reports `PASS: NvmlActuator and DcgmActuator agree on all probes
(tolerance +/- 2.0 W)`, exit code 0, with nvidia-smi confirming
`apply_cap(250)` -> 250 W and `restore_default()` -> 400 W on
every GPU.

Doc: `docs/design-docs/power-agent-dual-actuator.md` v1.10
changelog row added; §6.1 / §6.3 / §7 / §12 / §13 patched to
match (Protocol-level docstrings for `current_w`/`default_w`,
`get_uuid` / `_power_limits` / `restore_default` code samples,
`_apply_cap_inner` import + blank-constant sample, attribution
table row 2, summary clause, reference-material entries for
`pydcgm/dcgmvalue.py` and `GetGpuAttributes`).

Local gates: 163/163 power-agent unit tests pass; pre-commit
isort/black/flake8/codespell/ruff/EOF/whitespace clean on the
three changed files. Broader `tests/` + `pytest-marker-report`
blocked by the known Windows-only `fcntl` import in
`tests/conftest.py:22`; CI runs those on Linux.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable

on a fresh checkout. Fixing any one in isolation leaves the build broken,

so they travel together:

1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04'

   does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect'

   → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag.

2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/

   to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous

   COPY would silently copy zero files under the new pin. Switch the source

   path to the 4.5+ location.

3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing

   'import logger' — but logger.py lives in DCGM's source tree under

   testing/python3/ and is NOT packaged. Without a shim every DcgmGroup

   construction raises ModuleNotFoundError. Add a 10-line stdlib-logging

   adapter at components/power_agent/logger.py and COPY it into

   /opt/dcgm/python/logger.py during the runtime stage.

This unblocks 'docker build -f components/power_agent/Dockerfile' on a

fresh clone (verified locally via 'docker buildx build --build-arg

DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21,

image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the

Path-B live test on aks-a100b-22138447-vmss000000).

Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6.
Signed-off-by: Kai Ma <kaim@nvidia.com>
components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2,

a struct that only exists in DCGM 4.x. With 3.x bindings the agent's

DcgmActuator.init() succeeded, opened the hostengine connection, and

ran NVML init cleanly — then crashed mid-first-reconcile with

'AttributeError: module dcgm_structs has no attribute

c_dcgmDeviceConfig_v2' after some GPUs were already capped. The

SIGTERM-restore path won't run when the actuator never finished

registering with _active_actuator, so the GPUs are left at custom caps.

Add a 7-line hasattr check immediately after the dcgm_structs import

in init() that raises a RuntimeError with:

  - the missing struct name (so the error is grep-able)

  - the required DCGM major (>=4.0)

  - the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators

    know exactly what to bump)

Guards against accidental Dockerfile regressions to a 3.x base image.

Pair with the Dockerfile fix in 4820ca7 (which bumps the default

to 4.5.1-1-ubuntu22.04) so the default build path is consistent with

the runtime contract.

Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection

(MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails

BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments

don't half-init and leave hostengine sockets dangling.

Refs: PR #9790 review, Power Agent live-test finding #5.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Verified each of the 9 fresh CodeRabbit findings on PR #9790 against
the actual code (per .cursor/rules/review-feedback.mdc) and accepted
all nine. Bundled fix in one commit on top of the previous PR9790
HEAD; no history rewrite.

Code (components/power_agent/):
  * actuator.py:539 — replace silent `except Exception: pass` around
    the stale-handle Shutdown with `logger.exception(...)` + a
    comment that reconnect proceeds intentionally on cleanup
    failure. Continue rather than re-raise: failing reconnect
    because cleanup of the already-dead handle failed would be
    strictly worse than logging and moving on.
  * actuator.py:622 — chain RuntimeError off the originating
    KeyError via `raise ... from err` so the UUID-map lookup
    failure stays in the traceback (PEP 3134 / Ruff B904).
  * actuator.py:777 — narrow `except Exception` in `apply_cap` to
    `except dcgm_structs.DCGMError`. The broad catch masked
    binding / programming defects (e.g. the v1.10
    `dcgmvalue.DCGM_INT32_BLANK` AttributeError documented in
    `_apply_cap_inner`) as normal apply failures, which both
    silenced the bug AND incorrectly ticked
    `apply_failures_total` — a metric reserved for "the DCGM
    write itself failed". Lazy-import `dcgm_structs` AFTER the
    metrics-required check so the constructor path stays runnable
    on hosts without the DCGM bindings (verified by
    `test_apply_cap_requires_metrics`).
  * power_agent.py:170 — compute `invalid_count` separately
    instead of `len(set) vs len(list)`. The set comprehension
    deduplicates, so the previous comparison would log
    "1 non-string entry" when uuids was `["a","a","b"]` (zero
    invalid entries, just a duplicate).

Helm (deploy/helm/charts/power-agent/):
  * templates/_helpers.tpl:117/129 — replace
    `ne $tag (trim $tag)` (and same for $digest) with
    `regexMatch "\\s" ...`. The trim check only caught
    leading/trailing whitespace; an internal-whitespace tag like
    `v1 .1.0` (copy/paste typo) slipped through and rendered an
    invalid OCI image reference, deferring failure from template
    time to kubelet pull time. CodeRabbit only flagged the tag
    side, but the same flaw was symmetric on the digest helper,
    so fix both for parity.
  * tests/validate_image_tag_test.yaml — update existing
    "leading/trailing whitespace" assertions to "contains
    whitespace" (the new helper message), and add four new cases:
    image.tag with internal whitespace, image.tag with embedded
    tab, image.digest with internal whitespace, and the existing
    " latest " ordering pin. Symmetric coverage across both
    fields.
  * README.md:49 — rewrite the "image tag must match appVersion"
    paragraph to anchor on `helm show chart … | grep appVersion`
    rather than hardcoding "chart v1.1.0 requires the v1.1.0
    image". The chart is on v1.2.0 with appVersion 1.1.0; the
    old wording was already stale and would re-stale on every
    image bump.

Docs (docs/design-docs/power-agent-helm-chart-plan.md):
  * §0 line 29 — promote `### Revision history` to
    `## Revision history` so the heading hierarchy goes h1 → h2
    (was h1 → h3, MD001).
  * §5.2 — bump "13 files / 2 helm-unittests" → "14 files /
    3 helm-unittests", add the v1.2.0
    `validate_image_tag_test.yaml` to the file list, replace
    the brittle LOC totals with a one-line `wc -l` recipe so
    the doc stops drifting on every cycle, and tag the three
    unlabeled fences as ```text``` (MD040).
  * §8 — bump the "Author all 13 chart files" checklist item
    to 14 and update the helm-unittest expected count from
    `24 passed` (v1.1.0) to `46 passed` (v1.2.0), with a note
    to rerun rather than hardcode.

Tests added:
  * test_apply_cap_propagates_non_dcgm_exception — regression
    guard for the narrowed apply_cap except: an AttributeError
    injected on `group.config.Set` MUST propagate, MUST NOT
    tick `apply_failures_total`, and MUST NOT touch the
    `_managed_gpu_indices` / `applied_limit_watts` bookkeeping
    (those run only on the success path).
  * (Existing `test_apply_cap_dcgm_generic_error_bumps_failure_metric_and_returns`
    already exercises the "DCGMError still caught" case; both
    assertions now bracket the narrowed exception contract.)

Verification:
  * pytest components/power_agent/tests/ -q → 170 passed
  * pre-commit run --files <touched 7 files> → all hooks pass
    (pytest-marker-report skipped — Windows fcntl issue, runs
    in Linux CI).
  * Linter (ReadLints) → no errors on any of the 7 touched
    files.
  * The DCGM `import dcgm_structs` is local to `apply_cap` and
    scoped after the metrics-required check, so
    `test_apply_cap_requires_metrics` still raises
    RuntimeError (not ModuleNotFoundError) on hosts without
    the bindings.

Refs: PR #9790 CodeRabbit re-review (2026-05-25).
Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
Three high/medium findings from the PR9790 Codex adversarial review,
all confirmed against current code (PR9682/PR9683 fixed adjacent issues
but did not touch these).

power_agent.py
- _handle_sigterm now prunes _previously_managed after each successful
  restore_default and persists the pruned set before shutdown. Pre-fix
  the SIGTERM handler restored each GPU to default but never updated
  managed_gpus.json, so on the next startup orphan recovery could
  clobber a cap applied by another workflow (different DGD, manual
  nvidia-smi -pl, vendor firmware) on a now-idle GPU still recorded as
  agent-owned. The actuator branch uses actuator.get_uuid(gpu_idx);
  the raw-NVML fallback branch uses _nvml_uuid(handle). Failed
  restores skip the prune (cap may still be live; next-startup orphan
  recovery is our only chance to reset it). UUID-lookup and persist
  failures log a WARNING but never crash the SIGTERM handler.

- _resolve_cap_for_gpu rewritten so multi-pod GPUs require ALL pods to
  carry a parseable integer annotation, not just the non-None subset.
  Pre-fix the resolver filtered None BEFORE computing the agree-set,
  so a GPU shared by pod A (cap 480) and pod B (no annotation) silently
  resolved to 480 - pod A's annotation governed pod B's GPU usage, the
  exact cross-workload policy failure the multi-pod guard exists to
  contain. Any missing or invalid annotation in the multi-pod case now
  resolves to safe_default_watts and bumps the conflict counter.
  Single-pod paths unchanged.

- _list_pods_on_node propagates Kubernetes API errors instead of
  returning []. Pre-fix \"except Exception: return []\" made a transient
  apiserver outage, RBAC regression, or network blip indistinguishable
  from a genuinely empty node, silently dropping enforcement.
  reconcile_once now catches and skips the cycle, leaving previously-
  applied caps live. New Prometheus counter
  dynamo_power_agent_k8s_list_failures_total lets operators alert on
  \">0 over 5m\". Policy choice: skip-cycle (preserve last cap) rather
  than fail-closed to safe_default; transient hiccups should not
  cap-down healthy workloads.

Tests (+12 new, 1 flipped; 190 total passing):
- test_shutdown.py::TestSigtermPrunesManagedGpusState (5 tests):
  successful prune persists; failed restore retains UUID; UUID-lookup
  failure is logged and non-fatal; raw-NVML fallback prunes via
  _nvml_uuid; persist failure does not hang the agent.
- test_multi_pod_policy.py: flipped test_mixed_none_and_valid (was
  asserting buggy 480 inheritance; now asserts SAFE_DEFAULT +
  conflict); added test_multipod_invalid_annotation_on_one_pod_is_conflict
  and test_multipod_all_missing_is_conflict.
- test_reconcile_wiring.py: TestListPodsOnNodePropagatesErrors
  (3 tests, namespaced + cluster-wide + happy path);
  TestReconcileOnceK8sListFailure (2 tests, skip-cycle on failure +
  per-GPU run on success).

Signed-off-by: Kai Ma <kaim@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebase follow-up: dev-mode examples, helm-unittest fixtures, and design
docs still referenced nvcr.io/nvidia/dynamo/power-agent after #9682
standardised on nvcr.io/nvidia/ai-dynamo/power-agent.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1b/power-agent-dcgm-actuator branch from 464387c to 852ad7e Compare June 8, 2026 16:06
@kaim-eng

kaim-eng commented Jun 8, 2026

Copy link
Copy Markdown
Author

Rebased onto current pr1a/power-agent

This branch was 186 commits stale relative to main via its pr1a base. I've rebased the 13 DCGM-only commits (dc4cc53d63464387cd23) onto the current pr1a/power-agent tip (6cd82e8be1), resolving conflicts in:

  • components/power_agent/Dockerfile — merged pr1a's DOCKER_PROXY CI wiring with the multi-stage DCGM vendor build (pinned 4.5.1-1-ubuntu22.04, logger.py shim)
  • docs/design-docs/power-agent-helm-chart-plan.md — combined pr1a's Implemented/build-wiring status with v1.1.0/v1.2.0 dual-actuator + image.digest refresh
  • deploy/helm/charts/power-agent/{values.yaml,README.md} — kept image.digest hardening on top of pr1a's nvcr.io/nvidia/ai-dynamo/power-agent registry path

Plus one follow-up commit (852ad7ee60) aligning remaining dev-mode examples / helm-unittest fixtures / design docs to the ai-dynamo registry path.

What this PR delivers (review context)

Opt-in DCGM actuation alongside the NVML default from #9682:

  • Actuator protocol + NvmlActuator / DcgmActuator (components/power_agent/actuator.py)
  • Helm chart v1.3.0 (appVersion 1.2.0): agent.actuator (nvml|dcgm), agent.dcgm.{host,port,enforce}, image.digest with template-time validators + helm-unittest suite
  • Design doc: docs/design-docs/power-agent-dual-actuator.md

Verification (local, pre-push)

  • python3.10 -m pytest components/power_agent/tests/ -q190 passed (e2e parity excluded)
  • helm lint deploy/helm/charts/power-agent --set image.tag=v1.2.0 → 0 failures
  • helm template … --set agent.actuator=dcgm → renders cleanly
  • pre-commit content hooks pass; DCO signed on all 14 commits

All 9 CodeRabbit threads are resolved. Requesting a human review + approval so this can merge into pr1a/power-agent after #9682 lands.

Suggested merge order: #9682#9790#9683.

@kaim-eng kaim-eng requested a review from sttts June 8, 2026 16:07
The docs link check fails on README links to
docs.nvidia.com/dynamo/kubernetes-deployment/* (404 from GitHub
runners as of 2026-06-08). These are inherited from main, not
introduced by the DCGM actuator work, and match the existing
.lycheeignore pattern for external docs that fail from CI.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Comment thread .lycheeignore Outdated
Comment thread components/power_agent/actuator.py
Comment thread components/power_agent/actuator.py
kaim-eng added 3 commits June 9, 2026 21:48
…n DCGM restore

Graceful SIGTERM restore silently dropped every cap in-cluster: the image runs 'python /app/power_agent.py' (module __main__) while actuator.py reaches the agent via 'import power_agent', so the two module copies each got their own _managed_gpu_indices. The actuator recorded caps into one copy; the SIGTERM handler restored from the other (always empty).

Hoist the mutable managed-GPU state (_managed_gpu_indices, _previously_managed, state path) into a standalone managed_state module imported by canonical name from both sites, so all copies converge on one set. Orphan reload now mutates in place instead of rebinding. Revert the __main__ delegation hack. Package managed_state.py into the image COPY and the dev-pod ConfigMap; bump chart 1.2.0 -> 1.3.0 and update README/NOTES/dev-pod docs.

Harden DcgmActuator.restore_default against post-reconnect GPU re-enumeration: snapshot the capped UUID at apply time and relocate to its current index before writing default TGP. On a proven index mismatch never fall back to the original (known-wrong) index; relocate, or skip without writing or pruning so cold-start orphan recovery retries. Best-effort restore at the original index only when identity is unreadable (mismatch unproven).

Adds regression tests (managed-state sharing, fresh-module-copy aliasing, DCGM relocate/skip/best-effort). Addresses PR #9790 sttts review (module-identity split, DCGM re-enumeration) and follow-up review findings.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>

# Conflicts:
#	components/power_agent/power_agent.py
…ersion 1.2.0

The plan header and layout still described chart version 1.2.0 /
appVersion 1.1.0 and a 46-test helm-unittest gate, but PR9790 bumped the
chart to 1.3.0 / appVersion 1.2.0 (managed_state.py module) and the
helm-unittest suite now reports 49 passing tests. Align the status
header, the Chart.yaml layout comment, and the validation-gate count.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng requested a review from a team as a code owner June 10, 2026 03:00
kaim-eng added 4 commits June 9, 2026 23:33
…ion 1.2.0

The helm-chart-plan design doc still described the chart as
version/appVersion 1.1.0 with a 46-test helm-unittest snapshot and
pinned every install/validation example to image.tag=v1.1.0, while
Chart.yaml is now version 1.3.0 / appVersion 1.2.0 and the suite is
49 tests (verified via `helm unittest deploy/helm/charts/power-agent`).
Update the current-state claims (risk-register row 6, section 7.4), the
section 5.4 validation recipes, the section 1.4.2 / 5.3.1 install
examples, and the test-count snapshot.

Also make the validateImageTag helper error text track the chart
appVersion (printf "v%s" .Chart.AppVersion) rather than a hardcoded
v1.1.0 example, and note managed_state.py in the dev-pod ConfigMap
comment.

Signed-off-by: Kai Ma <kaim@nvidia.com>
…overage

Signed-off-by: Kai Ma <kaim@nvidia.com>
Resolve conflicts in components/power_agent/power_agent.py:
- _resolve_cap_for_gpu: keep strict multi-pod policy (PR9790 Codex
  finding #2) over base's lenient None-filtering docstring/intent; the
  merged code body enforces the strict policy.
- _reconcile_gpu: keep actuator-routed PID read + apply_cap (the DCGM
  dual-actuator path that is the purpose of this PR) over base's inline
  pynvml; folded base's persistent-cap rationale into the docstring.

release.yml and power-agent/README.md auto-merged.
Validated: components/power_agent/tests 206 passed.

Signed-off-by: Kai Ma <kaim@nvidia.com>
The file was committed before full CI pre-commit ran (workflows were
gated by copy-pr-bot), so black/isort line-wrapping of the PID_TO_UID
dict and two list comprehensions was never applied. Pure formatting,
no logic change; test_multi_dgd_topology 5 passed.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng

Copy link
Copy Markdown
Author

Status update -- DCGM dual actuator. All three review threads are addressed and resolved:

  • lychee coverage narrowed to the 3 exact 404 URLs (anchored, so new/renamed links are still checked);
  • the __main__-vs-power_agent shared-state cap-leak fixed via managed_state.py (single cached module object);
  • DCGM re-enumeration handled by a UUID-keyed restore that relocates-or-skips on a proven index mismatch (0b0bca38b3).

CI green (snapshot/operator skip via path filters). Stacked on #9682 -- mergeable once that lands. Description refreshed (DCGM 4.5.1-1, chart 1.3.0/appVersion 1.2.0, 49 helm-unittest cases; NVML default described as behaviourally equivalent rather than byte-identical).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants