feat(power-agent): per-node power-cap enforcement DaemonSet by kaim-eng · Pull Request #9682 · ai-dynamo/dynamo

kaim-eng · 2026-05-18T15:53:54Z

Part of the PR #9369 split plan.
This is PR 1 of 6 (PR 1a - Power Agent).

Predecessor: none (base = main)
Successor: #9683 (stacked on pr1a/power-agent; #9682 must land first)

Summary

Standalone Power Agent DaemonSet for per-node GPU power-cap enforcement, packaged as a Helm chart. The PR is reviewable in isolation from planner logic and has zero import dynamo.planner.* coupling.

Current branch shape: 26 files changed: net-new components/power_agent/ code and deploy/helm/charts/power-agent/ chart files, plus CI plumbing edits to 5 existing .github/ files.

components/power_agent/power_agent.py (566 LOC): NVML clamp helper, cgroup parser, multi-pod policy, SIGTERM restore, fail-safe reconcile (_list_pods_on_node -> Optional[list]), and UUID-gated orphan-cap restoration.
components/power_agent/Dockerfile: slim Python runtime; libnvidia-ml.so is injected at runtime by runtimeClassName: nvidia.
components/power_agent/tests/: 47 unit tests covering apply_cap, cgroup parsing, multi-pod policy, reconcile fail-safe, and shutdown.
deploy/helm/charts/power-agent/: Helm chart with DaemonSet, dev pod, ServiceAccount, and RBAC. RBAC renders namespaces through {{ .Release.Namespace }} and supports cluster-scoped or namespace-restricted mode.
.github/: power_agent path filter plus dedicated power-agent image build/push wiring in PR, post-merge, and release workflows.

Reviewer context:

Design context: docs/design-docs/powerplanner-design.md section 7 (lands later in the stack; readable from pr5/docs-devenv).
Planner contract: the only planner-facing coupling is the dynamo.nvidia.com/gpu-power-limit pod annotation.
Split-plan context: sections 2.1 and 2.1.2.

Validation

Required/current GitHub checks pass at the current tip; rust jobs and team-request are intentionally skipped by filters.
components/power_agent/tests/: 47 unit tests in-tree. The last full dev-pod run before the reconcile fail-safe additions was 43 passed on 2026-05-18; CI is green at the current tip.
Pre-commit on origin/main..pr1a/power-agent: 14 hooks pass, 4 skipped. pytest-marker-report was reconfirmed in a Linux pod with Missing sets: 0. The Windows host has the known pre-existing POSIX-only fcntl import issue in untouched test utilities; Linux CI is unaffected.
DCO: commits are signed off.

Merge Strategy

Rebase-and-merge, no squash, preserving the single PR commit in main history per split-plan section 4.3.

Stack Order

Merge #9682 first. #9683 and #9790 are both stacked on pr1a/power-agent and become actionable after #9682 lands.

copy-pr-bot · 2026-05-18T15:53:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-18T15:59:24Z

Walkthrough

This PR introduces a complete Kubernetes DaemonSet agent that enforces per-GPU NVML power caps on worker nodes. The daemon reconciles every 15 seconds, mapping running GPU processes to pod UIDs via cgroup parsing, reading power-limit annotations, resolving multi-pod conflicts, and applying caps via NVML with Prometheus metrics and graceful shutdown support.

Changes

Power Agent: GPU NVML Power Cap Enforcement DaemonSet

Layer / File(s)	Summary
Module foundation and utilities `components/power_agent/power_agent.py` (lines 1–213)	Module header, optional lazy imports for NVML/Kubernetes/Prometheus with fallbacks, logging, constants, cgroup parsing regex for extracting pod UIDs from `/proc/{pid}/cgroup`, atomic JSON persistence for managed GPU state, UUID normalization for pynvml version compatibility, and Prometheus metrics infrastructure with noop fallback when unavailable.
NVML cap application and multi-pod policy `components/power_agent/power_agent.py` (lines 220–404), `components/power_agent/tests/test_apply_cap.py`, `components/power_agent/tests/test_multi_pod_policy.py`	NVML helper functions to clamp watts to device min/max constraints and apply power caps with logging and metric updates; multi-pod-per-GPU resolution policy that applies an agreed cap when all pods match, otherwise applies safe default on conflicts or missing/invalid annotations, incrementing error counters; unit tests validating clamping boundaries, NVML error tolerance, UUID handling (`bytes` vs `str`), and all policy scenarios (single pod, agreement, conflict, invalid annotations, mixed None/valid).
Shutdown and startup recovery `components/power_agent/power_agent.py` (lines 280–337), `components/power_agent/tests/test_shutdown.py`	SIGTERM/SIGINT handler restores managed GPU power limits to NVML defaults and signals main loop exit; startup orphan recovery loads persisted GPU UUIDs, checks idle status, restores default caps for idle GPUs, and updates persisted state; unit tests validating per-GPU restoration, graceful shutdown with no managed GPUs, and NVML error handling.
PowerAgent orchestration `components/power_agent/power_agent.py` (lines 411–577)	PowerAgent class initializes NVML/Kubernetes clients, loads orphan state, lists node pods, maps running PIDs to pod UIDs, builds UID-to-annotation mappings, resolves GPU cap via policy, applies caps via NVML, and persists managed state; blocking reconcile loop runs every 15 seconds; CLI main() entry point accepts `--safe-default-watts`, optional node/namespace scope, and Prometheus port, then starts daemon.
Cgroup parser validation `components/power_agent/tests/test_cgroup_parser.py`	Unit tests for cgroup UID extraction across cgroup v1/v2 with systemd and cgroupfs layouts, QoS classes (Guaranteed/Burstable/BestEffort), and non-Kubernetes inputs; validates OSError handling and first-match-wins behavior when multiple cgroup lines exist.
Kubernetes deployment `deploy/power_agent/rbac.yaml`, `deploy/power_agent/daemonset.yaml`, `deploy/power_agent/dev-pod.yaml`	RBAC ServiceAccount, ClusterRole with pod read-access, and ClusterRoleBinding; production DaemonSet runs on GPU-labeled nodes with hostPID, privileged container, nvidia runtime, environment variables for safe default (500 W) and node name, resource limits, and volume mounts for `/proc` (read-only) and persisted state at `/var/lib/dynamo-power-agent`; development Pod harness for manual testing and validation on live clusters.
Documentation and CI integration `components/power_agent/README.md`, `.github/filters.yaml`	README documents Power Agent's 15-second reconciliation, cgroup UID extraction, annotation reading, NVML power cap enforcement, troubleshooting steps (annotations, logs), Prometheus metrics (applied limit, conflict/safe-default/failure counters), and shutdown/startup semantics; CI filter update triggers planner jobs on power_agent component and deployment changes.

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: introduction of a per-node power-cap enforcement DaemonSet for the Power Agent.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	PR description is comprehensive with clear overview, detailed summary of changes, specific file guidance, related issues reference, and explicit stack ordering context.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

components/power_agent/power_agent.py (1)
537-539: ⚡ Quick win

Move argparse import to module scope.

Line 538 imports inside main(), which violates the repo’s Python import placement rule.
Proposed fix
+import argparse
 import json
 import logging
@@
 def main() -> None:
-    import argparse
-
     parser = argparse.ArgumentParser(description="Dynamo Power Agent DaemonSet")
As per coding guidelines: "Keep all imports at the top of each file (flag any import inside functions/classes)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/power_agent.py` around lines 537 - 539, The argparse
import is inside the main() function; move the import to module scope by adding
"import argparse" at the top of the file and removing the in-function import in
main(), ensuring any type or usage remains correct (adjust other imports if
needed) so linting and the repo rule about top-level imports are satisfied.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/power_agent/power_agent.py`:
- Around line 119-124: The function _load_previously_managed_gpus should
defensively handle malformed or unreadable state files: when reading
_MANAGED_STATE_PATH, catch OSError in addition to
FileNotFoundError/JSONDecodeError, load the JSON into a variable (e.g., data =
json.load(f)), verify isinstance(data, dict) before calling
data.get("managed_uuids", []), and if the root is not a dict (or managed_uuids
is not an iterable of strings) return an empty set; ensure any
TypeError/ValueError from unexpected types is handled and results in returning
set() rather than letting startup crash.
- Around line 294-297: Replace the broad except that swallows errors from the
pynvml.nvmlShutdown() call with a targeted handler: catch pynvml.NVMLError (or
the specific pynvml error class) as e, log the exception with the module/class
logger (e.g., logger.exception("NVML shutdown failed: %s", e) or
logging.exception(...)) and then re-raise the error to avoid silent failures;
modify the try/except around pynvml.nvmlShutdown() accordingly so only
NVML-specific errors are caught, logged, and propagated.
- Around line 494-501: The loop that builds pod_annotations appends one entry
per GPU process, causing pods with multiple processes to be counted multiple
times; change the logic in the procs loop (using _extract_pod_uid_from_cgroup
and uid_to_annotation) to deduplicate by pod UID before applying multi-pod
policy — e.g., track seen UIDs (or build a uid->annotation map) and only append
one (uid, annotation) pair per unique UID so each pod is counted once when
evaluating multi-pod warnings/metrics.

In `@deploy/power_agent/daemonset.yaml`:
- Around line 16-35: The DaemonSet metadata.namespace is hardcoded to "default",
which causes mismatches with the RBAC/service account namespace; update the
manifest to use the parameterized namespace variable (e.g.
${POWER_AGENT_NAMESPACE}) instead of "default" and ensure the ServiceAccount
referenced by spec.serviceAccountName ("dynamo-power-agent") is created/bound in
that same parameterized namespace so RBAC bindings resolve correctly; locate
metadata.name ("dynamo-power-agent"), metadata.namespace, and
spec.serviceAccountName in the template to make the change.
- Around line 55-58: Replace the mutable image tag
"nvcr.io/nvidia/dynamo/power-agent:latest" with an immutable reference (a
specific version tag or an image digest), e.g.
"nvcr.io/nvidia/dynamo/power-agent:vX.Y.Z" or
"nvcr.io/nvidia/dynamo/power-agent@sha256:<digest>", so deployments are
reproducible; keep or verify imagePullPolicy (e.g. IfNotPresent) is appropriate
for pinned images and update the inline comment to note the image is
intentionally pinned.

---

Nitpick comments:
In `@components/power_agent/power_agent.py`:
- Around line 537-539: The argparse import is inside the main() function; move
the import to module scope by adding "import argparse" at the top of the file
and removing the in-function import in main(), ensuring any type or usage
remains correct (adjust other imports if needed) so linting and the repo rule
about top-level imports are satisfied.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f9faddf8-30e0-4a7b-9055-b45244b2a979

📥 Commits

Reviewing files that changed from the base of the PR and between f20ff4e and 3cde449.

📒 Files selected for processing (11)

.github/filters.yaml
components/power_agent/README.md
components/power_agent/power_agent.py
components/power_agent/tests/__init__.py
components/power_agent/tests/test_apply_cap.py
components/power_agent/tests/test_cgroup_parser.py
components/power_agent/tests/test_multi_pod_policy.py
components/power_agent/tests/test_shutdown.py
deploy/power_agent/daemonset.yaml
deploy/power_agent/dev-pod.yaml
deploy/power_agent/rbac.yaml

Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single Helm chart at deploy/helm/charts/power-agent/, resolving the three CodeRabbit findings on PR #9682: * hardcoded metadata.namespace=default -> {{ .Release.Namespace }} * mutable image :latest -> required image.tag with fail-fast validator * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating The chart supports three deployment shapes selectable via values: production DaemonSet (default, cluster-wide RBAC), namespace-restricted production (Role+RoleBinding), and an in-cluster dev-iteration Pod mounting power_agent.py from a ConfigMap. Three template-time validators reject foot-guns at install time: empty image.tag, mutex violations between daemonset.enabled and dev.enabled, and dev mode without a pinned dev.nodeName. Dev mode also automatically forces namespace-scoped RBAC (least privilege), leveraging power_agent.py's --namespace flag. Design rationale, scope decisions, and review-feedback responses are captured in docs/design-docs/power-agent-helm-chart-plan.md, committed alongside the chart. components/power_agent/README.md flips its install recipe to ``helm install``, and the planner CI filter (.github/filters.yaml) is retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**. Two examples/deployments/powerplanner/*.yaml header references live on PR #9687 and will be updated during that PR's cascade rebase per plan section 5.3. Validated locally: helm lint -> 0 errors helm template (3 positive exercises) -> expected resources helm template (3 negative exercises) -> expected fail-fast errors components/power_agent/tests/ -> 43/43 passed .github/scripts/test-filters.js -> 20/20 passed pre-commit (cross-cutting hooks) -> all applicable passed Signed-off-by: Kai Ma <kaim@nvidia.com>

Updates the three reference sites in examples/deployments/powerplanner/ that previously instructed users to ``kubectl apply -f deploy/power_agent/...``, flipping them to the new Helm chart at deploy/helm/charts/power-agent/ that landed in PR #9682: * disagg-power-aware.yaml header recipe * README.md Prerequisites + verify section * MULTI_DGD.md file index Also updates the verify-pods label selector from the legacy ``app=dynamo-power-agent`` to the chart-emitted ``app.kubernetes.io/name=power-agent``. Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md, pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference the old paths in their historical / architectural narrative sections, which is intentional -- those describe the pre-chart state and the transition rationale, not current deployment instructions. Part of the PR #9369 cascade following PR #9682''s Helm chart landing. Signed-off-by: Kai Ma <kaim@nvidia.com>

Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0 chart that PR #9790 ships. The plan was authored for the NVML-only v1.0.0 chart in PR #9682 and went stale once PR #9790 layered agent.actuator + agent.dcgm.* + validateActuator + validateEnforce + the helm-unittest suite on top. Two changelog rows capture the refresh: v1.3 - first pass: 8 reviewer findings closed (3 blocking, 4 major, 2 medium, 1 low) on Status header, values surface, dev ConfigMap recipe, image-tag pinning, helm-unittest gating, internal-dev-doc references, daemonset name, file/LOC accounting. v1.4 - second pass: 5 follow-up findings closed (2 major, 2 medium, 1 low) on Status self-contradiction, stale section 4.2 dev-block comment, overstated helm-unittest coverage prose, unrunnable section 5.4 positive helm template overlays (missing --set image.tag), and the lingering filename reference in the v1.3 changelog. No design reversal: every section 6 decision and the chart shape are unchanged. Only the values surface (extended), helper set (extended with two template-time validators), and validation-gate list (helm-unittest now required) grew. Every claim in the refreshed doc was verified against on-disk state: power_agent.py:706-804 for the 8-flag CLI surface, values.yaml for the dev-block recipe, _helpers.tpl for the five-helper set, and the two tests/validate_*_test.yaml files for the 24 enumerated unittest cases. Signed-off-by: Kai Ma <kaim@nvidia.com>

@sha256

…ical OCI form Addresses three iterative PR9682 CodeRabbit review findings on the chart's image-pinning surface. Bumps Chart.yaml 1.1.0 -> 1.2.0 (minor per SemVer - image.digest is purely additive; existing tag-pinned installs render byte-identically and need no change). Round 1 (original PR9682 finding): image.tag=latest was rejected only at the raw-manifest layer, not at the chart layer. helm template --set image.tag=latest happily rendered nvcr.io/nvidia/dynamo/ power-agent:latest because validateImageTag only checked "if not .Values.image.tag". Now case-insensitively rejects the literal "latest" (LATEST, Latest, " latest " all fail) while still accepting release tags that contain the substring (v1.0.0-latest-rc). Round 2 (digest path rendered an invalid OCI reference): operators were advised in values.yaml + README + tests to set image.tag= sha256:abc... for digest pinning. This rendered repo:sha256:... - INVALID per the OCI spec, which requires repo@sha256:... for digest form (the second ":" in the rendered string parses as repo + tag "sha256" with a stray suffix; kubelet manifest pulls fail with an opaque error). Three changes close this: - values.yaml: add image.digest field with per-rule comment block explaining the OCI form, the 64-hex requirement, the mutex with image.tag, and the PR #9682 rationale. - _helpers.tpl: split validateImageTag into a full rule set (XOR(tag, digest), whitespace rejection on both, latest rejection on tag, sha256: prefix rejection on tag, exact- 64-hex regex on digest) and add a new imageRef helper that emits {repo}@{digest} when digest is set, else {repo}:{tag}. - daemonset.yaml: image: line now routes through imageRef instead of hard-coding the ":" join, so the canonical OCI form is the only thing that can ship. Round 3 (validator was too lenient): - Truncated SHA-256 digests slipped through: regex was ^sha256:[0-9a-fA-F]{32,}$, so half-digests (a common pattern when an operator copies the first 8 bytes from `docker images --digests` output) rendered as repo@sha256:<truncated>. The kubelet then fails the pull with an opaque manifest-mismatch error. Tightened to {64} exactly (SHA-256 is 32 bytes x 2 nybbles). - Whitespace-padded values rendered raw: --set-string 'image.tag= v1.1.0 ' produced image: "repo: v1.1.0 " because the latest comparison trimmed but the renderer used the raw value. Choice: reject (don't silently normalize) so the operator sees their --set quoting typo. Applies symmetrically to image.digest. tests/validate_image_tag_test.yaml (new, 22 cases - chart helm-unittest count grew 24 -> 46): - Negative: unset (both empty), latest in 4 case variants, whitespace on tag (leading/trailing/both), tag+digest mutex, sha256: prefix on tag, non-digest on digest, 32/63/65-hex digest (off-by-one + half-length), whitespace on digest. - Positive: pinned release tag renders {repo}:{tag}; v1.0.0-latest-rc accepted (substring match guard); 64-hex digest renders the canonical {repo}@sha256:... form; uppercase-hex digest preserved verbatim. README updates: production-install section now shows both tag and digest invocations side by side with a callout that digests live on image.digest, not image.tag. Values table gains the image.digest row. Troubleshooting section adds three new error-message entries covering missing-pin, invalid-digest, and mutex-violation cases. Signed-off-by: Kai Ma <kaim@nvidia.com>

Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.2.0 chart that the preceding commit ships. The plan was authored for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in the PR #9790 dual-actuator work, and went stale again once v1.2.0 landed image.digest + the canonical-OCI imageRef helper on top. Four reviewer findings closed (v1.5 changelog row covers all four): (1) Truncated SHA-256 digests slipped through the v1.1.0 helper - already fixed in the preceding chart commit; this commit propagates the rule into the §4.4 helper snippet and the §5.4 validation gates. (2) Whitespace-padded tags rendered raw image references - same treatment: §4.4 documents the trim-then-reject contract. (3) §4.3 / §4.4 / §5.4 stale on image.digest: - §4.3 CodeRabbit-comment row rewritten to describe both helpers (validateImageTag + imageRef), both fields (tag + digest), the canonical repo@digest form, and the PR9682 follow-up that added the separate field. - §4.4 helper snippet replaced with the actual v1.2.0 validator (full rule set + per-rule rationale comments) plus the new imageRef helper. - §5.4 expected helm-unittest output bumped 24 -> 46 passed with a one-line breakdown of the +22 new cases. (4) §4.1 / §4.2 stale on file count + helper list + values surface: - §4.1 said 13 files (omitted .helmignore and the new validate_image_tag_test.yaml). Corrected to 14 files (7 templates + 3 helm-unittests + 4 root files). The _helpers.tpl helper-list line gained imageRef and chart entries. The daemonset.yaml annotation now mentions it routes through imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0. - §4.2 values snippet had only image.tag with no image.digest. Added the field with the same per-rule comment block that ships in values.yaml (OCI form, 64-hex requirement, mutex with image.tag, PR9682 rationale). Status header bumped to chart v1.2.0 with a one-line v1.2.0 rationale (additive opt-in field, no breaking change to existing tag-pinned installs). Revision-history table gains the v1.5 entry summarising all four findings; older v1.1 - v1.4 entries untouched. No design reversal: every §6 decision and the chart shape are unchanged. Only the values surface (one additive field), helper set (two helpers: imageRef new, validateImageTag rewritten), and helm-unittest count (24 -> 46) grew. Every claim in the refreshed doc was verified against on-disk state: Chart.yaml for the version bump, values.yaml for the image.digest field comment, _helpers.tpl for both helpers, tests/validate_image_tag_test.yaml for the 22 enumerated cases (verified via `helm unittest`). Signed-off-by: Kai Ma <kaim@nvidia.com>

When rbac.namespaceRestricted=true the chart renders a namespace-scoped Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of the cluster-scoped pair, but the DaemonSet's container command never passes --namespace to the agent. power_agent.py:438-446 therefore takes its else branch — list_pod_for_all_namespaces — which the namespace-only token rejects with 403 Forbidden on every reconcile. Reconciles never advance and no caps are ever applied (silent failure: the agent doesn't exit; pods are just never enumerated). The --namespace CLI flag itself already exists (power_agent.py:542) and the list_namespaced_pod branch (:439-442) works correctly — the dev-pod template (templates/dev-pod.yaml:57) has been wiring this all along via --namespace=$POD_NAMESPACE. This commit extends the same downward-API POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the existing power-agent.effectiveNamespaceRestricted helper so both: - production mode with rbac.namespaceRestricted=true - dev mode (forces effective=true unless overridden) produce a DaemonSet whose argv matches the RBAC scope its token holds. Verified by helm template against both modes: - default (false) → no --namespace, no POD_NAMESPACE, ClusterRole - true → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env via downward API, Role + RoleBinding - helm lint passes in both modes Refs: PR #9682 review, Power Agent live-test finding #3. Signed-off-by: Kai Ma <kaim@nvidia.com>

Steady-state RSS of the agent (pynvml + kubernetes client + prometheus exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left only ~48 MiB of headroom, breaking under both routine inspection (`kubectl exec --container=power-agent -- python -c "..."` re-imports the kubernetes client in the same cgroup; observed RSS spike > 128 MiB, two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and transient peaks in multi-pod conflict resolution where the UID → annotation dict + PID → UUID map are both held in memory simultaneously (linear in pod count per node). Bump: limits.memory: 128Mi → 256Mi (~3× steady-state RSS headroom) requests.memory: 64Mi → 96Mi (more honest scheduling floor) CPU envelope unchanged — agent reconciles every 15 s and the heavy calls (pynvml + list_pod_for_all_namespaces) are well under 200m on fleet inspection (median: 12 m, p99: 78 m). Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB, which is negligible compared to the operational cost of a silently OOM-killed power-cap controller (the failure mode is a node whose caps slowly drift back to safeDefaultWatts on the next reconcile that doesn't fit the budget, with no obvious alert signal — the agent's Pod restart counter is the only indicator). Verified: helm template + helm lint clean. Refs: PR #9682 review, Power Agent live-test finding #4. Signed-off-by: Kai Ma <kaim@nvidia.com>

Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single Helm chart at deploy/helm/charts/power-agent/, resolving the three CodeRabbit findings on PR #9682: * hardcoded metadata.namespace=default -> {{ .Release.Namespace }} * mutable image :latest -> required image.tag with fail-fast validator * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating The chart supports three deployment shapes selectable via values: production DaemonSet (default, cluster-wide RBAC), namespace-restricted production (Role+RoleBinding), and an in-cluster dev-iteration Pod mounting power_agent.py from a ConfigMap. Three template-time validators reject foot-guns at install time: empty image.tag, mutex violations between daemonset.enabled and dev.enabled, and dev mode without a pinned dev.nodeName. Dev mode also automatically forces namespace-scoped RBAC (least privilege), leveraging power_agent.py's --namespace flag. Design rationale, scope decisions, and review-feedback responses are captured in docs/design-docs/power-agent-helm-chart-plan.md, committed alongside the chart. components/power_agent/README.md flips its install recipe to ``helm install``, and the planner CI filter (.github/filters.yaml) is retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**. Two examples/deployments/powerplanner/*.yaml header references live on PR #9687 and will be updated during that PR's cascade rebase per plan section 5.3. Validated locally: helm lint -> 0 errors helm template (3 positive exercises) -> expected resources helm template (3 negative exercises) -> expected fail-fast errors components/power_agent/tests/ -> 43/43 passed .github/scripts/test-filters.js -> 20/20 passed pre-commit (cross-cutting hooks) -> all applicable passed Signed-off-by: Kai Ma <kaim@nvidia.com>

When rbac.namespaceRestricted=true the chart renders a namespace-scoped Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of the cluster-scoped pair, but the DaemonSet's container command never passes --namespace to the agent. power_agent.py:438-446 therefore takes its else branch — list_pod_for_all_namespaces — which the namespace-only token rejects with 403 Forbidden on every reconcile. Reconciles never advance and no caps are ever applied (silent failure: the agent doesn't exit; pods are just never enumerated). The --namespace CLI flag itself already exists (power_agent.py:542) and the list_namespaced_pod branch (:439-442) works correctly — the dev-pod template (templates/dev-pod.yaml:57) has been wiring this all along via --namespace=$POD_NAMESPACE. This commit extends the same downward-API POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the existing power-agent.effectiveNamespaceRestricted helper so both: - production mode with rbac.namespaceRestricted=true - dev mode (forces effective=true unless overridden) produce a DaemonSet whose argv matches the RBAC scope its token holds. Verified by helm template against both modes: - default (false) → no --namespace, no POD_NAMESPACE, ClusterRole - true → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env via downward API, Role + RoleBinding - helm lint passes in both modes Refs: PR #9682 review, Power Agent live-test finding #3. Signed-off-by: Kai Ma <kaim@nvidia.com>

Steady-state RSS of the agent (pynvml + kubernetes client + prometheus exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left only ~48 MiB of headroom, breaking under both routine inspection (`kubectl exec --container=power-agent -- python -c "..."` re-imports the kubernetes client in the same cgroup; observed RSS spike > 128 MiB, two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and transient peaks in multi-pod conflict resolution where the UID → annotation dict + PID → UUID map are both held in memory simultaneously (linear in pod count per node). Bump: limits.memory: 128Mi → 256Mi (~3× steady-state RSS headroom) requests.memory: 64Mi → 96Mi (more honest scheduling floor) CPU envelope unchanged — agent reconciles every 15 s and the heavy calls (pynvml + list_pod_for_all_namespaces) are well under 200m on fleet inspection (median: 12 m, p99: 78 m). Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB, which is negligible compared to the operational cost of a silently OOM-killed power-cap controller (the failure mode is a node whose caps slowly drift back to safeDefaultWatts on the next reconcile that doesn't fit the budget, with no obvious alert signal — the agent's Pod restart counter is the only indicator). Verified: helm template + helm lint clean. Refs: PR #9682 review, Power Agent live-test finding #4. Signed-off-by: Kai Ma <kaim@nvidia.com>

… log, pod-UID dedup, argparse) Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790): * _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup. * _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container. * _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric. * main(): move 'import argparse' to module scope per the project's import-placement convention. Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there. Signed-off-by: Kai Ma <kaim@nvidia.com>

@sttts

The Helm chart's values.yaml points image.repository at nvcr.io/nvidia/dynamo/power-agent, but nothing in the tree built or published that image (PR #9682 review by @sttts: "where is it built?"). Add components/power_agent/Dockerfile (single-stage python:3.11-slim, NVML injected at runtime via runtimeClassName: nvidia, deps pynvml / kubernetes / prometheus-client) and a dedicated power-agent build job in pr.yaml + post-merge-ci.yml, gated on a new power_agent changed-files filter, following the snapshot-agent precedent. Signed-off-by: Kai Ma <kaim@nvidia.com>

@sttts

The design-doc header still read "Draft v1 - no code authored yet" (PR #9682 review by @sttts: "needs update?"), but the chart, templates, and power_agent.py are all in the tree on this PR. Flip Status to "Implemented", point at the shipped chart (v1.0.0) and the new Dockerfile / power-agent CI job, and add a v1.3 revision-history row. Signed-off-by: Kai Ma <kaim@nvidia.com>

@sttts

values.yaml image.repository pointed at nvcr.io/nvidia/dynamo/power-agent, but every other Dynamo image (vllm-runtime, sglang-runtime, frontend, dynamo-planner, snapshot-agent) is published under nvcr.io/nvidia/ai-dynamo/. The dynamo/ path was an unintentional divergence (PR #9682 review by @sttts). Align values.yaml, the chart README values table, and the design-doc values-surface block to nvcr.io/nvidia/ai-dynamo/power-agent. Signed-off-by: Kai Ma <kaim@nvidia.com>

Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods for the dynamo.nvidia.com/gpu-power-limit annotation, parses /proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup driver x CRI runtime variants), and calls nvmlDeviceSetPowerManagementLimit per physical GPU. Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds, UUID-gated orphan-cap restoration on startup, SIGTERM-handler default- TGP restore, and a multi-pod-per-GPU policy that falls back to power_agent_safe_default_watts on conflict. 43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy, shutdown. Standalone component - no coupling to planner internals. CI filter (.github/filters.yaml) extends the planner group to cover the new component + deploy paths so changes here trip the planner job set. Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>

Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single Helm chart at deploy/helm/charts/power-agent/, resolving the three CodeRabbit findings on PR #9682: * hardcoded metadata.namespace=default -> {{ .Release.Namespace }} * mutable image :latest -> required image.tag with fail-fast validator * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating The chart supports three deployment shapes selectable via values: production DaemonSet (default, cluster-wide RBAC), namespace-restricted production (Role+RoleBinding), and an in-cluster dev-iteration Pod mounting power_agent.py from a ConfigMap. Three template-time validators reject foot-guns at install time: empty image.tag, mutex violations between daemonset.enabled and dev.enabled, and dev mode without a pinned dev.nodeName. Dev mode also automatically forces namespace-scoped RBAC (least privilege), leveraging power_agent.py's --namespace flag. Design rationale, scope decisions, and review-feedback responses are captured in docs/design-docs/power-agent-helm-chart-plan.md, committed alongside the chart. components/power_agent/README.md flips its install recipe to ``helm install``, and the planner CI filter (.github/filters.yaml) is retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**. Two examples/deployments/powerplanner/*.yaml header references live on PR #9687 and will be updated during that PR's cascade rebase per plan section 5.3. Validated locally: helm lint -> 0 errors helm template (3 positive exercises) -> expected resources helm template (3 negative exercises) -> expected fail-fast errors components/power_agent/tests/ -> 43/43 passed .github/scripts/test-filters.js -> 20/20 passed pre-commit (cross-cutting hooks) -> all applicable passed Signed-off-by: Kai Ma <kaim@nvidia.com>

@grahamking

…S.txt copyright-checks (regex requiring `#`-prefixed SPDX lines) rejected the Go-template-comment header in _helpers.tpl and the missing header in NOTES.txt on commit 8e490a7. Switch both to the same 2-line `#`-prefixed form already used by the other 9 chart files, aligning with @grahamking's PR review preference (minimal SPDX, no Apache boilerplate). For NOTES.txt this adds two visible lines to the post-`helm install` banner, matching the project convention established by deploy/helm/charts/platform/templates/NOTES.txt. For _helpers.tpl the header sits outside any `{{- define -}}` block, so it is never invoked at render time and is invisible to consumers despite not being inside a Go template comment. Re-validated locally: helm lint -> 0 errors helm template (default mode) -> 4 expected kinds pre-commit on the two touched files -> all applicable hooks pass Signed-off-by: Kai Ma <kaim@nvidia.com>

When rbac.namespaceRestricted=true the chart renders a namespace-scoped Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of the cluster-scoped pair, but the DaemonSet's container command never passes --namespace to the agent. power_agent.py:438-446 therefore takes its else branch — list_pod_for_all_namespaces — which the namespace-only token rejects with 403 Forbidden on every reconcile. Reconciles never advance and no caps are ever applied (silent failure: the agent doesn't exit; pods are just never enumerated). The --namespace CLI flag itself already exists (power_agent.py:542) and the list_namespaced_pod branch (:439-442) works correctly — the dev-pod template (templates/dev-pod.yaml:57) has been wiring this all along via --namespace=$POD_NAMESPACE. This commit extends the same downward-API POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the existing power-agent.effectiveNamespaceRestricted helper so both: - production mode with rbac.namespaceRestricted=true - dev mode (forces effective=true unless overridden) produce a DaemonSet whose argv matches the RBAC scope its token holds. Verified by helm template against both modes: - default (false) → no --namespace, no POD_NAMESPACE, ClusterRole - true → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env via downward API, Role + RoleBinding - helm lint passes in both modes Refs: PR #9682 review, Power Agent live-test finding #3. Signed-off-by: Kai Ma <kaim@nvidia.com>

Steady-state RSS of the agent (pynvml + kubernetes client + prometheus exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left only ~48 MiB of headroom, breaking under both routine inspection (`kubectl exec --container=power-agent -- python -c "..."` re-imports the kubernetes client in the same cgroup; observed RSS spike > 128 MiB, two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and transient peaks in multi-pod conflict resolution where the UID → annotation dict + PID → UUID map are both held in memory simultaneously (linear in pod count per node). Bump: limits.memory: 128Mi → 256Mi (~3× steady-state RSS headroom) requests.memory: 64Mi → 96Mi (more honest scheduling floor) CPU envelope unchanged — agent reconciles every 15 s and the heavy calls (pynvml + list_pod_for_all_namespaces) are well under 200m on fleet inspection (median: 12 m, p99: 78 m). Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB, which is negligible compared to the operational cost of a silently OOM-killed power-cap controller (the failure mode is a node whose caps slowly drift back to safeDefaultWatts on the next reconcile that doesn't fit the budget, with no obvious alert signal — the agent's Pod restart counter is the only indicator). Verified: helm template + helm lint clean. Refs: PR #9682 review, Power Agent live-test finding #4. Signed-off-by: Kai Ma <kaim@nvidia.com>

… log, pod-UID dedup, argparse) Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790): * _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup. * _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container. * _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric. * main(): move 'import argparse' to module scope per the project's import-placement convention. Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there. Signed-off-by: Kai Ma <kaim@nvidia.com>

@sttts

The Helm chart's values.yaml points image.repository at nvcr.io/nvidia/dynamo/power-agent, but nothing in the tree built or published that image (PR #9682 review by @sttts: "where is it built?"). Add components/power_agent/Dockerfile (single-stage python:3.11-slim, NVML injected at runtime via runtimeClassName: nvidia, deps pynvml / kubernetes / prometheus-client) and a dedicated power-agent build job in pr.yaml + post-merge-ci.yml, gated on a new power_agent changed-files filter, following the snapshot-agent precedent. Signed-off-by: Kai Ma <kaim@nvidia.com>

@sttts

The design-doc header still read "Draft v1 - no code authored yet" (PR #9682 review by @sttts: "needs update?"), but the chart, templates, and power_agent.py are all in the tree on this PR. Flip Status to "Implemented", point at the shipped chart (v1.0.0) and the new Dockerfile / power-agent CI job, and add a v1.3 revision-history row. Signed-off-by: Kai Ma <kaim@nvidia.com>

@sttts

values.yaml image.repository pointed at nvcr.io/nvidia/dynamo/power-agent, but every other Dynamo image (vllm-runtime, sglang-runtime, frontend, dynamo-planner, snapshot-agent) is published under nvcr.io/nvidia/ai-dynamo/. The dynamo/ path was an unintentional divergence (PR #9682 review by @sttts). Align values.yaml, the chart README values table, and the design-doc values-surface block to nvcr.io/nvidia/ai-dynamo/power-agent. Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng · 2026-06-03T16:58:21Z

@sttts — thanks for the review. All four of your comments are addressed, and I've rebased the branch onto current main to clear a merge conflict. Quick summary so you can re-check:

Your comment	Resolution
Design doc "needs update?"	Flipped Status from "Draft v1 — no code authored yet" to Implemented (chart `version: 1.0.0`/`appVersion: "1.0.0"`), plus a `v1.3` revision row. (`docs/design-docs/power-agent-helm-chart-plan.md`)
`values.yaml` "where is it built?"	Added `components/power_agent/Dockerfile` and a dedicated `power-agent` build-and-push job in `pr.yaml` + `post-merge-ci.yml`, gated on a new `power_agent` changed-files filter (mirrors the `snapshot-agent` precedent).
`power_agent.py:499` "impact when this fails silently?"	Made the failure path explicit (commit `b05adaf5eb`): `_list_pods_on_node()` now returns `Optional[list]` — `None` on an API error (never `[]`), and `reconcile_once()` keys its fail-safe off that `None`, skipping the whole cycle with a distinct warning log (`"Pod listing unavailable this cycle; skipping reconcile to preserve last-known-good caps."`) so every GPU freezes at its last-known-good cap. A successful empty listing still drives a normal pass. Covered by `tests/test_reconcile_failsafe.py`.
Registry path "other images are under `nvcr.io/nvidia/ai-dynamo/`. Intentional?"	Not intentional — fixed to `nvcr.io/nvidia/ai-dynamo/power-agent` in `values.yaml`, the chart README, and the design-doc §4.2 block.

Rebase: the branch was 186 commits behind main and CONFLICTING; it's now rebased (only conflict was the changed-files COVERED_FILES line, where I kept both upstream's new efa entry and the new power_agent entry). The PR is now mergeable and all pre-merge checks are green.

Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0 chart that PR #9790 ships. The plan was authored for the NVML-only v1.0.0 chart in PR #9682 and went stale once PR #9790 layered agent.actuator + agent.dcgm.* + validateActuator + validateEnforce + the helm-unittest suite on top. Two changelog rows capture the refresh: v1.3 - first pass: 8 reviewer findings closed (3 blocking, 4 major, 2 medium, 1 low) on Status header, values surface, dev ConfigMap recipe, image-tag pinning, helm-unittest gating, internal-dev-doc references, daemonset name, file/LOC accounting. v1.4 - second pass: 5 follow-up findings closed (2 major, 2 medium, 1 low) on Status self-contradiction, stale section 4.2 dev-block comment, overstated helm-unittest coverage prose, unrunnable section 5.4 positive helm template overlays (missing --set image.tag), and the lingering filename reference in the v1.3 changelog. No design reversal: every section 6 decision and the chart shape are unchanged. Only the values surface (extended), helper set (extended with two template-time validators), and validation-gate list (helm-unittest now required) grew. Every claim in the refreshed doc was verified against on-disk state: power_agent.py:706-804 for the 8-flag CLI surface, values.yaml for the dev-block recipe, _helpers.tpl for the five-helper set, and the two tests/validate_*_test.yaml files for the 24 enumerated unittest cases. Signed-off-by: Kai Ma <kaim@nvidia.com>

@sha256

…ical OCI form Addresses three iterative PR9682 CodeRabbit review findings on the chart's image-pinning surface. Bumps Chart.yaml 1.1.0 -> 1.2.0 (minor per SemVer - image.digest is purely additive; existing tag-pinned installs render byte-identically and need no change). Round 1 (original PR9682 finding): image.tag=latest was rejected only at the raw-manifest layer, not at the chart layer. helm template --set image.tag=latest happily rendered nvcr.io/nvidia/dynamo/ power-agent:latest because validateImageTag only checked "if not .Values.image.tag". Now case-insensitively rejects the literal "latest" (LATEST, Latest, " latest " all fail) while still accepting release tags that contain the substring (v1.0.0-latest-rc). Round 2 (digest path rendered an invalid OCI reference): operators were advised in values.yaml + README + tests to set image.tag= sha256:abc... for digest pinning. This rendered repo:sha256:... - INVALID per the OCI spec, which requires repo@sha256:... for digest form (the second ":" in the rendered string parses as repo + tag "sha256" with a stray suffix; kubelet manifest pulls fail with an opaque error). Three changes close this: - values.yaml: add image.digest field with per-rule comment block explaining the OCI form, the 64-hex requirement, the mutex with image.tag, and the PR #9682 rationale. - _helpers.tpl: split validateImageTag into a full rule set (XOR(tag, digest), whitespace rejection on both, latest rejection on tag, sha256: prefix rejection on tag, exact- 64-hex regex on digest) and add a new imageRef helper that emits {repo}@{digest} when digest is set, else {repo}:{tag}. - daemonset.yaml: image: line now routes through imageRef instead of hard-coding the ":" join, so the canonical OCI form is the only thing that can ship. Round 3 (validator was too lenient): - Truncated SHA-256 digests slipped through: regex was ^sha256:[0-9a-fA-F]{32,}$, so half-digests (a common pattern when an operator copies the first 8 bytes from `docker images --digests` output) rendered as repo@sha256:<truncated>. The kubelet then fails the pull with an opaque manifest-mismatch error. Tightened to {64} exactly (SHA-256 is 32 bytes x 2 nybbles). - Whitespace-padded values rendered raw: --set-string 'image.tag= v1.1.0 ' produced image: "repo: v1.1.0 " because the latest comparison trimmed but the renderer used the raw value. Choice: reject (don't silently normalize) so the operator sees their --set quoting typo. Applies symmetrically to image.digest. tests/validate_image_tag_test.yaml (new, 22 cases - chart helm-unittest count grew 24 -> 46): - Negative: unset (both empty), latest in 4 case variants, whitespace on tag (leading/trailing/both), tag+digest mutex, sha256: prefix on tag, non-digest on digest, 32/63/65-hex digest (off-by-one + half-length), whitespace on digest. - Positive: pinned release tag renders {repo}:{tag}; v1.0.0-latest-rc accepted (substring match guard); 64-hex digest renders the canonical {repo}@sha256:... form; uppercase-hex digest preserved verbatim. README updates: production-install section now shows both tag and digest invocations side by side with a callout that digests live on image.digest, not image.tag. Values table gains the image.digest row. Troubleshooting section adds three new error-message entries covering missing-pin, invalid-digest, and mutex-violation cases. Signed-off-by: Kai Ma <kaim@nvidia.com>

Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.2.0 chart that the preceding commit ships. The plan was authored for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in the PR #9790 dual-actuator work, and went stale again once v1.2.0 landed image.digest + the canonical-OCI imageRef helper on top. Four reviewer findings closed (v1.5 changelog row covers all four): (1) Truncated SHA-256 digests slipped through the v1.1.0 helper - already fixed in the preceding chart commit; this commit propagates the rule into the §4.4 helper snippet and the §5.4 validation gates. (2) Whitespace-padded tags rendered raw image references - same treatment: §4.4 documents the trim-then-reject contract. (3) §4.3 / §4.4 / §5.4 stale on image.digest: - §4.3 CodeRabbit-comment row rewritten to describe both helpers (validateImageTag + imageRef), both fields (tag + digest), the canonical repo@digest form, and the PR9682 follow-up that added the separate field. - §4.4 helper snippet replaced with the actual v1.2.0 validator (full rule set + per-rule rationale comments) plus the new imageRef helper. - §5.4 expected helm-unittest output bumped 24 -> 46 passed with a one-line breakdown of the +22 new cases. (4) §4.1 / §4.2 stale on file count + helper list + values surface: - §4.1 said 13 files (omitted .helmignore and the new validate_image_tag_test.yaml). Corrected to 14 files (7 templates + 3 helm-unittests + 4 root files). The _helpers.tpl helper-list line gained imageRef and chart entries. The daemonset.yaml annotation now mentions it routes through imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0. - §4.2 values snippet had only image.tag with no image.digest. Added the field with the same per-rule comment block that ships in values.yaml (OCI form, 64-hex requirement, mutex with image.tag, PR9682 rationale). Status header bumped to chart v1.2.0 with a one-line v1.2.0 rationale (additive opt-in field, no breaking change to existing tag-pinned installs). Revision-history table gains the v1.5 entry summarising all four findings; older v1.1 - v1.4 entries untouched. No design reversal: every §6 decision and the chart shape are unchanged. Only the values surface (one additive field), helper set (two helpers: imageRef new, validateImageTag rewritten), and helm-unittest count (24 -> 46) grew. Every claim in the refreshed doc was verified against on-disk state: Chart.yaml for the version bump, values.yaml for the image.digest field comment, _helpers.tpl for both helpers, tests/validate_image_tag_test.yaml for the 22 enumerated cases (verified via `helm unittest`). Signed-off-by: Kai Ma <kaim@nvidia.com>

Rebase follow-up: dev-mode examples, helm-unittest fixtures, and design docs still referenced nvcr.io/nvidia/dynamo/power-agent after #9682 standardised on nvcr.io/nvidia/ai-dynamo/power-agent. Signed-off-by: Kai Ma <kaim@nvidia.com>

sttts · 2026-06-09T15:01:24Z

Some Codex xhigh findings:

  - [P1] Released installs point at an image that is never published. deploy/helm/charts/power-agent/values.yaml:8 defaults to nvcr.io/nvidia/ai-dynamo/power-agent, but the new CI only builds internal SHA tags in ECR/ACR (.github/workflows/post-merge-ci.yml:150). The release workflow still copies operator/planner/snapshot and stops at snapshot (.github/workflows/release.yml:409), with no power-agent NGC copy. Result: documented release installs can hit ImagePullBackOff.

Seems real. Needs an addition in release.yml.

  - [P1] Caps are not restored when a managed workload exits normally. _apply_cap() records the GPU as managed after setting a cap (components/power_agent/power_agent.py:276), but _reconcile_gpu() returns immediately when nvmlDeviceGetComputeRunningProcesses() becomes empty (components/power_agent/power_agent.py:515). That leaves the old cap in place indefinitely unless the agent restarts or receives SIGTERM. This can throttle unrelated later workloads on the same GPU.

Cannot judge criticality of this.

  - [P2] Mixed annotated and unannotated pods on the same GPU do not fail closed. _resolve_cap_for_gpu() drops None annotations before conflict resolution (components/power_agent/power_agent.py:383), so [None, "480"] applies 480 instead of safe_default_watts; the test locks this in at components/power_agent/tests/test_multi_pod_policy.py:125. For unsupported multi-pod sharing, any missing/
    unparseable pod annotation should likely force the safe default.

  - [P2] The new unit tests and chart templates are not wired into CI. The power-agent PR job only builds/pushes the image (.github/workflows/pr.yaml:185), while the Helm lint job only runs for operator changes and only lints deploy/helm/charts/platform (.github/workflows/pr.yaml:144). The added tests are unmarked unittest tests under components/power_agent/tests, so the marker-based Dynamo pytest jobs will not select them.

This is https://github.com/ai-dynamo/dynamo/pull/9683/changes#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R291 in the follow-up PR I guess.

@sttts

`_list_pods_on_node` now returns `None` on an API error (vs `[]` for a genuinely empty node), and `reconcile_once` keys its fail-safe off that sentinel: on failure it skips the cycle, freezing each GPU at its last-known-good cap instead of re-deriving caps from a zero-pod view. This makes the fail-safe contract explicit in the signature and in code comments so it can't silently drift, per @sttts review on #9682. Adds test_reconcile_failsafe.py covering the None-on-error, []-on-empty-node, skip-on-failure, and reconcile-on-empty-node cases. Signed-off-by: Kai Ma <kaim@nvidia.com>

Signed-off-by: Kai Ma <kaim@nvidia.com> # Conflicts: # .github/workflows/pr.yaml

Signed-off-by: Kai Ma <kaim@nvidia.com>

kaim-eng · 2026-06-12T15:03:58Z

All four review threads on this PR are addressed and now resolved:

deploy/power_agent/rbac.yaml ${POWER_AGENT_NAMESPACE} placeholder -> repackaged as a Helm chart; every RBAC namespace renders via {{ .Release.Namespace }} (@dynamo-ops)
SPDX headers trimmed to the two-line form across all files (@grahamking)
Image build added: components/power_agent/Dockerfile + a dedicated power-agent build-and-push CI job (@sttts)
Silent pod-list failure made explicit: _list_pods_on_node -> Optional[list] with a distinct skip log and a fail-safe regression test (@sttts)

CI is green (required checks pass; rust-* etc. skip via path filters). The PR description has been refreshed to match the current Helm-chart packaging (26 files; the only edits to pre-existing files are 5 .github/ CI files).

This is the foundation of the Phase-1 power-planner stack (#9683 and #9790 are stacked on it). It is MERGEABLE but blocked on REVIEW_REQUIRED -- @sttts @grahamking @dynamo-ops, could one of you take a look / approve when you have a moment? Thanks!

kaim-eng requested review from a team as code owners May 18, 2026 15:53

pull-request-size Bot added the size/XXL label May 18, 2026

github-actions Bot added feat documentation Improvements or additions to documentation actions labels May 18, 2026

kaim-eng mentioned this pull request May 18, 2026

feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus #9683

Open

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread components/power_agent/power_agent.py

Comment thread components/power_agent/power_agent.py Outdated

Comment thread components/power_agent/power_agent.py

Comment thread deploy/power_agent/daemonset.yaml Outdated

Comment thread deploy/power_agent/daemonset.yaml Outdated

kaim-eng force-pushed the pr1a/power-agent branch from 3cde449 to 1338028 Compare May 18, 2026 16:03

dynamo-ops reviewed May 18, 2026

View reviewed changes

Comment thread deploy/power_agent/rbac.yaml Outdated

grahamking reviewed May 18, 2026

View reviewed changes

Comment thread components/power_agent/power_agent.py

kaim-eng force-pushed the pr1a/power-agent branch from 1338028 to ec21081 Compare May 19, 2026 12:54

github-actions Bot added the deployment::k8s Relates to dynamo deployment in kubernetes label May 19, 2026

kaim-eng mentioned this pull request May 20, 2026

feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default) #9790

Open

kaim-eng force-pushed the pr1a/power-agent branch from 0eda1d7 to 1e1ef6f Compare May 25, 2026 13:43

sttts reviewed Jun 3, 2026

View reviewed changes

Comment thread docs/design-docs/power-agent-helm-chart-plan.md Outdated

sttts reviewed Jun 3, 2026

View reviewed changes

Comment thread deploy/helm/charts/power-agent/values.yaml Outdated

sttts reviewed Jun 3, 2026

View reviewed changes

Comment thread components/power_agent/power_agent.py

sttts reviewed Jun 3, 2026

View reviewed changes

Comment thread docs/design-docs/power-agent-helm-chart-plan.md Outdated

kaim-eng added 9 commits June 3, 2026 12:35

kaim-eng force-pushed the pr1a/power-agent branch from 376c6bb to 6cd82e8 Compare June 3, 2026 16:42

kaim-eng requested a review from a team as a code owner June 9, 2026 22:38

kaim-eng added 4 commits June 9, 2026 22:18

Merge remote-tracking branch 'origin/main' into pr1a/power-agent

3a24764

Signed-off-by: Kai Ma <kaim@nvidia.com> # Conflicts: # .github/workflows/pr.yaml

ci(release): publish power-agent image to NGC on release

8e24d2e

Signed-off-by: Kai Ma <kaim@nvidia.com>

docs(power-agent): document cap persistence and mixed-annotation policy

8c44a70

Signed-off-by: Kai Ma <kaim@nvidia.com>

ci(release): publish power-agent helm chart to NGC on release

fff195f

Signed-off-by: Kai Ma <kaim@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(power-agent): per-node power-cap enforcement DaemonSet#9682

feat(power-agent): per-node power-cap enforcement DaemonSet#9682
kaim-eng wants to merge 14 commits into
mainfrom
pr1a/power-agent

kaim-eng commented May 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaim-eng commented Jun 3, 2026 •

edited

Loading

Uh oh!

sttts commented Jun 9, 2026 •

edited

Loading

Uh oh!

kaim-eng commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kaim-eng commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Merge Strategy

Stack Order

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaim-eng commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sttts commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaim-eng commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kaim-eng commented May 18, 2026 •

edited

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

kaim-eng commented Jun 3, 2026 •

edited

Loading

sttts commented Jun 9, 2026 •

edited

Loading