feat(power-agent): per-node power-cap enforcement DaemonSet#9682
feat(power-agent): per-node power-cap enforcement DaemonSet#9682kaim-eng wants to merge 14 commits into
Conversation
WalkthroughThis PR introduces a complete Kubernetes DaemonSet agent that enforces per-GPU NVML power caps on worker nodes. The daemon reconciles every 15 seconds, mapping running GPU processes to pod UIDs via cgroup parsing, reading power-limit annotations, resolving multi-pod conflicts, and applying caps via NVML with Prometheus metrics and graceful shutdown support. ChangesPower Agent: GPU NVML Power Cap Enforcement DaemonSet
🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
components/power_agent/power_agent.py (1)
537-539: ⚡ Quick winMove
argparseimport to module scope.Line 538 imports inside
main(), which violates the repo’s Python import placement rule.As per coding guidelines: "Keep all imports at the top of each file (flag any import inside functions/classes)."Proposed fix
+import argparse import json import logging @@ def main() -> None: - import argparse - parser = argparse.ArgumentParser(description="Dynamo Power Agent DaemonSet")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/power_agent.py` around lines 537 - 539, The argparse import is inside the main() function; move the import to module scope by adding "import argparse" at the top of the file and removing the in-function import in main(), ensuring any type or usage remains correct (adjust other imports if needed) so linting and the repo rule about top-level imports are satisfied.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/power_agent/power_agent.py`:
- Around line 119-124: The function _load_previously_managed_gpus should
defensively handle malformed or unreadable state files: when reading
_MANAGED_STATE_PATH, catch OSError in addition to
FileNotFoundError/JSONDecodeError, load the JSON into a variable (e.g., data =
json.load(f)), verify isinstance(data, dict) before calling
data.get("managed_uuids", []), and if the root is not a dict (or managed_uuids
is not an iterable of strings) return an empty set; ensure any
TypeError/ValueError from unexpected types is handled and results in returning
set() rather than letting startup crash.
- Around line 294-297: Replace the broad except that swallows errors from the
pynvml.nvmlShutdown() call with a targeted handler: catch pynvml.NVMLError (or
the specific pynvml error class) as e, log the exception with the module/class
logger (e.g., logger.exception("NVML shutdown failed: %s", e) or
logging.exception(...)) and then re-raise the error to avoid silent failures;
modify the try/except around pynvml.nvmlShutdown() accordingly so only
NVML-specific errors are caught, logged, and propagated.
- Around line 494-501: The loop that builds pod_annotations appends one entry
per GPU process, causing pods with multiple processes to be counted multiple
times; change the logic in the procs loop (using _extract_pod_uid_from_cgroup
and uid_to_annotation) to deduplicate by pod UID before applying multi-pod
policy — e.g., track seen UIDs (or build a uid->annotation map) and only append
one (uid, annotation) pair per unique UID so each pod is counted once when
evaluating multi-pod warnings/metrics.
In `@deploy/power_agent/daemonset.yaml`:
- Around line 16-35: The DaemonSet metadata.namespace is hardcoded to "default",
which causes mismatches with the RBAC/service account namespace; update the
manifest to use the parameterized namespace variable (e.g.
${POWER_AGENT_NAMESPACE}) instead of "default" and ensure the ServiceAccount
referenced by spec.serviceAccountName ("dynamo-power-agent") is created/bound in
that same parameterized namespace so RBAC bindings resolve correctly; locate
metadata.name ("dynamo-power-agent"), metadata.namespace, and
spec.serviceAccountName in the template to make the change.
- Around line 55-58: Replace the mutable image tag
"nvcr.io/nvidia/dynamo/power-agent:latest" with an immutable reference (a
specific version tag or an image digest), e.g.
"nvcr.io/nvidia/dynamo/power-agent:vX.Y.Z" or
"nvcr.io/nvidia/dynamo/power-agent@sha256:<digest>", so deployments are
reproducible; keep or verify imagePullPolicy (e.g. IfNotPresent) is appropriate
for pinned images and update the inline comment to note the image is
intentionally pinned.
---
Nitpick comments:
In `@components/power_agent/power_agent.py`:
- Around line 537-539: The argparse import is inside the main() function; move
the import to module scope by adding "import argparse" at the top of the file
and removing the in-function import in main(), ensuring any type or usage
remains correct (adjust other imports if needed) so linting and the repo rule
about top-level imports are satisfied.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f9faddf8-30e0-4a7b-9055-b45244b2a979
📒 Files selected for processing (11)
.github/filters.yamlcomponents/power_agent/README.mdcomponents/power_agent/power_agent.pycomponents/power_agent/tests/__init__.pycomponents/power_agent/tests/test_apply_cap.pycomponents/power_agent/tests/test_cgroup_parser.pycomponents/power_agent/tests/test_multi_pod_policy.pycomponents/power_agent/tests/test_shutdown.pydeploy/power_agent/daemonset.yamldeploy/power_agent/dev-pod.yamldeploy/power_agent/rbac.yaml
3cde449 to
1338028
Compare
1338028 to
ec21081
Compare
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:
* hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
* mutable image :latest -> required image.tag with fail-fast validator
* `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating
The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.
Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.
components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.
Validated locally:
helm lint -> 0 errors
helm template (3 positive exercises) -> expected resources
helm template (3 negative exercises) -> expected fail-fast errors
components/power_agent/tests/ -> 43/43 passed
.github/scripts/test-filters.js -> 20/20 passed
pre-commit (cross-cutting hooks) -> all applicable passed
Signed-off-by: Kai Ma <kaim@nvidia.com>
Updates the three reference sites in examples/deployments/powerplanner/ that previously instructed users to ``kubectl apply -f deploy/power_agent/...``, flipping them to the new Helm chart at deploy/helm/charts/power-agent/ that landed in PR #9682: * disagg-power-aware.yaml header recipe * README.md Prerequisites + verify section * MULTI_DGD.md file index Also updates the verify-pods label selector from the legacy ``app=dynamo-power-agent`` to the chart-emitted ``app.kubernetes.io/name=power-agent``. Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md, pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference the old paths in their historical / architectural narrative sections, which is intentional -- those describe the pre-chart state and the transition rationale, not current deployment instructions. Part of the PR #9369 cascade following PR #9682''s Helm chart landing. Signed-off-by: Kai Ma <kaim@nvidia.com>
Updates the three reference sites in examples/deployments/powerplanner/ that previously instructed users to ``kubectl apply -f deploy/power_agent/...``, flipping them to the new Helm chart at deploy/helm/charts/power-agent/ that landed in PR #9682: * disagg-power-aware.yaml header recipe * README.md Prerequisites + verify section * MULTI_DGD.md file index Also updates the verify-pods label selector from the legacy ``app=dynamo-power-agent`` to the chart-emitted ``app.kubernetes.io/name=power-agent``. Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md, pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference the old paths in their historical / architectural narrative sections, which is intentional -- those describe the pre-chart state and the transition rationale, not current deployment instructions. Part of the PR #9369 cascade following PR #9682''s Helm chart landing. Signed-off-by: Kai Ma <kaim@nvidia.com>
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0 chart that PR #9790 ships. The plan was authored for the NVML-only v1.0.0 chart in PR #9682 and went stale once PR #9790 layered agent.actuator + agent.dcgm.* + validateActuator + validateEnforce + the helm-unittest suite on top. Two changelog rows capture the refresh: v1.3 - first pass: 8 reviewer findings closed (3 blocking, 4 major, 2 medium, 1 low) on Status header, values surface, dev ConfigMap recipe, image-tag pinning, helm-unittest gating, internal-dev-doc references, daemonset name, file/LOC accounting. v1.4 - second pass: 5 follow-up findings closed (2 major, 2 medium, 1 low) on Status self-contradiction, stale section 4.2 dev-block comment, overstated helm-unittest coverage prose, unrunnable section 5.4 positive helm template overlays (missing --set image.tag), and the lingering filename reference in the v1.3 changelog. No design reversal: every section 6 decision and the chart shape are unchanged. Only the values surface (extended), helper set (extended with two template-time validators), and validation-gate list (helm-unittest now required) grew. Every claim in the refreshed doc was verified against on-disk state: power_agent.py:706-804 for the 8-flag CLI surface, values.yaml for the dev-block recipe, _helpers.tpl for the five-helper set, and the two tests/validate_*_test.yaml files for the 24 enumerated unittest cases. Signed-off-by: Kai Ma <kaim@nvidia.com>
…ical OCI form
Addresses three iterative PR9682 CodeRabbit review findings on the
chart's image-pinning surface. Bumps Chart.yaml 1.1.0 -> 1.2.0 (minor
per SemVer - image.digest is purely additive; existing tag-pinned
installs render byte-identically and need no change).
Round 1 (original PR9682 finding): image.tag=latest was rejected only
at the raw-manifest layer, not at the chart layer. helm template
--set image.tag=latest happily rendered nvcr.io/nvidia/dynamo/
power-agent:latest because validateImageTag only checked
"if not .Values.image.tag". Now case-insensitively rejects the literal
"latest" (LATEST, Latest, " latest " all fail) while still accepting
release tags that contain the substring (v1.0.0-latest-rc).
Round 2 (digest path rendered an invalid OCI reference): operators
were advised in values.yaml + README + tests to set image.tag=
sha256:abc... for digest pinning. This rendered repo:sha256:... -
INVALID per the OCI spec, which requires repo@sha256:... for digest
form (the second ":" in the rendered string parses as repo + tag
"sha256" with a stray suffix; kubelet manifest pulls fail with an
opaque error). Three changes close this:
- values.yaml: add image.digest field with per-rule comment
block explaining the OCI form, the 64-hex requirement, the
mutex with image.tag, and the PR #9682 rationale.
- _helpers.tpl: split validateImageTag into a full rule set
(XOR(tag, digest), whitespace rejection on both, latest
rejection on tag, sha256: prefix rejection on tag, exact-
64-hex regex on digest) and add a new imageRef helper that
emits {repo}@{digest} when digest is set, else {repo}:{tag}.
- daemonset.yaml: image: line now routes through imageRef
instead of hard-coding the ":" join, so the canonical OCI
form is the only thing that can ship.
Round 3 (validator was too lenient):
- Truncated SHA-256 digests slipped through: regex was
^sha256:[0-9a-fA-F]{32,}$, so half-digests (a common pattern
when an operator copies the first 8 bytes from
`docker images --digests` output) rendered as
repo@sha256:<truncated>. The kubelet then fails the pull with
an opaque manifest-mismatch error. Tightened to {64} exactly
(SHA-256 is 32 bytes x 2 nybbles).
- Whitespace-padded values rendered raw: --set-string
'image.tag= v1.1.0 ' produced image: "repo: v1.1.0 " because
the latest comparison trimmed but the renderer used the raw
value. Choice: reject (don't silently normalize) so the
operator sees their --set quoting typo. Applies symmetrically
to image.digest.
tests/validate_image_tag_test.yaml (new, 22 cases - chart
helm-unittest count grew 24 -> 46):
- Negative: unset (both empty), latest in 4 case variants,
whitespace on tag (leading/trailing/both), tag+digest mutex,
sha256: prefix on tag, non-digest on digest, 32/63/65-hex digest
(off-by-one + half-length), whitespace on digest.
- Positive: pinned release tag renders {repo}:{tag};
v1.0.0-latest-rc accepted (substring match guard); 64-hex digest
renders the canonical {repo}@sha256:... form; uppercase-hex
digest preserved verbatim.
README updates: production-install section now shows both tag and
digest invocations side by side with a callout that digests live
on image.digest, not image.tag. Values table gains the image.digest
row. Troubleshooting section adds three new error-message entries
covering missing-pin, invalid-digest, and mutex-violation cases.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.2.0 chart that the preceding commit ships. The plan was authored for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in the PR #9790 dual-actuator work, and went stale again once v1.2.0 landed image.digest + the canonical-OCI imageRef helper on top. Four reviewer findings closed (v1.5 changelog row covers all four): (1) Truncated SHA-256 digests slipped through the v1.1.0 helper - already fixed in the preceding chart commit; this commit propagates the rule into the §4.4 helper snippet and the §5.4 validation gates. (2) Whitespace-padded tags rendered raw image references - same treatment: §4.4 documents the trim-then-reject contract. (3) §4.3 / §4.4 / §5.4 stale on image.digest: - §4.3 CodeRabbit-comment row rewritten to describe both helpers (validateImageTag + imageRef), both fields (tag + digest), the canonical repo@digest form, and the PR9682 follow-up that added the separate field. - §4.4 helper snippet replaced with the actual v1.2.0 validator (full rule set + per-rule rationale comments) plus the new imageRef helper. - §5.4 expected helm-unittest output bumped 24 -> 46 passed with a one-line breakdown of the +22 new cases. (4) §4.1 / §4.2 stale on file count + helper list + values surface: - §4.1 said 13 files (omitted .helmignore and the new validate_image_tag_test.yaml). Corrected to 14 files (7 templates + 3 helm-unittests + 4 root files). The _helpers.tpl helper-list line gained imageRef and chart entries. The daemonset.yaml annotation now mentions it routes through imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0. - §4.2 values snippet had only image.tag with no image.digest. Added the field with the same per-rule comment block that ships in values.yaml (OCI form, 64-hex requirement, mutex with image.tag, PR9682 rationale). Status header bumped to chart v1.2.0 with a one-line v1.2.0 rationale (additive opt-in field, no breaking change to existing tag-pinned installs). Revision-history table gains the v1.5 entry summarising all four findings; older v1.1 - v1.4 entries untouched. No design reversal: every §6 decision and the chart shape are unchanged. Only the values surface (one additive field), helper set (two helpers: imageRef new, validateImageTag rewritten), and helm-unittest count (24 -> 46) grew. Every claim in the refreshed doc was verified against on-disk state: Chart.yaml for the version bump, values.yaml for the image.digest field comment, _helpers.tpl for both helpers, tests/validate_image_tag_test.yaml for the 22 enumerated cases (verified via `helm unittest`). Signed-off-by: Kai Ma <kaim@nvidia.com>
When rbac.namespaceRestricted=true the chart renders a namespace-scoped
Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of
the cluster-scoped pair, but the DaemonSet's container command never
passes --namespace to the agent. power_agent.py:438-446 therefore takes
its else branch — list_pod_for_all_namespaces — which the namespace-only
token rejects with 403 Forbidden on every reconcile. Reconciles never
advance and no caps are ever applied (silent failure: the agent doesn't
exit; pods are just never enumerated).
The --namespace CLI flag itself already exists (power_agent.py:542) and
the list_namespaced_pod branch (:439-442) works correctly — the dev-pod
template (templates/dev-pod.yaml:57) has been wiring this all along via
--namespace=$POD_NAMESPACE. This commit extends the same downward-API
POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the
existing power-agent.effectiveNamespaceRestricted helper so both:
- production mode with rbac.namespaceRestricted=true
- dev mode (forces effective=true unless overridden)
produce a DaemonSet whose argv matches the RBAC scope its token holds.
Verified by helm template against both modes:
- default (false) → no --namespace, no POD_NAMESPACE, ClusterRole
- true → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env
via downward API, Role + RoleBinding
- helm lint passes in both modes
Refs: PR #9682 review, Power Agent live-test finding #3.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Steady-state RSS of the agent (pynvml + kubernetes client + prometheus exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left only ~48 MiB of headroom, breaking under both routine inspection (`kubectl exec --container=power-agent -- python -c "..."` re-imports the kubernetes client in the same cgroup; observed RSS spike > 128 MiB, two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and transient peaks in multi-pod conflict resolution where the UID → annotation dict + PID → UUID map are both held in memory simultaneously (linear in pod count per node). Bump: limits.memory: 128Mi → 256Mi (~3× steady-state RSS headroom) requests.memory: 64Mi → 96Mi (more honest scheduling floor) CPU envelope unchanged — agent reconciles every 15 s and the heavy calls (pynvml + list_pod_for_all_namespaces) are well under 200m on fleet inspection (median: 12 m, p99: 78 m). Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB, which is negligible compared to the operational cost of a silently OOM-killed power-cap controller (the failure mode is a node whose caps slowly drift back to safeDefaultWatts on the next reconcile that doesn't fit the budget, with no obvious alert signal — the agent's Pod restart counter is the only indicator). Verified: helm template + helm lint clean. Refs: PR #9682 review, Power Agent live-test finding #4. Signed-off-by: Kai Ma <kaim@nvidia.com>
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:
* hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
* mutable image :latest -> required image.tag with fail-fast validator
* `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating
The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.
Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.
components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.
Validated locally:
helm lint -> 0 errors
helm template (3 positive exercises) -> expected resources
helm template (3 negative exercises) -> expected fail-fast errors
components/power_agent/tests/ -> 43/43 passed
.github/scripts/test-filters.js -> 20/20 passed
pre-commit (cross-cutting hooks) -> all applicable passed
Signed-off-by: Kai Ma <kaim@nvidia.com>
When rbac.namespaceRestricted=true the chart renders a namespace-scoped
Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of
the cluster-scoped pair, but the DaemonSet's container command never
passes --namespace to the agent. power_agent.py:438-446 therefore takes
its else branch — list_pod_for_all_namespaces — which the namespace-only
token rejects with 403 Forbidden on every reconcile. Reconciles never
advance and no caps are ever applied (silent failure: the agent doesn't
exit; pods are just never enumerated).
The --namespace CLI flag itself already exists (power_agent.py:542) and
the list_namespaced_pod branch (:439-442) works correctly — the dev-pod
template (templates/dev-pod.yaml:57) has been wiring this all along via
--namespace=$POD_NAMESPACE. This commit extends the same downward-API
POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the
existing power-agent.effectiveNamespaceRestricted helper so both:
- production mode with rbac.namespaceRestricted=true
- dev mode (forces effective=true unless overridden)
produce a DaemonSet whose argv matches the RBAC scope its token holds.
Verified by helm template against both modes:
- default (false) → no --namespace, no POD_NAMESPACE, ClusterRole
- true → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env
via downward API, Role + RoleBinding
- helm lint passes in both modes
Refs: PR #9682 review, Power Agent live-test finding #3.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Steady-state RSS of the agent (pynvml + kubernetes client + prometheus exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left only ~48 MiB of headroom, breaking under both routine inspection (`kubectl exec --container=power-agent -- python -c "..."` re-imports the kubernetes client in the same cgroup; observed RSS spike > 128 MiB, two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and transient peaks in multi-pod conflict resolution where the UID → annotation dict + PID → UUID map are both held in memory simultaneously (linear in pod count per node). Bump: limits.memory: 128Mi → 256Mi (~3× steady-state RSS headroom) requests.memory: 64Mi → 96Mi (more honest scheduling floor) CPU envelope unchanged — agent reconciles every 15 s and the heavy calls (pynvml + list_pod_for_all_namespaces) are well under 200m on fleet inspection (median: 12 m, p99: 78 m). Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB, which is negligible compared to the operational cost of a silently OOM-killed power-cap controller (the failure mode is a node whose caps slowly drift back to safeDefaultWatts on the next reconcile that doesn't fit the budget, with no obvious alert signal — the agent's Pod restart counter is the only indicator). Verified: helm template + helm lint clean. Refs: PR #9682 review, Power Agent live-test finding #4. Signed-off-by: Kai Ma <kaim@nvidia.com>
0eda1d7 to
1e1ef6f
Compare
… log, pod-UID dedup, argparse) Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790): * _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup. * _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container. * _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric. * main(): move 'import argparse' to module scope per the project's import-placement convention. Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there. Signed-off-by: Kai Ma <kaim@nvidia.com>
The Helm chart's values.yaml points image.repository at nvcr.io/nvidia/dynamo/power-agent, but nothing in the tree built or published that image (PR #9682 review by @sttts: "where is it built?"). Add components/power_agent/Dockerfile (single-stage python:3.11-slim, NVML injected at runtime via runtimeClassName: nvidia, deps pynvml / kubernetes / prometheus-client) and a dedicated power-agent build job in pr.yaml + post-merge-ci.yml, gated on a new power_agent changed-files filter, following the snapshot-agent precedent. Signed-off-by: Kai Ma <kaim@nvidia.com>
The design-doc header still read "Draft v1 - no code authored yet" (PR #9682 review by @sttts: "needs update?"), but the chart, templates, and power_agent.py are all in the tree on this PR. Flip Status to "Implemented", point at the shipped chart (v1.0.0) and the new Dockerfile / power-agent CI job, and add a v1.3 revision-history row. Signed-off-by: Kai Ma <kaim@nvidia.com>
values.yaml image.repository pointed at nvcr.io/nvidia/dynamo/power-agent, but every other Dynamo image (vllm-runtime, sglang-runtime, frontend, dynamo-planner, snapshot-agent) is published under nvcr.io/nvidia/ai-dynamo/. The dynamo/ path was an unintentional divergence (PR #9682 review by @sttts). Align values.yaml, the chart README values table, and the design-doc values-surface block to nvcr.io/nvidia/ai-dynamo/power-agent. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.
Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.
43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.
CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.
Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:
* hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
* mutable image :latest -> required image.tag with fail-fast validator
* `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating
The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.
Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.
components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.
Validated locally:
helm lint -> 0 errors
helm template (3 positive exercises) -> expected resources
helm template (3 negative exercises) -> expected fail-fast errors
components/power_agent/tests/ -> 43/43 passed
.github/scripts/test-filters.js -> 20/20 passed
pre-commit (cross-cutting hooks) -> all applicable passed
Signed-off-by: Kai Ma <kaim@nvidia.com>
…S.txt copyright-checks (regex requiring `#`-prefixed SPDX lines) rejected the Go-template-comment header in _helpers.tpl and the missing header in NOTES.txt on commit 8e490a7. Switch both to the same 2-line `#`-prefixed form already used by the other 9 chart files, aligning with @grahamking's PR review preference (minimal SPDX, no Apache boilerplate). For NOTES.txt this adds two visible lines to the post-`helm install` banner, matching the project convention established by deploy/helm/charts/platform/templates/NOTES.txt. For _helpers.tpl the header sits outside any `{{- define -}}` block, so it is never invoked at render time and is invisible to consumers despite not being inside a Go template comment. Re-validated locally: helm lint -> 0 errors helm template (default mode) -> 4 expected kinds pre-commit on the two touched files -> all applicable hooks pass Signed-off-by: Kai Ma <kaim@nvidia.com>
When rbac.namespaceRestricted=true the chart renders a namespace-scoped
Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of
the cluster-scoped pair, but the DaemonSet's container command never
passes --namespace to the agent. power_agent.py:438-446 therefore takes
its else branch — list_pod_for_all_namespaces — which the namespace-only
token rejects with 403 Forbidden on every reconcile. Reconciles never
advance and no caps are ever applied (silent failure: the agent doesn't
exit; pods are just never enumerated).
The --namespace CLI flag itself already exists (power_agent.py:542) and
the list_namespaced_pod branch (:439-442) works correctly — the dev-pod
template (templates/dev-pod.yaml:57) has been wiring this all along via
--namespace=$POD_NAMESPACE. This commit extends the same downward-API
POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the
existing power-agent.effectiveNamespaceRestricted helper so both:
- production mode with rbac.namespaceRestricted=true
- dev mode (forces effective=true unless overridden)
produce a DaemonSet whose argv matches the RBAC scope its token holds.
Verified by helm template against both modes:
- default (false) → no --namespace, no POD_NAMESPACE, ClusterRole
- true → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env
via downward API, Role + RoleBinding
- helm lint passes in both modes
Refs: PR #9682 review, Power Agent live-test finding #3.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Steady-state RSS of the agent (pynvml + kubernetes client + prometheus exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left only ~48 MiB of headroom, breaking under both routine inspection (`kubectl exec --container=power-agent -- python -c "..."` re-imports the kubernetes client in the same cgroup; observed RSS spike > 128 MiB, two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and transient peaks in multi-pod conflict resolution where the UID → annotation dict + PID → UUID map are both held in memory simultaneously (linear in pod count per node). Bump: limits.memory: 128Mi → 256Mi (~3× steady-state RSS headroom) requests.memory: 64Mi → 96Mi (more honest scheduling floor) CPU envelope unchanged — agent reconciles every 15 s and the heavy calls (pynvml + list_pod_for_all_namespaces) are well under 200m on fleet inspection (median: 12 m, p99: 78 m). Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB, which is negligible compared to the operational cost of a silently OOM-killed power-cap controller (the failure mode is a node whose caps slowly drift back to safeDefaultWatts on the next reconcile that doesn't fit the budget, with no obvious alert signal — the agent's Pod restart counter is the only indicator). Verified: helm template + helm lint clean. Refs: PR #9682 review, Power Agent live-test finding #4. Signed-off-by: Kai Ma <kaim@nvidia.com>
… log, pod-UID dedup, argparse) Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790): * _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup. * _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container. * _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric. * main(): move 'import argparse' to module scope per the project's import-placement convention. Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there. Signed-off-by: Kai Ma <kaim@nvidia.com>
The Helm chart's values.yaml points image.repository at nvcr.io/nvidia/dynamo/power-agent, but nothing in the tree built or published that image (PR #9682 review by @sttts: "where is it built?"). Add components/power_agent/Dockerfile (single-stage python:3.11-slim, NVML injected at runtime via runtimeClassName: nvidia, deps pynvml / kubernetes / prometheus-client) and a dedicated power-agent build job in pr.yaml + post-merge-ci.yml, gated on a new power_agent changed-files filter, following the snapshot-agent precedent. Signed-off-by: Kai Ma <kaim@nvidia.com>
The design-doc header still read "Draft v1 - no code authored yet" (PR #9682 review by @sttts: "needs update?"), but the chart, templates, and power_agent.py are all in the tree on this PR. Flip Status to "Implemented", point at the shipped chart (v1.0.0) and the new Dockerfile / power-agent CI job, and add a v1.3 revision-history row. Signed-off-by: Kai Ma <kaim@nvidia.com>
values.yaml image.repository pointed at nvcr.io/nvidia/dynamo/power-agent, but every other Dynamo image (vllm-runtime, sglang-runtime, frontend, dynamo-planner, snapshot-agent) is published under nvcr.io/nvidia/ai-dynamo/. The dynamo/ path was an unintentional divergence (PR #9682 review by @sttts). Align values.yaml, the chart README values table, and the design-doc values-surface block to nvcr.io/nvidia/ai-dynamo/power-agent. Signed-off-by: Kai Ma <kaim@nvidia.com>
376c6bb to
6cd82e8
Compare
|
@sttts — thanks for the review. All four of your comments are addressed, and I've rebased the branch onto current
Rebase: the branch was 186 commits behind |
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0 chart that PR #9790 ships. The plan was authored for the NVML-only v1.0.0 chart in PR #9682 and went stale once PR #9790 layered agent.actuator + agent.dcgm.* + validateActuator + validateEnforce + the helm-unittest suite on top. Two changelog rows capture the refresh: v1.3 - first pass: 8 reviewer findings closed (3 blocking, 4 major, 2 medium, 1 low) on Status header, values surface, dev ConfigMap recipe, image-tag pinning, helm-unittest gating, internal-dev-doc references, daemonset name, file/LOC accounting. v1.4 - second pass: 5 follow-up findings closed (2 major, 2 medium, 1 low) on Status self-contradiction, stale section 4.2 dev-block comment, overstated helm-unittest coverage prose, unrunnable section 5.4 positive helm template overlays (missing --set image.tag), and the lingering filename reference in the v1.3 changelog. No design reversal: every section 6 decision and the chart shape are unchanged. Only the values surface (extended), helper set (extended with two template-time validators), and validation-gate list (helm-unittest now required) grew. Every claim in the refreshed doc was verified against on-disk state: power_agent.py:706-804 for the 8-flag CLI surface, values.yaml for the dev-block recipe, _helpers.tpl for the five-helper set, and the two tests/validate_*_test.yaml files for the 24 enumerated unittest cases. Signed-off-by: Kai Ma <kaim@nvidia.com>
…ical OCI form
Addresses three iterative PR9682 CodeRabbit review findings on the
chart's image-pinning surface. Bumps Chart.yaml 1.1.0 -> 1.2.0 (minor
per SemVer - image.digest is purely additive; existing tag-pinned
installs render byte-identically and need no change).
Round 1 (original PR9682 finding): image.tag=latest was rejected only
at the raw-manifest layer, not at the chart layer. helm template
--set image.tag=latest happily rendered nvcr.io/nvidia/dynamo/
power-agent:latest because validateImageTag only checked
"if not .Values.image.tag". Now case-insensitively rejects the literal
"latest" (LATEST, Latest, " latest " all fail) while still accepting
release tags that contain the substring (v1.0.0-latest-rc).
Round 2 (digest path rendered an invalid OCI reference): operators
were advised in values.yaml + README + tests to set image.tag=
sha256:abc... for digest pinning. This rendered repo:sha256:... -
INVALID per the OCI spec, which requires repo@sha256:... for digest
form (the second ":" in the rendered string parses as repo + tag
"sha256" with a stray suffix; kubelet manifest pulls fail with an
opaque error). Three changes close this:
- values.yaml: add image.digest field with per-rule comment
block explaining the OCI form, the 64-hex requirement, the
mutex with image.tag, and the PR #9682 rationale.
- _helpers.tpl: split validateImageTag into a full rule set
(XOR(tag, digest), whitespace rejection on both, latest
rejection on tag, sha256: prefix rejection on tag, exact-
64-hex regex on digest) and add a new imageRef helper that
emits {repo}@{digest} when digest is set, else {repo}:{tag}.
- daemonset.yaml: image: line now routes through imageRef
instead of hard-coding the ":" join, so the canonical OCI
form is the only thing that can ship.
Round 3 (validator was too lenient):
- Truncated SHA-256 digests slipped through: regex was
^sha256:[0-9a-fA-F]{32,}$, so half-digests (a common pattern
when an operator copies the first 8 bytes from
`docker images --digests` output) rendered as
repo@sha256:<truncated>. The kubelet then fails the pull with
an opaque manifest-mismatch error. Tightened to {64} exactly
(SHA-256 is 32 bytes x 2 nybbles).
- Whitespace-padded values rendered raw: --set-string
'image.tag= v1.1.0 ' produced image: "repo: v1.1.0 " because
the latest comparison trimmed but the renderer used the raw
value. Choice: reject (don't silently normalize) so the
operator sees their --set quoting typo. Applies symmetrically
to image.digest.
tests/validate_image_tag_test.yaml (new, 22 cases - chart
helm-unittest count grew 24 -> 46):
- Negative: unset (both empty), latest in 4 case variants,
whitespace on tag (leading/trailing/both), tag+digest mutex,
sha256: prefix on tag, non-digest on digest, 32/63/65-hex digest
(off-by-one + half-length), whitespace on digest.
- Positive: pinned release tag renders {repo}:{tag};
v1.0.0-latest-rc accepted (substring match guard); 64-hex digest
renders the canonical {repo}@sha256:... form; uppercase-hex
digest preserved verbatim.
README updates: production-install section now shows both tag and
digest invocations side by side with a callout that digests live
on image.digest, not image.tag. Values table gains the image.digest
row. Troubleshooting section adds three new error-message entries
covering missing-pin, invalid-digest, and mutex-violation cases.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.2.0 chart that the preceding commit ships. The plan was authored for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in the PR #9790 dual-actuator work, and went stale again once v1.2.0 landed image.digest + the canonical-OCI imageRef helper on top. Four reviewer findings closed (v1.5 changelog row covers all four): (1) Truncated SHA-256 digests slipped through the v1.1.0 helper - already fixed in the preceding chart commit; this commit propagates the rule into the §4.4 helper snippet and the §5.4 validation gates. (2) Whitespace-padded tags rendered raw image references - same treatment: §4.4 documents the trim-then-reject contract. (3) §4.3 / §4.4 / §5.4 stale on image.digest: - §4.3 CodeRabbit-comment row rewritten to describe both helpers (validateImageTag + imageRef), both fields (tag + digest), the canonical repo@digest form, and the PR9682 follow-up that added the separate field. - §4.4 helper snippet replaced with the actual v1.2.0 validator (full rule set + per-rule rationale comments) plus the new imageRef helper. - §5.4 expected helm-unittest output bumped 24 -> 46 passed with a one-line breakdown of the +22 new cases. (4) §4.1 / §4.2 stale on file count + helper list + values surface: - §4.1 said 13 files (omitted .helmignore and the new validate_image_tag_test.yaml). Corrected to 14 files (7 templates + 3 helm-unittests + 4 root files). The _helpers.tpl helper-list line gained imageRef and chart entries. The daemonset.yaml annotation now mentions it routes through imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0. - §4.2 values snippet had only image.tag with no image.digest. Added the field with the same per-rule comment block that ships in values.yaml (OCI form, 64-hex requirement, mutex with image.tag, PR9682 rationale). Status header bumped to chart v1.2.0 with a one-line v1.2.0 rationale (additive opt-in field, no breaking change to existing tag-pinned installs). Revision-history table gains the v1.5 entry summarising all four findings; older v1.1 - v1.4 entries untouched. No design reversal: every §6 decision and the chart shape are unchanged. Only the values surface (one additive field), helper set (two helpers: imageRef new, validateImageTag rewritten), and helm-unittest count (24 -> 46) grew. Every claim in the refreshed doc was verified against on-disk state: Chart.yaml for the version bump, values.yaml for the image.digest field comment, _helpers.tpl for both helpers, tests/validate_image_tag_test.yaml for the 22 enumerated cases (verified via `helm unittest`). Signed-off-by: Kai Ma <kaim@nvidia.com>
Rebase follow-up: dev-mode examples, helm-unittest fixtures, and design docs still referenced nvcr.io/nvidia/dynamo/power-agent after #9682 standardised on nvcr.io/nvidia/ai-dynamo/power-agent. Signed-off-by: Kai Ma <kaim@nvidia.com>
|
Some Codex xhigh findings: Seems real. Needs an addition in release.yml. Cannot judge criticality of this. This is https://github.com/ai-dynamo/dynamo/pull/9683/changes#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R291 in the follow-up PR I guess. |
`_list_pods_on_node` now returns `None` on an API error (vs `[]` for a genuinely empty node), and `reconcile_once` keys its fail-safe off that sentinel: on failure it skips the cycle, freezing each GPU at its last-known-good cap instead of re-deriving caps from a zero-pod view. This makes the fail-safe contract explicit in the signature and in code comments so it can't silently drift, per @sttts review on #9682. Adds test_reconcile_failsafe.py covering the None-on-error, []-on-empty-node, skip-on-failure, and reconcile-on-empty-node cases. Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com> # Conflicts: # .github/workflows/pr.yaml
Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
|
All four review threads on this PR are addressed and now resolved:
CI is green (required checks pass; This is the foundation of the Phase-1 power-planner stack (#9683 and #9790 are stacked on it). It is |
Part of the PR #9369 split plan.
This is PR 1 of 6 (PR 1a - Power Agent).
Predecessor: none (base =
main)Successor: #9683 (stacked on
pr1a/power-agent; #9682 must land first)Summary
Standalone Power Agent DaemonSet for per-node GPU power-cap enforcement, packaged as a Helm chart. The PR is reviewable in isolation from planner logic and has zero
import dynamo.planner.*coupling.Current branch shape: 26 files changed: net-new
components/power_agent/code anddeploy/helm/charts/power-agent/chart files, plus CI plumbing edits to 5 existing.github/files.components/power_agent/power_agent.py(566 LOC): NVML clamp helper, cgroup parser, multi-pod policy, SIGTERM restore, fail-safe reconcile (_list_pods_on_node -> Optional[list]), and UUID-gated orphan-cap restoration.components/power_agent/Dockerfile: slim Python runtime;libnvidia-ml.sois injected at runtime byruntimeClassName: nvidia.components/power_agent/tests/: 47 unit tests covering apply_cap, cgroup parsing, multi-pod policy, reconcile fail-safe, and shutdown.deploy/helm/charts/power-agent/: Helm chart with DaemonSet, dev pod, ServiceAccount, and RBAC. RBAC renders namespaces through{{ .Release.Namespace }}and supports cluster-scoped or namespace-restricted mode..github/:power_agentpath filter plus dedicated power-agent image build/push wiring in PR, post-merge, and release workflows.Reviewer context:
docs/design-docs/powerplanner-design.mdsection 7 (lands later in the stack; readable frompr5/docs-devenv).dynamo.nvidia.com/gpu-power-limitpod annotation.Validation
components/power_agent/tests/: 47 unit tests in-tree. The last full dev-pod run before the reconcile fail-safe additions was 43 passed on 2026-05-18; CI is green at the current tip.origin/main..pr1a/power-agent: 14 hooks pass, 4 skipped.pytest-marker-reportwas reconfirmed in a Linux pod withMissing sets: 0. The Windows host has the known pre-existing POSIX-onlyfcntlimport issue in untouched test utilities; Linux CI is unaffected.Merge Strategy
Rebase-and-merge, no squash, preserving the single PR commit in
mainhistory per split-plan section 4.3.Stack Order
Merge #9682 first. #9683 and #9790 are both stacked on
pr1a/power-agentand become actionable after #9682 lands.