Skip to content

feat(power-agent): per-node power-cap enforcement DaemonSet#9682

Open
kaim-eng wants to merge 14 commits into
mainfrom
pr1a/power-agent
Open

feat(power-agent): per-node power-cap enforcement DaemonSet#9682
kaim-eng wants to merge 14 commits into
mainfrom
pr1a/power-agent

Conversation

@kaim-eng

@kaim-eng kaim-eng commented May 18, 2026

Copy link
Copy Markdown

Part of the PR #9369 split plan.
This is PR 1 of 6 (PR 1a - Power Agent).

Predecessor: none (base = main)
Successor: #9683 (stacked on pr1a/power-agent; #9682 must land first)

Summary

Standalone Power Agent DaemonSet for per-node GPU power-cap enforcement, packaged as a Helm chart. The PR is reviewable in isolation from planner logic and has zero import dynamo.planner.* coupling.

Current branch shape: 26 files changed: net-new components/power_agent/ code and deploy/helm/charts/power-agent/ chart files, plus CI plumbing edits to 5 existing .github/ files.

  • components/power_agent/power_agent.py (566 LOC): NVML clamp helper, cgroup parser, multi-pod policy, SIGTERM restore, fail-safe reconcile (_list_pods_on_node -> Optional[list]), and UUID-gated orphan-cap restoration.
  • components/power_agent/Dockerfile: slim Python runtime; libnvidia-ml.so is injected at runtime by runtimeClassName: nvidia.
  • components/power_agent/tests/: 47 unit tests covering apply_cap, cgroup parsing, multi-pod policy, reconcile fail-safe, and shutdown.
  • deploy/helm/charts/power-agent/: Helm chart with DaemonSet, dev pod, ServiceAccount, and RBAC. RBAC renders namespaces through {{ .Release.Namespace }} and supports cluster-scoped or namespace-restricted mode.
  • .github/: power_agent path filter plus dedicated power-agent image build/push wiring in PR, post-merge, and release workflows.

Reviewer context:

  • Design context: docs/design-docs/powerplanner-design.md section 7 (lands later in the stack; readable from pr5/docs-devenv).
  • Planner contract: the only planner-facing coupling is the dynamo.nvidia.com/gpu-power-limit pod annotation.
  • Split-plan context: sections 2.1 and 2.1.2.

Validation

  • Required/current GitHub checks pass at the current tip; rust jobs and team-request are intentionally skipped by filters.
  • components/power_agent/tests/: 47 unit tests in-tree. The last full dev-pod run before the reconcile fail-safe additions was 43 passed on 2026-05-18; CI is green at the current tip.
  • Pre-commit on origin/main..pr1a/power-agent: 14 hooks pass, 4 skipped. pytest-marker-report was reconfirmed in a Linux pod with Missing sets: 0. The Windows host has the known pre-existing POSIX-only fcntl import issue in untouched test utilities; Linux CI is unaffected.
  • DCO: commits are signed off.

Merge Strategy

Rebase-and-merge, no squash, preserving the single PR commit in main history per split-plan section 4.3.

Stack Order

Merge #9682 first. #9683 and #9790 are both stacked on pr1a/power-agent and become actionable after #9682 lands.

@kaim-eng kaim-eng requested review from a team as code owners May 18, 2026 15:53
@copy-pr-bot

copy-pr-bot Bot commented May 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR introduces a complete Kubernetes DaemonSet agent that enforces per-GPU NVML power caps on worker nodes. The daemon reconciles every 15 seconds, mapping running GPU processes to pod UIDs via cgroup parsing, reading power-limit annotations, resolving multi-pod conflicts, and applying caps via NVML with Prometheus metrics and graceful shutdown support.

Changes

Power Agent: GPU NVML Power Cap Enforcement DaemonSet

Layer / File(s) Summary
Module foundation and utilities
components/power_agent/power_agent.py (lines 1–213)
Module header, optional lazy imports for NVML/Kubernetes/Prometheus with fallbacks, logging, constants, cgroup parsing regex for extracting pod UIDs from /proc/{pid}/cgroup, atomic JSON persistence for managed GPU state, UUID normalization for pynvml version compatibility, and Prometheus metrics infrastructure with noop fallback when unavailable.
NVML cap application and multi-pod policy
components/power_agent/power_agent.py (lines 220–404), components/power_agent/tests/test_apply_cap.py, components/power_agent/tests/test_multi_pod_policy.py
NVML helper functions to clamp watts to device min/max constraints and apply power caps with logging and metric updates; multi-pod-per-GPU resolution policy that applies an agreed cap when all pods match, otherwise applies safe default on conflicts or missing/invalid annotations, incrementing error counters; unit tests validating clamping boundaries, NVML error tolerance, UUID handling (bytes vs str), and all policy scenarios (single pod, agreement, conflict, invalid annotations, mixed None/valid).
Shutdown and startup recovery
components/power_agent/power_agent.py (lines 280–337), components/power_agent/tests/test_shutdown.py
SIGTERM/SIGINT handler restores managed GPU power limits to NVML defaults and signals main loop exit; startup orphan recovery loads persisted GPU UUIDs, checks idle status, restores default caps for idle GPUs, and updates persisted state; unit tests validating per-GPU restoration, graceful shutdown with no managed GPUs, and NVML error handling.
PowerAgent orchestration
components/power_agent/power_agent.py (lines 411–577)
PowerAgent class initializes NVML/Kubernetes clients, loads orphan state, lists node pods, maps running PIDs to pod UIDs, builds UID-to-annotation mappings, resolves GPU cap via policy, applies caps via NVML, and persists managed state; blocking reconcile loop runs every 15 seconds; CLI main() entry point accepts --safe-default-watts, optional node/namespace scope, and Prometheus port, then starts daemon.
Cgroup parser validation
components/power_agent/tests/test_cgroup_parser.py
Unit tests for cgroup UID extraction across cgroup v1/v2 with systemd and cgroupfs layouts, QoS classes (Guaranteed/Burstable/BestEffort), and non-Kubernetes inputs; validates OSError handling and first-match-wins behavior when multiple cgroup lines exist.
Kubernetes deployment
deploy/power_agent/rbac.yaml, deploy/power_agent/daemonset.yaml, deploy/power_agent/dev-pod.yaml
RBAC ServiceAccount, ClusterRole with pod read-access, and ClusterRoleBinding; production DaemonSet runs on GPU-labeled nodes with hostPID, privileged container, nvidia runtime, environment variables for safe default (500 W) and node name, resource limits, and volume mounts for /proc (read-only) and persisted state at /var/lib/dynamo-power-agent; development Pod harness for manual testing and validation on live clusters.
Documentation and CI integration
components/power_agent/README.md, .github/filters.yaml
README documents Power Agent's 15-second reconciliation, cgroup UID extraction, annotation reading, NVML power cap enforcement, troubleshooting steps (annotations, logs), Prometheus metrics (applied limit, conflict/safe-default/failure counters), and shutdown/startup semantics; CI filter update triggers planner jobs on power_agent component and deployment changes.

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: introduction of a per-node power-cap enforcement DaemonSet for the Power Agent.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR description is comprehensive with clear overview, detailed summary of changes, specific file guidance, related issues reference, and explicit stack ordering context.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
components/power_agent/power_agent.py (1)

537-539: ⚡ Quick win

Move argparse import to module scope.

Line 538 imports inside main(), which violates the repo’s Python import placement rule.

Proposed fix
+import argparse
 import json
 import logging
@@
 def main() -> None:
-    import argparse
-
     parser = argparse.ArgumentParser(description="Dynamo Power Agent DaemonSet")
As per coding guidelines: "Keep all imports at the top of each file (flag any import inside functions/classes)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/power_agent.py` around lines 537 - 539, The argparse
import is inside the main() function; move the import to module scope by adding
"import argparse" at the top of the file and removing the in-function import in
main(), ensuring any type or usage remains correct (adjust other imports if
needed) so linting and the repo rule about top-level imports are satisfied.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/power_agent/power_agent.py`:
- Around line 119-124: The function _load_previously_managed_gpus should
defensively handle malformed or unreadable state files: when reading
_MANAGED_STATE_PATH, catch OSError in addition to
FileNotFoundError/JSONDecodeError, load the JSON into a variable (e.g., data =
json.load(f)), verify isinstance(data, dict) before calling
data.get("managed_uuids", []), and if the root is not a dict (or managed_uuids
is not an iterable of strings) return an empty set; ensure any
TypeError/ValueError from unexpected types is handled and results in returning
set() rather than letting startup crash.
- Around line 294-297: Replace the broad except that swallows errors from the
pynvml.nvmlShutdown() call with a targeted handler: catch pynvml.NVMLError (or
the specific pynvml error class) as e, log the exception with the module/class
logger (e.g., logger.exception("NVML shutdown failed: %s", e) or
logging.exception(...)) and then re-raise the error to avoid silent failures;
modify the try/except around pynvml.nvmlShutdown() accordingly so only
NVML-specific errors are caught, logged, and propagated.
- Around line 494-501: The loop that builds pod_annotations appends one entry
per GPU process, causing pods with multiple processes to be counted multiple
times; change the logic in the procs loop (using _extract_pod_uid_from_cgroup
and uid_to_annotation) to deduplicate by pod UID before applying multi-pod
policy — e.g., track seen UIDs (or build a uid->annotation map) and only append
one (uid, annotation) pair per unique UID so each pod is counted once when
evaluating multi-pod warnings/metrics.

In `@deploy/power_agent/daemonset.yaml`:
- Around line 16-35: The DaemonSet metadata.namespace is hardcoded to "default",
which causes mismatches with the RBAC/service account namespace; update the
manifest to use the parameterized namespace variable (e.g.
${POWER_AGENT_NAMESPACE}) instead of "default" and ensure the ServiceAccount
referenced by spec.serviceAccountName ("dynamo-power-agent") is created/bound in
that same parameterized namespace so RBAC bindings resolve correctly; locate
metadata.name ("dynamo-power-agent"), metadata.namespace, and
spec.serviceAccountName in the template to make the change.
- Around line 55-58: Replace the mutable image tag
"nvcr.io/nvidia/dynamo/power-agent:latest" with an immutable reference (a
specific version tag or an image digest), e.g.
"nvcr.io/nvidia/dynamo/power-agent:vX.Y.Z" or
"nvcr.io/nvidia/dynamo/power-agent@sha256:<digest>", so deployments are
reproducible; keep or verify imagePullPolicy (e.g. IfNotPresent) is appropriate
for pinned images and update the inline comment to note the image is
intentionally pinned.

---

Nitpick comments:
In `@components/power_agent/power_agent.py`:
- Around line 537-539: The argparse import is inside the main() function; move
the import to module scope by adding "import argparse" at the top of the file
and removing the in-function import in main(), ensuring any type or usage
remains correct (adjust other imports if needed) so linting and the repo rule
about top-level imports are satisfied.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f9faddf8-30e0-4a7b-9055-b45244b2a979

📥 Commits

Reviewing files that changed from the base of the PR and between f20ff4e and 3cde449.

📒 Files selected for processing (11)
  • .github/filters.yaml
  • components/power_agent/README.md
  • components/power_agent/power_agent.py
  • components/power_agent/tests/__init__.py
  • components/power_agent/tests/test_apply_cap.py
  • components/power_agent/tests/test_cgroup_parser.py
  • components/power_agent/tests/test_multi_pod_policy.py
  • components/power_agent/tests/test_shutdown.py
  • deploy/power_agent/daemonset.yaml
  • deploy/power_agent/dev-pod.yaml
  • deploy/power_agent/rbac.yaml

Comment thread components/power_agent/power_agent.py
Comment thread components/power_agent/power_agent.py Outdated
Comment thread components/power_agent/power_agent.py
Comment thread deploy/power_agent/daemonset.yaml Outdated
Comment thread deploy/power_agent/daemonset.yaml Outdated
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 3cde449 to 1338028 Compare May 18, 2026 16:03
Comment thread deploy/power_agent/rbac.yaml Outdated
Comment thread components/power_agent/power_agent.py
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 1338028 to ec21081 Compare May 19, 2026 12:54
kaim-eng added a commit that referenced this pull request May 19, 2026
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:

  * hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
  * mutable image :latest -> required image.tag with fail-fast validator
  * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating

The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.

Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.

components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.

Validated locally:
  helm lint                                -> 0 errors
  helm template (3 positive exercises)     -> expected resources
  helm template (3 negative exercises)     -> expected fail-fast errors
  components/power_agent/tests/            -> 43/43 passed
  .github/scripts/test-filters.js          -> 20/20 passed
  pre-commit (cross-cutting hooks)         -> all applicable passed

Signed-off-by: Kai Ma <kaim@nvidia.com>
@github-actions github-actions Bot added the deployment::k8s Relates to dynamo deployment in kubernetes label May 19, 2026
kaim-eng added a commit that referenced this pull request May 19, 2026
Updates the three reference sites in examples/deployments/powerplanner/
that previously instructed users to ``kubectl apply -f deploy/power_agent/...``,
flipping them to the new Helm chart at deploy/helm/charts/power-agent/
that landed in PR #9682:

  * disagg-power-aware.yaml header recipe
  * README.md Prerequisites + verify section
  * MULTI_DGD.md file index

Also updates the verify-pods label selector from the legacy
``app=dynamo-power-agent`` to the chart-emitted
``app.kubernetes.io/name=power-agent``.

Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md,
pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference
the old paths in their historical / architectural narrative sections,
which is intentional -- those describe the pre-chart state and the
transition rationale, not current deployment instructions.

Part of the PR #9369 cascade following PR #9682''s Helm chart landing.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Updates the three reference sites in examples/deployments/powerplanner/
that previously instructed users to ``kubectl apply -f deploy/power_agent/...``,
flipping them to the new Helm chart at deploy/helm/charts/power-agent/
that landed in PR #9682:

  * disagg-power-aware.yaml header recipe
  * README.md Prerequisites + verify section
  * MULTI_DGD.md file index

Also updates the verify-pods label selector from the legacy
``app=dynamo-power-agent`` to the chart-emitted
``app.kubernetes.io/name=power-agent``.

Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md,
pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference
the old paths in their historical / architectural narrative sections,
which is intentional -- those describe the pre-chart state and the
transition rationale, not current deployment instructions.

Part of the PR #9369 cascade following PR #9682''s Helm chart landing.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 20, 2026
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0
chart that PR #9790 ships. The plan was authored for the NVML-only
v1.0.0 chart in PR #9682 and went stale once PR #9790 layered
agent.actuator + agent.dcgm.* + validateActuator + validateEnforce +
the helm-unittest suite on top.

Two changelog rows capture the refresh:
  v1.3 - first pass: 8 reviewer findings closed (3 blocking,
                     4 major, 2 medium, 1 low) on Status header,
                     values surface, dev ConfigMap recipe, image-tag
                     pinning, helm-unittest gating, internal-dev-doc
                     references, daemonset name, file/LOC accounting.
  v1.4 - second pass: 5 follow-up findings closed (2 major,
                     2 medium, 1 low) on Status self-contradiction,
                     stale section 4.2 dev-block comment, overstated
                     helm-unittest coverage prose, unrunnable
                     section 5.4 positive helm template overlays
                     (missing --set image.tag), and the lingering
                     filename reference in the v1.3 changelog.

No design reversal: every section 6 decision and the chart shape are
unchanged. Only the values surface (extended), helper set (extended
with two template-time validators), and validation-gate list
(helm-unittest now required) grew. Every claim in the refreshed
doc was verified against on-disk state: power_agent.py:706-804
for the 8-flag CLI surface, values.yaml for the dev-block recipe,
_helpers.tpl for the five-helper set, and the two
tests/validate_*_test.yaml files for the 24 enumerated unittest
cases.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 20, 2026
…ical OCI form

Addresses three iterative PR9682 CodeRabbit review findings on the
chart's image-pinning surface. Bumps Chart.yaml 1.1.0 -> 1.2.0 (minor
per SemVer - image.digest is purely additive; existing tag-pinned
installs render byte-identically and need no change).

Round 1 (original PR9682 finding): image.tag=latest was rejected only
at the raw-manifest layer, not at the chart layer. helm template
--set image.tag=latest happily rendered nvcr.io/nvidia/dynamo/
power-agent:latest because validateImageTag only checked
"if not .Values.image.tag". Now case-insensitively rejects the literal
"latest" (LATEST, Latest, " latest " all fail) while still accepting
release tags that contain the substring (v1.0.0-latest-rc).

Round 2 (digest path rendered an invalid OCI reference): operators
were advised in values.yaml + README + tests to set image.tag=
sha256:abc... for digest pinning. This rendered repo:sha256:... -
INVALID per the OCI spec, which requires repo@sha256:... for digest
form (the second ":" in the rendered string parses as repo + tag
"sha256" with a stray suffix; kubelet manifest pulls fail with an
opaque error). Three changes close this:
  - values.yaml: add image.digest field with per-rule comment
    block explaining the OCI form, the 64-hex requirement, the
    mutex with image.tag, and the PR #9682 rationale.
  - _helpers.tpl: split validateImageTag into a full rule set
    (XOR(tag, digest), whitespace rejection on both, latest
    rejection on tag, sha256: prefix rejection on tag, exact-
    64-hex regex on digest) and add a new imageRef helper that
    emits {repo}@{digest} when digest is set, else {repo}:{tag}.
  - daemonset.yaml: image: line now routes through imageRef
    instead of hard-coding the ":" join, so the canonical OCI
    form is the only thing that can ship.

Round 3 (validator was too lenient):
  - Truncated SHA-256 digests slipped through: regex was
    ^sha256:[0-9a-fA-F]{32,}$, so half-digests (a common pattern
    when an operator copies the first 8 bytes from
    `docker images --digests` output) rendered as
    repo@sha256:<truncated>. The kubelet then fails the pull with
    an opaque manifest-mismatch error. Tightened to {64} exactly
    (SHA-256 is 32 bytes x 2 nybbles).
  - Whitespace-padded values rendered raw: --set-string
    'image.tag= v1.1.0 ' produced image: "repo: v1.1.0 " because
    the latest comparison trimmed but the renderer used the raw
    value. Choice: reject (don't silently normalize) so the
    operator sees their --set quoting typo. Applies symmetrically
    to image.digest.

tests/validate_image_tag_test.yaml (new, 22 cases - chart
helm-unittest count grew 24 -> 46):
- Negative: unset (both empty), latest in 4 case variants,
  whitespace on tag (leading/trailing/both), tag+digest mutex,
  sha256: prefix on tag, non-digest on digest, 32/63/65-hex digest
  (off-by-one + half-length), whitespace on digest.
- Positive: pinned release tag renders {repo}:{tag};
  v1.0.0-latest-rc accepted (substring match guard); 64-hex digest
  renders the canonical {repo}@sha256:... form; uppercase-hex
  digest preserved verbatim.

README updates: production-install section now shows both tag and
digest invocations side by side with a callout that digests live
on image.digest, not image.tag. Values table gains the image.digest
row. Troubleshooting section adds three new error-message entries
covering missing-pin, invalid-digest, and mutex-violation cases.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 20, 2026
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the
v1.2.0 chart that the preceding commit ships. The plan was authored
for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in
the PR #9790 dual-actuator work, and went stale again once v1.2.0
landed image.digest + the canonical-OCI imageRef helper on top.

Four reviewer findings closed (v1.5 changelog row covers all four):

(1) Truncated SHA-256 digests slipped through the v1.1.0 helper -
already fixed in the preceding chart commit; this commit propagates
the rule into the §4.4 helper snippet and the §5.4 validation gates.

(2) Whitespace-padded tags rendered raw image references - same
treatment: §4.4 documents the trim-then-reject contract.

(3) §4.3 / §4.4 / §5.4 stale on image.digest:
  - §4.3 CodeRabbit-comment row rewritten to describe both
    helpers (validateImageTag + imageRef), both fields
    (tag + digest), the canonical repo@digest form, and the
    PR9682 follow-up that added the separate field.
  - §4.4 helper snippet replaced with the actual v1.2.0
    validator (full rule set + per-rule rationale comments)
    plus the new imageRef helper.
  - §5.4 expected helm-unittest output bumped 24 -> 46 passed
    with a one-line breakdown of the +22 new cases.

(4) §4.1 / §4.2 stale on file count + helper list + values surface:
  - §4.1 said 13 files (omitted .helmignore and the new
    validate_image_tag_test.yaml). Corrected to 14 files (7
    templates + 3 helm-unittests + 4 root files). The _helpers.tpl
    helper-list line gained imageRef and chart entries. The
    daemonset.yaml annotation now mentions it routes through
    imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a
    breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0.
  - §4.2 values snippet had only image.tag with no image.digest.
    Added the field with the same per-rule comment block that
    ships in values.yaml (OCI form, 64-hex requirement, mutex
    with image.tag, PR9682 rationale).

Status header bumped to chart v1.2.0 with a one-line v1.2.0
rationale (additive opt-in field, no breaking change to existing
tag-pinned installs). Revision-history table gains the v1.5 entry
summarising all four findings; older v1.1 - v1.4 entries
untouched.

No design reversal: every §6 decision and the chart shape are
unchanged. Only the values surface (one additive field), helper
set (two helpers: imageRef new, validateImageTag rewritten), and
helm-unittest count (24 -> 46) grew. Every claim in the refreshed
doc was verified against on-disk state: Chart.yaml for the
version bump, values.yaml for the image.digest field comment,
_helpers.tpl for both helpers, tests/validate_image_tag_test.yaml
for the 22 enumerated cases (verified via `helm unittest`).

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
When rbac.namespaceRestricted=true the chart renders a namespace-scoped
Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of
the cluster-scoped pair, but the DaemonSet's container command never
passes --namespace to the agent. power_agent.py:438-446 therefore takes
its else branch — list_pod_for_all_namespaces — which the namespace-only
token rejects with 403 Forbidden on every reconcile. Reconciles never
advance and no caps are ever applied (silent failure: the agent doesn't
exit; pods are just never enumerated).

The --namespace CLI flag itself already exists (power_agent.py:542) and
the list_namespaced_pod branch (:439-442) works correctly — the dev-pod
template (templates/dev-pod.yaml:57) has been wiring this all along via
--namespace=$POD_NAMESPACE. This commit extends the same downward-API
POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the
existing power-agent.effectiveNamespaceRestricted helper so both:

  - production mode with rbac.namespaceRestricted=true
  - dev mode (forces effective=true unless overridden)

produce a DaemonSet whose argv matches the RBAC scope its token holds.

Verified by helm template against both modes:
  - default (false) → no --namespace, no POD_NAMESPACE, ClusterRole
  - true            → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env
                      via downward API, Role + RoleBinding
  - helm lint passes in both modes

Refs: PR #9682 review, Power Agent live-test finding #3.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
Steady-state RSS of the agent (pynvml + kubernetes client + prometheus
exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left
only ~48 MiB of headroom, breaking under both routine inspection
(`kubectl exec --container=power-agent -- python -c "..."` re-imports
the kubernetes client in the same cgroup; observed RSS spike > 128 MiB,
two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and
transient peaks in multi-pod conflict resolution where the UID →
annotation dict + PID → UUID map are both held in memory simultaneously
(linear in pod count per node).

Bump:
  limits.memory:   128Mi → 256Mi   (~3× steady-state RSS headroom)
  requests.memory:  64Mi →  96Mi   (more honest scheduling floor)

CPU envelope unchanged — agent reconciles every 15 s and the heavy
calls (pynvml + list_pod_for_all_namespaces) are well under 200m on
fleet inspection (median: 12 m, p99: 78 m).

Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB,
which is negligible compared to the operational cost of a silently
OOM-killed power-cap controller (the failure mode is a node whose
caps slowly drift back to safeDefaultWatts on the next reconcile that
doesn't fit the budget, with no obvious alert signal — the agent's
Pod restart counter is the only indicator).

Verified: helm template + helm lint clean.

Refs: PR #9682 review, Power Agent live-test finding #4.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:

  * hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
  * mutable image :latest -> required image.tag with fail-fast validator
  * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating

The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.

Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.

components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.

Validated locally:
  helm lint                                -> 0 errors
  helm template (3 positive exercises)     -> expected resources
  helm template (3 negative exercises)     -> expected fail-fast errors
  components/power_agent/tests/            -> 43/43 passed
  .github/scripts/test-filters.js          -> 20/20 passed
  pre-commit (cross-cutting hooks)         -> all applicable passed

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
When rbac.namespaceRestricted=true the chart renders a namespace-scoped
Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of
the cluster-scoped pair, but the DaemonSet's container command never
passes --namespace to the agent. power_agent.py:438-446 therefore takes
its else branch — list_pod_for_all_namespaces — which the namespace-only
token rejects with 403 Forbidden on every reconcile. Reconciles never
advance and no caps are ever applied (silent failure: the agent doesn't
exit; pods are just never enumerated).

The --namespace CLI flag itself already exists (power_agent.py:542) and
the list_namespaced_pod branch (:439-442) works correctly — the dev-pod
template (templates/dev-pod.yaml:57) has been wiring this all along via
--namespace=$POD_NAMESPACE. This commit extends the same downward-API
POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the
existing power-agent.effectiveNamespaceRestricted helper so both:

  - production mode with rbac.namespaceRestricted=true
  - dev mode (forces effective=true unless overridden)

produce a DaemonSet whose argv matches the RBAC scope its token holds.

Verified by helm template against both modes:
  - default (false) → no --namespace, no POD_NAMESPACE, ClusterRole
  - true            → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env
                      via downward API, Role + RoleBinding
  - helm lint passes in both modes

Refs: PR #9682 review, Power Agent live-test finding #3.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
Steady-state RSS of the agent (pynvml + kubernetes client + prometheus
exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left
only ~48 MiB of headroom, breaking under both routine inspection
(`kubectl exec --container=power-agent -- python -c "..."` re-imports
the kubernetes client in the same cgroup; observed RSS spike > 128 MiB,
two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and
transient peaks in multi-pod conflict resolution where the UID →
annotation dict + PID → UUID map are both held in memory simultaneously
(linear in pod count per node).

Bump:
  limits.memory:   128Mi → 256Mi   (~3× steady-state RSS headroom)
  requests.memory:  64Mi →  96Mi   (more honest scheduling floor)

CPU envelope unchanged — agent reconciles every 15 s and the heavy
calls (pynvml + list_pod_for_all_namespaces) are well under 200m on
fleet inspection (median: 12 m, p99: 78 m).

Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB,
which is negligible compared to the operational cost of a silently
OOM-killed power-cap controller (the failure mode is a node whose
caps slowly drift back to safeDefaultWatts on the next reconcile that
doesn't fit the budget, with no obvious alert signal — the agent's
Pod restart counter is the only indicator).

Verified: helm template + helm lint clean.

Refs: PR #9682 review, Power Agent live-test finding #4.
Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 0eda1d7 to 1e1ef6f Compare May 25, 2026 13:43
kaim-eng added a commit that referenced this pull request May 25, 2026
… log, pod-UID dedup, argparse)

Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790):

* _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup.

* _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container.

* _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric.

* main(): move 'import argparse' to module scope per the project's import-placement convention.

Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Comment thread docs/design-docs/power-agent-helm-chart-plan.md Outdated
Comment thread deploy/helm/charts/power-agent/values.yaml Outdated
Comment thread components/power_agent/power_agent.py
Comment thread docs/design-docs/power-agent-helm-chart-plan.md Outdated
kaim-eng added a commit that referenced this pull request Jun 3, 2026
The Helm chart's values.yaml points image.repository at
nvcr.io/nvidia/dynamo/power-agent, but nothing in the tree built or
published that image (PR #9682 review by @sttts: "where is it built?").

Add components/power_agent/Dockerfile (single-stage python:3.11-slim,
NVML injected at runtime via runtimeClassName: nvidia, deps pynvml /
kubernetes / prometheus-client) and a dedicated power-agent build job
in pr.yaml + post-merge-ci.yml, gated on a new power_agent changed-files
filter, following the snapshot-agent precedent.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 3, 2026
The design-doc header still read "Draft v1 - no code authored yet"
(PR #9682 review by @sttts: "needs update?"), but the chart, templates,
and power_agent.py are all in the tree on this PR. Flip Status to
"Implemented", point at the shipped chart (v1.0.0) and the new
Dockerfile / power-agent CI job, and add a v1.3 revision-history row.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 3, 2026
values.yaml image.repository pointed at nvcr.io/nvidia/dynamo/power-agent,
but every other Dynamo image (vllm-runtime, sglang-runtime, frontend,
dynamo-planner, snapshot-agent) is published under
nvcr.io/nvidia/ai-dynamo/. The dynamo/ path was an unintentional
divergence (PR #9682 review by @sttts). Align values.yaml, the chart
README values table, and the design-doc values-surface block to
nvcr.io/nvidia/ai-dynamo/power-agent.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added 9 commits June 3, 2026 12:35
Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.

Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.

43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.

CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.

Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:

  * hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
  * mutable image :latest -> required image.tag with fail-fast validator
  * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating

The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.

Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.

components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.

Validated locally:
  helm lint                                -> 0 errors
  helm template (3 positive exercises)     -> expected resources
  helm template (3 negative exercises)     -> expected fail-fast errors
  components/power_agent/tests/            -> 43/43 passed
  .github/scripts/test-filters.js          -> 20/20 passed
  pre-commit (cross-cutting hooks)         -> all applicable passed

Signed-off-by: Kai Ma <kaim@nvidia.com>
…S.txt

copyright-checks (regex requiring `#`-prefixed SPDX lines) rejected the
Go-template-comment header in _helpers.tpl and the missing header in
NOTES.txt on commit 8e490a7. Switch both to the same 2-line
`#`-prefixed form already used by the other 9 chart files, aligning
with @grahamking's PR review preference (minimal SPDX, no Apache
boilerplate).

For NOTES.txt this adds two visible lines to the post-`helm install`
banner, matching the project convention established by
deploy/helm/charts/platform/templates/NOTES.txt.

For _helpers.tpl the header sits outside any `{{- define -}}` block, so
it is never invoked at render time and is invisible to consumers
despite not being inside a Go template comment.

Re-validated locally:
  helm lint                              -> 0 errors
  helm template (default mode)           -> 4 expected kinds
  pre-commit on the two touched files    -> all applicable hooks pass

Signed-off-by: Kai Ma <kaim@nvidia.com>
When rbac.namespaceRestricted=true the chart renders a namespace-scoped
Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of
the cluster-scoped pair, but the DaemonSet's container command never
passes --namespace to the agent. power_agent.py:438-446 therefore takes
its else branch — list_pod_for_all_namespaces — which the namespace-only
token rejects with 403 Forbidden on every reconcile. Reconciles never
advance and no caps are ever applied (silent failure: the agent doesn't
exit; pods are just never enumerated).

The --namespace CLI flag itself already exists (power_agent.py:542) and
the list_namespaced_pod branch (:439-442) works correctly — the dev-pod
template (templates/dev-pod.yaml:57) has been wiring this all along via
--namespace=$POD_NAMESPACE. This commit extends the same downward-API
POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the
existing power-agent.effectiveNamespaceRestricted helper so both:

  - production mode with rbac.namespaceRestricted=true
  - dev mode (forces effective=true unless overridden)

produce a DaemonSet whose argv matches the RBAC scope its token holds.

Verified by helm template against both modes:
  - default (false) → no --namespace, no POD_NAMESPACE, ClusterRole
  - true            → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env
                      via downward API, Role + RoleBinding
  - helm lint passes in both modes

Refs: PR #9682 review, Power Agent live-test finding #3.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Steady-state RSS of the agent (pynvml + kubernetes client + prometheus
exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left
only ~48 MiB of headroom, breaking under both routine inspection
(`kubectl exec --container=power-agent -- python -c "..."` re-imports
the kubernetes client in the same cgroup; observed RSS spike > 128 MiB,
two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and
transient peaks in multi-pod conflict resolution where the UID →
annotation dict + PID → UUID map are both held in memory simultaneously
(linear in pod count per node).

Bump:
  limits.memory:   128Mi → 256Mi   (~3× steady-state RSS headroom)
  requests.memory:  64Mi →  96Mi   (more honest scheduling floor)

CPU envelope unchanged — agent reconciles every 15 s and the heavy
calls (pynvml + list_pod_for_all_namespaces) are well under 200m on
fleet inspection (median: 12 m, p99: 78 m).

Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB,
which is negligible compared to the operational cost of a silently
OOM-killed power-cap controller (the failure mode is a node whose
caps slowly drift back to safeDefaultWatts on the next reconcile that
doesn't fit the budget, with no obvious alert signal — the agent's
Pod restart counter is the only indicator).

Verified: helm template + helm lint clean.

Refs: PR #9682 review, Power Agent live-test finding #4.
Signed-off-by: Kai Ma <kaim@nvidia.com>
… log, pod-UID dedup, argparse)

Folds the four CodeRabbit findings on the foundation PR back into this branch so each fix lives with the code it changes, rather than leaking into the downstream DCGM-actuator PR (#9790):

* _load_previously_managed_gpus: catch OSError (not just FileNotFoundError) and validate that the JSON root is a dict and managed_uuids is a list. Malformed state files now log a warning and return an empty set instead of crashing the agent at startup.

* _handle_sigterm: replace 'except Exception: pass' on pynvml.nvmlShutdown() with logger.exception so shutdown-time NVML faults appear in pod logs. We still fall through to _shutdown.set() so SIGTERM never hangs the container.

* _reconcile_gpu: dedup the (pod_uid, annotation) list by UID before applying multi-pod policy. A single pod with N PIDs on one GPU was being counted as N pods, falsely tripping the multi-pod-conflict branch and the multi_pod_gpu_total metric.

* main(): move 'import argparse' to module scope per the project's import-placement convention.

Regression coverage: existing components/power_agent/tests/ suite (43 tests) still passes locally; behavior-specific tests for these four fixes already live on PR #9790 and remain there.

Signed-off-by: Kai Ma <kaim@nvidia.com>
The Helm chart's values.yaml points image.repository at
nvcr.io/nvidia/dynamo/power-agent, but nothing in the tree built or
published that image (PR #9682 review by @sttts: "where is it built?").

Add components/power_agent/Dockerfile (single-stage python:3.11-slim,
NVML injected at runtime via runtimeClassName: nvidia, deps pynvml /
kubernetes / prometheus-client) and a dedicated power-agent build job
in pr.yaml + post-merge-ci.yml, gated on a new power_agent changed-files
filter, following the snapshot-agent precedent.

Signed-off-by: Kai Ma <kaim@nvidia.com>
The design-doc header still read "Draft v1 - no code authored yet"
(PR #9682 review by @sttts: "needs update?"), but the chart, templates,
and power_agent.py are all in the tree on this PR. Flip Status to
"Implemented", point at the shipped chart (v1.0.0) and the new
Dockerfile / power-agent CI job, and add a v1.3 revision-history row.

Signed-off-by: Kai Ma <kaim@nvidia.com>
values.yaml image.repository pointed at nvcr.io/nvidia/dynamo/power-agent,
but every other Dynamo image (vllm-runtime, sglang-runtime, frontend,
dynamo-planner, snapshot-agent) is published under
nvcr.io/nvidia/ai-dynamo/. The dynamo/ path was an unintentional
divergence (PR #9682 review by @sttts). Align values.yaml, the chart
README values table, and the design-doc values-surface block to
nvcr.io/nvidia/ai-dynamo/power-agent.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 376c6bb to 6cd82e8 Compare June 3, 2026 16:42
@kaim-eng

kaim-eng commented Jun 3, 2026

Copy link
Copy Markdown
Author

@sttts — thanks for the review. All four of your comments are addressed, and I've rebased the branch onto current main to clear a merge conflict. Quick summary so you can re-check:

Your comment Resolution
Design doc "needs update?" Flipped Status from "Draft v1 — no code authored yet" to Implemented (chart version: 1.0.0/appVersion: "1.0.0"), plus a v1.3 revision row. (docs/design-docs/power-agent-helm-chart-plan.md)
values.yaml "where is it built?" Added components/power_agent/Dockerfile and a dedicated power-agent build-and-push job in pr.yaml + post-merge-ci.yml, gated on a new power_agent changed-files filter (mirrors the snapshot-agent precedent).
power_agent.py:499 "impact when this fails silently?" Made the failure path explicit (commit b05adaf5eb): _list_pods_on_node() now returns Optional[list]None on an API error (never []), and reconcile_once() keys its fail-safe off that None, skipping the whole cycle with a distinct warning log ("Pod listing unavailable this cycle; skipping reconcile to preserve last-known-good caps.") so every GPU freezes at its last-known-good cap. A successful empty listing still drives a normal pass. Covered by tests/test_reconcile_failsafe.py.
Registry path "other images are under nvcr.io/nvidia/ai-dynamo/. Intentional?" Not intentional — fixed to nvcr.io/nvidia/ai-dynamo/power-agent in values.yaml, the chart README, and the design-doc §4.2 block.

Rebase: the branch was 186 commits behind main and CONFLICTING; it's now rebased (only conflict was the changed-files COVERED_FILES line, where I kept both upstream's new efa entry and the new power_agent entry). The PR is now mergeable and all pre-merge checks are green.

kaim-eng added a commit that referenced this pull request Jun 8, 2026
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the v1.1.0
chart that PR #9790 ships. The plan was authored for the NVML-only
v1.0.0 chart in PR #9682 and went stale once PR #9790 layered
agent.actuator + agent.dcgm.* + validateActuator + validateEnforce +
the helm-unittest suite on top.

Two changelog rows capture the refresh:
  v1.3 - first pass: 8 reviewer findings closed (3 blocking,
                     4 major, 2 medium, 1 low) on Status header,
                     values surface, dev ConfigMap recipe, image-tag
                     pinning, helm-unittest gating, internal-dev-doc
                     references, daemonset name, file/LOC accounting.
  v1.4 - second pass: 5 follow-up findings closed (2 major,
                     2 medium, 1 low) on Status self-contradiction,
                     stale section 4.2 dev-block comment, overstated
                     helm-unittest coverage prose, unrunnable
                     section 5.4 positive helm template overlays
                     (missing --set image.tag), and the lingering
                     filename reference in the v1.3 changelog.

No design reversal: every section 6 decision and the chart shape are
unchanged. Only the values surface (extended), helper set (extended
with two template-time validators), and validation-gate list
(helm-unittest now required) grew. Every claim in the refreshed
doc was verified against on-disk state: power_agent.py:706-804
for the 8-flag CLI surface, values.yaml for the dev-block recipe,
_helpers.tpl for the five-helper set, and the two
tests/validate_*_test.yaml files for the 24 enumerated unittest
cases.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
…ical OCI form

Addresses three iterative PR9682 CodeRabbit review findings on the
chart's image-pinning surface. Bumps Chart.yaml 1.1.0 -> 1.2.0 (minor
per SemVer - image.digest is purely additive; existing tag-pinned
installs render byte-identically and need no change).

Round 1 (original PR9682 finding): image.tag=latest was rejected only
at the raw-manifest layer, not at the chart layer. helm template
--set image.tag=latest happily rendered nvcr.io/nvidia/dynamo/
power-agent:latest because validateImageTag only checked
"if not .Values.image.tag". Now case-insensitively rejects the literal
"latest" (LATEST, Latest, " latest " all fail) while still accepting
release tags that contain the substring (v1.0.0-latest-rc).

Round 2 (digest path rendered an invalid OCI reference): operators
were advised in values.yaml + README + tests to set image.tag=
sha256:abc... for digest pinning. This rendered repo:sha256:... -
INVALID per the OCI spec, which requires repo@sha256:... for digest
form (the second ":" in the rendered string parses as repo + tag
"sha256" with a stray suffix; kubelet manifest pulls fail with an
opaque error). Three changes close this:
  - values.yaml: add image.digest field with per-rule comment
    block explaining the OCI form, the 64-hex requirement, the
    mutex with image.tag, and the PR #9682 rationale.
  - _helpers.tpl: split validateImageTag into a full rule set
    (XOR(tag, digest), whitespace rejection on both, latest
    rejection on tag, sha256: prefix rejection on tag, exact-
    64-hex regex on digest) and add a new imageRef helper that
    emits {repo}@{digest} when digest is set, else {repo}:{tag}.
  - daemonset.yaml: image: line now routes through imageRef
    instead of hard-coding the ":" join, so the canonical OCI
    form is the only thing that can ship.

Round 3 (validator was too lenient):
  - Truncated SHA-256 digests slipped through: regex was
    ^sha256:[0-9a-fA-F]{32,}$, so half-digests (a common pattern
    when an operator copies the first 8 bytes from
    `docker images --digests` output) rendered as
    repo@sha256:<truncated>. The kubelet then fails the pull with
    an opaque manifest-mismatch error. Tightened to {64} exactly
    (SHA-256 is 32 bytes x 2 nybbles).
  - Whitespace-padded values rendered raw: --set-string
    'image.tag= v1.1.0 ' produced image: "repo: v1.1.0 " because
    the latest comparison trimmed but the renderer used the raw
    value. Choice: reject (don't silently normalize) so the
    operator sees their --set quoting typo. Applies symmetrically
    to image.digest.

tests/validate_image_tag_test.yaml (new, 22 cases - chart
helm-unittest count grew 24 -> 46):
- Negative: unset (both empty), latest in 4 case variants,
  whitespace on tag (leading/trailing/both), tag+digest mutex,
  sha256: prefix on tag, non-digest on digest, 32/63/65-hex digest
  (off-by-one + half-length), whitespace on digest.
- Positive: pinned release tag renders {repo}:{tag};
  v1.0.0-latest-rc accepted (substring match guard); 64-hex digest
  renders the canonical {repo}@sha256:... form; uppercase-hex
  digest preserved verbatim.

README updates: production-install section now shows both tag and
digest invocations side by side with a callout that digests live
on image.digest, not image.tag. Values table gains the image.digest
row. Troubleshooting section adds three new error-message entries
covering missing-pin, invalid-digest, and mutex-violation cases.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
Aligns docs/design-docs/power-agent-helm-chart-plan.md with the
v1.2.0 chart that the preceding commit ships. The plan was authored
for the v1.0.0 NVML-only chart in PR #9682, refreshed to v1.1.0 in
the PR #9790 dual-actuator work, and went stale again once v1.2.0
landed image.digest + the canonical-OCI imageRef helper on top.

Four reviewer findings closed (v1.5 changelog row covers all four):

(1) Truncated SHA-256 digests slipped through the v1.1.0 helper -
already fixed in the preceding chart commit; this commit propagates
the rule into the §4.4 helper snippet and the §5.4 validation gates.

(2) Whitespace-padded tags rendered raw image references - same
treatment: §4.4 documents the trim-then-reject contract.

(3) §4.3 / §4.4 / §5.4 stale on image.digest:
  - §4.3 CodeRabbit-comment row rewritten to describe both
    helpers (validateImageTag + imageRef), both fields
    (tag + digest), the canonical repo@digest form, and the
    PR9682 follow-up that added the separate field.
  - §4.4 helper snippet replaced with the actual v1.2.0
    validator (full rule set + per-rule rationale comments)
    plus the new imageRef helper.
  - §5.4 expected helm-unittest output bumped 24 -> 46 passed
    with a one-line breakdown of the +22 new cases.

(4) §4.1 / §4.2 stale on file count + helper list + values surface:
  - §4.1 said 13 files (omitted .helmignore and the new
    validate_image_tag_test.yaml). Corrected to 14 files (7
    templates + 3 helm-unittests + 4 root files). The _helpers.tpl
    helper-list line gained imageRef and chart entries. The
    daemonset.yaml annotation now mentions it routes through
    imageRef. LOC estimate bumped ~1,430 -> ~1,700 with a
    breakdown of what v1.2.0 added on top of v1.0.0 / v1.1.0.
  - §4.2 values snippet had only image.tag with no image.digest.
    Added the field with the same per-rule comment block that
    ships in values.yaml (OCI form, 64-hex requirement, mutex
    with image.tag, PR9682 rationale).

Status header bumped to chart v1.2.0 with a one-line v1.2.0
rationale (additive opt-in field, no breaking change to existing
tag-pinned installs). Revision-history table gains the v1.5 entry
summarising all four findings; older v1.1 - v1.4 entries
untouched.

No design reversal: every §6 decision and the chart shape are
unchanged. Only the values surface (one additive field), helper
set (two helpers: imageRef new, validateImageTag rewritten), and
helm-unittest count (24 -> 46) grew. Every claim in the refreshed
doc was verified against on-disk state: Chart.yaml for the
version bump, values.yaml for the image.digest field comment,
_helpers.tpl for both helpers, tests/validate_image_tag_test.yaml
for the 22 enumerated cases (verified via `helm unittest`).

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
Rebase follow-up: dev-mode examples, helm-unittest fixtures, and design
docs still referenced nvcr.io/nvidia/dynamo/power-agent after #9682
standardised on nvcr.io/nvidia/ai-dynamo/power-agent.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@sttts

sttts commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Some Codex xhigh findings:

  - [P1] Released installs point at an image that is never published. deploy/helm/charts/power-agent/values.yaml:8 defaults to nvcr.io/nvidia/ai-dynamo/power-agent, but the new CI only builds internal SHA tags in ECR/ACR (.github/workflows/post-merge-ci.yml:150). The release workflow still copies operator/planner/snapshot and stops at snapshot (.github/workflows/release.yml:409), with no power-agent NGC copy. Result: documented release installs can hit ImagePullBackOff.

Seems real. Needs an addition in release.yml.

  - [P1] Caps are not restored when a managed workload exits normally. _apply_cap() records the GPU as managed after setting a cap (components/power_agent/power_agent.py:276), but _reconcile_gpu() returns immediately when nvmlDeviceGetComputeRunningProcesses() becomes empty (components/power_agent/power_agent.py:515). That leaves the old cap in place indefinitely unless the agent restarts or receives SIGTERM. This can throttle unrelated later workloads on the same GPU.

Cannot judge criticality of this.

  - [P2] Mixed annotated and unannotated pods on the same GPU do not fail closed. _resolve_cap_for_gpu() drops None annotations before conflict resolution (components/power_agent/power_agent.py:383), so [None, "480"] applies 480 instead of safe_default_watts; the test locks this in at components/power_agent/tests/test_multi_pod_policy.py:125. For unsupported multi-pod sharing, any missing/
    unparseable pod annotation should likely force the safe default.
  - [P2] The new unit tests and chart templates are not wired into CI. The power-agent PR job only builds/pushes the image (.github/workflows/pr.yaml:185), while the Helm lint job only runs for operator changes and only lints deploy/helm/charts/platform (.github/workflows/pr.yaml:144). The added tests are unmarked unittest tests under components/power_agent/tests, so the marker-based Dynamo pytest jobs will not select them.

This is https://github.com/ai-dynamo/dynamo/pull/9683/changes#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R291 in the follow-up PR I guess.

`_list_pods_on_node` now returns `None` on an API error (vs `[]` for a
genuinely empty node), and `reconcile_once` keys its fail-safe off that
sentinel: on failure it skips the cycle, freezing each GPU at its
last-known-good cap instead of re-deriving caps from a zero-pod view.

This makes the fail-safe contract explicit in the signature and in code
comments so it can't silently drift, per @sttts review on #9682. Adds
test_reconcile_failsafe.py covering the None-on-error,
[]-on-empty-node, skip-on-failure, and reconcile-on-empty-node cases.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng requested a review from a team as a code owner June 9, 2026 22:38
kaim-eng added 4 commits June 9, 2026 22:18
Signed-off-by: Kai Ma <kaim@nvidia.com>

# Conflicts:
#	.github/workflows/pr.yaml
Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng

Copy link
Copy Markdown
Author

All four review threads on this PR are addressed and now resolved:

  • deploy/power_agent/rbac.yaml ${POWER_AGENT_NAMESPACE} placeholder -> repackaged as a Helm chart; every RBAC namespace renders via {{ .Release.Namespace }} (@dynamo-ops)
  • SPDX headers trimmed to the two-line form across all files (@grahamking)
  • Image build added: components/power_agent/Dockerfile + a dedicated power-agent build-and-push CI job (@sttts)
  • Silent pod-list failure made explicit: _list_pods_on_node -> Optional[list] with a distinct skip log and a fail-safe regression test (@sttts)

CI is green (required checks pass; rust-* etc. skip via path filters). The PR description has been refreshed to match the current Helm-chart packaging (26 files; the only edits to pre-existing files are 5 .github/ CI files).

This is the foundation of the Phase-1 power-planner stack (#9683 and #9790 are stacked on it). It is MERGEABLE but blocked on REVIEW_REQUIRED -- @sttts @grahamking @dynamo-ops, could one of you take a look / approve when you have a moment? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants