docs(planner): power planner design docs + dev environment by kaim-eng · Pull Request #9687 · ai-dynamo/dynamo

kaim-eng · 2026-05-18T15:55:09Z

Part of the PR #9369 split plan.
This is PR 6 of 6 (PR 5 — Documentation + Dev Environment + Examples + Tools). Held in Draft per plan §4.5.

Predecessor: #9686 — Stress testbed
Successor: none (final PR)

Scope

No production code changes. Purely additive: design docs, testbed design doc, dev-environment guide, dev-pod manifests, example DGD configs, operator scripts, and standalone analysis tools.

~7,000 lines · 25 files.

docs/design-docs/powerplanner-design.md (2,111 LOC) — full power-planner design
docs/design-docs/powerplanner-testbed-design.md (2,187 LOC) — α + γ testbed design
docs/components/planner/dpp-dev-env.md (675 LOC) — dev-environment setup + one-shot sweep
deploy/planner/dev/ — Dockerfile, dev-pod manifest, qwen3 quickstart DGD
examples/deployments/powerplanner/ — README, PIPECLEAN, MULTI_DGD, two DGD YAMLs, verify_poweraware.bash
scripts/ — 7 inspection scripts for DCGM, frontend, live metrics, k8s access
tools/ — compute_power_budget.py, integrate_aic_power_data.py, validate_aic_power_integration.py
test_planner_power_launch.py (58 LOC) — repo-root smoke launcher
.github/filters.yaml — planner CI filter cumulative final state

Plan §2.6 explains why all three design docs ship together in PR 5 (deliberate choice — they're internally consistent only at PR 5's tip; doc cross-references Phase-3 algorithms that exist only at this tip).

Reviewer onboarding

Docs-focused review; no algorithm or code-review context-switching expected
Plan section: §2.6 (this PR)

Tests at this tip

No code changes → same as PR 4 tip: 739 passed, 6 skipped, 0 failures (planner 696 / power_agent 43)
Measured 2026-05-18 on dev pod against rebased tip cec5681f15

Merge strategy

Rebase-and-merge (no squash). After this merges, the cumulative diff matches kaim/power-planner byte-for-byte.

copy-pr-bot · 2026-05-18T15:55:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single Helm chart at deploy/helm/charts/power-agent/, resolving the three CodeRabbit findings on PR #9682: * hardcoded metadata.namespace=default -> {{ .Release.Namespace }} * mutable image :latest -> required image.tag with fail-fast validator * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating The chart supports three deployment shapes selectable via values: production DaemonSet (default, cluster-wide RBAC), namespace-restricted production (Role+RoleBinding), and an in-cluster dev-iteration Pod mounting power_agent.py from a ConfigMap. Three template-time validators reject foot-guns at install time: empty image.tag, mutex violations between daemonset.enabled and dev.enabled, and dev mode without a pinned dev.nodeName. Dev mode also automatically forces namespace-scoped RBAC (least privilege), leveraging power_agent.py's --namespace flag. Design rationale, scope decisions, and review-feedback responses are captured in docs/design-docs/power-agent-helm-chart-plan.md, committed alongside the chart. components/power_agent/README.md flips its install recipe to ``helm install``, and the planner CI filter (.github/filters.yaml) is retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**. Two examples/deployments/powerplanner/*.yaml header references live on PR #9687 and will be updated during that PR's cascade rebase per plan section 5.3. Validated locally: helm lint -> 0 errors helm template (3 positive exercises) -> expected resources helm template (3 negative exercises) -> expected fail-fast errors components/power_agent/tests/ -> 43/43 passed .github/scripts/test-filters.js -> 20/20 passed pre-commit (cross-cutting hooks) -> all applicable passed Signed-off-by: Kai Ma <kaim@nvidia.com>

The two ``[Appendix C.10]``/``[Appendix D.7]`` links in the testbed README pointed at ``docs/design-docs/powerplanner-testbed-design.md``, which is introduced by PR #9687 and does not yet exist on this branch. The Docs link check (lychee) has therefore failed on PR #9686 since the PR was opened on 2026-05-18 -- a cross-PR forward reference baked into the original PR #9369 split. Convert both occurrences from ``[text](relative-path)`` syntax to plain backticked text references. The information value (appendix numbers + target file) is preserved; lychee no longer treats them as candidate links to resolve. Once both PRs land on ``main`` the file resolves naturally and reviewers can grep for the path. No code or test change. Cascade-affected branches: this commit lives on pr4/testbed; pr5/docs-devenv will rebase onto the new tip. Signed-off-by: Kai Ma <kaim@nvidia.com>

Adds documentation, dev environment, example DGD configs, operator scripts, and standalone analysis tools. Design documentation (~5,000 lines): - docs/design-docs/powerplanner-design.md (2,111 lines): full design with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics, 5.6 drift detection + hysteresis, 5.7 admission control, 6.5 Power Agent fail-closed cold-start cap, section 8 failure-mode catalog (14 modes). - docs/design-docs/powerplanner-testbed-design.md (2,187 lines): testbed design with alpha and gamma scenario classes, fake-driver contracts, replay-overlay timing semantics. - docs/components/planner/dpp-dev-env.md (675 lines): dev-pod bring-up + smoke-test recipes against viking-prod and umb-b200 cluster hardware. Dev environment + example DGDs: - deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml + qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for live-cluster iteration without rebuilds. - examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml + verify_poweraware.bash (operator smoke-test). Inspection + analysis tooling: - scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time helpers for DCGM attribution, frontend port resolution, planner DCGM query methods, worker /metrics endpoints, live planner metrics. - tools/compute_power_budget.py + integrate_aic_power_data.py + validate_aic_power_integration.py: standalone analysis utilities for sizing the budget, ingesting AIC offline data, and validating the integration end-to-end. - test_planner_power_launch.py: root-level pipeclean launcher. CI filter (.github/filters.yaml) accumulates the remaining planner-side paths so changes here trigger the planner job set (cumulative with PR 1a and PR 1b additions; the final file matches kaim/power-planner byte-for-byte). No production code changes; this PR is purely additive documentation, deployment manifests, and operator tooling. All planner/power_agent tests at PR 4 tip remain green (no test additions in this PR). Part of the PR #9369 split (PR 5 of 6, final). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>

Updates the three reference sites in examples/deployments/powerplanner/ that previously instructed users to ``kubectl apply -f deploy/power_agent/...``, flipping them to the new Helm chart at deploy/helm/charts/power-agent/ that landed in PR #9682: * disagg-power-aware.yaml header recipe * README.md Prerequisites + verify section * MULTI_DGD.md file index Also updates the verify-pods label selector from the legacy ``app=dynamo-power-agent`` to the chart-emitted ``app.kubernetes.io/name=power-agent``. Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md, pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference the old paths in their historical / architectural narrative sections, which is intentional -- those describe the pre-chart state and the transition rationale, not current deployment instructions. Part of the PR #9369 cascade following PR #9682''s Helm chart landing. Signed-off-by: Kai Ma <kaim@nvidia.com>

Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single Helm chart at deploy/helm/charts/power-agent/, resolving the three CodeRabbit findings on PR #9682: * hardcoded metadata.namespace=default -> {{ .Release.Namespace }} * mutable image :latest -> required image.tag with fail-fast validator * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating The chart supports three deployment shapes selectable via values: production DaemonSet (default, cluster-wide RBAC), namespace-restricted production (Role+RoleBinding), and an in-cluster dev-iteration Pod mounting power_agent.py from a ConfigMap. Three template-time validators reject foot-guns at install time: empty image.tag, mutex violations between daemonset.enabled and dev.enabled, and dev mode without a pinned dev.nodeName. Dev mode also automatically forces namespace-scoped RBAC (least privilege), leveraging power_agent.py's --namespace flag. Design rationale, scope decisions, and review-feedback responses are captured in docs/design-docs/power-agent-helm-chart-plan.md, committed alongside the chart. components/power_agent/README.md flips its install recipe to ``helm install``, and the planner CI filter (.github/filters.yaml) is retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**. Two examples/deployments/powerplanner/*.yaml header references live on PR #9687 and will be updated during that PR's cascade rebase per plan section 5.3. Validated locally: helm lint -> 0 errors helm template (3 positive exercises) -> expected resources helm template (3 negative exercises) -> expected fail-fast errors components/power_agent/tests/ -> 43/43 passed .github/scripts/test-filters.js -> 20/20 passed pre-commit (cross-cutting hooks) -> all applicable passed Signed-off-by: Kai Ma <kaim@nvidia.com>

pull-request-size Bot added the size/XXL label May 18, 2026

github-actions Bot added docs documentation Improvements or additions to documentation actions labels May 18, 2026

kaim-eng mentioned this pull request May 18, 2026

feat(planner): power planner stress testbed (α + γ) #9686

Draft

kaim-eng force-pushed the pr4/testbed branch from 6995868 to 3fa0915 Compare May 18, 2026 16:05

kaim-eng force-pushed the pr5/docs-devenv branch 2 times, most recently from dc9e109 to ca97b7a Compare May 19, 2026 12:58

kaim-eng force-pushed the pr4/testbed branch from 3fa0915 to 1d1b5b9 Compare May 19, 2026 12:58

kaim-eng force-pushed the pr4/testbed branch from 1d1b5b9 to da36145 Compare May 19, 2026 15:56

kaim-eng force-pushed the pr5/docs-devenv branch from ca97b7a to c63a96b Compare May 19, 2026 17:53

kaim-eng added 2 commits May 19, 2026 14:17

kaim-eng force-pushed the pr5/docs-devenv branch from c63a96b to 5426c75 Compare May 19, 2026 18:18

kaim-eng mentioned this pull request May 20, 2026

feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default) #9790

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(planner): power planner design docs + dev environment#9687

docs(planner): power planner design docs + dev environment#9687
kaim-eng wants to merge 2 commits into
pr4/testbedfrom
pr5/docs-devenv

kaim-eng commented May 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaim-eng commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

Reviewer onboarding

Tests at this tip

Merge strategy

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kaim-eng commented May 18, 2026 •

edited

Loading