docs(planner): power planner design docs + dev environment#9687
Draft
kaim-eng wants to merge 2 commits into
Draft
docs(planner): power planner design docs + dev environment#9687kaim-eng wants to merge 2 commits into
kaim-eng wants to merge 2 commits into
Conversation
dc9e109 to
ca97b7a
Compare
kaim-eng
added a commit
that referenced
this pull request
May 19, 2026
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:
* hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
* mutable image :latest -> required image.tag with fail-fast validator
* `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating
The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.
Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.
components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.
Validated locally:
helm lint -> 0 errors
helm template (3 positive exercises) -> expected resources
helm template (3 negative exercises) -> expected fail-fast errors
components/power_agent/tests/ -> 43/43 passed
.github/scripts/test-filters.js -> 20/20 passed
pre-commit (cross-cutting hooks) -> all applicable passed
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 19, 2026
The two ``[Appendix C.10]``/``[Appendix D.7]`` links in the testbed README pointed at ``docs/design-docs/powerplanner-testbed-design.md``, which is introduced by PR #9687 and does not yet exist on this branch. The Docs link check (lychee) has therefore failed on PR #9686 since the PR was opened on 2026-05-18 -- a cross-PR forward reference baked into the original PR #9369 split. Convert both occurrences from ``[text](relative-path)`` syntax to plain backticked text references. The information value (appendix numbers + target file) is preserved; lychee no longer treats them as candidate links to resolve. Once both PRs land on ``main`` the file resolves naturally and reviewers can grep for the path. No code or test change. Cascade-affected branches: this commit lives on pr4/testbed; pr5/docs-devenv will rebase onto the new tip. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds documentation, dev environment, example DGD configs, operator scripts, and standalone analysis tools. Design documentation (~5,000 lines): - docs/design-docs/powerplanner-design.md (2,111 lines): full design with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics, 5.6 drift detection + hysteresis, 5.7 admission control, 6.5 Power Agent fail-closed cold-start cap, section 8 failure-mode catalog (14 modes). - docs/design-docs/powerplanner-testbed-design.md (2,187 lines): testbed design with alpha and gamma scenario classes, fake-driver contracts, replay-overlay timing semantics. - docs/components/planner/dpp-dev-env.md (675 lines): dev-pod bring-up + smoke-test recipes against viking-prod and umb-b200 cluster hardware. Dev environment + example DGDs: - deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml + qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for live-cluster iteration without rebuilds. - examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml + verify_poweraware.bash (operator smoke-test). Inspection + analysis tooling: - scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time helpers for DCGM attribution, frontend port resolution, planner DCGM query methods, worker /metrics endpoints, live planner metrics. - tools/compute_power_budget.py + integrate_aic_power_data.py + validate_aic_power_integration.py: standalone analysis utilities for sizing the budget, ingesting AIC offline data, and validating the integration end-to-end. - test_planner_power_launch.py: root-level pipeclean launcher. CI filter (.github/filters.yaml) accumulates the remaining planner-side paths so changes here trigger the planner job set (cumulative with PR 1a and PR 1b additions; the final file matches kaim/power-planner byte-for-byte). No production code changes; this PR is purely additive documentation, deployment manifests, and operator tooling. All planner/power_agent tests at PR 4 tip remain green (no test additions in this PR). Part of the PR #9369 split (PR 5 of 6, final). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Updates the three reference sites in examples/deployments/powerplanner/ that previously instructed users to ``kubectl apply -f deploy/power_agent/...``, flipping them to the new Helm chart at deploy/helm/charts/power-agent/ that landed in PR #9682: * disagg-power-aware.yaml header recipe * README.md Prerequisites + verify section * MULTI_DGD.md file index Also updates the verify-pods label selector from the legacy ``app=dynamo-power-agent`` to the chart-emitted ``app.kubernetes.io/name=power-agent``. Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md, pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference the old paths in their historical / architectural narrative sections, which is intentional -- those describe the pre-chart state and the transition rationale, not current deployment instructions. Part of the PR #9369 cascade following PR #9682''s Helm chart landing. Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 25, 2026
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:
* hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
* mutable image :latest -> required image.tag with fail-fast validator
* `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating
The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.
Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.
components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.
Validated locally:
helm lint -> 0 errors
helm template (3 positive exercises) -> expected resources
helm template (3 negative exercises) -> expected fail-fast errors
components/power_agent/tests/ -> 43/43 passed
.github/scripts/test-filters.js -> 20/20 passed
pre-commit (cross-cutting hooks) -> all applicable passed
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
Jun 3, 2026
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:
* hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
* mutable image :latest -> required image.tag with fail-fast validator
* `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating
The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.
Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.
components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.
Validated locally:
helm lint -> 0 errors
helm template (3 positive exercises) -> expected resources
helm template (3 negative exercises) -> expected fail-fast errors
components/power_agent/tests/ -> 43/43 passed
.github/scripts/test-filters.js -> 20/20 passed
pre-commit (cross-cutting hooks) -> all applicable passed
Signed-off-by: Kai Ma <kaim@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of the PR #9369 split plan.
This is PR 6 of 6 (PR 5 — Documentation + Dev Environment + Examples + Tools). Held in Draft per plan §4.5.
Predecessor: #9686 — Stress testbed
Successor: none (final PR)
Scope
No production code changes. Purely additive: design docs, testbed design doc, dev-environment guide, dev-pod manifests, example DGD configs, operator scripts, and standalone analysis tools.
~7,000 lines · 25 files.
docs/design-docs/powerplanner-design.md(2,111 LOC) — full power-planner designdocs/design-docs/powerplanner-testbed-design.md(2,187 LOC) — α + γ testbed designdocs/components/planner/dpp-dev-env.md(675 LOC) — dev-environment setup + one-shot sweepdeploy/planner/dev/— Dockerfile, dev-pod manifest, qwen3 quickstart DGDexamples/deployments/powerplanner/— README, PIPECLEAN, MULTI_DGD, two DGD YAMLs,verify_poweraware.bashscripts/— 7 inspection scripts for DCGM, frontend, live metrics, k8s accesstools/—compute_power_budget.py,integrate_aic_power_data.py,validate_aic_power_integration.pytest_planner_power_launch.py(58 LOC) — repo-root smoke launcher.github/filters.yaml— planner CI filter cumulative final stateReviewer onboarding
Tests at this tip
cec5681f15Merge strategy
Rebase-and-merge (no squash). After this merges, the cumulative diff matches
kaim/power-plannerbyte-for-byte.