Skip to content

docs(planner): power planner design docs + dev environment#9687

Draft
kaim-eng wants to merge 2 commits into
pr4/testbedfrom
pr5/docs-devenv
Draft

docs(planner): power planner design docs + dev environment#9687
kaim-eng wants to merge 2 commits into
pr4/testbedfrom
pr5/docs-devenv

Conversation

@kaim-eng

@kaim-eng kaim-eng commented May 18, 2026

Copy link
Copy Markdown

Part of the PR #9369 split plan.
This is PR 6 of 6 (PR 5 — Documentation + Dev Environment + Examples + Tools). Held in Draft per plan §4.5.

Predecessor: #9686 — Stress testbed
Successor: none (final PR)

Scope

No production code changes. Purely additive: design docs, testbed design doc, dev-environment guide, dev-pod manifests, example DGD configs, operator scripts, and standalone analysis tools.

~7,000 lines · 25 files.

  • docs/design-docs/powerplanner-design.md (2,111 LOC) — full power-planner design
  • docs/design-docs/powerplanner-testbed-design.md (2,187 LOC) — α + γ testbed design
  • docs/components/planner/dpp-dev-env.md (675 LOC) — dev-environment setup + one-shot sweep
  • deploy/planner/dev/ — Dockerfile, dev-pod manifest, qwen3 quickstart DGD
  • examples/deployments/powerplanner/ — README, PIPECLEAN, MULTI_DGD, two DGD YAMLs, verify_poweraware.bash
  • scripts/ — 7 inspection scripts for DCGM, frontend, live metrics, k8s access
  • tools/compute_power_budget.py, integrate_aic_power_data.py, validate_aic_power_integration.py
  • test_planner_power_launch.py (58 LOC) — repo-root smoke launcher
  • .github/filters.yaml — planner CI filter cumulative final state

Plan §2.6 explains why all three design docs ship together in PR 5 (deliberate choice — they're internally consistent only at PR 5's tip; doc cross-references Phase-3 algorithms that exist only at this tip).

Reviewer onboarding

  • Docs-focused review; no algorithm or code-review context-switching expected
  • Plan section: §2.6 (this PR)

Tests at this tip

  • No code changes → same as PR 4 tip: 739 passed, 6 skipped, 0 failures (planner 696 / power_agent 43)
  • Measured 2026-05-18 on dev pod against rebased tip cec5681f15

Merge strategy

Rebase-and-merge (no squash). After this merges, the cumulative diff matches kaim/power-planner byte-for-byte.

@copy-pr-bot

copy-pr-bot Bot commented May 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added docs documentation Improvements or additions to documentation actions labels May 18, 2026
@kaim-eng kaim-eng force-pushed the pr5/docs-devenv branch 2 times, most recently from dc9e109 to ca97b7a Compare May 19, 2026 12:58
kaim-eng added a commit that referenced this pull request May 19, 2026
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:

  * hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
  * mutable image :latest -> required image.tag with fail-fast validator
  * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating

The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.

Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.

components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.

Validated locally:
  helm lint                                -> 0 errors
  helm template (3 positive exercises)     -> expected resources
  helm template (3 negative exercises)     -> expected fail-fast errors
  components/power_agent/tests/            -> 43/43 passed
  .github/scripts/test-filters.js          -> 20/20 passed
  pre-commit (cross-cutting hooks)         -> all applicable passed

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
The two ``[Appendix C.10]``/``[Appendix D.7]`` links in the testbed
README pointed at ``docs/design-docs/powerplanner-testbed-design.md``,
which is introduced by PR #9687 and does not yet exist on this branch.
The Docs link check (lychee) has therefore failed on PR #9686 since the
PR was opened on 2026-05-18 -- a cross-PR forward reference baked into
the original PR #9369 split.

Convert both occurrences from ``[text](relative-path)`` syntax to plain
backticked text references. The information value (appendix numbers +
target file) is preserved; lychee no longer treats them as candidate
links to resolve. Once both PRs land on ``main`` the file resolves
naturally and reviewers can grep for the path.

No code or test change. Cascade-affected branches: this commit lives on
pr4/testbed; pr5/docs-devenv will rebase onto the new tip.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added 2 commits May 19, 2026 14:17
Adds documentation, dev environment, example DGD configs, operator
scripts, and standalone analysis tools.

Design documentation (~5,000 lines):
- docs/design-docs/powerplanner-design.md (2,111 lines): full design
  with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics,
  5.6 drift detection + hysteresis, 5.7 admission control, 6.5
  Power Agent fail-closed cold-start cap, section 8 failure-mode
  catalog (14 modes).
- docs/design-docs/powerplanner-testbed-design.md (2,187 lines):
  testbed design with alpha and gamma scenario classes, fake-driver
  contracts, replay-overlay timing semantics.
- docs/components/planner/dpp-dev-env.md (675 lines): dev-pod
  bring-up + smoke-test recipes against viking-prod and umb-b200
  cluster hardware.

Dev environment + example DGDs:
- deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml +
  qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for
  live-cluster iteration without rebuilds.
- examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD
  guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml
  + verify_poweraware.bash (operator smoke-test).

Inspection + analysis tooling:
- scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time
  helpers for DCGM attribution, frontend port resolution, planner DCGM
  query methods, worker /metrics endpoints, live planner metrics.
- tools/compute_power_budget.py + integrate_aic_power_data.py +
  validate_aic_power_integration.py: standalone analysis utilities
  for sizing the budget, ingesting AIC offline data, and validating
  the integration end-to-end.
- test_planner_power_launch.py: root-level pipeclean launcher.

CI filter (.github/filters.yaml) accumulates the remaining planner-side
paths so changes here trigger the planner job set (cumulative with PR 1a
and PR 1b additions; the final file matches kaim/power-planner
byte-for-byte).

No production code changes; this PR is purely additive documentation,
deployment manifests, and operator tooling. All planner/power_agent
tests at PR 4 tip remain green (no test additions in this PR).

Part of the PR #9369 split (PR 5 of 6, final). See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Updates the three reference sites in examples/deployments/powerplanner/
that previously instructed users to ``kubectl apply -f deploy/power_agent/...``,
flipping them to the new Helm chart at deploy/helm/charts/power-agent/
that landed in PR #9682:

  * disagg-power-aware.yaml header recipe
  * README.md Prerequisites + verify section
  * MULTI_DGD.md file index

Also updates the verify-pods label selector from the legacy
``app=dynamo-power-agent`` to the chart-emitted
``app.kubernetes.io/name=power-agent``.

Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md,
pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference
the old paths in their historical / architectural narrative sections,
which is intentional -- those describe the pre-chart state and the
transition rationale, not current deployment instructions.

Part of the PR #9369 cascade following PR #9682''s Helm chart landing.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:

  * hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
  * mutable image :latest -> required image.tag with fail-fast validator
  * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating

The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.

Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.

components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.

Validated locally:
  helm lint                                -> 0 errors
  helm template (3 positive exercises)     -> expected resources
  helm template (3 negative exercises)     -> expected fail-fast errors
  components/power_agent/tests/            -> 43/43 passed
  .github/scripts/test-filters.js          -> 20/20 passed
  pre-commit (cross-cutting hooks)         -> all applicable passed

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 3, 2026
Folds deploy/power_agent/{daemonset,rbac,dev-pod}.yaml into a single
Helm chart at deploy/helm/charts/power-agent/, resolving the three
CodeRabbit findings on PR #9682:

  * hardcoded metadata.namespace=default -> {{ .Release.Namespace }}
  * mutable image :latest -> required image.tag with fail-fast validator
  * `${POWER_AGENT_NAMESPACE}` envsubst placeholder -> native Helm templating

The chart supports three deployment shapes selectable via values:
production DaemonSet (default, cluster-wide RBAC), namespace-restricted
production (Role+RoleBinding), and an in-cluster dev-iteration Pod
mounting power_agent.py from a ConfigMap. Three template-time validators
reject foot-guns at install time: empty image.tag, mutex violations
between daemonset.enabled and dev.enabled, and dev mode without a
pinned dev.nodeName. Dev mode also automatically forces namespace-scoped
RBAC (least privilege), leveraging power_agent.py's --namespace flag.

Design rationale, scope decisions, and review-feedback responses are
captured in docs/design-docs/power-agent-helm-chart-plan.md, committed
alongside the chart.

components/power_agent/README.md flips its install recipe to
``helm install``, and the planner CI filter (.github/filters.yaml) is
retargeted from deploy/power_agent/** to deploy/helm/charts/power-agent/**.
Two examples/deployments/powerplanner/*.yaml header references live on
PR #9687 and will be updated during that PR's cascade rebase per plan
section 5.3.

Validated locally:
  helm lint                                -> 0 errors
  helm template (3 positive exercises)     -> expected resources
  helm template (3 negative exercises)     -> expected fail-fast errors
  components/power_agent/tests/            -> 43/43 passed
  .github/scripts/test-filters.js          -> 20/20 passed
  pre-commit (cross-cutting hooks)         -> all applicable passed

Signed-off-by: Kai Ma <kaim@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions docs documentation Improvements or additions to documentation size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant