Skip to content

feat(planner): power-aware planner — Phase 1+2+3 + Phase 4 pipeclean#9369

Open
kaim-eng wants to merge 11 commits into
mainfrom
kaim/power-planner
Open

feat(planner): power-aware planner — Phase 1+2+3 + Phase 4 pipeclean#9369
kaim-eng wants to merge 11 commits into
mainfrom
kaim/power-planner

Conversation

@kaim-eng

@kaim-eng kaim-eng commented May 10, 2026

Copy link
Copy Markdown

Adds power-aware intelligence to the Dynamo planner across three layers, plus tooling that pipecleans the Phase 4 selection path with currently available single-TDP AIC data. Reimplementation of unmerged PR #5280 against post-refactor ToT (planner/utils/ no longer exists; configuration is Pydantic; layout is config/, connectors/, core/, monitoring/).

Full design: docs/design-docs/powerplanner-design.md
Repro / dev environment: docs/components/planner/dpp-dev-env.md

What's in this PR

Phase 1 — infrastructure

  • Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/} Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop. Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore.
  • Pod annotation actuation: connectors/kubernetes.py writes dynamo.nvidia.com/gpu-power-limit on worker pods.
  • K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml.
  • Operator chart fix: ClusterRole now grants pods permissions (deploy/helm/charts/platform/components/operator/templates/planner.yaml, chart bumped to 1.2.1). Older clusters can apply deploy/planner-pod-rbac-dev.yaml as a temporary patch.

Phase 2 — budget enforcement

  • _apply_power_budget() in core/state_machine.py clamps replica scaling within a watt budget. New PlannerConfig fields: enable_power_awareness, total_gpu_power_limit (required when awareness enabled — validator enforces, no silent default), prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts.

Phase 3 — AIC optimizer

  • monitoring/aic_power_optimizer.py: AIConfigurator sweep picks (replica, power) configs that maximise throughput within the watt budget. Per-component EMA correction (c_ttft, c_itl, c_power_p, c_power_d for disagg; c_power_agg for agg) closes the loop on silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded re-optimize, auto-disable on consecutive failures.
  • core/base.py: NativePlannerBase plumbing (admission threshold push, named-port resolution, AIC integration).

Phase 4 — pipeclean (using existing single-TDP data)

  • tools/integrate_aic_power_data.py: copies measured H200/B200 power data into an AIC checkout and reinstalls.
  • tools/validate_aic_power_integration.py: asserts estimate_perf() returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback).
  • tools/compute_power_budget.py: rack-capacity → safe-budget helper.
  • examples/deployments/powerplanner/: PIPECLEAN.md runbook, disagg-{power-aware,conservative-cold-start}.yaml, verify_poweraware.bash (8-section health check inc. Phase 4 preview), MULTI_DGD.md operator playbook, README.md.

Bug fixes (found by live integration tests, fixed in this patchset) -------------------------------------------------------------------

  • Operator chart 1.2.0 ClusterRole had no pods rules → planner couldn't list workers or patch annotations. Fixed in chart template; workaround manifest provided for clusters not yet upgraded.
  • Stale label selectors in connectors/kubernetes.py: dynamo-graph-deploymentdynamo-graph-deployment-name, dynamo-servicedynamo-component/dynamo-component-type.
  • DCGM queries used pod=~ (matched the DCGM exporter pod) and the wrong namespace; corrected to exported_pod=~ + exported_namespace=<k8s-namespace> and pod regex now matches the operator's <dgd>-<replica-idx>-<service-key>-... format (monitoring/traffic_metrics.py).
  • Frontend-source metric methods returned NaN on quiet clusters (0/0 in increase(_sum)/increase(_count)); router path had a NaN guard, frontend didn't — added matching math.isnan filter.

Tests

Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS dev cluster (qwen3-quickstart DGD, single dev namespace):

Suite Pass Skip Fail


planner/tests/unit/ 465 0 0
planner/tests/integration/test_aic_power_e2e_sim 15 0 0
planner/tests/integration/test_aic_power_optimizer 34 0 0
planner/tests/integration/test_metric_paths_live 22-23 3-2 0
planner/tests/integration/test_actuation_knobs_live 10-11 1-0 0
power_agent/tests/ 43 0 0


Total (cold deploy / after sanity chat completion) 590-591 4-3 0

Documented skips:

  • test_frontend_metric_series_exists — passes after any traffic.
  • TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose dynamo_frontend_worker_* gauges (open-question feat: OAI compatible endpoints for TRTLLM #14 in design doc).
  • TestScalingRealMutation — opt-in disruptive test, gated by RUN_DISRUPTIVE_TESTS=1 (passes when enabled).

Repro recipe: docs/components/planner/dpp-dev-env.md
"From-Scratch Repro Script" section.

Compatibility

Power awareness is opt-in (enable_power_awareness=False by default). AIC optimizer is opt-in (enable_aic_optimizer=False by default). No existing planner behavior changes when both flags are off.

Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation) remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4 pipeclean code path so it can be exercised before AIC ships the full sweep API.

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Power Agent DaemonSet for enforcing GPU power limits on Dynamo worker pods via Kubernetes annotations.
    • Introduced power-aware planner with GPU power budgeting and replica scaling constraints.
    • Added AIC closed-loop optimizer for dynamic power and performance reconfiguration.
    • Implemented admission control coupling for frontend threshold management.
    • Added tools for power budget computation and AIC power-data integration.
  • Documentation

    • Added developer environment setup guide, deployment playbooks, and verification scripts for power-aware scenarios.
  • Deployment

    • Updated Helm charts to version 1.2.1 with planner pod-patching RBAC permissions.
    • Added example DynamoGraphDeployment manifests for power-aware configurations.

Review Change Stack

@kaim-eng kaim-eng requested review from a team as code owners May 10, 2026 13:45
@copy-pr-bot

copy-pr-bot Bot commented May 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added feat documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes planner labels May 10, 2026
@coderabbitai

coderabbitai Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This pull request implements end-to-end power-aware GPU resource management for the Dynamo planner across three phases: (1) a privileged DaemonSet that enforces per-pod NVML power caps via cgroup-based pod mapping, (2) static planner budget enforcement, and (3) a closed-loop AIC optimizer with EMA-based power coefficient correction. It includes comprehensive tests, Kubernetes manifests, deployment examples, and tooling.

Changes

Power-Aware GPU Resource Management

Layer / File(s) Summary
Configuration & Schema
components/src/dynamo/planner/config/defaults.py, components/src/dynamo/planner/config/planner_config.py, components/src/dynamo/planner/core/types.py
SLAPlannerDefaults and PlannerConfig extended with power-awareness flags, GPU power budgets, AIC optimizer parameters, and admission control settings. TrafficObservation gains AIC signal fields (TTFT/ITL seconds, token throughput, scheduled tokens).
Power Agent Daemon
components/power_agent/README.md, components/power_agent/power_agent.py
DaemonSet that maps node PIDs to K8s pods via /proc/<pid>/cgroup parsing (v1/v2, systemd/cgroupfs), reads per-pod dynamo.nvidia.com/gpu-power-limit annotations, applies NVML power caps with safe-default fallback for conflicts, restores defaults on SIGTERM, and persists managed GPU UUIDs for orphan recovery.
Power Agent Tests
components/power_agent/tests/*
Unit tests covering cgroup parsing (6 QoS/driver combinations), annotation mapping, SKU clamping, apply-cap behavior with metrics, multi-pod policy (single/agree/conflict/parse-failure), and graceful shutdown with NVML error resilience.
Planner Budget Enforcement
components/src/dynamo/planner/core/state_machine.py, components/src/dynamo/planner/core/load_scaling.py, components/src/dynamo/planner/core/throughput_scaling.py, components/src/dynamo/planner/core/budget.py
_apply_power_budget() conditionally caps replica counts when total projected power exceeds limit, proportionally allocates prefill/decode, and prevents decode upscaling. Budget constant defined for pod annotation key.
Planner AIC Integration
components/src/dynamo/planner/core/base.py
Initializes optional AICPowerOptimizer at startup, enriches traffic observations with token scheduling and TTFT/ITL in seconds, applies per-pod power annotations, publishes power budget metrics, collects observed power from Prometheus, updates AIC correction coefficients, triggers re-optimizations on drift, and fans out admission /busy_threshold POSTs to frontend pods.
AIC Closed-Loop Optimizer
components/src/dynamo/planner/monitoring/aic_power_optimizer.py
Manages EMA-smoothed TTFT/ITL/power coefficients with traffic/token gating, detects SLA drift with hysteresis and rate limiting, sweeps replica counts under power budget, derives admission thresholds, warns on throughput regression, and implements fail-closed startup + fail-open runtime with auto-disable after consecutive failures.
Kubernetes Integration
components/src/dynamo/planner/connectors/kubernetes.py, components/src/dynamo/planner/connectors/kubernetes_api.py
Methods to discover component pods by label, list frontend pods, resolve named http container ports with fallback, patch pod annotations, and POST admission thresholds to frontend pod IPs via httpx.
Metrics & Monitoring
components/src/dynamo/planner/monitoring/planner_metrics.py, components/src/dynamo/planner/monitoring/traffic_metrics.py
Prometheus gauges for power budget/utilization, AIC coefficients (per-side and agg), implied admission thresholds, counters for failures/regressions. DCGM power query methods for total DGD power and per-component per-GPU power.
Planner Unit Tests
components/src/dynamo/planner/tests/unit/test_*.py
Tests for scaling targets, pod annotation patching with no-op/disabled checks, busy threshold JSON construction, Kubernetes connector frontend discovery/port resolution, DCGM power queries, and state machine budget fitting.
Planner Integration Tests
components/src/dynamo/planner/tests/integration/test_*.py
AIC optimizer happy path (replicas, power caps, latency/throughput), budget constraints, failure handling, EMA correction gating, drift detection. E2E simulations with synthetic H200 power levels. Live-cluster tests for annotation round-trip, pod discovery, admission POST, and metric path validation.
Profiler Extensions
components/src/dynamo/profiler/utils/aic_dataframe.py
build_prefill_row and build_decode_row now accept power_w parameter (measured from NVML) instead of hardcoded 0.0.
Helm Charts & RBAC
deploy/helm/charts/platform/*
Version bumps to 1.2.1, extended planner Role/ClusterRole with core pods permissions (get/list/watch/patch) for pod actuation.
Power Agent Manifests
deploy/power_agent/daemonset.yaml, deploy/power_agent/rbac.yaml, deploy/planner-pod-rbac-dev.yaml
DaemonSet running on GPU nodes (hostPID, nvidia runtime, tolerations), mounts /proc read-only and /var/lib/dynamo-power-agent for state, passes SAFE_DEFAULT_WATTS and node name. RBAC for ServiceAccount with pod read + patch. Dev RBAC workaround for operator ≤1.2.0.
Documentation
docs/components/planner/dpp-dev-env.md, examples/deployments/powerplanner/*
End-to-end dev environment setup on AKS, multi-DGD deployment playbook with power accounting, cold-start coefficient recommendations, E2E pipeclean runbook, README with quick-start and budget formula.
Deployment Examples
examples/deployments/powerplanner/disagg-*.yaml, examples/deployments/powerplanner/verify_poweraware.bash
Two disaggregated profiles (standard and conservative cold-start) with inline planner config enabling power/AIC. Bash smoke test validating DaemonSet readiness, pod annotations, power metrics, AIC coefficients, and Phase 4 preview.
Tooling
tools/compute_power_budget.py, tools/integrate_aic_power_data.py, tools/validate_aic_power_integration.py
CLI to compute total_gpu_power_limit from rack capacity/headroom/overhead, integrate GPU power benchmark files into AIC, validate AIC estimator returns non-zero power within expected ranges and optimizer uses real power (not TDP fallback).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py (1)

1-799: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix formatting and linting errors before merge.

The file has formatting issues and linting violations that must be addressed:

  1. Formatting: Run ruff format to fix whitespace and line wrapping.
  2. Linting (manual fixes required):
    • Replace ambiguous × character with x in comments at lines 329, 708 and docstring at line 696 (RUF002, RUF003).
    • Prefix unused variable n_p with underscore at line 742: _n_p, n_d = ... (RUF059).
Ruff violations detected
RUF002 at line 696: Docstring contains ambiguous × (MULTIPLICATION SIGN)
RUF003 at lines 329, 708: Comment contains ambiguous × (MULTIPLICATION SIGN)  
RUF059 at line 742: Unpacked variable `n_p` is never used
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py`
around lines 1 - 799, Run automatic formatting (ruff format) to fix
whitespace/line-wrapping, then manually fix three lint issues: replace the
Unicode multiplication sign '×' with ASCII 'x' in the explanatory
comments/docstring around the budget and throughput calculations (references:
the multi-replica budget comment in TestOptimizeBudgetConstraint, the
docstring/throughput comment in TestThroughputRegressionWarning and the comment
in TestSweepReplicas) and make the unused unpacked variable in
TestSweepReplicas.test_no_budget_uses_max_replicas a named-unused by prefixing
it with an underscore (change n_p, n_d = opt._sweep_replicas(...) to _n_p, n_d =
...). These changes address RUF002/RUF003 (ambiguous ×) and RUF059 (unused
variable) while keeping function/class names like _sweep_replicas,
TestSweepReplicas, TestOptimizeBudgetConstraint, and
TestThroughputRegressionWarning locators unchanged.
components/src/dynamo/planner/tests/unit/test_prometheus.py (1)

1-1078: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Address pre-commit formatting failures.

The file has formatting issues flagged by ruff. Run ruff format test_prometheus.py to apply reformatting and ruff check --fix test_prometheus.py to fix lint violations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/unit/test_prometheus.py` around lines 1 -
1078, The test file fails ruff pre-commit checks; run an automatic reformat and
fixer (e.g. ruff format followed by ruff check --fix) on the file and commit the
changes so tests and lint pass; if applying edits manually, ensure formatting
and lint rules are respected across imports and function/class definitions
(examples to inspect: PrometheusAPIClient, DirectRouterMetricsClient,
TestPrometheusAPIClientRouterSource, and
test_frontend_metric_container_with_nan_value) and re-run ruff to confirm no
remaining violations.
components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py (1)

1-711: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Run ruff format to fix spacing and line-wrapping issues.

The code has formatting inconsistencies in comment alignment and line lengths. Run ruff format test_aic_power_e2e_sim.py to resolve the spacing before comments (e.g., 50.0, # 50.0, # ) and line-wrapping issues.

Formatting changes needed
-H200_PREFILL_SERVING_W = H200_PREFILL_P90_W   # 603 W/GPU at heavy prefill load
-H200_DECODE_SERVING_W = H200_DECODE_P90_W     # 558 W/GPU at heavy decode load
+H200_PREFILL_SERVING_W = H200_PREFILL_P90_W  # 603 W/GPU at heavy prefill load
+H200_DECODE_SERVING_W = H200_DECODE_P90_W  # 558 W/GPU at heavy decode load
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py`
around lines 1 - 711, The file has ruff-formatting issues (misaligned inline
comments and long wrapped lines) across the test module (e.g., around numeric
literals and comment spacing such as "50.0,   #"), so run an automatic formatter
and fix spacing/line-wrapping: run `ruff format` (or your repo's formatter) on
this test file and ensure inline comments use a single space after the comma and
are properly wrapped; focus on symbols/classes to inspect after formatting like
TestFeedbackLoopH200.test_ema_converges_from_cold_start, _POWER_LEVELS entries,
_make_config, _make_traffic, and TestPhase4SyntheticMultiLevel methods to verify
comment alignment and line lengths are corrected.
components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Run the code formatter before merge.

The pre-commit check failed due to formatting issues. Run ruff format to fix line wrapping and assertion formatting.

#!/bin/bash
# Check current formatting diff
cd components/src/dynamo/planner
ruff format --diff tests/unit/test_kubernetes_connector.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py` at
line 1, Run the project's formatter on the test file to fix line wrapping and
assertion formatting: execute ruff format (e.g., `ruff format
tests/unit/test_kubernetes_connector.py` from the components/src/dynamo/planner
directory) to update formatting in test_kubernetes_connector.py so the
pre-commit checks pass; re-run the tests or pre-commit to confirm no remaining
style diffs.
🟡 Minor comments (7)
components/src/dynamo/planner/config/defaults.py-81-103 (1)

81-103: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Per-engine power-limit defaults contradict the stated "no placeholder integers" philosophy.

The comment at lines 82–84 explicitly argues that defaults are None "so the type itself signals 'operator must set'", yet prefill_engine_gpu_power_limit and decode_engine_gpu_power_limit are hardcoded to 300 W. With enable_power_awareness=True, these silently cap every prefill/decode GPU to 300 W regardless of SKU (≈43% of an H200 SXM's 700 W TDP), which would dramatically and silently regress throughput while the operator believes they are within budget. The validator in planner_config.py only enforces total_gpu_power_limit and power_agent_safe_default_watts, so this slips through.

Either make these Optional[int] = None and add a validator, or derive them from total_gpu_power_limit / gpu_count so the operator only needs to set one number.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/config/defaults.py` around lines 81 - 103, The
defaults prefill_engine_gpu_power_limit and decode_engine_gpu_power_limit
contradict the "no placeholder integers" policy and silently cap GPUs when
enable_power_awareness=True; change both to Optional[int]=None (or compute them
from total_gpu_power_limit divided by gpu_count) and update the
planner_config.py validation logic to require/derive per-engine limits when
enable_power_awareness is true (i.e., enforce that either per-engine limits are
set or automatically compute them from total_gpu_power_limit and known GPU
count) while keeping total_gpu_power_limit and power_agent_safe_default_watts
validation intact.
components/src/dynamo/planner/tests/unit/test_state_machine.py-1356-1358 (1)

1356-1358: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename the unused unpack target to keep pre-commit green.

num_p is never read in this test, and Ruff is already flagging it. Renaming it to _num_p avoids the RUF059 failure without changing the assertion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/unit/test_state_machine.py` around lines
1356 - 1358, The test assigns the tuple return from sm._apply_global_budget to
num_p and num_d but never uses num_p; rename the unused unpack target to _num_p
(i.e., change "num_p, num_d = sm._apply_global_budget(5, 3)" to "_num_p, num_d =
sm._apply_global_budget(5, 3)") to satisfy Ruff/RUF059 while leaving the
assertion against num_d unchanged.
tools/validate_aic_power_integration.py-137-137 (1)

137-137: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Drop the stray f prefixes to unblock Ruff.

These strings have no placeholders, and pre-commit is already reporting the resulting F541 errors here.

Also applies to: 167-167, 305-305

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/validate_aic_power_integration.py` at line 137, Calls to _section are
using f-string prefixes for plain string literals (e.g., _section(f"  prefill
estimate (isl=1024, tp=8)")) which triggers Ruff F541; remove the leading f so
the calls use normal string literals (e.g., _section("  prefill estimate
(isl=1024, tp=8)")). Apply the same change to the other occurrences of
_section(f"...") in the file (the other similar call sites that currently use an
f-prefix).
components/src/dynamo/planner/tests/unit/test_actuation_knobs.py-1-1 (1)

1-1: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Run the code formatter before merge.

The pre-commit check failed because the code needs formatting. Run ruff format on this file to fix line wrapping issues.

#!/bin/bash
# Check current formatting diff
cd components/src/dynamo/planner
ruff format --diff tests/unit/test_actuation_knobs.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/unit/test_actuation_knobs.py` at line 1,
The file tests/unit/test_actuation_knobs.py needs to be auto-formatted to
satisfy pre-commit checks; run the project formatter (ruff format) on that file
(e.g., ruff format
components/src/dynamo/planner/tests/unit/test_actuation_knobs.py) to fix line
wrapping and style issues, then stage the updated file so the pre-commit check
passes.
examples/deployments/powerplanner/disagg-conservative-cold-start.yaml-94-94 (1)

94-94: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Correct the model ID to include the "Meta-" prefix.

The model ID meta-llama/Llama-3-8B-Instruct is incorrect. The official HuggingFace model is meta-llama/Meta-Llama-3-8B-Instruct. Using the wrong ID will cause model download failures at runtime.

Also applies to: line 114

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/deployments/powerplanner/disagg-conservative-cold-start.yaml` at
line 94, The model ID strings are incorrect: replace every occurrence of
"meta-llama/Llama-3-8B-Instruct" with the official HuggingFace ID
"meta-llama/Meta-Llama-3-8B-Instruct" (e.g., the value used in the model field
in the YAML), and update the second occurrence noted in the review as well;
ensure both instances exactly match "meta-llama/Meta-Llama-3-8B-Instruct" so the
runtime can successfully download the model.
components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py-305-313 (1)

305-313: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Resolve the frontend port from the pod instead of forcing 8000.

Line 308 hardcodes the default frontend port, so this live test will false-fail on deployments that override the http port. Use the same per-pod port resolution helper as the runtime path instead of assuming 8000.

As per coding guidelines, "Don’t hardcode ports in tests (no literal 8000/8081/...; use provided dynamic port fixtures/helpers)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py`
around lines 305 - 313, The test currently hardcodes port 8000 when calling
connector.post_busy_threshold (pod, model, port=8000) which breaks deployments
that override the http port; replace the literal with the per-pod frontend-port
resolver used by the runtime path (call it to compute port_from_pod =
resolve_frontend_port(pod) or the existing helper in your test runtime
utilities) and pass that variable into connector.post_busy_threshold (ensure you
await or import the helper if async). This fixes the test by dynamically
resolving the correct frontend port for the given pod instead of assuming 8000.
components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py-571-582 (1)

571-582: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace time.sleep(2) with await asyncio.sleep(2) to avoid blocking the event loop.

This async test uses await calls, and time.sleep(2) blocks the entire asyncio event loop, which can interfere with async cleanup and make the test flaky. Import asyncio at the module level and use await asyncio.sleep(2) instead.

Suggested change
-            time.sleep(2)  # let the API server settle
+            await asyncio.sleep(2)  # let the API server settle
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py`
around lines 571 - 582, The test is blocking the asyncio event loop by calling
time.sleep(2); replace that call with await asyncio.sleep(2) and add an import
asyncio at module level. Locate the snippet where
connector.set_component_replicas(...) is called (in the test function that uses
await) and change the synchronous sleep to the async sleep so the event loop
isn’t blocked and async cleanup can run correctly.
🧹 Nitpick comments (9)
components/power_agent/power_agent.py (1)

489-490: 💤 Low value

Move argparse import to module top per Python guidelines.

 import json
 import logging
 import os
 import re
 import signal
 import sys
 import threading
 import time
 from typing import Optional
+import argparse
 def main() -> None:
-    import argparse
-
     parser = argparse.ArgumentParser(description="Dynamo Power Agent DaemonSet")

As per coding guidelines (.ai/python-guidelines.md): "flag any import inside function/method/class bodies (keep imports at module top)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/power_agent.py` around lines 489 - 490, The import of
argparse is inside the main() function; move the import to the module top to
follow guidelines. Remove the inline "import argparse" from def main() and add
"import argparse" alongside other top-level imports so that the module-level
imports include argparse and main() simply uses it.
components/power_agent/tests/test_multi_pod_policy.py (1)

66-126: ⚡ Quick win

Coverage looks good for the policy itself; consider adding a same-pod-multi-PID case.

The cases here cover _resolve_cap_for_gpu thoroughly. To complete the matrix and lock down the deduplication concern raised against power_agent.py:449-455, please add a test that asserts a single pod producing two PIDs on the same GPU resolves to the pod's annotation without firing multi_pod_agree:

def test_same_pod_multi_pid_not_multi_pod(self):
    # After the dedup fix, two PIDs from one pod -> 1 entry, no warning.
    cap = _resolve_cap_for_gpu(
        0, [("uid-1", "480")], SAFE_DEFAULT, self.m
    )
    self.assertEqual(cap, 480)
    self.assertEqual(self.m.multi_pod_agree, 0)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_multi_pod_policy.py` around lines 66 - 126,
Add a new unit test method test_same_pod_multi_pid_not_multi_pod to
TestMultiPodPolicy that calls _resolve_cap_for_gpu with a single pod UID and a
valid annotation (use SAFE_DEFAULT and the fixture self.m) and asserts the
returned cap equals the annotated value and that self.m.multi_pod_agree remains
0; reference _resolve_cap_for_gpu, SAFE_DEFAULT and the metric multi_pod_agree
so the test verifies same-pod/multiple-PID deduplication (after the fix around
power_agent.py:449-455) does not trigger the multi-pod agreement metric.
components/src/dynamo/planner/monitoring/planner_metrics.py (1)

189-213: ⚡ Quick win

Keep the new AIC metrics under the dynamo_planner_* namespace.

Everything else in this registry uses PREFIX, but these counters/gauges switch to dynamo_aic_*. That splits one feature area across two metric namespaces and makes prefix-based dashboards/alerts miss the new series.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/monitoring/planner_metrics.py` around lines 189
- 213, The new AIC metric names use the "dynamo_aic_*" prefix but must live
under the existing PREFIX namespace; update the metric name arguments for
aic_correction_pegged_total, aic_consecutive_failures,
aic_optimizer_exceptions_total, aic_optimizer_disabled_reason, and
aic_throughput_regression_total to use PREFIX (e.g. PREFIX +
"aic_correction_pegged_total" etc.) instead of hardcoded "dynamo_aic_*" so they
appear under the dynamo_planner_* namespace; keep the same help strings,
labelnames, and metric types unchanged.
components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py (1)

742-742: ⚡ Quick win

Unused variable in test assertion.

The variable n_p is unpacked but never used in this test. Either use it in an assertion or prefix with _ to indicate it's intentionally unused.

♻️ Suggested fix
-        n_p, n_d = opt._sweep_replicas(1, 5, 3200, 2400, None)
+        _, n_d = opt._sweep_replicas(1, 5, 3200, 2400, None)
         assert n_d == 5
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py`
at line 742, The test unpacks n_p and n_d from opt._sweep_replicas(1, 5, 3200,
2400, None) but never uses n_p; update the test to either assert something about
n_p (for example expected value) or mark it as intentionally unused by renaming
it to _n_p (or simply _), e.g., change the unpacking from "n_p, n_d =
opt._sweep_replicas(...)" to "_, n_d = opt._sweep_replicas(...)" or "_n_p, n_d =
opt._sweep_replicas(...)" so the linter/test clearly reflects that n_p is not
used.
examples/deployments/powerplanner/README.md (1)

52-54: 💤 Low value

Optional: Specify language for fenced code block.

The budget formula code block could use a language specifier for consistency (e.g., text or plaintext), though it's not critical for a mathematical formula.

♻️ Optional improvement
-```
+```text
 total_gpu_power_limit = (rack_capacity_W × headroom_factor) − non_gpu_overhead
 ```
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/deployments/powerplanner/README.md` around lines 52 - 54, The fenced
code block containing the budget formula currently has no language specified;
update the opening fence from ``` to a language label (e.g., ```text or
```plaintext) so the block becomes ```text (or ```plaintext) followed by the
same formula `total_gpu_power_limit = (rack_capacity_W × headroom_factor) −
non_gpu_overhead` and a closing ```, ensuring consistency and clearer rendering
in README.md.
deploy/power_agent/daemonset.yaml (1)

93-99: ⚡ Quick win

Verify resource limits are appropriate for production.

The agent requests 50m CPU / 64Mi memory with limits of 200m CPU / 128Mi memory. Confirm these values are validated against:

  • Typical NVML API latency (should be low)
  • Kubernetes API list/watch overhead per 15s reconcile loop
  • Number of GPUs per node (more GPUs = more NVML calls)

Consider monitoring container_memory_working_set_bytes and rate(container_cpu_usage_seconds_total) in production to validate these limits don't cause OOM or CPU throttling on large nodes (e.g., 8× H200 SXM).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/power_agent/daemonset.yaml` around lines 93 - 99, The current
resources.requests and resources.limits in daemonset.yaml (cpu: "50m"/"200m",
memory: "64Mi"/"128Mi") are likely too low for production; validate and raise
them based on measured NVML API latency, 15s controller reconcile list/watch
overhead, and max GPUs per node by load-testing with representative nodes (e.g.,
8× H200 SXM). Measure container_memory_working_set_bytes and
rate(container_cpu_usage_seconds_total) under stress to find steady-state and
peak usage, then update resources.requests to cover steady-state usage and
resources.limits to provide safe headroom (avoid OOM/throttling), and iterate:
simulate NVML-heavy loops, reconcile cycles, and multiple GPUs, adjust values in
daemonset.yaml accordingly and add monitoring alerts if working_set or CPU rate
approaches limits.
components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py (1)

486-489: 💤 Low value

Class attribute dict comprehension triggers Ruff warning.

Ruff flags this as a mutable class attribute (RUF012), though it's effectively a constant. Consider moving it outside the class or converting to a module-level constant to silence the warning.

♻️ Optional refactor
+# Per-replica wattage for each level (power_w × 8 GPUs)
+_WATTS_PER_REPLICA = {
+    lvl: _POWER_LEVELS[lvl]["prefill_w"] * H200_GPUS_PER_ENGINE
+    for lvl in _POWER_LEVELS
+}
+
 class TestPhase4SyntheticMultiLevel:
-    # Per-replica wattage for each level (power_w × 8 GPUs)
-    _WATTS_PER_REPLICA = {
-        lvl: _POWER_LEVELS[lvl]["prefill_w"] * H200_GPUS_PER_ENGINE
-        for lvl in _POWER_LEVELS
-    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py`
around lines 486 - 489, Move the mutable dict comprehension _WATTS_PER_REPLICA
out of the class and make it a module-level constant to avoid the Ruff RUF012
warning: compute the dict using the existing symbols _POWER_LEVELS and
H200_GPUS_PER_ENGINE at module scope (e.g., next to _POWER_LEVELS) and replace
the class attribute reference with that module-level constant so the class no
longer defines a mutable attribute.
components/src/dynamo/planner/tests/unit/test_prometheus.py (1)

896-896: ⚡ Quick win

Prefix unused unpacked variables with underscore.

Multiple test methods unpack tuples but don't use all variables. Prefix unused variables with _ to indicate this is intentional.

♻️ Suggested fixes
 # Line 896
-        recent, per_worker_avg, cluster_avg = result
+        recent, _per_worker_avg, _cluster_avg = result
         assert recent["w0"]["active_prefill_tokens"] == pytest.approx(50.0)

 # Line 904
-        recent, per_worker_avg, cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill")
+        _recent, per_worker_avg, _cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill")
         assert per_worker_avg["w0"]["active_prefill_tokens"] == pytest.approx(50.0)

 # Line 927
-        _, per_worker_avg, cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill")
+        _, _per_worker_avg, cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill")
         assert cluster_avg["active_prefill_tokens"] == pytest.approx(20.0)

Also applies to: 904-904, 927-927

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/unit/test_prometheus.py` at line 896, In
components/src/dynamo/planner/tests/unit/test_prometheus.py there are tuple
unpackings like "recent, per_worker_avg, cluster_avg = result" where some
unpacked variables are unused; change the unused names to be prefixed with an
underscore (e.g., "recent, _per_worker_avg, _cluster_avg = result" or replace
with "_" placeholders) so the intent is explicit. Update the other similar
unpackings in the same file (the occurrences that unpack into recent,
per_worker_avg, cluster_avg) to follow the same underscore-prefixed naming
convention.
components/src/dynamo/planner/connectors/kubernetes.py (1)

306-307: ⚡ Quick win

Avoid broad exception catching.

The broad except Exception: can mask unexpected errors (e.g., KeyError, AttributeError from malformed deployment dicts). Catch specific exceptions (PlannerError) to make the error-handling intent explicit.

🔍 Catch specific exceptions
         try:
             service = get_service_from_sub_component_type_or_name(
                 deployment, sub_component_type=sub_component_type
             )
-        except Exception:
+        except PlannerError:
             return []

Based on learnings: Python guidelines require catching specific exceptions rather than broad Exception to avoid swallowing unexpected errors.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/connectors/kubernetes.py` around lines 306 -
307, The broad except Exception: in connectors/kubernetes.py masks unexpected
errors; change it to catch the specific PlannerError (e.g., use except
PlannerError as e:) and return [] for that case, while letting any other
exceptions propagate (or re-raise them) so KeyError/AttributeError aren’t
swallowed; also ensure PlannerError is imported where the except block resides
and preserve the original return [] behavior for the PlannerError path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/power_agent/power_agent.py`:
- Around line 449-455: _reconcile_gpu currently appends one (uid, annotation)
tuple per PID causing duplicate entries for multi-PID pods; change the logic to
deduplicate by pod UID before calling _resolve_cap_for_gpu (e.g., build a set or
map of seen UIDs using _extract_pod_uid_from_cgroup and uid_to_annotation, then
produce a single (uid, annotation) entry per UID into pod_annotations) so
_resolve_cap_for_gpu sees unique pods; update references to pod_annotations and
keep uid_to_annotation lookup as before; add a regression test in
test_multi_pod_policy.py that covers "single pod, two PIDs" to prevent
regressions.

In `@components/power_agent/tests/__init__.py`:
- Line 1: Add the standard repository SPDX header to the new package module
components.power_agent.tests.__init__.py; open the empty __init__.py and prepend
the project's copyright/SPDX license header block (the same header used across
other top-level Python files in the repo) so the file passes copyright-checks.

In `@components/power_agent/tests/test_apply_cap.py`:
- Around line 17-24: The test module imports power_agent eagerly which can raise
ImportError for optional NVML/K8s deps and break test collection; wrap the
import block that pulls in power_agent and symbols (POWER_ANNOTATION_KEY,
PowerAgent, _apply_cap, _clamp_to_constraints, _nvml_uuid) in a try/except
ImportError and in the except call pytest.skip("optional dependency missing:
<package>", allow_module_level=True) at module scope (ensure pytest is imported
at top-level so skip can be called).

In `@components/src/dynamo/planner/connectors/kubernetes.py`:
- Around line 384-390: Move the httpx import to the module top and remove the
try/except ImportError block in the admission-control POST handling code: add a
plain "import httpx" at the top of
components/src/dynamo/planner/connectors/kubernetes.py, delete the in-function
try/except that currently raises RuntimeError for missing httpx, and rely on the
module-level import to surface ImportError early; update any references in the
admission-control POST support code to use httpx directly (no fallback or
conditional logic needed).

In `@components/src/dynamo/planner/core/base.py`:
- Around line 753-767: The planner's advisory/dry-run flag isn't respected:
update _apply_power_annotations to bail out when self.config.advisory is True
(in addition to existing checks) so it never PATCHes pod annotations in advisory
mode; likewise locate the autoset fanout logic that posts /busy_threshold (the
code around the second occurrence noted) and add the same advisory guard to skip
POSTing frontend thresholds when self.config.advisory is True. Ensure both
guards check self.config.advisory before making any Kubernetes PATCH or HTTP
POST calls so advisory mode performs no writes.
- Around line 946-976: The current fanout to self.connector.post_busy_threshold
can leave some frontend pods with old thresholds and the planner continuing;
after computing failed (pods with Exception) you must treat any partial failure
as a hard failure: increment the admission_partial_success_total metric as done,
log the error, then raise an exception (e.g., RuntimeError with the failed pod
names and original results) or otherwise abort/return an error so the planner
does not proceed; ensure this change is applied in the same block that builds
results/pods (refer to self.connector.post_busy_threshold, the failed list,
self.prometheus_metrics.admission_partial_success_total, and logger.error) so
partial success stops the planner instead of being treated as success.

In `@components/src/dynamo/planner/monitoring/aic_power_optimizer.py`:
- Around line 534-545: The current branch returns (min_ep, min_ep) when
remaining_after_min_p <= 0 which can still exceed the budget once dynamic power
(d_watts) is considered; update this branch to check whether min_ep * (p_watts +
d_watts) <= budget before returning min_ep and if that check fails, log an error
via logger.error (including budget, min_ep, p_watts, d_watts) and fail closed by
returning a safe zero/NO_OP result (e.g., (0, 0)) or raising a clear exception
instead of returning an over-budget config; change the code around the variables
budget, min_ep, p_watts, d_watts and the logger.warning call accordingly.

In `@components/src/dynamo/planner/monitoring/planner_metrics.py`:
- Around line 244-247: The metric variable admission_partial_success_total is
misnamed vs its help text; either rename the Counter to
admission_partial_failure_total (and update all references/usages of
admission_partial_success_total) so the metric name matches that it counts
frontend POST failures, or keep the variable name but change the Counter help
string to describe "Cumulative count of frontend POST partial successes" if it
truly counts successes; update the f-string name
(f"{PREFIX}_admission_partial_failure_total") and any code that reads or exports
this metric (search for admission_partial_success_total) to maintain
consistency.

In `@components/src/dynamo/planner/monitoring/traffic_metrics.py`:
- Around line 404-414: The try/except in get_total_dgd_power (and the sibling
power helper around lines 445–457) is catching Exception broadly and swallowing
errors by logging and returning None; change it to either catch only expected
parsing/Prometheus errors (e.g., ValueError, TypeError, KeyError, or the
specific exception type thrown by self.prom.custom_query) and handle those by
logging and returning None, or if you must catch Exception, log the error with
logger.exception/ logger.error and then re-raise so unexpected failures
propagate; locate the calls to self.prom.custom_query and the surrounding
logger.debug call in get_total_dgd_power and the other helper and apply the
narrowed exception list or add a re-raise after logging.

In
`@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py`:
- Line 75: The module-level pytestmark currently sets pytest.mark.integration
and pytest.mark.gpu_0 but lacks a timeout; update the pytestmark list (the
pytestmark variable) to also include a pytest.mark.timeout(...) marker (e.g.,
pytest.mark.timeout(120) or another appropriate upper bound >30s) so the entire
live suite has a hard timeout and network/polling calls fail fast; ensure you
import pytest if not already and add the timeout marker to the existing
pytestmark list rather than adding per-test decorators.

In `@components/src/dynamo/planner/tests/integration/test_metric_paths_live.py`:
- Line 73: The module-level pytestmarks lack a timeout; update the pytestmark
variable in test_metric_paths_live.py (the existing pytestmark =
[pytest.mark.integration, pytest.mark.gpu_0]) to include a timeout marker such
as pytest.mark.timeout(300) (or another appropriate value >=30s) so long-running
in-cluster Prometheus/K8s calls will fail instead of hanging; ensure pytest is
imported if not already.

In `@deploy/planner-pod-rbac-dev.yaml`:
- Around line 37-39: The RoleBinding subject for the ServiceAccount
"planner-dev-sa" is missing the required namespace field; update the RoleBinding
subjects block to add "namespace: <namespace>" for the ServiceAccount subject
(or wire a templating/Kustomize variable) so the subject entry for
planner-dev-sa includes its namespace and the binding correctly grants
permissions.

In `@deploy/power_agent/rbac.yaml`:
- Around line 9-14: The docs show an apply path that leaves the
ClusterRoleBinding.subjects[0].namespace set to the literal placeholder
`${POWER_AGENT_NAMESPACE}`, which breaks the binding because kubectl does not
perform variable substitution; update the README snippet to either remove the
misleading "kubectl apply -n <your-namespace> -f deploy/power_agent/rbac.yaml"
path or change it to explicitly instruct the user to substitute the placeholder
(e.g., edit the ServiceAccount and ClusterRoleBinding.subjects[0].namespace or
use envsubst) so the ServiceAccount namespace and
ClusterRoleBinding.subjects[0].namespace match.

In `@tools/compute_power_budget.py`:
- Line 148: The print statement currently uses an unnecessary f-string prefix
("print(f\"    (= 70% of TDP; adjust for your model's idle power profile)\")")
which triggers Ruff F541; change the call to a plain string literal by removing
the leading "f" so the statement is normal (e.g., update the print call in
compute_power_budget.py where that string is printed). Ensure no other f-prefix
remains on that exact print invocation.
- Around line 316-336: Validate the numeric CLI inputs before building
SizingInput and calling compute_budget: check that args.nodes and
args.gpus_per_node are positive integers (greater than 0) so total_gpus =
args.nodes * args.gpus_per_node > 0, and also validate args.rack_capacity and
gpu_tdp (and any other numeric args used in SizingInput such as
args.non_gpu_overhead and elements of args.dgd_gpus) are positive where
applicable; if any validation fails, print a clear stderr error and exit
non-zero. Locate the creation of total_gpus and the SizingInput construction in
compute_power_budget.py (symbols: total_gpus, SizingInput) and add these
pre-checks before calling compute_budget() to avoid division-by-zero or
meaningless budgets.

In `@tools/integrate_aic_power_data.py`:
- Around line 230-231: After resolving args.h200_data and args.b200_data into
h200_data and b200_data, validate that any non-None Path actually exists (and is
a directory) and fail fast if not: if h200_data is not None and not
h200_data.exists()/is_dir() (same for b200_data) log/raise a clear error and
exit with non-zero status (or call the argparse parser.error) so a mistyped root
produces a hard failure instead of silent "nothing to copy"; update the
validation near the h200_data/b200_data assignment to perform these checks and
produce a descriptive message including the offending path variable name.

In `@tools/validate_aic_power_integration.py`:
- Around line 20-21: The tight-budget test never runs because the script only
calls the optimizer with a generous hardcoded cap (80_000); add an explicit
tight-budget run that sets the budget to just under one replica's power and
assert the optimizer returns fewer replicas than requested. Concretely: compute
the single-replica draw (use the same attribute/func the script uses to estimate
replica power, e.g., replica_power_watts or equivalent), call the optimizer (the
existing optimizer/optimize/run_optimizer call) with budget =
single_replica_power - 1, and add an assertion that the returned replica count <
requested_max_replicas (mirror the style used elsewhere in the file). Ensure
this new check is executed (not behind a dead branch) alongside the existing
generous-cap test (the one using 80_000).

---

Outside diff comments:
In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py`:
- Around line 1-711: The file has ruff-formatting issues (misaligned inline
comments and long wrapped lines) across the test module (e.g., around numeric
literals and comment spacing such as "50.0,   #"), so run an automatic formatter
and fix spacing/line-wrapping: run `ruff format` (or your repo's formatter) on
this test file and ensure inline comments use a single space after the comma and
are properly wrapped; focus on symbols/classes to inspect after formatting like
TestFeedbackLoopH200.test_ema_converges_from_cold_start, _POWER_LEVELS entries,
_make_config, _make_traffic, and TestPhase4SyntheticMultiLevel methods to verify
comment alignment and line lengths are corrected.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py`:
- Around line 1-799: Run automatic formatting (ruff format) to fix
whitespace/line-wrapping, then manually fix three lint issues: replace the
Unicode multiplication sign '×' with ASCII 'x' in the explanatory
comments/docstring around the budget and throughput calculations (references:
the multi-replica budget comment in TestOptimizeBudgetConstraint, the
docstring/throughput comment in TestThroughputRegressionWarning and the comment
in TestSweepReplicas) and make the unused unpacked variable in
TestSweepReplicas.test_no_budget_uses_max_replicas a named-unused by prefixing
it with an underscore (change n_p, n_d = opt._sweep_replicas(...) to _n_p, n_d =
...). These changes address RUF002/RUF003 (ambiguous ×) and RUF059 (unused
variable) while keeping function/class names like _sweep_replicas,
TestSweepReplicas, TestOptimizeBudgetConstraint, and
TestThroughputRegressionWarning locators unchanged.

In `@components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py`:
- Line 1: Run the project's formatter on the test file to fix line wrapping and
assertion formatting: execute ruff format (e.g., `ruff format
tests/unit/test_kubernetes_connector.py` from the components/src/dynamo/planner
directory) to update formatting in test_kubernetes_connector.py so the
pre-commit checks pass; re-run the tests or pre-commit to confirm no remaining
style diffs.

In `@components/src/dynamo/planner/tests/unit/test_prometheus.py`:
- Around line 1-1078: The test file fails ruff pre-commit checks; run an
automatic reformat and fixer (e.g. ruff format followed by ruff check --fix) on
the file and commit the changes so tests and lint pass; if applying edits
manually, ensure formatting and lint rules are respected across imports and
function/class definitions (examples to inspect: PrometheusAPIClient,
DirectRouterMetricsClient, TestPrometheusAPIClientRouterSource, and
test_frontend_metric_container_with_nan_value) and re-run ruff to confirm no
remaining violations.

---

Minor comments:
In `@components/src/dynamo/planner/config/defaults.py`:
- Around line 81-103: The defaults prefill_engine_gpu_power_limit and
decode_engine_gpu_power_limit contradict the "no placeholder integers" policy
and silently cap GPUs when enable_power_awareness=True; change both to
Optional[int]=None (or compute them from total_gpu_power_limit divided by
gpu_count) and update the planner_config.py validation logic to require/derive
per-engine limits when enable_power_awareness is true (i.e., enforce that either
per-engine limits are set or automatically compute them from
total_gpu_power_limit and known GPU count) while keeping total_gpu_power_limit
and power_agent_safe_default_watts validation intact.

In
`@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py`:
- Around line 305-313: The test currently hardcodes port 8000 when calling
connector.post_busy_threshold (pod, model, port=8000) which breaks deployments
that override the http port; replace the literal with the per-pod frontend-port
resolver used by the runtime path (call it to compute port_from_pod =
resolve_frontend_port(pod) or the existing helper in your test runtime
utilities) and pass that variable into connector.post_busy_threshold (ensure you
await or import the helper if async). This fixes the test by dynamically
resolving the correct frontend port for the given pod instead of assuming 8000.
- Around line 571-582: The test is blocking the asyncio event loop by calling
time.sleep(2); replace that call with await asyncio.sleep(2) and add an import
asyncio at module level. Locate the snippet where
connector.set_component_replicas(...) is called (in the test function that uses
await) and change the synchronous sleep to the async sleep so the event loop
isn’t blocked and async cleanup can run correctly.

In `@components/src/dynamo/planner/tests/unit/test_actuation_knobs.py`:
- Line 1: The file tests/unit/test_actuation_knobs.py needs to be auto-formatted
to satisfy pre-commit checks; run the project formatter (ruff format) on that
file (e.g., ruff format
components/src/dynamo/planner/tests/unit/test_actuation_knobs.py) to fix line
wrapping and style issues, then stage the updated file so the pre-commit check
passes.

In `@components/src/dynamo/planner/tests/unit/test_state_machine.py`:
- Around line 1356-1358: The test assigns the tuple return from
sm._apply_global_budget to num_p and num_d but never uses num_p; rename the
unused unpack target to _num_p (i.e., change "num_p, num_d =
sm._apply_global_budget(5, 3)" to "_num_p, num_d = sm._apply_global_budget(5,
3)") to satisfy Ruff/RUF059 while leaving the assertion against num_d unchanged.

In `@examples/deployments/powerplanner/disagg-conservative-cold-start.yaml`:
- Line 94: The model ID strings are incorrect: replace every occurrence of
"meta-llama/Llama-3-8B-Instruct" with the official HuggingFace ID
"meta-llama/Meta-Llama-3-8B-Instruct" (e.g., the value used in the model field
in the YAML), and update the second occurrence noted in the review as well;
ensure both instances exactly match "meta-llama/Meta-Llama-3-8B-Instruct" so the
runtime can successfully download the model.

In `@tools/validate_aic_power_integration.py`:
- Line 137: Calls to _section are using f-string prefixes for plain string
literals (e.g., _section(f"  prefill estimate (isl=1024, tp=8)")) which triggers
Ruff F541; remove the leading f so the calls use normal string literals (e.g.,
_section("  prefill estimate (isl=1024, tp=8)")). Apply the same change to the
other occurrences of _section(f"...") in the file (the other similar call sites
that currently use an f-prefix).

---

Nitpick comments:
In `@components/power_agent/power_agent.py`:
- Around line 489-490: The import of argparse is inside the main() function;
move the import to the module top to follow guidelines. Remove the inline
"import argparse" from def main() and add "import argparse" alongside other
top-level imports so that the module-level imports include argparse and main()
simply uses it.

In `@components/power_agent/tests/test_multi_pod_policy.py`:
- Around line 66-126: Add a new unit test method
test_same_pod_multi_pid_not_multi_pod to TestMultiPodPolicy that calls
_resolve_cap_for_gpu with a single pod UID and a valid annotation (use
SAFE_DEFAULT and the fixture self.m) and asserts the returned cap equals the
annotated value and that self.m.multi_pod_agree remains 0; reference
_resolve_cap_for_gpu, SAFE_DEFAULT and the metric multi_pod_agree so the test
verifies same-pod/multiple-PID deduplication (after the fix around
power_agent.py:449-455) does not trigger the multi-pod agreement metric.

In `@components/src/dynamo/planner/connectors/kubernetes.py`:
- Around line 306-307: The broad except Exception: in connectors/kubernetes.py
masks unexpected errors; change it to catch the specific PlannerError (e.g., use
except PlannerError as e:) and return [] for that case, while letting any other
exceptions propagate (or re-raise them) so KeyError/AttributeError aren’t
swallowed; also ensure PlannerError is imported where the except block resides
and preserve the original return [] behavior for the PlannerError path.

In `@components/src/dynamo/planner/monitoring/planner_metrics.py`:
- Around line 189-213: The new AIC metric names use the "dynamo_aic_*" prefix
but must live under the existing PREFIX namespace; update the metric name
arguments for aic_correction_pegged_total, aic_consecutive_failures,
aic_optimizer_exceptions_total, aic_optimizer_disabled_reason, and
aic_throughput_regression_total to use PREFIX (e.g. PREFIX +
"aic_correction_pegged_total" etc.) instead of hardcoded "dynamo_aic_*" so they
appear under the dynamo_planner_* namespace; keep the same help strings,
labelnames, and metric types unchanged.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py`:
- Around line 486-489: Move the mutable dict comprehension _WATTS_PER_REPLICA
out of the class and make it a module-level constant to avoid the Ruff RUF012
warning: compute the dict using the existing symbols _POWER_LEVELS and
H200_GPUS_PER_ENGINE at module scope (e.g., next to _POWER_LEVELS) and replace
the class attribute reference with that module-level constant so the class no
longer defines a mutable attribute.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py`:
- Line 742: The test unpacks n_p and n_d from opt._sweep_replicas(1, 5, 3200,
2400, None) but never uses n_p; update the test to either assert something about
n_p (for example expected value) or mark it as intentionally unused by renaming
it to _n_p (or simply _), e.g., change the unpacking from "n_p, n_d =
opt._sweep_replicas(...)" to "_, n_d = opt._sweep_replicas(...)" or "_n_p, n_d =
opt._sweep_replicas(...)" so the linter/test clearly reflects that n_p is not
used.

In `@components/src/dynamo/planner/tests/unit/test_prometheus.py`:
- Line 896: In components/src/dynamo/planner/tests/unit/test_prometheus.py there
are tuple unpackings like "recent, per_worker_avg, cluster_avg = result" where
some unpacked variables are unused; change the unused names to be prefixed with
an underscore (e.g., "recent, _per_worker_avg, _cluster_avg = result" or replace
with "_" placeholders) so the intent is explicit. Update the other similar
unpackings in the same file (the occurrences that unpack into recent,
per_worker_avg, cluster_avg) to follow the same underscore-prefixed naming
convention.

In `@deploy/power_agent/daemonset.yaml`:
- Around line 93-99: The current resources.requests and resources.limits in
daemonset.yaml (cpu: "50m"/"200m", memory: "64Mi"/"128Mi") are likely too low
for production; validate and raise them based on measured NVML API latency, 15s
controller reconcile list/watch overhead, and max GPUs per node by load-testing
with representative nodes (e.g., 8× H200 SXM). Measure
container_memory_working_set_bytes and rate(container_cpu_usage_seconds_total)
under stress to find steady-state and peak usage, then update resources.requests
to cover steady-state usage and resources.limits to provide safe headroom (avoid
OOM/throttling), and iterate: simulate NVML-heavy loops, reconcile cycles, and
multiple GPUs, adjust values in daemonset.yaml accordingly and add monitoring
alerts if working_set or CPU rate approaches limits.

In `@examples/deployments/powerplanner/README.md`:
- Around line 52-54: The fenced code block containing the budget formula
currently has no language specified; update the opening fence from ``` to a
language label (e.g., ```text or ```plaintext) so the block becomes ```text (or
```plaintext) followed by the same formula `total_gpu_power_limit =
(rack_capacity_W × headroom_factor) − non_gpu_overhead` and a closing ```,
ensuring consistency and clearer rendering in README.md.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a7e07320-b76e-4140-a984-1251cbfdf781

📥 Commits

Reviewing files that changed from the base of the PR and between 9db7176 and 3c0e9a7.

📒 Files selected for processing (47)
  • components/power_agent/README.md
  • components/power_agent/power_agent.py
  • components/power_agent/tests/__init__.py
  • components/power_agent/tests/test_apply_cap.py
  • components/power_agent/tests/test_cgroup_parser.py
  • components/power_agent/tests/test_multi_pod_policy.py
  • components/power_agent/tests/test_shutdown.py
  • components/src/dynamo/planner/config/defaults.py
  • components/src/dynamo/planner/config/planner_config.py
  • components/src/dynamo/planner/connectors/kubernetes.py
  • components/src/dynamo/planner/connectors/kubernetes_api.py
  • components/src/dynamo/planner/core/base.py
  • components/src/dynamo/planner/core/budget.py
  • components/src/dynamo/planner/core/load_scaling.py
  • components/src/dynamo/planner/core/state_machine.py
  • components/src/dynamo/planner/core/throughput_scaling.py
  • components/src/dynamo/planner/core/types.py
  • components/src/dynamo/planner/monitoring/aic_power_optimizer.py
  • components/src/dynamo/planner/monitoring/planner_metrics.py
  • components/src/dynamo/planner/monitoring/traffic_metrics.py
  • components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py
  • components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py
  • components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py
  • components/src/dynamo/planner/tests/integration/test_metric_paths_live.py
  • components/src/dynamo/planner/tests/unit/test_actuation_knobs.py
  • components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py
  • components/src/dynamo/planner/tests/unit/test_prometheus.py
  • components/src/dynamo/planner/tests/unit/test_state_machine.py
  • components/src/dynamo/profiler/utils/aic_dataframe.py
  • deploy/helm/charts/platform/Chart.yaml
  • deploy/helm/charts/platform/README.md
  • deploy/helm/charts/platform/components/operator/Chart.yaml
  • deploy/helm/charts/platform/components/operator/templates/planner.yaml
  • deploy/planner-pod-rbac-dev.yaml
  • deploy/power_agent/daemonset.yaml
  • deploy/power_agent/rbac.yaml
  • docs/components/planner/dpp-dev-env.md
  • docs/design-docs/powerplanner-design.md
  • examples/deployments/powerplanner/MULTI_DGD.md
  • examples/deployments/powerplanner/PIPECLEAN.md
  • examples/deployments/powerplanner/README.md
  • examples/deployments/powerplanner/disagg-conservative-cold-start.yaml
  • examples/deployments/powerplanner/disagg-power-aware.yaml
  • examples/deployments/powerplanner/verify_poweraware.bash
  • tools/compute_power_budget.py
  • tools/integrate_aic_power_data.py
  • tools/validate_aic_power_integration.py

Comment on lines +449 to +455
pod_annotations: list[tuple[str, Optional[str]]] = []
for proc in procs:
uid = _extract_pod_uid_from_cgroup(proc.pid)
if uid is None:
continue # non-K8s process — skip
if uid in uid_to_annotation:
pod_annotations.append((uid, uid_to_annotation[uid]))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Same-pod multi-PID workloads are incorrectly flagged as multi-pod-per-GPU.

_reconcile_gpu appends one entry to pod_annotations per PID rather than per pod UID. NVML's nvmlDeviceGetComputeRunningProcesses returns every CUDA-context process on the GPU, so a single pod with multiple GPU-bound processes (TP workers, MPS clients, MPI ranks, child workers spawned by serving stacks) yields N tuples with the same UID and same annotation. _resolve_cap_for_gpu then sees len(pod_annotations) > 1, all unique == 1 → emits the WARN "multi-pod-per-GPU is unsupported topology" and increments multi_pod_gpu_total{disposition="agree"} on every reconcile (every 15 s).

The applied cap is still correct (values agree), but the warnings/metrics are false positives and contradict the policy comment at lines 319–325. Deduplicate by UID first:

🩹 Proposed fix
-        pod_annotations: list[tuple[str, Optional[str]]] = []
-        for proc in procs:
-            uid = _extract_pod_uid_from_cgroup(proc.pid)
-            if uid is None:
-                continue  # non-K8s process — skip
-            if uid in uid_to_annotation:
-                pod_annotations.append((uid, uid_to_annotation[uid]))
+        # Deduplicate by pod UID — multiple PIDs from the same pod (TP workers,
+        # MPI ranks, MPS clients) must count as one pod, not N.
+        seen_uids: set[str] = set()
+        pod_annotations: list[tuple[str, Optional[str]]] = []
+        for proc in procs:
+            uid = _extract_pod_uid_from_cgroup(proc.pid)
+            if uid is None or uid in seen_uids:
+                continue  # non-K8s process or already-counted pod
+            seen_uids.add(uid)
+            if uid in uid_to_annotation:
+                pod_annotations.append((uid, uid_to_annotation[uid]))

A regression test for "single pod, two PIDs" in test_multi_pod_policy.py would lock this in.

Based on retrieved learning from PR 5280: "Either enforce an explicit one-to-one mapping with clear guards and comments, or extend the implementation to correctly handle multiple pods per GPU".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/power_agent.py` around lines 449 - 455, _reconcile_gpu
currently appends one (uid, annotation) tuple per PID causing duplicate entries
for multi-PID pods; change the logic to deduplicate by pod UID before calling
_resolve_cap_for_gpu (e.g., build a set or map of seen UIDs using
_extract_pod_uid_from_cgroup and uid_to_annotation, then produce a single (uid,
annotation) entry per UID into pod_annotations) so _resolve_cap_for_gpu sees
unique pods; update references to pod_annotations and keep uid_to_annotation
lookup as before; add a regression test in test_multi_pod_policy.py that covers
"single pod, two PIDs" to prevent regressions.

@@ -0,0 +1 @@

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the standard SPDX header to this new package file.

copyright-checks is already failing on this file, so the empty __init__.py still needs the repo header before merge.

🧰 Tools
🪛 GitHub Actions: Copyright Checks / 0_copyright-checks.txt

[error] 1-1: Copyright checkers detected invalid/missing header: components/power_agent/tests/init.py

🪛 GitHub Actions: Copyright Checks / copyright-checks

[error] 1-1: Copyright check failed: invalid/missing header.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/__init__.py` at line 1, Add the standard
repository SPDX header to the new package module
components.power_agent.tests.__init__.py; open the empty __init__.py and prepend
the project's copyright/SPDX license header block (the same header used across
other top-level Python files in the repo) so the file passes copyright-checks.

Comment on lines +17 to +24
import power_agent
from power_agent import (
POWER_ANNOTATION_KEY,
PowerAgent,
_apply_cap,
_clamp_to_constraints,
_nvml_uuid,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard the module import so optional NVML/K8s deps don't break collection.

This imports power_agent eagerly, so any missing optional dependency will fail test collection instead of skipping the module. Wrap the import in try/except ImportError and call pytest.skip(..., allow_module_level=True) at module scope. As per coding guidelines, **/test_*.py: Use try/except ImportError to skip test collection when optional packages are absent with pytest.skip(..., allow_module_level=True).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_apply_cap.py` around lines 17 - 24, The
test module imports power_agent eagerly which can raise ImportError for optional
NVML/K8s deps and break test collection; wrap the import block that pulls in
power_agent and symbols (POWER_ANNOTATION_KEY, PowerAgent, _apply_cap,
_clamp_to_constraints, _nvml_uuid) in a try/except ImportError and in the except
call pytest.skip("optional dependency missing: <package>",
allow_module_level=True) at module scope (ensure pytest is imported at top-level
so skip can be called).

Comment on lines +384 to +390
try:
import httpx
except ImportError as exc:
raise RuntimeError(
"httpx is required for admission-control POST support. "
"Install it with: pip install httpx"
) from exc

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Move the httpx import to module top.

This import is inside the function body, which violates the Python guidelines. The guidelines state: "flag any import inside function/method/class bodies (keep imports at module top)" and "try/except ImportError around imports must be avoided unless the dependency is truly optional and a fallback/None path exists."

Here, the function raises RuntimeError when httpx is missing, so there's no fallback—httpx is a hard requirement for this code path. Move the import to the top of the file.

📦 Move import to module top
 import json
 import logging
 import os
 from typing import Optional
+
+try:
+    import httpx
+except ImportError:
+    httpx = None

 from dynamo.planner.config.defaults import SubComponentType, TargetReplica

Then in the function:

     ) -> None:
         """POST /busy_threshold to a single frontend pod (§6.7).
 
         Raises on HTTP error or network failure — callers should gather with
         ``return_exceptions=True`` and inspect the results.
         Per-pod timeout is 5 s (frontend is in-cluster; long timeouts mask
         saturation events).
         """
-        try:
-            import httpx
-        except ImportError as exc:
+        if httpx is None:
             raise RuntimeError(
                 "httpx is required for admission-control POST support. "
                 "Install it with: pip install httpx"
-            ) from exc
+            )

Based on learnings: Python guidelines require imports at module top unless truly optional with a fallback/None path. As per coding guidelines, this pattern avoids dual execution modes and makes dependencies explicit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/connectors/kubernetes.py` around lines 384 -
390, Move the httpx import to the module top and remove the try/except
ImportError block in the admission-control POST handling code: add a plain
"import httpx" at the top of
components/src/dynamo/planner/connectors/kubernetes.py, delete the in-function
try/except that currently raises RuntimeError for missing httpx, and rely on the
module-level import to surface ImportError early; update any references in the
admission-control POST support code to use httpx directly (no fallback or
conditional logic needed).

Comment on lines +753 to +767
async def _apply_power_annotations(self) -> None:
"""Annotate worker pods with per-GPU power limits (Phase 1).

Reads the actual annotation from each Pod object returned by
get_component_pods(). Only PATCHes when annotation is missing or wrong.
K8s is the source of truth — no local cache.

Called after _apply_effects() on every tick so newly-created pods
pick up the cap within one tick, and AIC-driven cap changes (Phase 3+)
propagate to all running pods automatically.
"""
if not self.config.enable_power_awareness:
return
if not isinstance(self.connector, KubernetesConnector):
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Advisory mode still writes to the cluster through the new power/admission paths.

_apply_scaling_targets() honors self.config.advisory, but _apply_power_annotations() and the autoset /busy_threshold fanout do not. With advisory=True and power awareness enabled, the planner will still PATCH pod annotations and POST frontend thresholds, which breaks the dry-run contract and changes live behavior.

Suggested change
     async def _apply_power_annotations(self) -> None:
         """Annotate worker pods with per-GPU power limits (Phase 1).
@@
-        if not self.config.enable_power_awareness:
+        if self.config.advisory or not self.config.enable_power_awareness:
             return
         if not isinstance(self.connector, KubernetesConnector):
             return
@@
-        if self.config.admission_mode == "autoset" and isinstance(
-            self.connector, KubernetesConnector
-        ):
+        if (
+            not self.config.advisory
+            and self.config.admission_mode == "autoset"
+            and isinstance(self.connector, KubernetesConnector)
+        ):
             await self._fanout_busy_threshold(new_config)

Also applies to: 899-902

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/core/base.py` around lines 753 - 767, The
planner's advisory/dry-run flag isn't respected: update _apply_power_annotations
to bail out when self.config.advisory is True (in addition to existing checks)
so it never PATCHes pod annotations in advisory mode; likewise locate the
autoset fanout logic that posts /busy_threshold (the code around the second
occurrence noted) and add the same advisory guard to skip POSTing frontend
thresholds when self.config.advisory is True. Ensure both guards check
self.config.advisory before making any Kubernetes PATCH or HTTP POST calls so
advisory mode performs no writes.

Comment on lines +9 to +14
# Apply with an explicit namespace flag:
# kubectl apply -n <your-namespace> -f deploy/power_agent/rbac.yaml
#
# The ClusterRoleBinding subject below references that same namespace via
# `${POWER_AGENT_NAMESPACE}` — substitute it manually or pipe through envsubst:
# POWER_AGENT_NAMESPACE=<your-namespace> envsubst < rbac.yaml | kubectl apply -f -

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n deploy/power_agent/rbac.yaml

Repository: ai-dynamo/dynamo

Length of output: 1758


The plain kubectl apply -n ... path leaves the binding broken because it doesn't substitute the placeholder.

When you run kubectl apply -n <your-namespace> -f deploy/power_agent/rbac.yaml, the ServiceAccount is created in the specified namespace via the -n flag, but the ClusterRoleBinding.subjects[0].namespace field (line 50) retains the literal string ${POWER_AGENT_NAMESPACE}. Since kubectl does not perform variable substitution, the binding references a non-existent namespace and the pod-read permissions never take effect. Only the envsubst path (lines 13–14) substitutes the placeholder correctly. Recommend either: (1) removing the first documentation path, or (2) updating it to clarify that manual substitution of line 50 is required before applying.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/power_agent/rbac.yaml` around lines 9 - 14, The docs show an apply
path that leaves the ClusterRoleBinding.subjects[0].namespace set to the literal
placeholder `${POWER_AGENT_NAMESPACE}`, which breaks the binding because kubectl
does not perform variable substitution; update the README snippet to either
remove the misleading "kubectl apply -n <your-namespace> -f
deploy/power_agent/rbac.yaml" path or change it to explicitly instruct the user
to substitute the placeholder (e.g., edit the ServiceAccount and
ClusterRoleBinding.subjects[0].namespace or use envsubst) so the ServiceAccount
namespace and ClusterRoleBinding.subjects[0].namespace match.

Comment thread tools/compute_power_budget.py Outdated
print()
print(" Additional planner settings to review:")
print(f" power_agent_safe_default_watts: {math.ceil(self.gpu_tdp_w * 0.70)}")
print(f" (= 70% of TDP; adjust for your model's idle power profile)")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the stray f prefix on Line 148.

Ruff is flagging this as F541, so the file will keep failing pre-commit until that string is a plain literal again.

🧰 Tools
🪛 Ruff (0.15.12)

[error] 148-148: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/compute_power_budget.py` at line 148, The print statement currently
uses an unnecessary f-string prefix ("print(f\"    (= 70% of TDP; adjust for
your model's idle power profile)\")") which triggers Ruff F541; change the call
to a plain string literal by removing the leading "f" so the statement is normal
(e.g., update the print call in compute_power_budget.py where that string is
printed). Ensure no other f-prefix remains on that exact print invocation.

Comment on lines +316 to +336
total_gpus = args.nodes * args.gpus_per_node
non_gpu_overhead = (
args.non_gpu_overhead
if args.non_gpu_overhead is not None
else args.nodes * _NON_GPU_OVERHEAD_PER_NODE_W
)

if not (0 < args.headroom_factor <= 1.0):
print(
f"ERROR: --headroom-factor must be in (0, 1]. Got {args.headroom_factor}.",
file=sys.stderr,
)
sys.exit(2)

inp = SizingInput(
rack_capacity_w=args.rack_capacity,
headroom_factor=args.headroom_factor,
non_gpu_overhead_w=non_gpu_overhead,
total_gpus=total_gpus,
gpu_tdp_w=gpu_tdp,
dgd_gpu_counts=args.dgd_gpus or [],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the numeric CLI inputs before computing the budget.

argparse accepts 0 and negatives for these fields. With --nodes 0 / --gpus-per-node 0 the tool can emit a meaningless budget for zero GPUs, and in multi-DGD mode it can fall into gpus / inp.total_gpus inside compute_budget().

Suggested fix
     total_gpus = args.nodes * args.gpus_per_node
     non_gpu_overhead = (
         args.non_gpu_overhead
         if args.non_gpu_overhead is not None
         else args.nodes * _NON_GPU_OVERHEAD_PER_NODE_W
     )
+
+    if args.nodes <= 0 or args.gpus_per_node <= 0:
+        print("ERROR: --nodes and --gpus-per-node must be > 0.", file=sys.stderr)
+        sys.exit(2)
+    if gpu_tdp <= 0:
+        print("ERROR: --gpu-tdp must be > 0.", file=sys.stderr)
+        sys.exit(2)
+    if non_gpu_overhead < 0:
+        print("ERROR: --non-gpu-overhead must be >= 0.", file=sys.stderr)
+        sys.exit(2)
+    if any(g <= 0 for g in (args.dgd_gpus or [])):
+        print("ERROR: every --dgd-gpus entry must be > 0.", file=sys.stderr)
+        sys.exit(2)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/compute_power_budget.py` around lines 316 - 336, Validate the numeric
CLI inputs before building SizingInput and calling compute_budget: check that
args.nodes and args.gpus_per_node are positive integers (greater than 0) so
total_gpus = args.nodes * args.gpus_per_node > 0, and also validate
args.rack_capacity and gpu_tdp (and any other numeric args used in SizingInput
such as args.non_gpu_overhead and elements of args.dgd_gpus) are positive where
applicable; if any validation fails, print a clear stderr error and exit
non-zero. Locate the creation of total_gpus and the SizingInput construction in
compute_power_budget.py (symbols: total_gpus, SizingInput) and add these
pre-checks before calling compute_budget() to avoid division-by-zero or
meaningless budgets.

Comment on lines +230 to +231
h200_data = args.h200_data.resolve() if args.h200_data else None
b200_data = args.b200_data.resolve() if args.b200_data else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject nonexistent data roots before starting integration.

A typo in --h200-data or --b200-data currently degrades into a stream of per-subdir warnings and a zero-file success exit. That makes a bad input path look like “nothing to copy” instead of a hard failure.

Suggested fix
     h200_data = args.h200_data.resolve() if args.h200_data else None
     b200_data = args.b200_data.resolve() if args.b200_data else None
+
+    for flag, path in (("--h200-data", h200_data), ("--b200-data", b200_data)):
+        if path is not None and not path.is_dir():
+            print(f"ERROR: {flag} directory not found: {path}", file=sys.stderr)
+            sys.exit(1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/integrate_aic_power_data.py` around lines 230 - 231, After resolving
args.h200_data and args.b200_data into h200_data and b200_data, validate that
any non-None Path actually exists (and is a directory) and fail fast if not: if
h200_data is not None and not h200_data.exists()/is_dir() (same for b200_data)
log/raise a clear error and exit with non-zero status (or call the argparse
parser.error) so a mistyped root produces a hard failure instead of silent
"nothing to copy"; update the validation near the h200_data/b200_data assignment
to perform these checks and produce a descriptive message including the
offending path variable name.

Comment on lines +20 to +21
5. Budget constraint: with a budget tighter than 1 full replica's power draw,
the optimizer correctly returns fewer replicas than the requested max.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The advertised tight-budget validation never actually runs.

The docstring says this script verifies the optimizer under a budget tighter than one full replica, but the only optimizer run uses an intentionally generous 80_000 W cap and never asserts replica reduction. This can pass even when the tight-budget path is broken.

Also applies to: 231-236, 272-290

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/validate_aic_power_integration.py` around lines 20 - 21, The
tight-budget test never runs because the script only calls the optimizer with a
generous hardcoded cap (80_000); add an explicit tight-budget run that sets the
budget to just under one replica's power and assert the optimizer returns fewer
replicas than requested. Concretely: compute the single-replica draw (use the
same attribute/func the script uses to estimate replica power, e.g.,
replica_power_watts or equivalent), call the optimizer (the existing
optimizer/optimize/run_optimizer call) with budget = single_replica_power - 1,
and add an assertion that the returned replica count < requested_max_replicas
(mirror the style used elsewhere in the file). Ensure this new check is executed
(not behind a dead branch) alongside the existing generous-cap test (the one
using 80_000).

@grahamking grahamking left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if you intended to send an 11,000+ line PR, but that's what this is. Please split this into separate far smaller pieces.

Ideally start with an Issue to set the context and attach your agent's plan (if using one).

@kaim-eng kaim-eng force-pushed the kaim/power-planner branch 3 times, most recently from 218306c to 1ac66af Compare May 12, 2026 20:37
kaim-eng and others added 11 commits May 18, 2026 08:56
Adds power-aware intelligence to the Dynamo planner across three layers,
plus tooling that pipecleans the Phase 4 selection path with currently
available single-TDP AIC data. Reimplementation of unmerged PR #5280
against post-refactor ToT (planner/utils/ no longer exists; configuration
is Pydantic; layout is config/, connectors/, core/, monitoring/).

Full design: docs/design-docs/powerplanner-design.md
Repro / dev environment: docs/components/planner/dpp-dev-env.md

What's in this PR
-----------------

Phase 1 — infrastructure
  - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/}
    Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls
    nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop.
    Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore.
  - Pod annotation actuation: connectors/kubernetes.py writes
    dynamo.nvidia.com/gpu-power-limit on worker pods.
  - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml.
  - Operator chart fix: ClusterRole now grants `pods` permissions
    (deploy/helm/charts/platform/components/operator/templates/planner.yaml,
    chart bumped to 1.2.1). Older clusters can apply
    deploy/planner-pod-rbac-dev.yaml as a temporary patch.

Phase 2 — budget enforcement
  - _apply_power_budget() in core/state_machine.py clamps replica
    scaling within a watt budget. New PlannerConfig fields:
    enable_power_awareness, total_gpu_power_limit (required when
    awareness enabled — validator enforces, no silent default),
    prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts.

Phase 3 — AIC optimizer
  - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks
    (replica, power) configs that maximise throughput within the watt
    budget. Per-component EMA correction (c_ttft, c_itl, c_power_p,
    c_power_d for disagg; c_power_agg for agg) closes the loop on
    silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded
    re-optimize, auto-disable on consecutive failures.
  - core/base.py: NativePlannerBase plumbing (admission threshold push,
    named-port resolution, AIC integration).

Phase 4 — pipeclean (using existing single-TDP data)
  - tools/integrate_aic_power_data.py: copies measured H200/B200 power
    data into an AIC checkout and reinstalls.
  - tools/validate_aic_power_integration.py: asserts estimate_perf()
    returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback).
  - tools/compute_power_budget.py: rack-capacity → safe-budget helper.
  - examples/deployments/powerplanner/: PIPECLEAN.md runbook,
    disagg-{power-aware,conservative-cold-start}.yaml,
    verify_poweraware.bash (8-section health check inc. Phase 4 preview),
    MULTI_DGD.md operator playbook, README.md.

Bug fixes (found by live integration tests, fixed in this patchset)
-------------------------------------------------------------------
  - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner
    couldn't list workers or patch annotations. Fixed in chart template;
    workaround manifest provided for clusters not yet upgraded.
  - Stale label selectors in connectors/kubernetes.py:
    `dynamo-graph-deployment` → `dynamo-graph-deployment-name`,
    `dynamo-service` → `dynamo-component`/`dynamo-component-type`.
  - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the
    wrong namespace; corrected to `exported_pod=~` +
    `exported_namespace=<k8s-namespace>` and pod regex now matches the
    operator's `<dgd>-<replica-idx>-<service-key>-...` format
    (monitoring/traffic_metrics.py).
  - Frontend-source metric methods returned NaN on quiet clusters
    (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN
    guard, frontend didn't — added matching `math.isnan` filter.

Tests
-----

Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS
dev cluster (qwen3-quickstart DGD, single dev namespace):

  Suite                                                  Pass  Skip  Fail
  ----------------------------------------------------   ----  ----  ----
  planner/tests/unit/                                     465     0     0
  planner/tests/integration/test_aic_power_e2e_sim       15     0     0
  planner/tests/integration/test_aic_power_optimizer     34     0     0
  planner/tests/integration/test_metric_paths_live       22-23  3-2    0
  planner/tests/integration/test_actuation_knobs_live    10-11  1-0    0
  power_agent/tests/                                      43     0     0
  ----------------------------------------------------   ----  ----  ----
  Total (cold deploy / after sanity chat completion)    590-591  4-3    0

Documented skips:
  - test_frontend_metric_series_exists — passes after any traffic.
  - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose
    dynamo_frontend_worker_* gauges (open-question #14 in design doc).
  - TestScalingRealMutation — opt-in disruptive test, gated by
    RUN_DISRUPTIVE_TESTS=1 (passes when enabled).

Repro recipe: docs/components/planner/dpp-dev-env.md
              "From-Scratch Repro Script" section.

Compatibility
-------------
Power awareness is opt-in (enable_power_awareness=False by default).
AIC optimizer is opt-in (enable_aic_optimizer=False by default).
No existing planner behavior changes when both flags are off.

Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation)
remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4
pipeclean code path so it can be exercised before AIC ships the full
sweep API.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the synthetic-metrics stress testbed for the Power Planner state
machine + AIC optimizer, plus the small source-side hooks that let the
testbed swap in fake AIC and clamp tick-time monotonicity.

Testbed (alpha + gamma classes, no GPU / no live cluster required):
- 27 alpha-class scenarios driven by parameterized fakes (FakeAIC,
  FakeActuator, FakePrometheus, FakePlannerMetrics) wired against the
  real PlannerStateMachine + AICPowerOptimizer
- 3 gamma-class scenarios that overlay synthetic power on the real
  dynamo-mocker scheduler + PlannerReplayBridge for realism back-stop
- Self-consistency + overlay determinism tests
- YAML scenario catalog (A/B/C/D/E/F/G families) with shared _base/ and
  systems/ definitions; loadable gate prevents silent schema drift
- Auto-skips gamma when the local Python wheel ships only the testbed
  stub (no Rust mocker), so the same suite is runnable on Windows
  laptops AND inside the dev pod
- Verified: 82 passed, 5 skipped on Windows; 86 passed, 1 skipped on
  the dev pod against the older create_disagg bridge

Source enablement:
- AICPowerOptimizer: optional _aic_estimator_factory injection point
  for testbed substitution; production path unchanged
- ReplayPlannerAdapter: clamp now_s = max(tick.at_s, bridge_now_s) so
  sparse traces do not wedge throughput-tick scheduling

Misc:
- components/src/conftest.py: top-of-src dynamo._core stub installer
  (needed before any dynamo.* package init runs)
- scripts/inspect_dcgm_qwen3.py, scripts/inspect_live_metrics.py:
  Phase 4 cluster-side metric probes
- test_planner_power_launch.py: Phase 4 advisory-mode launch script
  used to validate the planner -> annotation -> NVML enforcement chain
- docs/design-docs/powerplanner-testbed-design.md: full testbed design
  + defect/fix history through v2.3 (dev-pod gamma validation)

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping:
makes the dev environment shareable across teammates by removing personal
identifiers, parameterizing all dev-pod / probe references, and folding 11
review fixes into the three Phase 1-4 design documents.

Dev-env hardening
-----------------
* Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID
  (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset:
  - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and
    `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with
    `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline
    usage instructions.
  - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads
    `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime.
  - 5 `scripts/inspect_*.py` cluster probes parameterized via
    `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit)
    rather than a hard-coded namespace.
  - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to
    `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...`
    one-liner showing how to discover the right node.
* `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally
  not committed" convention (matches `examples/deployments/powerplanner/
  .tmp-gp-minimal.yaml`) and to block the four common root-level
  personal-scratch files from sneaking back in via `git add .`.

dpp-dev-env.md updates
----------------------
* All 10 path references rewritten to point at the new
  `deploy/planner/dev/` and `scripts/dev/` homes.
* New §5 ("Deploy the Dev Pod") subsection documenting the
  `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an
  `envsubst` (Linux/WSL) path and a Windows edit-in-place path.
* Quick Deploy Checklist filename corrected from the stale-DGDR
  `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a
  cross-reference to the One-Time Setup §4 warning.
* DGD-ready wait command standardized to the programmatic
  `kubectl wait --for=jsonpath='{.status.state}'=successful` form
  in both §4 and the Checklist (removes the manual-`-w` divergence).

Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md)
-------------------------------------------------------------------------------
* powerplanner-design.md
  - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 -
    590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still
    draft.` with last-validation date.
  - §3.1: added an explicit note that `aic_interpolation` and `mode`
    are pre-existing PlannerConfig fields (owned by
    `monitoring/aic_interpolation.py`) and intentionally absent from
    the Names Registry.
  - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so
    it correctly points at the *throughput regression* case rather
    than reading as a blanket "config revert".
  - §6.5 pseudocode: replaced module-level `self.device_count` with
    `pynvml.nvmlDeviceGetCount()` (the original was syntactically
    incorrect outside the daemon class).
  - §13 Open Question #11: corrected the duplicated
    `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads
    `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching
    §5.3.
* powerplanner-testbed-design.md
  - §7: added a "Numeric-suffix convention" note explaining the
    intentional D21/E21 ID collision (IDs unique by filename + tuple,
    not renumbered).
  - §11: corrected "Six guards" -> "Seven guards" to match the seven
    items actually listed.
  - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`.
  - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper
    + 30 parametrized) so the listing matches the actual run.

Verification
------------
* Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented
  Windows baseline; gamma auto-skipped without the Rust mocker).
* AIC no-cluster integration (test_aic_power_optimizer.py +
  test_aic_power_e2e_sim.py): 49 passed.
* Unit tests: 456 passed; the 9 remaining failures are pre-existing
  Windows-only environment limitations (cp1252 codec, `os.killpg`
  POSIX-only, missing `filterpy`) confirmed via `git stash` to be
  untouched by this commit.
* All touched .py files: `python3.10 -m py_compile` clean.
* All touched .yaml files: `yaml.safe_load_all` clean.
* `ReadLints` over all 12 touched files: no errors.
* Final `rg "kaim|aks-a100a-36888584"` outside committed
  `tests/fault_tolerance/...` (pre-existing, separate component) and
  `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
Three minor issues surfaced by an end-to-end run of the optimizer
against real H200/B200 power-enabled AIC data (Llama-3.1-8B,
ISL=2048, OSL=256):

1. Non-physical AIC power_w. AIC's per-kernel 3D-cubic interpolator
   extrapolates `generation_attention` to ~1563 W per op on H200
   vLLM 0.19.1 at batch >= 256, producing an aggregate power_w of
   1275 W against a 700 W TDP and a 721 W measured peak. Cap_d was
   ending up at 1276 W, which NVML would silently clamp to TDP --
   leaving the planner's stated cap fictional and pegging the EMA at
   0.5 within a few ticks. AICPowerOptimizer now clamps the value
   used for cap derivation at TDP*1.1; the *raw* value is preserved
   on PowerAwareConfig as the EMA denominator so c_power_*
   coefficients visibly converge below 1.0 whenever AIC over-predicts,
   giving operators an independent signal alongside the new
   `aic_power_w_clamped_total{side}` counter. New design-doc §8 row 14
   wires the cascade with the existing §6 EMA-peg signal. 5 new
   integration tests (TestAICPowerWClamp): non-physical clamp, raw
   value preserved on config, side-selective clamp, threshold
   boundary, agg-mode averaged path. The B200 TRT-LLM 1.3.0rc6 data
   stays cold on this clamp because its lattice is ~10x denser than
   the H200 sandbox -- confirms this is a data-density issue, not a
   fundamental AIC bug.

2. Wrong runtime documentation. optimize docstring (and
   powerplanner-design.md sec 6.4 + sec 10 row 12) claimed ~6 s per
   sweep. Measured 2026-05-11 against real sandboxes: cold first
   call is 3.5 s (H200 vLLM) / 5.9 s (B200 TRT-LLM), dominated by
   perf-database load from disk; warm calls are 10-15 ms once
   databases_cache is populated. Closed-loop CPU duty cycle at the
   default aic_reoptimize_interval=300s is ~0.004 percent, not the
   ~2 percent the docstring implied. Docs updated to distinguish
   cold vs warm and to clarify that the hysteresis guard exists to
   suppress downstream PATCH/POST fanout, not AIC CPU.

3. No regression coverage against real AIC data. New opt-in testbed
   class wires the AIC sandbox into pytest via the AIC_SANDBOX_DIR
   env var (marker real_aic). 11 tests parametrized over every
   (system, backend) tuple present in the sandbox: smoke + clamp
   contract + multi-tick EMA drift loop (well-calibrated -> 1.0,
   over-predict -> pegs at 0.5, under-predict -> ~1.5, hysteresis
   fires at exactly aic_drift_consecutive_ticks, healthy load
   produces zero spurious sweeps). Module-level skip when the env
   var is unset, keeping the default test surface unchanged.

Test impact:
- tests/integration/test_aic_power_optimizer.py: 34 -> 39, all pass
- tests/testbed/tests/: 55 -> 66 with sandbox mounted (11 new
  real_aic tests); 55 + 3 skips without (1 module skip + 2 gamma
  skips), default test surface unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
…bits)

Mechanical post-rebase cleanup surfaced by `pre-commit run` on PR 9369:

- isort/black reformat across 47 power-planner files

- ruff F841: drop dead assignments (kwargs/sweep_fired/c_ttft_before_stable/clamp_threshold)

- ruff F821: add missing TYPE_CHECKING imports for StructuredAssertion, ExprAssertion, NoiseModel, TickHistory in testbed forward-refs

- ruff E402: `# noqa: E402` on intentional late imports after sys.path / stub install (testbed/conftest.py, test_planner_power_launch.py)

- chmod +x: 5 scripts with shebangs (power_agent.py, compute/integrate/validate_aic_power tools, verify_poweraware.bash)

No behavior changes.

Signed-off-by: Kai Ma <kaim@nvidia.com>
…proportional_clamp_pair

PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline
ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling
beyond original demand (Open Question #2).

The rebase took main's #9294 refactor that delegates to the new
budget.proportional_clamp_pair helper. That helper distributes the
remaining budget proportionally and can over-allocate decode in the
asymmetric (p_gpu != d_gpu) case.

Re-introduce the clamp at the call site rather than in the shared helper,
so we don't change the contract for other callers (e.g. SLA-mode
scale-down consolidation, also using the helper).

Fixes: test_decode_never_exceeds_input_when_prefill_binding
Signed-off-by: Kai Ma <kaim@nvidia.com>
…ebase

Main's #9246 (chore: suffix duration config fields with units) renamed
PlannerConfig.ttft -> ttft_ms and PlannerConfig.itl -> itl_ms. Construction
keeps working via pydantic's validation_alias=AliasChoices("ttft_ms", "ttft"),
so PlannerConfig(ttft=..., itl=...) keyword construction is still accepted,
but Python attribute access must use the new names.

PR 9369 was based on the pre-rename code and read the SLAs via .ttft / .itl
on PlannerConfig instances:

  - aic_power_optimizer.py: 2 reads (single-engine feasibility check) +
    2 reads (should_reoptimize SLA-violation check)
  - test_aic_real_data.py: 1 read (synthesizing 2x-SLA traffic)

All 4 + 1 sites updated. Surfaced as AttributeError when running the
26 alpha-class testbed scenarios (test_scenarios.py::test_alpha[*]).

Fixes: 26 testbed alpha-scenario failures, 1 test_self_consistency alpha test
Signed-off-by: Kai Ma <kaim@nvidia.com>
Rebase onto main pulled in PR #9246's PlannerConfig field renames
(	tft->	tft_ms, itl->itl_ms, 	hroughput_adjustment_interval->
	hroughput_adjustment_interval_seconds,
load_adjustment_interval->load_adjustment_interval_seconds).
Two of the four ms-suffix accesses were updated in the prior fixup,
but power_aware_replay_adapter.py still accessed the old
	hroughput_adjustment_interval attribute. Pydantic's AliasChoices
only covers constructor input, not attribute access — so all four
gamma-class testbed tests failed with::

    AttributeError: 'PlannerConfig' object has no attribute
    'throughput_adjustment_interval'. Did you mean:
    'throughput_adjustment_interval_seconds'?

This patch renames the attribute access at line 111 to the new
canonical field name. Constructor sites still work via AliasChoices.

Signed-off-by: Kai Ma <kaim@nvidia.com>
…uced suites

Add pytestmark with pre_merge/unit-or-integration/gpu_0/planner markers to seven testbed and live-integration test files introduced by this PR. The pytest-marker-report pre-commit hook (newly enforced on main) requires every test to declare a Lifecycle, Test Type, and Hardware category; without these markers it fails with 'Missing sets: 122' across all PR9369 tests. Marker assignments: testbed unit suites (test_fakes, test_overlay, test_scenarios_loadable) get unit+planner; testbed scenario drivers (test_scenarios, test_self_consistency) get integration+planner since they run full planner closed-loop simulations; live K8s tests (test_metric_paths_live, test_actuation_knobs_live) already had integration+gpu_0 but lacked a Lifecycle marker ? added pre_merge (they skip cleanly when no cluster) and planner.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the 2-line NVIDIA SPDX header to 49 files introduced by the power-aware
planner work (testbed scenario YAMLs, testbed Python suites, dev/inspect
scripts, and root-level pipeclean launcher) so they pass copyright-checks.

Extends the planner filter group in .github/filters.yaml to cover the new
power_agent component, planner dev manifests, dev scripts under scripts/dev
and scripts/inspect_*, and the tools/*power*.py helpers, so changed-files
no longer reports them as uncovered.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Main PR #9558 (`feat: introduce RoutingConstraints + worker taints`)
added `from dynamo._core import RoutingConstraints` to
`dynamo.llm.__init__`. Without this symbol in the testbed runtime
stub, every planner unit test that transitively imports
`dynamo.common.utils.endpoint_types` (-> `dynamo.llm`) fails at
collection time on dev machines that don't have the maturin-built
native binding.

Add `RoutingConstraints` to the stubbed class list in
`tests/testbed/_runtime_stub.py`. Idempotent, doesn't affect CI or
the dev pod (real `dynamo._core` is loaded there).

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the kaim/power-planner branch from 1ac66af to cec5681 Compare May 18, 2026 13:53
kaim-eng added a commit that referenced this pull request May 18, 2026
Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.

Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.

43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.

CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.

Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 18, 2026
…rometheus

Adds the planner-side mechanism for power-aware operation:
- 5 PlannerConfig fields (enable_power_awareness defaults False) with
  required-when-enabled validator for total_gpu_power_limit and
  power_agent_safe_default_watts
- _apply_power_annotations() reconciliation loop on every tick,
  gated on the feature flag, with POWER_ANNOTATION_KEY constant
- _publish_power_budget_metrics() dashboard gauges (static config,
  not DCGM, so it remains valid when attribution drops)
- CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI
- get_component_pods + list_frontend_pods + resolve_frontend_http_port
  on KubernetesConnector
- get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the
  frontend-source path
- 3 planner_metrics gauges: power_budget_total_watts,
  power_projected_watts, power_budget_utilization
- Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch
  (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml
  for non-operator development clusters
- repo-shared components/src/conftest.py (planner pkg conftest)
- 623-line unit suite (test_actuation_knobs.py) + 567-line live
  integration suite (test_actuation_knobs_live.py) covering every
  actuation knob this PR adds
- Unit test additions on test_kubernetes_connector.py (~119 lines)
  and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries
  get_total_dgd_power subset, ~100 lines)
- CI filter (.github/filters.yaml) extends the planner group with
  components/src/conftest.py

Power awareness is off by default. Budget enforcement
(_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3.

Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent
component) must merge first or land in parallel. See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 18, 2026
Adds _apply_power_budget() to PlannerStateMachine, wired into both
load-scaling and throughput-scaling paths (one-line call site each).
Budget enforcement runs only when enable_power_awareness=True and
total_gpu_power_limit is set, so it is a no-op on existing deployments.

Includes a no-upscale-decode invariant in _apply_global_budget: when
proportional_clamp_pair (from main's budget helper, see PR #9296) would
allocate decode beyond original demand because prefill is the binding
constraint, decode is clamped back to min(num_d, min_endpoint). This
preserves the original behavioural contract from the pre-rebase inline
formula while delegating to main's shared clamp.

68-line TestApplyGlobalBudgetNoUpscale suite (5 tests) pins the
no-upscale invariant.

Part of the PR #9369 split (PR 2 of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 18, 2026
Adds the correction-coefficient EMA engine that observes live
TTFT/ITL/power, detects drift against the planner's expected operating
point, and re-invokes the AIC sweep to recompute (cap_p, cap_d, n_p,
n_d) plus implied frontend admission thresholds.

Components:
- AICPowerOptimizer (monitoring/aic_power_optimizer.py): correction
  coefficients (c_ttft, c_itl, c_power_p, c_power_d, c_power_agg) with
  EMA smoothing, drift detection with hysteresis, optimize() sweep
  driver, defensive power_w clamp at nameplate TDP (guards against
  AIC's non-physical interpolator extrapolations at sparse data-grid
  corners), and the aic_to_planner_cap bridge.
- core/base.py Phase-3 additions: _aic_optimizer field + TYPE_CHECKING
  import; AIC startup sweep in _async_init(); _apply_aic_config()
  (caps + replica targets + drift-ref pin + admission gauges + POST
  fanout); _fanout_busy_threshold() with B6 absolute-prefill defense;
  _min_prefill_max_num_batched_tokens() helper; AIC tick-loop block
  in run() with disagg vs agg branches and conditional re-sweep;
  traffic observation enrichment with AIC-driven fields.
- 5 new TrafficObservation fields (core/types.py): ttft_avg, itl_avg,
  total_tokens_per_s, scheduled_prefill_tokens, scheduled_decode_kv_tokens.
- 13 AIC config fields + 3 admission config fields on PlannerConfig
  with validator (enable_aic_optimizer requires enable_power_awareness
  AND aic_interpolation).
- get_avg_per_gpu_power_by_component DCGM query on PrometheusAPIClient
  and supporting documentation on DirectRouterMetricsClient (callable
  surface unchanged).
- 9 AIC gauges/counters + 5 admission metrics on PlannerPrometheusMetrics.

Test suites (~2,000 lines new tests):
- 967-line test_aic_power_optimizer.py: 34 unit-grade tests covering
  EMA mechanics, drift detection, sweep failure modes, and the
  power_w clamp.
- 712-line test_aic_power_e2e_sim.py: 15 closed-loop integration tests
  with the fake-AIC + fake-Prometheus harness.
- 558-line test_metric_paths_live.py: live-cluster validation of
  DCGM per-component queries.
- 505 additional lines on test_prometheus.py: TestMetricsIsValid,
  TestDirectRouterMetricsClient*, TestGetAvgRequestDurationRouterSource,
  TestGetAvgKvHitRateEdgeCases, plus the get_avg_per_gpu_power_by_component
  half of TestPowerAwareDcgmQueries.
- profiler/utils/aic_dataframe.py: 14-line defensive fix for AIC
  table lookups.

The PR 1b hand-authored intermediate of the 6 shared files plus this PR's
verbatim checkout reproduces kaim/power-planner byte-for-byte (validated
via git diff).

Part of the PR #9369 split (PR 3 of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 18, 2026
Adds the synthetic-metrics testbed: deterministic, no GPU, no cluster.

Alpha-class (27 scenarios, A1-F26): synthetic fleet + fake metrics
+ fake actuator drive the planner through fault-injection scenarios
covering AIC drift, NVML clamps, K8s RBAC denials, node loss / recovery,
Prometheus outages, MDC gaps, budget shrinkage, AIC infeasibility,
and drift-threshold boundary cases.

Gamma-class (3 scenarios, G1-G3): mocker-driven trace replay with
synthetic-power overlay (replay/synthetic_power_overlay.py) and a
power-aware replay adapter (replay/power_aware_replay_adapter.py)
exercise the closed loop against real Mooncake traces.

Infrastructure:
- runner.py + scenarios.py + assertions.py + recorder.py + clock.py
  (run-loop, scenario loader, invariant checks, recording).
- synthetic_fleet.py + fake_actuator.py + fake_aic.py +
  fake_planner_metrics.py + fake_prometheus.py (test doubles for
  every external dependency of the planner run loop).
- _runtime_stub.py installs a stub dynamo._core when the compiled
  Rust binding is absent, so the testbed runs on developer laptops
  without a CUDA toolchain (carries every dynamo._core symbol used
  by dynamo.llm at module-load time, including the post-rebase
  RoutingConstraints addition from main PR #9558).
- grafana/testbed_dashboard.json + systems/ (h100_pcie / h100_sxm /
  h200_sxm SKUs) + traces/placeholder_h200_disagg_1rps.jsonl provide
  a complete observability + replay stack.

Production code:
- offline/replay_adapter.py: 9-line timing fix
  (now_s = max(tick.at_s, bridge_now_s)) prevents stale-tick loops
  on sparse traces.

86 testbed tests + 30 scenarios (27 alpha + 3 gamma) ship green at
this tip (1 skipped pending env-var; test_aic_real_data.py is
module-skipped unless AIC_SANDBOX_DIR is set).

Part of the PR #9369 split (PR 4 of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 18, 2026
Adds documentation, dev environment, example DGD configs, operator
scripts, and standalone analysis tools.

Design documentation (~5,000 lines):
- docs/design-docs/powerplanner-design.md (2,111 lines): full design
  with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics,
  5.6 drift detection + hysteresis, 5.7 admission control, 6.5
  Power Agent fail-closed cold-start cap, section 8 failure-mode
  catalog (14 modes).
- docs/design-docs/powerplanner-testbed-design.md (2,187 lines):
  testbed design with alpha and gamma scenario classes, fake-driver
  contracts, replay-overlay timing semantics.
- docs/components/planner/dpp-dev-env.md (675 lines): dev-pod
  bring-up + smoke-test recipes against viking-prod and umb-b200
  cluster hardware.

Dev environment + example DGDs:
- deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml +
  qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for
  live-cluster iteration without rebuilds.
- examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD
  guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml
  + verify_poweraware.bash (operator smoke-test).

Inspection + analysis tooling:
- scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time
  helpers for DCGM attribution, frontend port resolution, planner DCGM
  query methods, worker /metrics endpoints, live planner metrics.
- tools/compute_power_budget.py + integrate_aic_power_data.py +
  validate_aic_power_integration.py: standalone analysis utilities
  for sizing the budget, ingesting AIC offline data, and validating
  the integration end-to-end.
- test_planner_power_launch.py: root-level pipeclean launcher.

CI filter (.github/filters.yaml) accumulates the remaining planner-side
paths so changes here trigger the planner job set (cumulative with PR 1a
and PR 1b additions; the final file matches kaim/power-planner
byte-for-byte).

No production code changes; this PR is purely additive documentation,
deployment manifests, and operator tooling. All planner/power_agent
tests at PR 4 tip remain green (no test additions in this PR).

Part of the PR #9369 split (PR 5 of 6, final). See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.

Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.

43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.

CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.

Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
…rometheus

Adds the planner-side mechanism for power-aware operation:
- 5 PlannerConfig fields (enable_power_awareness defaults False) with
  required-when-enabled validator for total_gpu_power_limit and
  power_agent_safe_default_watts
- _apply_power_annotations() reconciliation loop on every tick,
  gated on the feature flag, with POWER_ANNOTATION_KEY constant
- _publish_power_budget_metrics() dashboard gauges (static config,
  not DCGM, so it remains valid when attribution drops)
- CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI
- get_component_pods + list_frontend_pods + resolve_frontend_http_port
  on KubernetesConnector
- get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the
  frontend-source path
- 3 planner_metrics gauges: power_budget_total_watts,
  power_projected_watts, power_budget_utilization
- Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch
  (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml
  for non-operator development clusters
- repo-shared components/src/conftest.py (planner pkg conftest)
- 623-line unit suite (test_actuation_knobs.py) + 567-line live
  integration suite (test_actuation_knobs_live.py) covering every
  actuation knob this PR adds
- Unit test additions on test_kubernetes_connector.py (~119 lines)
  and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries
  get_total_dgd_power subset, ~100 lines)
- CI filter (.github/filters.yaml) extends the planner group with
  components/src/conftest.py

Power awareness is off by default. Budget enforcement
(_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3.

Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent
component) must merge first or land in parallel. See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Adds documentation, dev environment, example DGD configs, operator
scripts, and standalone analysis tools.

Design documentation (~5,000 lines):
- docs/design-docs/powerplanner-design.md (2,111 lines): full design
  with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics,
  5.6 drift detection + hysteresis, 5.7 admission control, 6.5
  Power Agent fail-closed cold-start cap, section 8 failure-mode
  catalog (14 modes).
- docs/design-docs/powerplanner-testbed-design.md (2,187 lines):
  testbed design with alpha and gamma scenario classes, fake-driver
  contracts, replay-overlay timing semantics.
- docs/components/planner/dpp-dev-env.md (675 lines): dev-pod
  bring-up + smoke-test recipes against viking-prod and umb-b200
  cluster hardware.

Dev environment + example DGDs:
- deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml +
  qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for
  live-cluster iteration without rebuilds.
- examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD
  guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml
  + verify_poweraware.bash (operator smoke-test).

Inspection + analysis tooling:
- scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time
  helpers for DCGM attribution, frontend port resolution, planner DCGM
  query methods, worker /metrics endpoints, live planner metrics.
- tools/compute_power_budget.py + integrate_aic_power_data.py +
  validate_aic_power_integration.py: standalone analysis utilities
  for sizing the budget, ingesting AIC offline data, and validating
  the integration end-to-end.
- test_planner_power_launch.py: root-level pipeclean launcher.

CI filter (.github/filters.yaml) accumulates the remaining planner-side
paths so changes here trigger the planner job set (cumulative with PR 1a
and PR 1b additions; the final file matches kaim/power-planner
byte-for-byte).

No production code changes; this PR is purely additive documentation,
deployment manifests, and operator tooling. All planner/power_agent
tests at PR 4 tip remain green (no test additions in this PR).

Part of the PR #9369 split (PR 5 of 6, final). See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
…rometheus

Adds the planner-side mechanism for power-aware operation:
- 5 PlannerConfig fields (enable_power_awareness defaults False) with
  required-when-enabled validator for total_gpu_power_limit and
  power_agent_safe_default_watts
- _apply_power_annotations() reconciliation loop on every tick,
  gated on the feature flag, with POWER_ANNOTATION_KEY constant
- _publish_power_budget_metrics() dashboard gauges (static config,
  not DCGM, so it remains valid when attribution drops)
- CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI
- get_component_pods + list_frontend_pods + resolve_frontend_http_port
  on KubernetesConnector
- get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the
  frontend-source path
- 3 planner_metrics gauges: power_budget_total_watts,
  power_projected_watts, power_budget_utilization
- Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch
  (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml
  for non-operator development clusters
- repo-shared components/src/conftest.py (planner pkg conftest)
- 623-line unit suite (test_actuation_knobs.py) + 567-line live
  integration suite (test_actuation_knobs_live.py) covering every
  actuation knob this PR adds
- Unit test additions on test_kubernetes_connector.py (~119 lines)
  and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries
  get_total_dgd_power subset, ~100 lines)
- CI filter (.github/filters.yaml) extends the planner group with
  components/src/conftest.py

Power awareness is off by default. Budget enforcement
(_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3.

Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent
component) must merge first or land in parallel. See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Adds _apply_power_budget() to PlannerStateMachine, wired into both
load-scaling and throughput-scaling paths (one-line call site each).
Budget enforcement runs only when enable_power_awareness=True and
total_gpu_power_limit is set, so it is a no-op on existing deployments.

Includes a no-upscale-decode invariant in _apply_global_budget: when
proportional_clamp_pair (from main's budget helper, see PR #9296) would
allocate decode beyond original demand because prefill is the binding
constraint, decode is clamped back to min(num_d, min_endpoint). This
preserves the original behavioural contract from the pre-rebase inline
formula while delegating to main's shared clamp.

68-line TestApplyGlobalBudgetNoUpscale suite (5 tests) pins the
no-upscale invariant.

Part of the PR #9369 split (PR 2 of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Adds the correction-coefficient EMA engine that observes live
TTFT/ITL/power, detects drift against the planner's expected operating
point, and re-invokes the AIC sweep to recompute (cap_p, cap_d, n_p,
n_d) plus implied frontend admission thresholds.

Components:
- AICPowerOptimizer (monitoring/aic_power_optimizer.py): correction
  coefficients (c_ttft, c_itl, c_power_p, c_power_d, c_power_agg) with
  EMA smoothing, drift detection with hysteresis, optimize() sweep
  driver, defensive power_w clamp at nameplate TDP (guards against
  AIC's non-physical interpolator extrapolations at sparse data-grid
  corners), and the aic_to_planner_cap bridge.
- core/base.py Phase-3 additions: _aic_optimizer field + TYPE_CHECKING
  import; AIC startup sweep in _async_init(); _apply_aic_config()
  (caps + replica targets + drift-ref pin + admission gauges + POST
  fanout); _fanout_busy_threshold() with B6 absolute-prefill defense;
  _min_prefill_max_num_batched_tokens() helper; AIC tick-loop block
  in run() with disagg vs agg branches and conditional re-sweep;
  traffic observation enrichment with AIC-driven fields.
- 5 new TrafficObservation fields (core/types.py): ttft_avg, itl_avg,
  total_tokens_per_s, scheduled_prefill_tokens, scheduled_decode_kv_tokens.
- 13 AIC config fields + 3 admission config fields on PlannerConfig
  with validator (enable_aic_optimizer requires enable_power_awareness
  AND aic_interpolation).
- get_avg_per_gpu_power_by_component DCGM query on PrometheusAPIClient
  and supporting documentation on DirectRouterMetricsClient (callable
  surface unchanged).
- 9 AIC gauges/counters + 5 admission metrics on PlannerPrometheusMetrics.

Test suites (~2,000 lines new tests):
- 967-line test_aic_power_optimizer.py: 34 unit-grade tests covering
  EMA mechanics, drift detection, sweep failure modes, and the
  power_w clamp.
- 712-line test_aic_power_e2e_sim.py: 15 closed-loop integration tests
  with the fake-AIC + fake-Prometheus harness.
- 558-line test_metric_paths_live.py: live-cluster validation of
  DCGM per-component queries.
- 505 additional lines on test_prometheus.py: TestMetricsIsValid,
  TestDirectRouterMetricsClient*, TestGetAvgRequestDurationRouterSource,
  TestGetAvgKvHitRateEdgeCases, plus the get_avg_per_gpu_power_by_component
  half of TestPowerAwareDcgmQueries.
- profiler/utils/aic_dataframe.py: 14-line defensive fix for AIC
  table lookups.

The PR 1b hand-authored intermediate of the 6 shared files plus this PR's
verbatim checkout reproduces kaim/power-planner byte-for-byte (validated
via git diff).

Part of the PR #9369 split (PR 3 of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Adds the synthetic-metrics testbed: deterministic, no GPU, no cluster.

Alpha-class (27 scenarios, A1-F26): synthetic fleet + fake metrics
+ fake actuator drive the planner through fault-injection scenarios
covering AIC drift, NVML clamps, K8s RBAC denials, node loss / recovery,
Prometheus outages, MDC gaps, budget shrinkage, AIC infeasibility,
and drift-threshold boundary cases.

Gamma-class (3 scenarios, G1-G3): mocker-driven trace replay with
synthetic-power overlay (replay/synthetic_power_overlay.py) and a
power-aware replay adapter (replay/power_aware_replay_adapter.py)
exercise the closed loop against real Mooncake traces.

Infrastructure:
- runner.py + scenarios.py + assertions.py + recorder.py + clock.py
  (run-loop, scenario loader, invariant checks, recording).
- synthetic_fleet.py + fake_actuator.py + fake_aic.py +
  fake_planner_metrics.py + fake_prometheus.py (test doubles for
  every external dependency of the planner run loop).
- _runtime_stub.py installs a stub dynamo._core when the compiled
  Rust binding is absent, so the testbed runs on developer laptops
  without a CUDA toolchain (carries every dynamo._core symbol used
  by dynamo.llm at module-load time, including the post-rebase
  RoutingConstraints addition from main PR #9558).
- grafana/testbed_dashboard.json + systems/ (h100_pcie / h100_sxm /
  h200_sxm SKUs) + traces/placeholder_h200_disagg_1rps.jsonl provide
  a complete observability + replay stack.

Production code:
- offline/replay_adapter.py: 9-line timing fix
  (now_s = max(tick.at_s, bridge_now_s)) prevents stale-tick loops
  on sparse traces.

86 testbed tests + 30 scenarios (27 alpha + 3 gamma) ship green at
this tip (1 skipped pending env-var; test_aic_real_data.py is
module-skipped unless AIC_SANDBOX_DIR is set).

Part of the PR #9369 split (PR 4 of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Adds documentation, dev environment, example DGD configs, operator
scripts, and standalone analysis tools.

Design documentation (~5,000 lines):
- docs/design-docs/powerplanner-design.md (2,111 lines): full design
  with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics,
  5.6 drift detection + hysteresis, 5.7 admission control, 6.5
  Power Agent fail-closed cold-start cap, section 8 failure-mode
  catalog (14 modes).
- docs/design-docs/powerplanner-testbed-design.md (2,187 lines):
  testbed design with alpha and gamma scenario classes, fake-driver
  contracts, replay-overlay timing semantics.
- docs/components/planner/dpp-dev-env.md (675 lines): dev-pod
  bring-up + smoke-test recipes against viking-prod and umb-b200
  cluster hardware.

Dev environment + example DGDs:
- deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml +
  qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for
  live-cluster iteration without rebuilds.
- examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD
  guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml
  + verify_poweraware.bash (operator smoke-test).

Inspection + analysis tooling:
- scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time
  helpers for DCGM attribution, frontend port resolution, planner DCGM
  query methods, worker /metrics endpoints, live planner metrics.
- tools/compute_power_budget.py + integrate_aic_power_data.py +
  validate_aic_power_integration.py: standalone analysis utilities
  for sizing the budget, ingesting AIC offline data, and validating
  the integration end-to-end.
- test_planner_power_launch.py: root-level pipeclean launcher.

CI filter (.github/filters.yaml) accumulates the remaining planner-side
paths so changes here trigger the planner job set (cumulative with PR 1a
and PR 1b additions; the final file matches kaim/power-planner
byte-for-byte).

No production code changes; this PR is purely additive documentation,
deployment manifests, and operator tooling. All planner/power_agent
tests at PR 4 tip remain green (no test additions in this PR).

Part of the PR #9369 split (PR 5 of 6, final). See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Updates the three reference sites in examples/deployments/powerplanner/
that previously instructed users to ``kubectl apply -f deploy/power_agent/...``,
flipping them to the new Helm chart at deploy/helm/charts/power-agent/
that landed in PR #9682:

  * disagg-power-aware.yaml header recipe
  * README.md Prerequisites + verify section
  * MULTI_DGD.md file index

Also updates the verify-pods label selector from the legacy
``app=dynamo-power-agent`` to the chart-emitted
``app.kubernetes.io/name=power-agent``.

Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md,
pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference
the old paths in their historical / architectural narrative sections,
which is intentional -- those describe the pre-chart state and the
transition rationale, not current deployment instructions.

Part of the PR #9369 cascade following PR #9682''s Helm chart landing.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
The two ``[Appendix C.10]``/``[Appendix D.7]`` links in the testbed
README pointed at ``docs/design-docs/powerplanner-testbed-design.md``,
which is introduced by PR #9687 and does not yet exist on this branch.
The Docs link check (lychee) has therefore failed on PR #9686 since the
PR was opened on 2026-05-18 -- a cross-PR forward reference baked into
the original PR #9369 split.

Convert both occurrences from ``[text](relative-path)`` syntax to plain
backticked text references. The information value (appendix numbers +
target file) is preserved; lychee no longer treats them as candidate
links to resolve. Once both PRs land on ``main`` the file resolves
naturally and reviewers can grep for the path.

No code or test change. Cascade-affected branches: this commit lives on
pr4/testbed; pr5/docs-devenv will rebase onto the new tip.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Adds documentation, dev environment, example DGD configs, operator
scripts, and standalone analysis tools.

Design documentation (~5,000 lines):
- docs/design-docs/powerplanner-design.md (2,111 lines): full design
  with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics,
  5.6 drift detection + hysteresis, 5.7 admission control, 6.5
  Power Agent fail-closed cold-start cap, section 8 failure-mode
  catalog (14 modes).
- docs/design-docs/powerplanner-testbed-design.md (2,187 lines):
  testbed design with alpha and gamma scenario classes, fake-driver
  contracts, replay-overlay timing semantics.
- docs/components/planner/dpp-dev-env.md (675 lines): dev-pod
  bring-up + smoke-test recipes against viking-prod and umb-b200
  cluster hardware.

Dev environment + example DGDs:
- deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml +
  qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for
  live-cluster iteration without rebuilds.
- examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD
  guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml
  + verify_poweraware.bash (operator smoke-test).

Inspection + analysis tooling:
- scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time
  helpers for DCGM attribution, frontend port resolution, planner DCGM
  query methods, worker /metrics endpoints, live planner metrics.
- tools/compute_power_budget.py + integrate_aic_power_data.py +
  validate_aic_power_integration.py: standalone analysis utilities
  for sizing the budget, ingesting AIC offline data, and validating
  the integration end-to-end.
- test_planner_power_launch.py: root-level pipeclean launcher.

CI filter (.github/filters.yaml) accumulates the remaining planner-side
paths so changes here trigger the planner job set (cumulative with PR 1a
and PR 1b additions; the final file matches kaim/power-planner
byte-for-byte).

No production code changes; this PR is purely additive documentation,
deployment manifests, and operator tooling. All planner/power_agent
tests at PR 4 tip remain green (no test additions in this PR).

Part of the PR #9369 split (PR 5 of 6, final). See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 19, 2026
Updates the three reference sites in examples/deployments/powerplanner/
that previously instructed users to ``kubectl apply -f deploy/power_agent/...``,
flipping them to the new Helm chart at deploy/helm/charts/power-agent/
that landed in PR #9682:

  * disagg-power-aware.yaml header recipe
  * README.md Prerequisites + verify section
  * MULTI_DGD.md file index

Also updates the verify-pods label selector from the legacy
``app=dynamo-power-agent`` to the chart-emitted
``app.kubernetes.io/name=power-agent``.

Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md,
pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference
the old paths in their historical / architectural narrative sections,
which is intentional -- those describe the pre-chart state and the
transition rationale, not current deployment instructions.

Part of the PR #9369 cascade following PR #9682''s Helm chart landing.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.

Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.

43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.

CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.

Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
…rometheus

Adds the planner-side mechanism for power-aware operation:
- 5 PlannerConfig fields (enable_power_awareness defaults False) with
  required-when-enabled validator for total_gpu_power_limit and
  power_agent_safe_default_watts
- _apply_power_annotations() reconciliation loop on every tick,
  gated on the feature flag, with POWER_ANNOTATION_KEY constant
- _publish_power_budget_metrics() dashboard gauges (static config,
  not DCGM, so it remains valid when attribution drops)
- CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI
- get_component_pods + list_frontend_pods + resolve_frontend_http_port
  on KubernetesConnector
- get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the
  frontend-source path
- 3 planner_metrics gauges: power_budget_total_watts,
  power_projected_watts, power_budget_utilization
- Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch
  (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml
  for non-operator development clusters
- repo-shared components/src/conftest.py (planner pkg conftest)
- 623-line unit suite (test_actuation_knobs.py) + 567-line live
  integration suite (test_actuation_knobs_live.py) covering every
  actuation knob this PR adds
- Unit test additions on test_kubernetes_connector.py (~119 lines)
  and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries
  get_total_dgd_power subset, ~100 lines)
- CI filter (.github/filters.yaml) extends the planner group with
  components/src/conftest.py

Power awareness is off by default. Budget enforcement
(_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3.

Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent
component) must merge first or land in parallel. See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 3, 2026
Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.

Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.

43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.

CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.

Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
…rometheus

Adds the planner-side mechanism for power-aware operation:
- 5 PlannerConfig fields (enable_power_awareness defaults False) with
  required-when-enabled validator for total_gpu_power_limit and
  power_agent_safe_default_watts
- _apply_power_annotations() reconciliation loop on every tick,
  gated on the feature flag, with POWER_ANNOTATION_KEY constant
- _publish_power_budget_metrics() dashboard gauges (static config,
  not DCGM, so it remains valid when attribution drops)
- CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI
- get_component_pods + list_frontend_pods + resolve_frontend_http_port
  on KubernetesConnector
- get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the
  frontend-source path
- 3 planner_metrics gauges: power_budget_total_watts,
  power_projected_watts, power_budget_utilization
- Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch
  (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml
  for non-operator development clusters
- repo-shared components/src/conftest.py (planner pkg conftest)
- 623-line unit suite (test_actuation_knobs.py) + 567-line live
  integration suite (test_actuation_knobs_live.py) covering every
  actuation knob this PR adds
- Unit test additions on test_kubernetes_connector.py (~119 lines)
  and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries
  get_total_dgd_power subset, ~100 lines)
- CI filter (.github/filters.yaml) extends the planner group with
  components/src/conftest.py

Power awareness is off by default. Budget enforcement
(_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3.

Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent
component) must merge first or land in parallel. See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
…rometheus

Adds the planner-side mechanism for power-aware operation:
- 5 PlannerConfig fields (enable_power_awareness defaults False) with
  required-when-enabled validator for total_gpu_power_limit and
  power_agent_safe_default_watts
- _apply_power_annotations() reconciliation loop on every tick,
  gated on the feature flag, with POWER_ANNOTATION_KEY constant
- _publish_power_budget_metrics() dashboard gauges (static config,
  not DCGM, so it remains valid when attribution drops)
- CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI
- get_component_pods + list_frontend_pods + resolve_frontend_http_port
  on KubernetesConnector
- get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the
  frontend-source path
- 3 planner_metrics gauges: power_budget_total_watts,
  power_projected_watts, power_budget_utilization
- Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch
  (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml
  for non-operator development clusters
- repo-shared components/src/conftest.py (planner pkg conftest)
- 623-line unit suite (test_actuation_knobs.py) + 567-line live
  integration suite (test_actuation_knobs_live.py) covering every
  actuation knob this PR adds
- Unit test additions on test_kubernetes_connector.py (~119 lines)
  and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries
  get_total_dgd_power subset, ~100 lines)
- CI filter (.github/filters.yaml) extends the planner group with
  components/src/conftest.py

Power awareness is off by default. Budget enforcement
(_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3.

Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent
component) must merge first or land in parallel. See
docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat planner size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants