feat(planner): power-aware planner — Phase 1+2+3 + Phase 4 pipeclean by kaim-eng · Pull Request #9369 · ai-dynamo/dynamo

kaim-eng · 2026-05-10T13:45:51Z

Adds power-aware intelligence to the Dynamo planner across three layers, plus tooling that pipecleans the Phase 4 selection path with currently available single-TDP AIC data. Reimplementation of unmerged PR #5280 against post-refactor ToT (planner/utils/ no longer exists; configuration is Pydantic; layout is config/, connectors/, core/, monitoring/).

Full design: docs/design-docs/powerplanner-design.md
Repro / dev environment: docs/components/planner/dpp-dev-env.md

What's in this PR

Phase 1 — infrastructure

Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/} Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop. Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore.
Pod annotation actuation: connectors/kubernetes.py writes dynamo.nvidia.com/gpu-power-limit on worker pods.
K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml.
Operator chart fix: ClusterRole now grants pods permissions (deploy/helm/charts/platform/components/operator/templates/planner.yaml, chart bumped to 1.2.1). Older clusters can apply deploy/planner-pod-rbac-dev.yaml as a temporary patch.

Phase 2 — budget enforcement

_apply_power_budget() in core/state_machine.py clamps replica scaling within a watt budget. New PlannerConfig fields: enable_power_awareness, total_gpu_power_limit (required when awareness enabled — validator enforces, no silent default), prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts.

Phase 3 — AIC optimizer

monitoring/aic_power_optimizer.py: AIConfigurator sweep picks (replica, power) configs that maximise throughput within the watt budget. Per-component EMA correction (c_ttft, c_itl, c_power_p, c_power_d for disagg; c_power_agg for agg) closes the loop on silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded re-optimize, auto-disable on consecutive failures.
core/base.py: NativePlannerBase plumbing (admission threshold push, named-port resolution, AIC integration).

Phase 4 — pipeclean (using existing single-TDP data)

tools/integrate_aic_power_data.py: copies measured H200/B200 power data into an AIC checkout and reinstalls.
tools/validate_aic_power_integration.py: asserts estimate_perf() returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback).
tools/compute_power_budget.py: rack-capacity → safe-budget helper.
examples/deployments/powerplanner/: PIPECLEAN.md runbook, disagg-{power-aware,conservative-cold-start}.yaml, verify_poweraware.bash (8-section health check inc. Phase 4 preview), MULTI_DGD.md operator playbook, README.md.

Bug fixes (found by live integration tests, fixed in this patchset) -------------------------------------------------------------------

Operator chart 1.2.0 ClusterRole had no pods rules → planner couldn't list workers or patch annotations. Fixed in chart template; workaround manifest provided for clusters not yet upgraded.
Stale label selectors in connectors/kubernetes.py: dynamo-graph-deployment → dynamo-graph-deployment-name, dynamo-service → dynamo-component/dynamo-component-type.
DCGM queries used pod=~ (matched the DCGM exporter pod) and the wrong namespace; corrected to exported_pod=~ + exported_namespace=<k8s-namespace> and pod regex now matches the operator's <dgd>-<replica-idx>-<service-key>-... format (monitoring/traffic_metrics.py).
Frontend-source metric methods returned NaN on quiet clusters (0/0 in increase(_sum)/increase(_count)); router path had a NaN guard, frontend didn't — added matching math.isnan filter.

Tests

Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS dev cluster (qwen3-quickstart DGD, single dev namespace):

Suite Pass Skip Fail

planner/tests/unit/ 465 0 0
planner/tests/integration/test_aic_power_e2e_sim 15 0 0
planner/tests/integration/test_aic_power_optimizer 34 0 0
planner/tests/integration/test_metric_paths_live 22-23 3-2 0
planner/tests/integration/test_actuation_knobs_live 10-11 1-0 0
power_agent/tests/ 43 0 0

Total (cold deploy / after sanity chat completion) 590-591 4-3 0

Documented skips:

test_frontend_metric_series_exists — passes after any traffic.
TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose dynamo_frontend_worker_* gauges (open-question feat: OAI compatible endpoints for TRTLLM #14 in design doc).
TestScalingRealMutation — opt-in disruptive test, gated by RUN_DISRUPTIVE_TESTS=1 (passes when enabled).

Repro recipe: docs/components/planner/dpp-dev-env.md
"From-Scratch Repro Script" section.

Compatibility

Power awareness is opt-in (enable_power_awareness=False by default). AIC optimizer is opt-in (enable_aic_optimizer=False by default). No existing planner behavior changes when both flags are off.

Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation) remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4 pipeclean code path so it can be exercised before AIC ships the full sweep API.

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

New Features
- Added Power Agent DaemonSet for enforcing GPU power limits on Dynamo worker pods via Kubernetes annotations.
- Introduced power-aware planner with GPU power budgeting and replica scaling constraints.
- Added AIC closed-loop optimizer for dynamic power and performance reconfiguration.
- Implemented admission control coupling for frontend threshold management.
- Added tools for power budget computation and AIC power-data integration.
Documentation
- Added developer environment setup guide, deployment playbooks, and verification scripts for power-aware scenarios.
Deployment
- Updated Helm charts to version 1.2.1 with planner pod-patching RBAC permissions.
- Added example DynamoGraphDeployment manifests for power-aware configurations.

copy-pr-bot · 2026-05-10T13:45:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-10T13:55:29Z

Walkthrough

This pull request implements end-to-end power-aware GPU resource management for the Dynamo planner across three phases: (1) a privileged DaemonSet that enforces per-pod NVML power caps via cgroup-based pod mapping, (2) static planner budget enforcement, and (3) a closed-loop AIC optimizer with EMA-based power coefficient correction. It includes comprehensive tests, Kubernetes manifests, deployment examples, and tooling.

Changes

Power-Aware GPU Resource Management

Layer / File(s)	Summary
Configuration & Schema `components/src/dynamo/planner/config/defaults.py`, `components/src/dynamo/planner/config/planner_config.py`, `components/src/dynamo/planner/core/types.py`	`SLAPlannerDefaults` and `PlannerConfig` extended with power-awareness flags, GPU power budgets, AIC optimizer parameters, and admission control settings. `TrafficObservation` gains AIC signal fields (TTFT/ITL seconds, token throughput, scheduled tokens).
Power Agent Daemon `components/power_agent/README.md`, `components/power_agent/power_agent.py`	DaemonSet that maps node PIDs to K8s pods via `/proc/<pid>/cgroup` parsing (v1/v2, systemd/cgroupfs), reads per-pod `dynamo.nvidia.com/gpu-power-limit` annotations, applies NVML power caps with safe-default fallback for conflicts, restores defaults on SIGTERM, and persists managed GPU UUIDs for orphan recovery.
Power Agent Tests `components/power_agent/tests/*`	Unit tests covering cgroup parsing (6 QoS/driver combinations), annotation mapping, SKU clamping, apply-cap behavior with metrics, multi-pod policy (single/agree/conflict/parse-failure), and graceful shutdown with NVML error resilience.
Planner Budget Enforcement `components/src/dynamo/planner/core/state_machine.py`, `components/src/dynamo/planner/core/load_scaling.py`, `components/src/dynamo/planner/core/throughput_scaling.py`, `components/src/dynamo/planner/core/budget.py`	`_apply_power_budget()` conditionally caps replica counts when total projected power exceeds limit, proportionally allocates prefill/decode, and prevents decode upscaling. Budget constant defined for pod annotation key.
Planner AIC Integration `components/src/dynamo/planner/core/base.py`	Initializes optional `AICPowerOptimizer` at startup, enriches traffic observations with token scheduling and TTFT/ITL in seconds, applies per-pod power annotations, publishes power budget metrics, collects observed power from Prometheus, updates AIC correction coefficients, triggers re-optimizations on drift, and fans out admission `/busy_threshold` POSTs to frontend pods.
AIC Closed-Loop Optimizer `components/src/dynamo/planner/monitoring/aic_power_optimizer.py`	Manages EMA-smoothed TTFT/ITL/power coefficients with traffic/token gating, detects SLA drift with hysteresis and rate limiting, sweeps replica counts under power budget, derives admission thresholds, warns on throughput regression, and implements fail-closed startup + fail-open runtime with auto-disable after consecutive failures.
Kubernetes Integration `components/src/dynamo/planner/connectors/kubernetes.py`, `components/src/dynamo/planner/connectors/kubernetes_api.py`	Methods to discover component pods by label, list frontend pods, resolve named `http` container ports with fallback, patch pod annotations, and POST admission thresholds to frontend pod IPs via `httpx`.
Metrics & Monitoring `components/src/dynamo/planner/monitoring/planner_metrics.py`, `components/src/dynamo/planner/monitoring/traffic_metrics.py`	Prometheus gauges for power budget/utilization, AIC coefficients (per-side and agg), implied admission thresholds, counters for failures/regressions. DCGM power query methods for total DGD power and per-component per-GPU power.
Planner Unit Tests `components/src/dynamo/planner/tests/unit/test_*.py`	Tests for scaling targets, pod annotation patching with no-op/disabled checks, busy threshold JSON construction, Kubernetes connector frontend discovery/port resolution, DCGM power queries, and state machine budget fitting.
Planner Integration Tests `components/src/dynamo/planner/tests/integration/test_*.py`	AIC optimizer happy path (replicas, power caps, latency/throughput), budget constraints, failure handling, EMA correction gating, drift detection. E2E simulations with synthetic H200 power levels. Live-cluster tests for annotation round-trip, pod discovery, admission POST, and metric path validation.
Profiler Extensions `components/src/dynamo/profiler/utils/aic_dataframe.py`	`build_prefill_row` and `build_decode_row` now accept `power_w` parameter (measured from NVML) instead of hardcoded `0.0`.
Helm Charts & RBAC `deploy/helm/charts/platform/*`	Version bumps to 1.2.1, extended planner Role/ClusterRole with core `pods` permissions (get/list/watch/patch) for pod actuation.
Power Agent Manifests `deploy/power_agent/daemonset.yaml`, `deploy/power_agent/rbac.yaml`, `deploy/planner-pod-rbac-dev.yaml`	DaemonSet running on GPU nodes (hostPID, nvidia runtime, tolerations), mounts `/proc` read-only and `/var/lib/dynamo-power-agent` for state, passes SAFE_DEFAULT_WATTS and node name. RBAC for ServiceAccount with pod read + patch. Dev RBAC workaround for operator ≤1.2.0.
Documentation `docs/components/planner/dpp-dev-env.md`, `examples/deployments/powerplanner/*`	End-to-end dev environment setup on AKS, multi-DGD deployment playbook with power accounting, cold-start coefficient recommendations, E2E pipeclean runbook, README with quick-start and budget formula.
Deployment Examples `examples/deployments/powerplanner/disagg-*.yaml`, `examples/deployments/powerplanner/verify_poweraware.bash`	Two disaggregated profiles (standard and conservative cold-start) with inline planner config enabling power/AIC. Bash smoke test validating DaemonSet readiness, pod annotations, power metrics, AIC coefficients, and Phase 4 preview.
Tooling `tools/compute_power_budget.py`, `tools/integrate_aic_power_data.py`, `tools/validate_aic_power_integration.py`	CLI to compute `total_gpu_power_limit` from rack capacity/headroom/overhead, integrate GPU power benchmark files into AIC, validate AIC estimator returns non-zero power within expected ranges and optimizer uses real power (not TDP fallback).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

coderabbitai

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py (1)

1-799: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix formatting and linting errors before merge.

The file has formatting issues and linting violations that must be addressed:

Formatting: Run ruff format to fix whitespace and line wrapping.
Linting (manual fixes required):
- Replace ambiguous × character with x in comments at lines 329, 708 and docstring at line 696 (RUF002, RUF003).
- Prefix unused variable n_p with underscore at line 742: _n_p, n_d = ... (RUF059).

Ruff violations detected

RUF002 at line 696: Docstring contains ambiguous × (MULTIPLICATION SIGN)
RUF003 at lines 329, 708: Comment contains ambiguous × (MULTIPLICATION SIGN)  
RUF059 at line 742: Unpacked variable `n_p` is never used

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py`
around lines 1 - 799, Run automatic formatting (ruff format) to fix
whitespace/line-wrapping, then manually fix three lint issues: replace the
Unicode multiplication sign '×' with ASCII 'x' in the explanatory
comments/docstring around the budget and throughput calculations (references:
the multi-replica budget comment in TestOptimizeBudgetConstraint, the
docstring/throughput comment in TestThroughputRegressionWarning and the comment
in TestSweepReplicas) and make the unused unpacked variable in
TestSweepReplicas.test_no_budget_uses_max_replicas a named-unused by prefixing
it with an underscore (change n_p, n_d = opt._sweep_replicas(...) to _n_p, n_d =
...). These changes address RUF002/RUF003 (ambiguous ×) and RUF059 (unused
variable) while keeping function/class names like _sweep_replicas,
TestSweepReplicas, TestOptimizeBudgetConstraint, and
TestThroughputRegressionWarning locators unchanged.

components/src/dynamo/planner/tests/unit/test_prometheus.py (1)

1-1078: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Address pre-commit formatting failures.

The file has formatting issues flagged by ruff. Run ruff format test_prometheus.py to apply reformatting and ruff check --fix test_prometheus.py to fix lint violations.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/unit/test_prometheus.py` around lines 1 -
1078, The test file fails ruff pre-commit checks; run an automatic reformat and
fixer (e.g. ruff format followed by ruff check --fix) on the file and commit the
changes so tests and lint pass; if applying edits manually, ensure formatting
and lint rules are respected across imports and function/class definitions
(examples to inspect: PrometheusAPIClient, DirectRouterMetricsClient,
TestPrometheusAPIClientRouterSource, and
test_frontend_metric_container_with_nan_value) and re-run ruff to confirm no
remaining violations.

components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py (1)

1-711: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Run ruff format to fix spacing and line-wrapping issues.

The code has formatting inconsistencies in comment alignment and line lengths. Run ruff format test_aic_power_e2e_sim.py to resolve the spacing before comments (e.g., 50.0, # → 50.0, # ) and line-wrapping issues.

Formatting changes needed

-H200_PREFILL_SERVING_W = H200_PREFILL_P90_W   # 603 W/GPU at heavy prefill load
-H200_DECODE_SERVING_W = H200_DECODE_P90_W     # 558 W/GPU at heavy decode load
+H200_PREFILL_SERVING_W = H200_PREFILL_P90_W  # 603 W/GPU at heavy prefill load
+H200_DECODE_SERVING_W = H200_DECODE_P90_W  # 558 W/GPU at heavy decode load

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py`
around lines 1 - 711, The file has ruff-formatting issues (misaligned inline
comments and long wrapped lines) across the test module (e.g., around numeric
literals and comment spacing such as "50.0,   #"), so run an automatic formatter
and fix spacing/line-wrapping: run `ruff format` (or your repo's formatter) on
this test file and ensure inline comments use a single space after the comma and
are properly wrapped; focus on symbols/classes to inspect after formatting like
TestFeedbackLoopH200.test_ema_converges_from_cold_start, _POWER_LEVELS entries,
_make_config, _make_traffic, and TestPhase4SyntheticMultiLevel methods to verify
comment alignment and line lengths are corrected.

components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Run the code formatter before merge.

The pre-commit check failed due to formatting issues. Run ruff format to fix line wrapping and assertion formatting.

#!/bin/bash
# Check current formatting diff
cd components/src/dynamo/planner
ruff format --diff tests/unit/test_kubernetes_connector.py

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py` at
line 1, Run the project's formatter on the test file to fix line wrapping and
assertion formatting: execute ruff format (e.g., `ruff format
tests/unit/test_kubernetes_connector.py` from the components/src/dynamo/planner
directory) to update formatting in test_kubernetes_connector.py so the
pre-commit checks pass; re-run the tests or pre-commit to confirm no remaining
style diffs.

🟡 Minor comments (7)

components/src/dynamo/planner/config/defaults.py-81-103 (1)
81-103: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Per-engine power-limit defaults contradict the stated "no placeholder integers" philosophy.

The comment at lines 82–84 explicitly argues that defaults are None "so the type itself signals 'operator must set'", yet prefill_engine_gpu_power_limit and decode_engine_gpu_power_limit are hardcoded to 300 W. With enable_power_awareness=True, these silently cap every prefill/decode GPU to 300 W regardless of SKU (≈43% of an H200 SXM's 700 W TDP), which would dramatically and silently regress throughput while the operator believes they are within budget. The validator in planner_config.py only enforces total_gpu_power_limit and power_agent_safe_default_watts, so this slips through.

Either make these Optional[int] = None and add a validator, or derive them from total_gpu_power_limit / gpu_count so the operator only needs to set one number.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/config/defaults.py` around lines 81 - 103, The
defaults prefill_engine_gpu_power_limit and decode_engine_gpu_power_limit
contradict the "no placeholder integers" policy and silently cap GPUs when
enable_power_awareness=True; change both to Optional[int]=None (or compute them
from total_gpu_power_limit divided by gpu_count) and update the
planner_config.py validation logic to require/derive per-engine limits when
enable_power_awareness is true (i.e., enforce that either per-engine limits are
set or automatically compute them from total_gpu_power_limit and known GPU
count) while keeping total_gpu_power_limit and power_agent_safe_default_watts
validation intact.
components/src/dynamo/planner/tests/unit/test_state_machine.py-1356-1358 (1)
1356-1358: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename the unused unpack target to keep pre-commit green.

num_p is never read in this test, and Ruff is already flagging it. Renaming it to _num_p avoids the RUF059 failure without changing the assertion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/unit/test_state_machine.py` around lines
1356 - 1358, The test assigns the tuple return from sm._apply_global_budget to
num_p and num_d but never uses num_p; rename the unused unpack target to _num_p
(i.e., change "num_p, num_d = sm._apply_global_budget(5, 3)" to "_num_p, num_d =
sm._apply_global_budget(5, 3)") to satisfy Ruff/RUF059 while leaving the
assertion against num_d unchanged.
tools/validate_aic_power_integration.py-137-137 (1)
137-137: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Drop the stray f prefixes to unblock Ruff.

These strings have no placeholders, and pre-commit is already reporting the resulting F541 errors here.

Also applies to: 167-167, 305-305
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/validate_aic_power_integration.py` at line 137, Calls to _section are
using f-string prefixes for plain string literals (e.g., _section(f"  prefill
estimate (isl=1024, tp=8)")) which triggers Ruff F541; remove the leading f so
the calls use normal string literals (e.g., _section("  prefill estimate
(isl=1024, tp=8)")). Apply the same change to the other occurrences of
_section(f"...") in the file (the other similar call sites that currently use an
f-prefix).
components/src/dynamo/planner/tests/unit/test_actuation_knobs.py-1-1 (1)
1-1: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Run the code formatter before merge.

The pre-commit check failed because the code needs formatting. Run ruff format on this file to fix line wrapping issues.
#!/bin/bash
# Check current formatting diff
cd components/src/dynamo/planner
ruff format --diff tests/unit/test_actuation_knobs.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/unit/test_actuation_knobs.py` at line 1,
The file tests/unit/test_actuation_knobs.py needs to be auto-formatted to
satisfy pre-commit checks; run the project formatter (ruff format) on that file
(e.g., ruff format
components/src/dynamo/planner/tests/unit/test_actuation_knobs.py) to fix line
wrapping and style issues, then stage the updated file so the pre-commit check
passes.
examples/deployments/powerplanner/disagg-conservative-cold-start.yaml-94-94 (1)
94-94: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Correct the model ID to include the "Meta-" prefix.

The model ID meta-llama/Llama-3-8B-Instruct is incorrect. The official HuggingFace model is meta-llama/Meta-Llama-3-8B-Instruct. Using the wrong ID will cause model download failures at runtime.

Also applies to: line 114
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/deployments/powerplanner/disagg-conservative-cold-start.yaml` at
line 94, The model ID strings are incorrect: replace every occurrence of
"meta-llama/Llama-3-8B-Instruct" with the official HuggingFace ID
"meta-llama/Meta-Llama-3-8B-Instruct" (e.g., the value used in the model field
in the YAML), and update the second occurrence noted in the review as well;
ensure both instances exactly match "meta-llama/Meta-Llama-3-8B-Instruct" so the
runtime can successfully download the model.
components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py-305-313 (1)
305-313: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Resolve the frontend port from the pod instead of forcing 8000.

Line 308 hardcodes the default frontend port, so this live test will false-fail on deployments that override the http port. Use the same per-pod port resolution helper as the runtime path instead of assuming 8000.

As per coding guidelines, "Don’t hardcode ports in tests (no literal 8000/8081/...; use provided dynamic port fixtures/helpers)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py`
around lines 305 - 313, The test currently hardcodes port 8000 when calling
connector.post_busy_threshold (pod, model, port=8000) which breaks deployments
that override the http port; replace the literal with the per-pod frontend-port
resolver used by the runtime path (call it to compute port_from_pod =
resolve_frontend_port(pod) or the existing helper in your test runtime
utilities) and pass that variable into connector.post_busy_threshold (ensure you
await or import the helper if async). This fixes the test by dynamically
resolving the correct frontend port for the given pod instead of assuming 8000.
components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py-571-582 (1)
571-582: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace time.sleep(2) with await asyncio.sleep(2) to avoid blocking the event loop.

This async test uses await calls, and time.sleep(2) blocks the entire asyncio event loop, which can interfere with async cleanup and make the test flaky. Import asyncio at the module level and use await asyncio.sleep(2) instead.
Suggested change
-            time.sleep(2)  # let the API server settle
+            await asyncio.sleep(2)  # let the API server settle
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py`
around lines 571 - 582, The test is blocking the asyncio event loop by calling
time.sleep(2); replace that call with await asyncio.sleep(2) and add an import
asyncio at module level. Locate the snippet where
connector.set_component_replicas(...) is called (in the test function that uses
await) and change the synchronous sleep to the async sleep so the event loop
isn’t blocked and async cleanup can run correctly.

🧹 Nitpick comments (9)

components/power_agent/power_agent.py (1)

489-490: 💤 Low value

Move argparse import to module top per Python guidelines.

 import json
 import logging
 import os
 import re
 import signal
 import sys
 import threading
 import time
 from typing import Optional
+import argparse

 def main() -> None:
-    import argparse
-
     parser = argparse.ArgumentParser(description="Dynamo Power Agent DaemonSet")

As per coding guidelines (.ai/python-guidelines.md): "flag any import inside function/method/class bodies (keep imports at module top)".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/power_agent.py` around lines 489 - 490, The import of
argparse is inside the main() function; move the import to the module top to
follow guidelines. Remove the inline "import argparse" from def main() and add
"import argparse" alongside other top-level imports so that the module-level
imports include argparse and main() simply uses it.

components/power_agent/tests/test_multi_pod_policy.py (1)

66-126: ⚡ Quick win

Coverage looks good for the policy itself; consider adding a same-pod-multi-PID case.

The cases here cover _resolve_cap_for_gpu thoroughly. To complete the matrix and lock down the deduplication concern raised against power_agent.py:449-455, please add a test that asserts a single pod producing two PIDs on the same GPU resolves to the pod's annotation without firing multi_pod_agree:

def test_same_pod_multi_pid_not_multi_pod(self):
    # After the dedup fix, two PIDs from one pod -> 1 entry, no warning.
    cap = _resolve_cap_for_gpu(
        0, [("uid-1", "480")], SAFE_DEFAULT, self.m
    )
    self.assertEqual(cap, 480)
    self.assertEqual(self.m.multi_pod_agree, 0)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/power_agent/tests/test_multi_pod_policy.py` around lines 66 - 126,
Add a new unit test method test_same_pod_multi_pid_not_multi_pod to
TestMultiPodPolicy that calls _resolve_cap_for_gpu with a single pod UID and a
valid annotation (use SAFE_DEFAULT and the fixture self.m) and asserts the
returned cap equals the annotated value and that self.m.multi_pod_agree remains
0; reference _resolve_cap_for_gpu, SAFE_DEFAULT and the metric multi_pod_agree
so the test verifies same-pod/multiple-PID deduplication (after the fix around
power_agent.py:449-455) does not trigger the multi-pod agreement metric.

components/src/dynamo/planner/monitoring/planner_metrics.py (1)

189-213: ⚡ Quick win

Keep the new AIC metrics under the dynamo_planner_* namespace.

Everything else in this registry uses PREFIX, but these counters/gauges switch to dynamo_aic_*. That splits one feature area across two metric namespaces and makes prefix-based dashboards/alerts miss the new series.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/monitoring/planner_metrics.py` around lines 189
- 213, The new AIC metric names use the "dynamo_aic_*" prefix but must live
under the existing PREFIX namespace; update the metric name arguments for
aic_correction_pegged_total, aic_consecutive_failures,
aic_optimizer_exceptions_total, aic_optimizer_disabled_reason, and
aic_throughput_regression_total to use PREFIX (e.g. PREFIX +
"aic_correction_pegged_total" etc.) instead of hardcoded "dynamo_aic_*" so they
appear under the dynamo_planner_* namespace; keep the same help strings,
labelnames, and metric types unchanged.

components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py (1)

742-742: ⚡ Quick win

Unused variable in test assertion.

The variable n_p is unpacked but never used in this test. Either use it in an assertion or prefix with _ to indicate it's intentionally unused.

♻️ Suggested fix

-        n_p, n_d = opt._sweep_replicas(1, 5, 3200, 2400, None)
+        _, n_d = opt._sweep_replicas(1, 5, 3200, 2400, None)
         assert n_d == 5

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py`
at line 742, The test unpacks n_p and n_d from opt._sweep_replicas(1, 5, 3200,
2400, None) but never uses n_p; update the test to either assert something about
n_p (for example expected value) or mark it as intentionally unused by renaming
it to _n_p (or simply _), e.g., change the unpacking from "n_p, n_d =
opt._sweep_replicas(...)" to "_, n_d = opt._sweep_replicas(...)" or "_n_p, n_d =
opt._sweep_replicas(...)" so the linter/test clearly reflects that n_p is not
used.

examples/deployments/powerplanner/README.md (1)

52-54: 💤 Low value

Optional: Specify language for fenced code block.

The budget formula code block could use a language specifier for consistency (e.g., text or plaintext), though it's not critical for a mathematical formula.

♻️ Optional improvement

-```
+```text
 total_gpu_power_limit = (rack_capacity_W × headroom_factor) − non_gpu_overhead
 ```

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/deployments/powerplanner/README.md` around lines 52 - 54, The fenced
code block containing the budget formula currently has no language specified;
update the opening fence from ``` to a language label (e.g., ```text or
```plaintext) so the block becomes ```text (or ```plaintext) followed by the
same formula `total_gpu_power_limit = (rack_capacity_W × headroom_factor) −
non_gpu_overhead` and a closing ```, ensuring consistency and clearer rendering
in README.md.

deploy/power_agent/daemonset.yaml (1)

93-99: ⚡ Quick win

Verify resource limits are appropriate for production.

The agent requests 50m CPU / 64Mi memory with limits of 200m CPU / 128Mi memory. Confirm these values are validated against:

Typical NVML API latency (should be low)

Kubernetes API list/watch overhead per 15s reconcile loop

Number of GPUs per node (more GPUs = more NVML calls)

Consider monitoring container_memory_working_set_bytes and rate(container_cpu_usage_seconds_total) in production to validate these limits don't cause OOM or CPU throttling on large nodes (e.g., 8× H200 SXM).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/power_agent/daemonset.yaml` around lines 93 - 99, The current
resources.requests and resources.limits in daemonset.yaml (cpu: "50m"/"200m",
memory: "64Mi"/"128Mi") are likely too low for production; validate and raise
them based on measured NVML API latency, 15s controller reconcile list/watch
overhead, and max GPUs per node by load-testing with representative nodes (e.g.,
8× H200 SXM). Measure container_memory_working_set_bytes and
rate(container_cpu_usage_seconds_total) under stress to find steady-state and
peak usage, then update resources.requests to cover steady-state usage and
resources.limits to provide safe headroom (avoid OOM/throttling), and iterate:
simulate NVML-heavy loops, reconcile cycles, and multiple GPUs, adjust values in
daemonset.yaml accordingly and add monitoring alerts if working_set or CPU rate
approaches limits.

components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py (1)

486-489: 💤 Low value

Class attribute dict comprehension triggers Ruff warning.

Ruff flags this as a mutable class attribute (RUF012), though it's effectively a constant. Consider moving it outside the class or converting to a module-level constant to silence the warning.

♻️ Optional refactor

+# Per-replica wattage for each level (power_w × 8 GPUs)
+_WATTS_PER_REPLICA = {
+    lvl: _POWER_LEVELS[lvl]["prefill_w"] * H200_GPUS_PER_ENGINE
+    for lvl in _POWER_LEVELS
+}
+
 class TestPhase4SyntheticMultiLevel:
-    # Per-replica wattage for each level (power_w × 8 GPUs)
-    _WATTS_PER_REPLICA = {
-        lvl: _POWER_LEVELS[lvl]["prefill_w"] * H200_GPUS_PER_ENGINE
-        for lvl in _POWER_LEVELS
-    }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py`
around lines 486 - 489, Move the mutable dict comprehension _WATTS_PER_REPLICA
out of the class and make it a module-level constant to avoid the Ruff RUF012
warning: compute the dict using the existing symbols _POWER_LEVELS and
H200_GPUS_PER_ENGINE at module scope (e.g., next to _POWER_LEVELS) and replace
the class attribute reference with that module-level constant so the class no
longer defines a mutable attribute.

components/src/dynamo/planner/tests/unit/test_prometheus.py (1)

896-896: ⚡ Quick win

Prefix unused unpacked variables with underscore.

Multiple test methods unpack tuples but don't use all variables. Prefix unused variables with _ to indicate this is intentional.

♻️ Suggested fixes

 # Line 896
-        recent, per_worker_avg, cluster_avg = result
+        recent, _per_worker_avg, _cluster_avg = result
         assert recent["w0"]["active_prefill_tokens"] == pytest.approx(50.0)

 # Line 904
-        recent, per_worker_avg, cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill")
+        _recent, per_worker_avg, _cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill")
         assert per_worker_avg["w0"]["active_prefill_tokens"] == pytest.approx(50.0)

 # Line 927
-        _, per_worker_avg, cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill")
+        _, _per_worker_avg, cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill")
         assert cluster_avg["active_prefill_tokens"] == pytest.approx(20.0)

Also applies to: 904-904, 927-927

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/unit/test_prometheus.py` at line 896, In
components/src/dynamo/planner/tests/unit/test_prometheus.py there are tuple
unpackings like "recent, per_worker_avg, cluster_avg = result" where some
unpacked variables are unused; change the unused names to be prefixed with an
underscore (e.g., "recent, _per_worker_avg, _cluster_avg = result" or replace
with "_" placeholders) so the intent is explicit. Update the other similar
unpackings in the same file (the occurrences that unpack into recent,
per_worker_avg, cluster_avg) to follow the same underscore-prefixed naming
convention.

components/src/dynamo/planner/connectors/kubernetes.py (1)

306-307: ⚡ Quick win

Avoid broad exception catching.

The broad except Exception: can mask unexpected errors (e.g., KeyError, AttributeError from malformed deployment dicts). Catch specific exceptions (PlannerError) to make the error-handling intent explicit.
🔍 Catch specific exceptions
         try:
             service = get_service_from_sub_component_type_or_name(
                 deployment, sub_component_type=sub_component_type
             )
-        except Exception:
+        except PlannerError:
             return []
Based on learnings: Python guidelines require catching specific exceptions rather than broad Exception to avoid swallowing unexpected errors.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/connectors/kubernetes.py` around lines 306 -
307, The broad except Exception: in connectors/kubernetes.py masks unexpected
errors; change it to catch the specific PlannerError (e.g., use except
PlannerError as e:) and return [] for that case, while letting any other
exceptions propagate (or re-raise them) so KeyError/AttributeError aren’t
swallowed; also ensure PlannerError is imported where the except block resides
and preserve the original return [] behavior for the PlannerError path.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/power_agent/power_agent.py`:
- Around line 449-455: _reconcile_gpu currently appends one (uid, annotation)
tuple per PID causing duplicate entries for multi-PID pods; change the logic to
deduplicate by pod UID before calling _resolve_cap_for_gpu (e.g., build a set or
map of seen UIDs using _extract_pod_uid_from_cgroup and uid_to_annotation, then
produce a single (uid, annotation) entry per UID into pod_annotations) so
_resolve_cap_for_gpu sees unique pods; update references to pod_annotations and
keep uid_to_annotation lookup as before; add a regression test in
test_multi_pod_policy.py that covers "single pod, two PIDs" to prevent
regressions.

In `@components/power_agent/tests/__init__.py`:
- Line 1: Add the standard repository SPDX header to the new package module
components.power_agent.tests.__init__.py; open the empty __init__.py and prepend
the project's copyright/SPDX license header block (the same header used across
other top-level Python files in the repo) so the file passes copyright-checks.

In `@components/power_agent/tests/test_apply_cap.py`:
- Around line 17-24: The test module imports power_agent eagerly which can raise
ImportError for optional NVML/K8s deps and break test collection; wrap the
import block that pulls in power_agent and symbols (POWER_ANNOTATION_KEY,
PowerAgent, _apply_cap, _clamp_to_constraints, _nvml_uuid) in a try/except
ImportError and in the except call pytest.skip("optional dependency missing:
<package>", allow_module_level=True) at module scope (ensure pytest is imported
at top-level so skip can be called).

In `@components/src/dynamo/planner/connectors/kubernetes.py`:
- Around line 384-390: Move the httpx import to the module top and remove the
try/except ImportError block in the admission-control POST handling code: add a
plain "import httpx" at the top of
components/src/dynamo/planner/connectors/kubernetes.py, delete the in-function
try/except that currently raises RuntimeError for missing httpx, and rely on the
module-level import to surface ImportError early; update any references in the
admission-control POST support code to use httpx directly (no fallback or
conditional logic needed).

In `@components/src/dynamo/planner/core/base.py`:
- Around line 753-767: The planner's advisory/dry-run flag isn't respected:
update _apply_power_annotations to bail out when self.config.advisory is True
(in addition to existing checks) so it never PATCHes pod annotations in advisory
mode; likewise locate the autoset fanout logic that posts /busy_threshold (the
code around the second occurrence noted) and add the same advisory guard to skip
POSTing frontend thresholds when self.config.advisory is True. Ensure both
guards check self.config.advisory before making any Kubernetes PATCH or HTTP
POST calls so advisory mode performs no writes.
- Around line 946-976: The current fanout to self.connector.post_busy_threshold
can leave some frontend pods with old thresholds and the planner continuing;
after computing failed (pods with Exception) you must treat any partial failure
as a hard failure: increment the admission_partial_success_total metric as done,
log the error, then raise an exception (e.g., RuntimeError with the failed pod
names and original results) or otherwise abort/return an error so the planner
does not proceed; ensure this change is applied in the same block that builds
results/pods (refer to self.connector.post_busy_threshold, the failed list,
self.prometheus_metrics.admission_partial_success_total, and logger.error) so
partial success stops the planner instead of being treated as success.

In `@components/src/dynamo/planner/monitoring/aic_power_optimizer.py`:
- Around line 534-545: The current branch returns (min_ep, min_ep) when
remaining_after_min_p <= 0 which can still exceed the budget once dynamic power
(d_watts) is considered; update this branch to check whether min_ep * (p_watts +
d_watts) <= budget before returning min_ep and if that check fails, log an error
via logger.error (including budget, min_ep, p_watts, d_watts) and fail closed by
returning a safe zero/NO_OP result (e.g., (0, 0)) or raising a clear exception
instead of returning an over-budget config; change the code around the variables
budget, min_ep, p_watts, d_watts and the logger.warning call accordingly.

In `@components/src/dynamo/planner/monitoring/planner_metrics.py`:
- Around line 244-247: The metric variable admission_partial_success_total is
misnamed vs its help text; either rename the Counter to
admission_partial_failure_total (and update all references/usages of
admission_partial_success_total) so the metric name matches that it counts
frontend POST failures, or keep the variable name but change the Counter help
string to describe "Cumulative count of frontend POST partial successes" if it
truly counts successes; update the f-string name
(f"{PREFIX}_admission_partial_failure_total") and any code that reads or exports
this metric (search for admission_partial_success_total) to maintain
consistency.

In `@components/src/dynamo/planner/monitoring/traffic_metrics.py`:
- Around line 404-414: The try/except in get_total_dgd_power (and the sibling
power helper around lines 445–457) is catching Exception broadly and swallowing
errors by logging and returning None; change it to either catch only expected
parsing/Prometheus errors (e.g., ValueError, TypeError, KeyError, or the
specific exception type thrown by self.prom.custom_query) and handle those by
logging and returning None, or if you must catch Exception, log the error with
logger.exception/ logger.error and then re-raise so unexpected failures
propagate; locate the calls to self.prom.custom_query and the surrounding
logger.debug call in get_total_dgd_power and the other helper and apply the
narrowed exception list or add a re-raise after logging.

In
`@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py`:
- Line 75: The module-level pytestmark currently sets pytest.mark.integration
and pytest.mark.gpu_0 but lacks a timeout; update the pytestmark list (the
pytestmark variable) to also include a pytest.mark.timeout(...) marker (e.g.,
pytest.mark.timeout(120) or another appropriate upper bound >30s) so the entire
live suite has a hard timeout and network/polling calls fail fast; ensure you
import pytest if not already and add the timeout marker to the existing
pytestmark list rather than adding per-test decorators.

In `@components/src/dynamo/planner/tests/integration/test_metric_paths_live.py`:
- Line 73: The module-level pytestmarks lack a timeout; update the pytestmark
variable in test_metric_paths_live.py (the existing pytestmark =
[pytest.mark.integration, pytest.mark.gpu_0]) to include a timeout marker such
as pytest.mark.timeout(300) (or another appropriate value >=30s) so long-running
in-cluster Prometheus/K8s calls will fail instead of hanging; ensure pytest is
imported if not already.

In `@deploy/planner-pod-rbac-dev.yaml`:
- Around line 37-39: The RoleBinding subject for the ServiceAccount
"planner-dev-sa" is missing the required namespace field; update the RoleBinding
subjects block to add "namespace: <namespace>" for the ServiceAccount subject
(or wire a templating/Kustomize variable) so the subject entry for
planner-dev-sa includes its namespace and the binding correctly grants
permissions.

In `@deploy/power_agent/rbac.yaml`:
- Around line 9-14: The docs show an apply path that leaves the
ClusterRoleBinding.subjects[0].namespace set to the literal placeholder
`${POWER_AGENT_NAMESPACE}`, which breaks the binding because kubectl does not
perform variable substitution; update the README snippet to either remove the
misleading "kubectl apply -n <your-namespace> -f deploy/power_agent/rbac.yaml"
path or change it to explicitly instruct the user to substitute the placeholder
(e.g., edit the ServiceAccount and ClusterRoleBinding.subjects[0].namespace or
use envsubst) so the ServiceAccount namespace and
ClusterRoleBinding.subjects[0].namespace match.

In `@tools/compute_power_budget.py`:
- Line 148: The print statement currently uses an unnecessary f-string prefix
("print(f\"    (= 70% of TDP; adjust for your model's idle power profile)\")")
which triggers Ruff F541; change the call to a plain string literal by removing
the leading "f" so the statement is normal (e.g., update the print call in
compute_power_budget.py where that string is printed). Ensure no other f-prefix
remains on that exact print invocation.
- Around line 316-336: Validate the numeric CLI inputs before building
SizingInput and calling compute_budget: check that args.nodes and
args.gpus_per_node are positive integers (greater than 0) so total_gpus =
args.nodes * args.gpus_per_node > 0, and also validate args.rack_capacity and
gpu_tdp (and any other numeric args used in SizingInput such as
args.non_gpu_overhead and elements of args.dgd_gpus) are positive where
applicable; if any validation fails, print a clear stderr error and exit
non-zero. Locate the creation of total_gpus and the SizingInput construction in
compute_power_budget.py (symbols: total_gpus, SizingInput) and add these
pre-checks before calling compute_budget() to avoid division-by-zero or
meaningless budgets.

In `@tools/integrate_aic_power_data.py`:
- Around line 230-231: After resolving args.h200_data and args.b200_data into
h200_data and b200_data, validate that any non-None Path actually exists (and is
a directory) and fail fast if not: if h200_data is not None and not
h200_data.exists()/is_dir() (same for b200_data) log/raise a clear error and
exit with non-zero status (or call the argparse parser.error) so a mistyped root
produces a hard failure instead of silent "nothing to copy"; update the
validation near the h200_data/b200_data assignment to perform these checks and
produce a descriptive message including the offending path variable name.

In `@tools/validate_aic_power_integration.py`:
- Around line 20-21: The tight-budget test never runs because the script only
calls the optimizer with a generous hardcoded cap (80_000); add an explicit
tight-budget run that sets the budget to just under one replica's power and
assert the optimizer returns fewer replicas than requested. Concretely: compute
the single-replica draw (use the same attribute/func the script uses to estimate
replica power, e.g., replica_power_watts or equivalent), call the optimizer (the
existing optimizer/optimize/run_optimizer call) with budget =
single_replica_power - 1, and add an assertion that the returned replica count <
requested_max_replicas (mirror the style used elsewhere in the file). Ensure
this new check is executed (not behind a dead branch) alongside the existing
generous-cap test (the one using 80_000).

---

Outside diff comments:
In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py`:
- Around line 1-711: The file has ruff-formatting issues (misaligned inline
comments and long wrapped lines) across the test module (e.g., around numeric
literals and comment spacing such as "50.0,   #"), so run an automatic formatter
and fix spacing/line-wrapping: run `ruff format` (or your repo's formatter) on
this test file and ensure inline comments use a single space after the comma and
are properly wrapped; focus on symbols/classes to inspect after formatting like
TestFeedbackLoopH200.test_ema_converges_from_cold_start, _POWER_LEVELS entries,
_make_config, _make_traffic, and TestPhase4SyntheticMultiLevel methods to verify
comment alignment and line lengths are corrected.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py`:
- Around line 1-799: Run automatic formatting (ruff format) to fix
whitespace/line-wrapping, then manually fix three lint issues: replace the
Unicode multiplication sign '×' with ASCII 'x' in the explanatory
comments/docstring around the budget and throughput calculations (references:
the multi-replica budget comment in TestOptimizeBudgetConstraint, the
docstring/throughput comment in TestThroughputRegressionWarning and the comment
in TestSweepReplicas) and make the unused unpacked variable in
TestSweepReplicas.test_no_budget_uses_max_replicas a named-unused by prefixing
it with an underscore (change n_p, n_d = opt._sweep_replicas(...) to _n_p, n_d =
...). These changes address RUF002/RUF003 (ambiguous ×) and RUF059 (unused
variable) while keeping function/class names like _sweep_replicas,
TestSweepReplicas, TestOptimizeBudgetConstraint, and
TestThroughputRegressionWarning locators unchanged.

In `@components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py`:
- Line 1: Run the project's formatter on the test file to fix line wrapping and
assertion formatting: execute ruff format (e.g., `ruff format
tests/unit/test_kubernetes_connector.py` from the components/src/dynamo/planner
directory) to update formatting in test_kubernetes_connector.py so the
pre-commit checks pass; re-run the tests or pre-commit to confirm no remaining
style diffs.

In `@components/src/dynamo/planner/tests/unit/test_prometheus.py`:
- Around line 1-1078: The test file fails ruff pre-commit checks; run an
automatic reformat and fixer (e.g. ruff format followed by ruff check --fix) on
the file and commit the changes so tests and lint pass; if applying edits
manually, ensure formatting and lint rules are respected across imports and
function/class definitions (examples to inspect: PrometheusAPIClient,
DirectRouterMetricsClient, TestPrometheusAPIClientRouterSource, and
test_frontend_metric_container_with_nan_value) and re-run ruff to confirm no
remaining violations.

---

Minor comments:
In `@components/src/dynamo/planner/config/defaults.py`:
- Around line 81-103: The defaults prefill_engine_gpu_power_limit and
decode_engine_gpu_power_limit contradict the "no placeholder integers" policy
and silently cap GPUs when enable_power_awareness=True; change both to
Optional[int]=None (or compute them from total_gpu_power_limit divided by
gpu_count) and update the planner_config.py validation logic to require/derive
per-engine limits when enable_power_awareness is true (i.e., enforce that either
per-engine limits are set or automatically compute them from
total_gpu_power_limit and known GPU count) while keeping total_gpu_power_limit
and power_agent_safe_default_watts validation intact.

In
`@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py`:
- Around line 305-313: The test currently hardcodes port 8000 when calling
connector.post_busy_threshold (pod, model, port=8000) which breaks deployments
that override the http port; replace the literal with the per-pod frontend-port
resolver used by the runtime path (call it to compute port_from_pod =
resolve_frontend_port(pod) or the existing helper in your test runtime
utilities) and pass that variable into connector.post_busy_threshold (ensure you
await or import the helper if async). This fixes the test by dynamically
resolving the correct frontend port for the given pod instead of assuming 8000.
- Around line 571-582: The test is blocking the asyncio event loop by calling
time.sleep(2); replace that call with await asyncio.sleep(2) and add an import
asyncio at module level. Locate the snippet where
connector.set_component_replicas(...) is called (in the test function that uses
await) and change the synchronous sleep to the async sleep so the event loop
isn’t blocked and async cleanup can run correctly.

In `@components/src/dynamo/planner/tests/unit/test_actuation_knobs.py`:
- Line 1: The file tests/unit/test_actuation_knobs.py needs to be auto-formatted
to satisfy pre-commit checks; run the project formatter (ruff format) on that
file (e.g., ruff format
components/src/dynamo/planner/tests/unit/test_actuation_knobs.py) to fix line
wrapping and style issues, then stage the updated file so the pre-commit check
passes.

In `@components/src/dynamo/planner/tests/unit/test_state_machine.py`:
- Around line 1356-1358: The test assigns the tuple return from
sm._apply_global_budget to num_p and num_d but never uses num_p; rename the
unused unpack target to _num_p (i.e., change "num_p, num_d =
sm._apply_global_budget(5, 3)" to "_num_p, num_d = sm._apply_global_budget(5,
3)") to satisfy Ruff/RUF059 while leaving the assertion against num_d unchanged.

In `@examples/deployments/powerplanner/disagg-conservative-cold-start.yaml`:
- Line 94: The model ID strings are incorrect: replace every occurrence of
"meta-llama/Llama-3-8B-Instruct" with the official HuggingFace ID
"meta-llama/Meta-Llama-3-8B-Instruct" (e.g., the value used in the model field
in the YAML), and update the second occurrence noted in the review as well;
ensure both instances exactly match "meta-llama/Meta-Llama-3-8B-Instruct" so the
runtime can successfully download the model.

In `@tools/validate_aic_power_integration.py`:
- Line 137: Calls to _section are using f-string prefixes for plain string
literals (e.g., _section(f"  prefill estimate (isl=1024, tp=8)")) which triggers
Ruff F541; remove the leading f so the calls use normal string literals (e.g.,
_section("  prefill estimate (isl=1024, tp=8)")). Apply the same change to the
other occurrences of _section(f"...") in the file (the other similar call sites
that currently use an f-prefix).

---

Nitpick comments:
In `@components/power_agent/power_agent.py`:
- Around line 489-490: The import of argparse is inside the main() function;
move the import to the module top to follow guidelines. Remove the inline
"import argparse" from def main() and add "import argparse" alongside other
top-level imports so that the module-level imports include argparse and main()
simply uses it.

In `@components/power_agent/tests/test_multi_pod_policy.py`:
- Around line 66-126: Add a new unit test method
test_same_pod_multi_pid_not_multi_pod to TestMultiPodPolicy that calls
_resolve_cap_for_gpu with a single pod UID and a valid annotation (use
SAFE_DEFAULT and the fixture self.m) and asserts the returned cap equals the
annotated value and that self.m.multi_pod_agree remains 0; reference
_resolve_cap_for_gpu, SAFE_DEFAULT and the metric multi_pod_agree so the test
verifies same-pod/multiple-PID deduplication (after the fix around
power_agent.py:449-455) does not trigger the multi-pod agreement metric.

In `@components/src/dynamo/planner/connectors/kubernetes.py`:
- Around line 306-307: The broad except Exception: in connectors/kubernetes.py
masks unexpected errors; change it to catch the specific PlannerError (e.g., use
except PlannerError as e:) and return [] for that case, while letting any other
exceptions propagate (or re-raise them) so KeyError/AttributeError aren’t
swallowed; also ensure PlannerError is imported where the except block resides
and preserve the original return [] behavior for the PlannerError path.

In `@components/src/dynamo/planner/monitoring/planner_metrics.py`:
- Around line 189-213: The new AIC metric names use the "dynamo_aic_*" prefix
but must live under the existing PREFIX namespace; update the metric name
arguments for aic_correction_pegged_total, aic_consecutive_failures,
aic_optimizer_exceptions_total, aic_optimizer_disabled_reason, and
aic_throughput_regression_total to use PREFIX (e.g. PREFIX +
"aic_correction_pegged_total" etc.) instead of hardcoded "dynamo_aic_*" so they
appear under the dynamo_planner_* namespace; keep the same help strings,
labelnames, and metric types unchanged.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py`:
- Around line 486-489: Move the mutable dict comprehension _WATTS_PER_REPLICA
out of the class and make it a module-level constant to avoid the Ruff RUF012
warning: compute the dict using the existing symbols _POWER_LEVELS and
H200_GPUS_PER_ENGINE at module scope (e.g., next to _POWER_LEVELS) and replace
the class attribute reference with that module-level constant so the class no
longer defines a mutable attribute.

In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py`:
- Line 742: The test unpacks n_p and n_d from opt._sweep_replicas(1, 5, 3200,
2400, None) but never uses n_p; update the test to either assert something about
n_p (for example expected value) or mark it as intentionally unused by renaming
it to _n_p (or simply _), e.g., change the unpacking from "n_p, n_d =
opt._sweep_replicas(...)" to "_, n_d = opt._sweep_replicas(...)" or "_n_p, n_d =
opt._sweep_replicas(...)" so the linter/test clearly reflects that n_p is not
used.

In `@components/src/dynamo/planner/tests/unit/test_prometheus.py`:
- Line 896: In components/src/dynamo/planner/tests/unit/test_prometheus.py there
are tuple unpackings like "recent, per_worker_avg, cluster_avg = result" where
some unpacked variables are unused; change the unused names to be prefixed with
an underscore (e.g., "recent, _per_worker_avg, _cluster_avg = result" or replace
with "_" placeholders) so the intent is explicit. Update the other similar
unpackings in the same file (the occurrences that unpack into recent,
per_worker_avg, cluster_avg) to follow the same underscore-prefixed naming
convention.

In `@deploy/power_agent/daemonset.yaml`:
- Around line 93-99: The current resources.requests and resources.limits in
daemonset.yaml (cpu: "50m"/"200m", memory: "64Mi"/"128Mi") are likely too low
for production; validate and raise them based on measured NVML API latency, 15s
controller reconcile list/watch overhead, and max GPUs per node by load-testing
with representative nodes (e.g., 8× H200 SXM). Measure
container_memory_working_set_bytes and rate(container_cpu_usage_seconds_total)
under stress to find steady-state and peak usage, then update resources.requests
to cover steady-state usage and resources.limits to provide safe headroom (avoid
OOM/throttling), and iterate: simulate NVML-heavy loops, reconcile cycles, and
multiple GPUs, adjust values in daemonset.yaml accordingly and add monitoring
alerts if working_set or CPU rate approaches limits.

In `@examples/deployments/powerplanner/README.md`:
- Around line 52-54: The fenced code block containing the budget formula
currently has no language specified; update the opening fence from ``` to a
language label (e.g., ```text or ```plaintext) so the block becomes ```text (or
```plaintext) followed by the same formula `total_gpu_power_limit =
(rack_capacity_W × headroom_factor) − non_gpu_overhead` and a closing ```,
ensuring consistency and clearer rendering in README.md.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a7e07320-b76e-4140-a984-1251cbfdf781

📥 Commits

Reviewing files that changed from the base of the PR and between 9db7176 and 3c0e9a7.

📒 Files selected for processing (47)

components/power_agent/README.md
components/power_agent/power_agent.py
components/power_agent/tests/__init__.py
components/power_agent/tests/test_apply_cap.py
components/power_agent/tests/test_cgroup_parser.py
components/power_agent/tests/test_multi_pod_policy.py
components/power_agent/tests/test_shutdown.py
components/src/dynamo/planner/config/defaults.py
components/src/dynamo/planner/config/planner_config.py
components/src/dynamo/planner/connectors/kubernetes.py
components/src/dynamo/planner/connectors/kubernetes_api.py
components/src/dynamo/planner/core/base.py
components/src/dynamo/planner/core/budget.py
components/src/dynamo/planner/core/load_scaling.py
components/src/dynamo/planner/core/state_machine.py
components/src/dynamo/planner/core/throughput_scaling.py
components/src/dynamo/planner/core/types.py
components/src/dynamo/planner/monitoring/aic_power_optimizer.py
components/src/dynamo/planner/monitoring/planner_metrics.py
components/src/dynamo/planner/monitoring/traffic_metrics.py
components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py
components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py
components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py
components/src/dynamo/planner/tests/integration/test_metric_paths_live.py
components/src/dynamo/planner/tests/unit/test_actuation_knobs.py
components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py
components/src/dynamo/planner/tests/unit/test_prometheus.py
components/src/dynamo/planner/tests/unit/test_state_machine.py
components/src/dynamo/profiler/utils/aic_dataframe.py
deploy/helm/charts/platform/Chart.yaml
deploy/helm/charts/platform/README.md
deploy/helm/charts/platform/components/operator/Chart.yaml
deploy/helm/charts/platform/components/operator/templates/planner.yaml
deploy/planner-pod-rbac-dev.yaml
deploy/power_agent/daemonset.yaml
deploy/power_agent/rbac.yaml
docs/components/planner/dpp-dev-env.md
docs/design-docs/powerplanner-design.md
examples/deployments/powerplanner/MULTI_DGD.md
examples/deployments/powerplanner/PIPECLEAN.md
examples/deployments/powerplanner/README.md
examples/deployments/powerplanner/disagg-conservative-cold-start.yaml
examples/deployments/powerplanner/disagg-power-aware.yaml
examples/deployments/powerplanner/verify_poweraware.bash
tools/compute_power_budget.py
tools/integrate_aic_power_data.py
tools/validate_aic_power_integration.py

coderabbitai · 2026-05-10T13:55:33Z

+        pod_annotations: list[tuple[str, Optional[str]]] = []
+        for proc in procs:
+            uid = _extract_pod_uid_from_cgroup(proc.pid)
+            if uid is None:
+                continue  # non-K8s process — skip
+            if uid in uid_to_annotation:
+                pod_annotations.append((uid, uid_to_annotation[uid]))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Same-pod multi-PID workloads are incorrectly flagged as multi-pod-per-GPU.

_reconcile_gpu appends one entry to pod_annotations per PID rather than per pod UID. NVML's nvmlDeviceGetComputeRunningProcesses returns every CUDA-context process on the GPU, so a single pod with multiple GPU-bound processes (TP workers, MPS clients, MPI ranks, child workers spawned by serving stacks) yields N tuples with the same UID and same annotation. _resolve_cap_for_gpu then sees len(pod_annotations) > 1, all unique == 1 → emits the WARN "multi-pod-per-GPU is unsupported topology" and increments multi_pod_gpu_total{disposition="agree"} on every reconcile (every 15 s).

The applied cap is still correct (values agree), but the warnings/metrics are false positives and contradict the policy comment at lines 319–325. Deduplicate by UID first:

🩹 Proposed fix

- pod_annotations: list[tuple[str, Optional[str]]] = [] - for proc in procs: - uid = _extract_pod_uid_from_cgroup(proc.pid) - if uid is None: - continue # non-K8s process — skip - if uid in uid_to_annotation: - pod_annotations.append((uid, uid_to_annotation[uid])) + # Deduplicate by pod UID — multiple PIDs from the same pod (TP workers, + # MPI ranks, MPS clients) must count as one pod, not N. + seen_uids: set[str] = set() + pod_annotations: list[tuple[str, Optional[str]]] = [] + for proc in procs: + uid = _extract_pod_uid_from_cgroup(proc.pid) + if uid is None or uid in seen_uids: + continue # non-K8s process or already-counted pod + seen_uids.add(uid) + if uid in uid_to_annotation: + pod_annotations.append((uid, uid_to_annotation[uid]))

A regression test for "single pod, two PIDs" in test_multi_pod_policy.py would lock this in.

Based on retrieved learning from PR 5280: "Either enforce an explicit one-to-one mapping with clear guards and comments, or extend the implementation to correctly handle multiple pods per GPU".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/power_agent.py` around lines 449 - 455, _reconcile_gpu currently appends one (uid, annotation) tuple per PID causing duplicate entries for multi-PID pods; change the logic to deduplicate by pod UID before calling _resolve_cap_for_gpu (e.g., build a set or map of seen UIDs using _extract_pod_uid_from_cgroup and uid_to_annotation, then produce a single (uid, annotation) entry per UID into pod_annotations) so _resolve_cap_for_gpu sees unique pods; update references to pod_annotations and keep uid_to_annotation lookup as before; add a regression test in test_multi_pod_policy.py that covers "single pod, two PIDs" to prevent regressions.

coderabbitai · 2026-05-10T13:55:33Z

@@ -0,0 +1 @@
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the standard SPDX header to this new package file.

copyright-checks is already failing on this file, so the empty __init__.py still needs the repo header before merge.

🧰 Tools

🪛 GitHub Actions: Copyright Checks / 0_copyright-checks.txt

[error] 1-1: Copyright checkers detected invalid/missing header: components/power_agent/tests/init.py

🪛 GitHub Actions: Copyright Checks / copyright-checks

[error] 1-1: Copyright check failed: invalid/missing header.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/tests/__init__.py` at line 1, Add the standard repository SPDX header to the new package module components.power_agent.tests.__init__.py; open the empty __init__.py and prepend the project's copyright/SPDX license header block (the same header used across other top-level Python files in the repo) so the file passes copyright-checks.

coderabbitai · 2026-05-10T13:55:33Z

+import power_agent
+from power_agent import (
+    POWER_ANNOTATION_KEY,
+    PowerAgent,
+    _apply_cap,
+    _clamp_to_constraints,
+    _nvml_uuid,
+)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard the module import so optional NVML/K8s deps don't break collection.

This imports power_agent eagerly, so any missing optional dependency will fail test collection instead of skipping the module. Wrap the import in try/except ImportError and call pytest.skip(..., allow_module_level=True) at module scope. As per coding guidelines, **/test_*.py: Use try/except ImportError to skip test collection when optional packages are absent with pytest.skip(..., allow_module_level=True).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/tests/test_apply_cap.py` around lines 17 - 24, The test module imports power_agent eagerly which can raise ImportError for optional NVML/K8s deps and break test collection; wrap the import block that pulls in power_agent and symbols (POWER_ANNOTATION_KEY, PowerAgent, _apply_cap, _clamp_to_constraints, _nvml_uuid) in a try/except ImportError and in the except call pytest.skip("optional dependency missing: <package>", allow_module_level=True) at module scope (ensure pytest is imported at top-level so skip can be called).

coderabbitai · 2026-05-10T13:55:33Z

+        try:
+            import httpx
+        except ImportError as exc:
+            raise RuntimeError(
+                "httpx is required for admission-control POST support. "
+                "Install it with: pip install httpx"
+            ) from exc


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Move the httpx import to module top.

This import is inside the function body, which violates the Python guidelines. The guidelines state: "flag any import inside function/method/class bodies (keep imports at module top)" and "try/except ImportError around imports must be avoided unless the dependency is truly optional and a fallback/None path exists."

Here, the function raises RuntimeError when httpx is missing, so there's no fallback—httpx is a hard requirement for this code path. Move the import to the top of the file.

📦 Move import to module top

import json import logging import os from typing import Optional + +try: + import httpx +except ImportError: + httpx = None from dynamo.planner.config.defaults import SubComponentType, TargetReplica

Then in the function:

) -> None: """POST /busy_threshold to a single frontend pod (§6.7). Raises on HTTP error or network failure — callers should gather with ``return_exceptions=True`` and inspect the results. Per-pod timeout is 5 s (frontend is in-cluster; long timeouts mask saturation events). """ - try: - import httpx - except ImportError as exc: + if httpx is None: raise RuntimeError( "httpx is required for admission-control POST support. " "Install it with: pip install httpx" - ) from exc + )

Based on learnings: Python guidelines require imports at module top unless truly optional with a fallback/None path. As per coding guidelines, this pattern avoids dual execution modes and makes dependencies explicit.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/connectors/kubernetes.py` around lines 384 - 390, Move the httpx import to the module top and remove the try/except ImportError block in the admission-control POST handling code: add a plain "import httpx" at the top of components/src/dynamo/planner/connectors/kubernetes.py, delete the in-function try/except that currently raises RuntimeError for missing httpx, and rely on the module-level import to surface ImportError early; update any references in the admission-control POST support code to use httpx directly (no fallback or conditional logic needed).

coderabbitai · 2026-05-10T13:55:33Z

+    async def _apply_power_annotations(self) -> None:
+        """Annotate worker pods with per-GPU power limits (Phase 1).
+
+        Reads the actual annotation from each Pod object returned by
+        get_component_pods(). Only PATCHes when annotation is missing or wrong.
+        K8s is the source of truth — no local cache.
+
+        Called after _apply_effects() on every tick so newly-created pods
+        pick up the cap within one tick, and AIC-driven cap changes (Phase 3+)
+        propagate to all running pods automatically.
+        """
+        if not self.config.enable_power_awareness:
+            return
+        if not isinstance(self.connector, KubernetesConnector):
+            return


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Advisory mode still writes to the cluster through the new power/admission paths.

_apply_scaling_targets() honors self.config.advisory, but _apply_power_annotations() and the autoset /busy_threshold fanout do not. With advisory=True and power awareness enabled, the planner will still PATCH pod annotations and POST frontend thresholds, which breaks the dry-run contract and changes live behavior.

Suggested change

async def _apply_power_annotations(self) -> None: """Annotate worker pods with per-GPU power limits (Phase 1). @@ - if not self.config.enable_power_awareness: + if self.config.advisory or not self.config.enable_power_awareness: return if not isinstance(self.connector, KubernetesConnector): return @@ - if self.config.admission_mode == "autoset" and isinstance( - self.connector, KubernetesConnector - ): + if ( + not self.config.advisory + and self.config.admission_mode == "autoset" + and isinstance(self.connector, KubernetesConnector) + ): await self._fanout_busy_threshold(new_config)

Also applies to: 899-902

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/core/base.py` around lines 753 - 767, The planner's advisory/dry-run flag isn't respected: update _apply_power_annotations to bail out when self.config.advisory is True (in addition to existing checks) so it never PATCHes pod annotations in advisory mode; likewise locate the autoset fanout logic that posts /busy_threshold (the code around the second occurrence noted) and add the same advisory guard to skip POSTing frontend thresholds when self.config.advisory is True. Ensure both guards check self.config.advisory before making any Kubernetes PATCH or HTTP POST calls so advisory mode performs no writes.

coderabbitai · 2026-05-10T13:55:33Z

+# Apply with an explicit namespace flag:
+#     kubectl apply -n <your-namespace> -f deploy/power_agent/rbac.yaml
+#
+# The ClusterRoleBinding subject below references that same namespace via
+# `${POWER_AGENT_NAMESPACE}` — substitute it manually or pipe through envsubst:
+#     POWER_AGENT_NAMESPACE=<your-namespace> envsubst < rbac.yaml | kubectl apply -f -


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n deploy/power_agent/rbac.yaml

Repository: ai-dynamo/dynamo

Length of output: 1758

The plain kubectl apply -n ... path leaves the binding broken because it doesn't substitute the placeholder.

When you run kubectl apply -n <your-namespace> -f deploy/power_agent/rbac.yaml, the ServiceAccount is created in the specified namespace via the -n flag, but the ClusterRoleBinding.subjects[0].namespace field (line 50) retains the literal string ${POWER_AGENT_NAMESPACE}. Since kubectl does not perform variable substitution, the binding references a non-existent namespace and the pod-read permissions never take effect. Only the envsubst path (lines 13–14) substitutes the placeholder correctly. Recommend either: (1) removing the first documentation path, or (2) updating it to clarify that manual substitution of line 50 is required before applying.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/power_agent/rbac.yaml` around lines 9 - 14, The docs show an apply path that leaves the ClusterRoleBinding.subjects[0].namespace set to the literal placeholder `${POWER_AGENT_NAMESPACE}`, which breaks the binding because kubectl does not perform variable substitution; update the README snippet to either remove the misleading "kubectl apply -n <your-namespace> -f deploy/power_agent/rbac.yaml" path or change it to explicitly instruct the user to substitute the placeholder (e.g., edit the ServiceAccount and ClusterRoleBinding.subjects[0].namespace or use envsubst) so the ServiceAccount namespace and ClusterRoleBinding.subjects[0].namespace match.

coderabbitai · 2026-05-10T13:55:33Z

+        print()
+        print("  Additional planner settings to review:")
+        print(f"  power_agent_safe_default_watts: {math.ceil(self.gpu_tdp_w * 0.70)}")
+        print(f"    (= 70% of TDP; adjust for your model's idle power profile)")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the stray f prefix on Line 148.

Ruff is flagging this as F541, so the file will keep failing pre-commit until that string is a plain literal again.

🧰 Tools

🪛 Ruff (0.15.12)

[error] 148-148: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/compute_power_budget.py` at line 148, The print statement currently uses an unnecessary f-string prefix ("print(f\" (= 70% of TDP; adjust for your model's idle power profile)\")") which triggers Ruff F541; change the call to a plain string literal by removing the leading "f" so the statement is normal (e.g., update the print call in compute_power_budget.py where that string is printed). Ensure no other f-prefix remains on that exact print invocation.

coderabbitai · 2026-05-10T13:55:34Z

+    total_gpus = args.nodes * args.gpus_per_node
+    non_gpu_overhead = (
+        args.non_gpu_overhead
+        if args.non_gpu_overhead is not None
+        else args.nodes * _NON_GPU_OVERHEAD_PER_NODE_W
+    )
+
+    if not (0 < args.headroom_factor <= 1.0):
+        print(
+            f"ERROR: --headroom-factor must be in (0, 1]. Got {args.headroom_factor}.",
+            file=sys.stderr,
+        )
+        sys.exit(2)
+
+    inp = SizingInput(
+        rack_capacity_w=args.rack_capacity,
+        headroom_factor=args.headroom_factor,
+        non_gpu_overhead_w=non_gpu_overhead,
+        total_gpus=total_gpus,
+        gpu_tdp_w=gpu_tdp,
+        dgd_gpu_counts=args.dgd_gpus or [],


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the numeric CLI inputs before computing the budget.

argparse accepts 0 and negatives for these fields. With --nodes 0 / --gpus-per-node 0 the tool can emit a meaningless budget for zero GPUs, and in multi-DGD mode it can fall into gpus / inp.total_gpus inside compute_budget().

Suggested fix

total_gpus = args.nodes * args.gpus_per_node non_gpu_overhead = ( args.non_gpu_overhead if args.non_gpu_overhead is not None else args.nodes * _NON_GPU_OVERHEAD_PER_NODE_W ) + + if args.nodes <= 0 or args.gpus_per_node <= 0: + print("ERROR: --nodes and --gpus-per-node must be > 0.", file=sys.stderr) + sys.exit(2) + if gpu_tdp <= 0: + print("ERROR: --gpu-tdp must be > 0.", file=sys.stderr) + sys.exit(2) + if non_gpu_overhead < 0: + print("ERROR: --non-gpu-overhead must be >= 0.", file=sys.stderr) + sys.exit(2) + if any(g <= 0 for g in (args.dgd_gpus or [])): + print("ERROR: every --dgd-gpus entry must be > 0.", file=sys.stderr) + sys.exit(2)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/compute_power_budget.py` around lines 316 - 336, Validate the numeric CLI inputs before building SizingInput and calling compute_budget: check that args.nodes and args.gpus_per_node are positive integers (greater than 0) so total_gpus = args.nodes * args.gpus_per_node > 0, and also validate args.rack_capacity and gpu_tdp (and any other numeric args used in SizingInput such as args.non_gpu_overhead and elements of args.dgd_gpus) are positive where applicable; if any validation fails, print a clear stderr error and exit non-zero. Locate the creation of total_gpus and the SizingInput construction in compute_power_budget.py (symbols: total_gpus, SizingInput) and add these pre-checks before calling compute_budget() to avoid division-by-zero or meaningless budgets.

coderabbitai · 2026-05-10T13:55:34Z

+    h200_data = args.h200_data.resolve() if args.h200_data else None
+    b200_data = args.b200_data.resolve() if args.b200_data else None


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject nonexistent data roots before starting integration.

A typo in --h200-data or --b200-data currently degrades into a stream of per-subdir warnings and a zero-file success exit. That makes a bad input path look like “nothing to copy” instead of a hard failure.

Suggested fix

h200_data = args.h200_data.resolve() if args.h200_data else None b200_data = args.b200_data.resolve() if args.b200_data else None + + for flag, path in (("--h200-data", h200_data), ("--b200-data", b200_data)): + if path is not None and not path.is_dir(): + print(f"ERROR: {flag} directory not found: {path}", file=sys.stderr) + sys.exit(1)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/integrate_aic_power_data.py` around lines 230 - 231, After resolving args.h200_data and args.b200_data into h200_data and b200_data, validate that any non-None Path actually exists (and is a directory) and fail fast if not: if h200_data is not None and not h200_data.exists()/is_dir() (same for b200_data) log/raise a clear error and exit with non-zero status (or call the argparse parser.error) so a mistyped root produces a hard failure instead of silent "nothing to copy"; update the validation near the h200_data/b200_data assignment to perform these checks and produce a descriptive message including the offending path variable name.

coderabbitai · 2026-05-10T13:55:34Z

+    5. Budget constraint: with a budget tighter than 1 full replica's power draw,
+       the optimizer correctly returns fewer replicas than the requested max.


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The advertised tight-budget validation never actually runs.

The docstring says this script verifies the optimizer under a budget tighter than one full replica, but the only optimizer run uses an intentionally generous 80_000 W cap and never asserts replica reduction. This can pass even when the tight-budget path is broken.

Also applies to: 231-236, 272-290

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/validate_aic_power_integration.py` around lines 20 - 21, The tight-budget test never runs because the script only calls the optimizer with a generous hardcoded cap (80_000); add an explicit tight-budget run that sets the budget to just under one replica's power and assert the optimizer returns fewer replicas than requested. Concretely: compute the single-replica draw (use the same attribute/func the script uses to estimate replica power, e.g., replica_power_watts or equivalent), call the optimizer (the existing optimizer/optimize/run_optimizer call) with budget = single_replica_power - 1, and add an assertion that the returned replica count < requested_max_replicas (mirror the style used elsewhere in the file). Ensure this new check is executed (not behind a dead branch) alongside the existing generous-cap test (the one using 80_000).

grahamking

I'm not sure if you intended to send an 11,000+ line PR, but that's what this is. Please split this into separate far smaller pieces.

Ideally start with an Issue to set the context and attach your agent's plan (if using one).

Adds power-aware intelligence to the Dynamo planner across three layers, plus tooling that pipecleans the Phase 4 selection path with currently available single-TDP AIC data. Reimplementation of unmerged PR #5280 against post-refactor ToT (planner/utils/ no longer exists; configuration is Pydantic; layout is config/, connectors/, core/, monitoring/). Full design: docs/design-docs/powerplanner-design.md Repro / dev environment: docs/components/planner/dpp-dev-env.md What's in this PR ----------------- Phase 1 — infrastructure - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/} Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop. Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore. - Pod annotation actuation: connectors/kubernetes.py writes dynamo.nvidia.com/gpu-power-limit on worker pods. - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml. - Operator chart fix: ClusterRole now grants `pods` permissions (deploy/helm/charts/platform/components/operator/templates/planner.yaml, chart bumped to 1.2.1). Older clusters can apply deploy/planner-pod-rbac-dev.yaml as a temporary patch. Phase 2 — budget enforcement - _apply_power_budget() in core/state_machine.py clamps replica scaling within a watt budget. New PlannerConfig fields: enable_power_awareness, total_gpu_power_limit (required when awareness enabled — validator enforces, no silent default), prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts. Phase 3 — AIC optimizer - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks (replica, power) configs that maximise throughput within the watt budget. Per-component EMA correction (c_ttft, c_itl, c_power_p, c_power_d for disagg; c_power_agg for agg) closes the loop on silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded re-optimize, auto-disable on consecutive failures. - core/base.py: NativePlannerBase plumbing (admission threshold push, named-port resolution, AIC integration). Phase 4 — pipeclean (using existing single-TDP data) - tools/integrate_aic_power_data.py: copies measured H200/B200 power data into an AIC checkout and reinstalls. - tools/validate_aic_power_integration.py: asserts estimate_perf() returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback). - tools/compute_power_budget.py: rack-capacity → safe-budget helper. - examples/deployments/powerplanner/: PIPECLEAN.md runbook, disagg-{power-aware,conservative-cold-start}.yaml, verify_poweraware.bash (8-section health check inc. Phase 4 preview), MULTI_DGD.md operator playbook, README.md. Bug fixes (found by live integration tests, fixed in this patchset) ------------------------------------------------------------------- - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner couldn't list workers or patch annotations. Fixed in chart template; workaround manifest provided for clusters not yet upgraded. - Stale label selectors in connectors/kubernetes.py: `dynamo-graph-deployment` → `dynamo-graph-deployment-name`, `dynamo-service` → `dynamo-component`/`dynamo-component-type`. - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the wrong namespace; corrected to `exported_pod=~` + `exported_namespace=<k8s-namespace>` and pod regex now matches the operator's `<dgd>-<replica-idx>-<service-key>-...` format (monitoring/traffic_metrics.py). - Frontend-source metric methods returned NaN on quiet clusters (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN guard, frontend didn't — added matching `math.isnan` filter. Tests ----- Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS dev cluster (qwen3-quickstart DGD, single dev namespace): Suite Pass Skip Fail ---------------------------------------------------- ---- ---- ---- planner/tests/unit/ 465 0 0 planner/tests/integration/test_aic_power_e2e_sim 15 0 0 planner/tests/integration/test_aic_power_optimizer 34 0 0 planner/tests/integration/test_metric_paths_live 22-23 3-2 0 planner/tests/integration/test_actuation_knobs_live 10-11 1-0 0 power_agent/tests/ 43 0 0 ---------------------------------------------------- ---- ---- ---- Total (cold deploy / after sanity chat completion) 590-591 4-3 0 Documented skips: - test_frontend_metric_series_exists — passes after any traffic. - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose dynamo_frontend_worker_* gauges (open-question #14 in design doc). - TestScalingRealMutation — opt-in disruptive test, gated by RUN_DISRUPTIVE_TESTS=1 (passes when enabled). Repro recipe: docs/components/planner/dpp-dev-env.md "From-Scratch Repro Script" section. Compatibility ------------- Power awareness is opt-in (enable_power_awareness=False by default). AIC optimizer is opt-in (enable_aic_optimizer=False by default). No existing planner behavior changes when both flags are off. Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation) remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4 pipeclean code path so it can be exercised before AIC ships the full sweep API. Signed-off-by: Kai Ma <kaim@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>

Adds the synthetic-metrics stress testbed for the Power Planner state machine + AIC optimizer, plus the small source-side hooks that let the testbed swap in fake AIC and clamp tick-time monotonicity. Testbed (alpha + gamma classes, no GPU / no live cluster required): - 27 alpha-class scenarios driven by parameterized fakes (FakeAIC, FakeActuator, FakePrometheus, FakePlannerMetrics) wired against the real PlannerStateMachine + AICPowerOptimizer - 3 gamma-class scenarios that overlay synthetic power on the real dynamo-mocker scheduler + PlannerReplayBridge for realism back-stop - Self-consistency + overlay determinism tests - YAML scenario catalog (A/B/C/D/E/F/G families) with shared _base/ and systems/ definitions; loadable gate prevents silent schema drift - Auto-skips gamma when the local Python wheel ships only the testbed stub (no Rust mocker), so the same suite is runnable on Windows laptops AND inside the dev pod - Verified: 82 passed, 5 skipped on Windows; 86 passed, 1 skipped on the dev pod against the older create_disagg bridge Source enablement: - AICPowerOptimizer: optional _aic_estimator_factory injection point for testbed substitution; production path unchanged - ReplayPlannerAdapter: clamp now_s = max(tick.at_s, bridge_now_s) so sparse traces do not wedge throughput-tick scheduling Misc: - components/src/conftest.py: top-of-src dynamo._core stub installer (needed before any dynamo.* package init runs) - scripts/inspect_dcgm_qwen3.py, scripts/inspect_live_metrics.py: Phase 4 cluster-side metric probes - test_planner_power_launch.py: Phase 4 advisory-mode launch script used to validate the planner -> annotation -> NVML enforcement chain - docs/design-docs/powerplanner-testbed-design.md: full testbed design + defect/fix history through v2.3 (dev-pod gamma validation) Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>

Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping: makes the dev environment shareable across teammates by removing personal identifiers, parameterizing all dev-pod / probe references, and folding 11 review fixes into the three Phase 1-4 design documents. Dev-env hardening ----------------- * Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset: - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline usage instructions. - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime. - 5 `scripts/inspect_*.py` cluster probes parameterized via `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit) rather than a hard-coded namespace. - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...` one-liner showing how to discover the right node. * `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally not committed" convention (matches `examples/deployments/powerplanner/ .tmp-gp-minimal.yaml`) and to block the four common root-level personal-scratch files from sneaking back in via `git add .`. dpp-dev-env.md updates ---------------------- * All 10 path references rewritten to point at the new `deploy/planner/dev/` and `scripts/dev/` homes. * New §5 ("Deploy the Dev Pod") subsection documenting the `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an `envsubst` (Linux/WSL) path and a Windows edit-in-place path. * Quick Deploy Checklist filename corrected from the stale-DGDR `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a cross-reference to the One-Time Setup §4 warning. * DGD-ready wait command standardized to the programmatic `kubectl wait --for=jsonpath='{.status.state}'=successful` form in both §4 and the Checklist (removes the manual-`-w` divergence). Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md) ------------------------------------------------------------------------------- * powerplanner-design.md - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 - 590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still draft.` with last-validation date. - §3.1: added an explicit note that `aic_interpolation` and `mode` are pre-existing PlannerConfig fields (owned by `monitoring/aic_interpolation.py`) and intentionally absent from the Names Registry. - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so it correctly points at the *throughput regression* case rather than reading as a blanket "config revert". - §6.5 pseudocode: replaced module-level `self.device_count` with `pynvml.nvmlDeviceGetCount()` (the original was syntactically incorrect outside the daemon class). - §13 Open Question #11: corrected the duplicated `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching §5.3. * powerplanner-testbed-design.md - §7: added a "Numeric-suffix convention" note explaining the intentional D21/E21 ID collision (IDs unique by filename + tuple, not renumbered). - §11: corrected "Six guards" -> "Seven guards" to match the seven items actually listed. - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`. - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper + 30 parametrized) so the listing matches the actual run. Verification ------------ * Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented Windows baseline; gamma auto-skipped without the Rust mocker). * AIC no-cluster integration (test_aic_power_optimizer.py + test_aic_power_e2e_sim.py): 49 passed. * Unit tests: 456 passed; the 9 remaining failures are pre-existing Windows-only environment limitations (cp1252 codec, `os.killpg` POSIX-only, missing `filterpy`) confirmed via `git stash` to be untouched by this commit. * All touched .py files: `python3.10 -m py_compile` clean. * All touched .yaml files: `yaml.safe_load_all` clean. * `ReadLints` over all 12 touched files: no errors. * Final `rg "kaim|aks-a100a-36888584"` outside committed `tests/fault_tolerance/...` (pre-existing, separate component) and `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>

Three minor issues surfaced by an end-to-end run of the optimizer against real H200/B200 power-enabled AIC data (Llama-3.1-8B, ISL=2048, OSL=256): 1. Non-physical AIC power_w. AIC's per-kernel 3D-cubic interpolator extrapolates `generation_attention` to ~1563 W per op on H200 vLLM 0.19.1 at batch >= 256, producing an aggregate power_w of 1275 W against a 700 W TDP and a 721 W measured peak. Cap_d was ending up at 1276 W, which NVML would silently clamp to TDP -- leaving the planner's stated cap fictional and pegging the EMA at 0.5 within a few ticks. AICPowerOptimizer now clamps the value used for cap derivation at TDP*1.1; the *raw* value is preserved on PowerAwareConfig as the EMA denominator so c_power_* coefficients visibly converge below 1.0 whenever AIC over-predicts, giving operators an independent signal alongside the new `aic_power_w_clamped_total{side}` counter. New design-doc §8 row 14 wires the cascade with the existing §6 EMA-peg signal. 5 new integration tests (TestAICPowerWClamp): non-physical clamp, raw value preserved on config, side-selective clamp, threshold boundary, agg-mode averaged path. The B200 TRT-LLM 1.3.0rc6 data stays cold on this clamp because its lattice is ~10x denser than the H200 sandbox -- confirms this is a data-density issue, not a fundamental AIC bug. 2. Wrong runtime documentation. optimize docstring (and powerplanner-design.md sec 6.4 + sec 10 row 12) claimed ~6 s per sweep. Measured 2026-05-11 against real sandboxes: cold first call is 3.5 s (H200 vLLM) / 5.9 s (B200 TRT-LLM), dominated by perf-database load from disk; warm calls are 10-15 ms once databases_cache is populated. Closed-loop CPU duty cycle at the default aic_reoptimize_interval=300s is ~0.004 percent, not the ~2 percent the docstring implied. Docs updated to distinguish cold vs warm and to clarify that the hysteresis guard exists to suppress downstream PATCH/POST fanout, not AIC CPU. 3. No regression coverage against real AIC data. New opt-in testbed class wires the AIC sandbox into pytest via the AIC_SANDBOX_DIR env var (marker real_aic). 11 tests parametrized over every (system, backend) tuple present in the sandbox: smoke + clamp contract + multi-tick EMA drift loop (well-calibrated -> 1.0, over-predict -> pegs at 0.5, under-predict -> ~1.5, hysteresis fires at exactly aic_drift_consecutive_ticks, healthy load produces zero spurious sweeps). Module-level skip when the env var is unset, keeping the default test surface unchanged. Test impact: - tests/integration/test_aic_power_optimizer.py: 34 -> 39, all pass - tests/testbed/tests/: 55 -> 66 with sandbox mounted (11 new real_aic tests); 55 + 3 skips without (1 module skip + 2 gamma skips), default test surface unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>

…bits) Mechanical post-rebase cleanup surfaced by `pre-commit run` on PR 9369: - isort/black reformat across 47 power-planner files - ruff F841: drop dead assignments (kwargs/sweep_fired/c_ttft_before_stable/clamp_threshold) - ruff F821: add missing TYPE_CHECKING imports for StructuredAssertion, ExprAssertion, NoiseModel, TickHistory in testbed forward-refs - ruff E402: `# noqa: E402` on intentional late imports after sys.path / stub install (testbed/conftest.py, test_planner_power_launch.py) - chmod +x: 5 scripts with shebangs (power_agent.py, compute/integrate/validate_aic_power tools, verify_poweraware.bash) No behavior changes. Signed-off-by: Kai Ma <kaim@nvidia.com>

…proportional_clamp_pair PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling beyond original demand (Open Question #2). The rebase took main's #9294 refactor that delegates to the new budget.proportional_clamp_pair helper. That helper distributes the remaining budget proportionally and can over-allocate decode in the asymmetric (p_gpu != d_gpu) case. Re-introduce the clamp at the call site rather than in the shared helper, so we don't change the contract for other callers (e.g. SLA-mode scale-down consolidation, also using the helper). Fixes: test_decode_never_exceeds_input_when_prefill_binding Signed-off-by: Kai Ma <kaim@nvidia.com>

…ebase Main's #9246 (chore: suffix duration config fields with units) renamed PlannerConfig.ttft -> ttft_ms and PlannerConfig.itl -> itl_ms. Construction keeps working via pydantic's validation_alias=AliasChoices("ttft_ms", "ttft"), so PlannerConfig(ttft=..., itl=...) keyword construction is still accepted, but Python attribute access must use the new names. PR 9369 was based on the pre-rename code and read the SLAs via .ttft / .itl on PlannerConfig instances: - aic_power_optimizer.py: 2 reads (single-engine feasibility check) + 2 reads (should_reoptimize SLA-violation check) - test_aic_real_data.py: 1 read (synthesizing 2x-SLA traffic) All 4 + 1 sites updated. Surfaced as AttributeError when running the 26 alpha-class testbed scenarios (test_scenarios.py::test_alpha[*]). Fixes: 26 testbed alpha-scenario failures, 1 test_self_consistency alpha test Signed-off-by: Kai Ma <kaim@nvidia.com>

Rebase onto main pulled in PR #9246's PlannerConfig field renames ( tft-> tft_ms, itl->itl_ms, hroughput_adjustment_interval-> hroughput_adjustment_interval_seconds, load_adjustment_interval->load_adjustment_interval_seconds). Two of the four ms-suffix accesses were updated in the prior fixup, but power_aware_replay_adapter.py still accessed the old hroughput_adjustment_interval attribute. Pydantic's AliasChoices only covers constructor input, not attribute access — so all four gamma-class testbed tests failed with:: AttributeError: 'PlannerConfig' object has no attribute 'throughput_adjustment_interval'. Did you mean: 'throughput_adjustment_interval_seconds'? This patch renames the attribute access at line 111 to the new canonical field name. Constructor sites still work via AliasChoices. Signed-off-by: Kai Ma <kaim@nvidia.com>

…uced suites Add pytestmark with pre_merge/unit-or-integration/gpu_0/planner markers to seven testbed and live-integration test files introduced by this PR. The pytest-marker-report pre-commit hook (newly enforced on main) requires every test to declare a Lifecycle, Test Type, and Hardware category; without these markers it fails with 'Missing sets: 122' across all PR9369 tests. Marker assignments: testbed unit suites (test_fakes, test_overlay, test_scenarios_loadable) get unit+planner; testbed scenario drivers (test_scenarios, test_self_consistency) get integration+planner since they run full planner closed-loop simulations; live K8s tests (test_metric_paths_live, test_actuation_knobs_live) already had integration+gpu_0 but lacked a Lifecycle marker ? added pre_merge (they skip cleanly when no cluster) and planner. Signed-off-by: Kai Ma <kaim@nvidia.com>

Adds the 2-line NVIDIA SPDX header to 49 files introduced by the power-aware planner work (testbed scenario YAMLs, testbed Python suites, dev/inspect scripts, and root-level pipeclean launcher) so they pass copyright-checks. Extends the planner filter group in .github/filters.yaml to cover the new power_agent component, planner dev manifests, dev scripts under scripts/dev and scripts/inspect_*, and the tools/*power*.py helpers, so changed-files no longer reports them as uncovered. Signed-off-by: Kai Ma <kaim@nvidia.com>

Main PR #9558 (`feat: introduce RoutingConstraints + worker taints`) added `from dynamo._core import RoutingConstraints` to `dynamo.llm.__init__`. Without this symbol in the testbed runtime stub, every planner unit test that transitively imports `dynamo.common.utils.endpoint_types` (-> `dynamo.llm`) fails at collection time on dev machines that don't have the maturin-built native binding. Add `RoutingConstraints` to the stubbed class list in `tests/testbed/_runtime_stub.py`. Idempotent, doesn't affect CI or the dev pod (real `dynamo._core` is loaded there). Signed-off-by: Kai Ma <kaim@nvidia.com>

Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods for the dynamo.nvidia.com/gpu-power-limit annotation, parses /proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup driver x CRI runtime variants), and calls nvmlDeviceSetPowerManagementLimit per physical GPU. Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds, UUID-gated orphan-cap restoration on startup, SIGTERM-handler default- TGP restore, and a multi-pod-per-GPU policy that falls back to power_agent_safe_default_watts on conflict. 43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy, shutdown. Standalone component - no coupling to planner internals. CI filter (.github/filters.yaml) extends the planner group to cover the new component + deploy paths so changes here trip the planner job set. Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>

…rometheus Adds the planner-side mechanism for power-aware operation: - 5 PlannerConfig fields (enable_power_awareness defaults False) with required-when-enabled validator for total_gpu_power_limit and power_agent_safe_default_watts - _apply_power_annotations() reconciliation loop on every tick, gated on the feature flag, with POWER_ANNOTATION_KEY constant - _publish_power_budget_metrics() dashboard gauges (static config, not DCGM, so it remains valid when attribution drops) - CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI - get_component_pods + list_frontend_pods + resolve_frontend_http_port on KubernetesConnector - get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the frontend-source path - 3 planner_metrics gauges: power_budget_total_watts, power_projected_watts, power_budget_utilization - Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml for non-operator development clusters - repo-shared components/src/conftest.py (planner pkg conftest) - 623-line unit suite (test_actuation_knobs.py) + 567-line live integration suite (test_actuation_knobs_live.py) covering every actuation knob this PR adds - Unit test additions on test_kubernetes_connector.py (~119 lines) and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries get_total_dgd_power subset, ~100 lines) - CI filter (.github/filters.yaml) extends the planner group with components/src/conftest.py Power awareness is off by default. Budget enforcement (_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3. Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent component) must merge first or land in parallel. See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>

Adds _apply_power_budget() to PlannerStateMachine, wired into both load-scaling and throughput-scaling paths (one-line call site each). Budget enforcement runs only when enable_power_awareness=True and total_gpu_power_limit is set, so it is a no-op on existing deployments. Includes a no-upscale-decode invariant in _apply_global_budget: when proportional_clamp_pair (from main's budget helper, see PR #9296) would allocate decode beyond original demand because prefill is the binding constraint, decode is clamped back to min(num_d, min_endpoint). This preserves the original behavioural contract from the pre-rebase inline formula while delegating to main's shared clamp. 68-line TestApplyGlobalBudgetNoUpscale suite (5 tests) pins the no-upscale invariant. Part of the PR #9369 split (PR 2 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>

Adds the correction-coefficient EMA engine that observes live TTFT/ITL/power, detects drift against the planner's expected operating point, and re-invokes the AIC sweep to recompute (cap_p, cap_d, n_p, n_d) plus implied frontend admission thresholds. Components: - AICPowerOptimizer (monitoring/aic_power_optimizer.py): correction coefficients (c_ttft, c_itl, c_power_p, c_power_d, c_power_agg) with EMA smoothing, drift detection with hysteresis, optimize() sweep driver, defensive power_w clamp at nameplate TDP (guards against AIC's non-physical interpolator extrapolations at sparse data-grid corners), and the aic_to_planner_cap bridge. - core/base.py Phase-3 additions: _aic_optimizer field + TYPE_CHECKING import; AIC startup sweep in _async_init(); _apply_aic_config() (caps + replica targets + drift-ref pin + admission gauges + POST fanout); _fanout_busy_threshold() with B6 absolute-prefill defense; _min_prefill_max_num_batched_tokens() helper; AIC tick-loop block in run() with disagg vs agg branches and conditional re-sweep; traffic observation enrichment with AIC-driven fields. - 5 new TrafficObservation fields (core/types.py): ttft_avg, itl_avg, total_tokens_per_s, scheduled_prefill_tokens, scheduled_decode_kv_tokens. - 13 AIC config fields + 3 admission config fields on PlannerConfig with validator (enable_aic_optimizer requires enable_power_awareness AND aic_interpolation). - get_avg_per_gpu_power_by_component DCGM query on PrometheusAPIClient and supporting documentation on DirectRouterMetricsClient (callable surface unchanged). - 9 AIC gauges/counters + 5 admission metrics on PlannerPrometheusMetrics. Test suites (~2,000 lines new tests): - 967-line test_aic_power_optimizer.py: 34 unit-grade tests covering EMA mechanics, drift detection, sweep failure modes, and the power_w clamp. - 712-line test_aic_power_e2e_sim.py: 15 closed-loop integration tests with the fake-AIC + fake-Prometheus harness. - 558-line test_metric_paths_live.py: live-cluster validation of DCGM per-component queries. - 505 additional lines on test_prometheus.py: TestMetricsIsValid, TestDirectRouterMetricsClient*, TestGetAvgRequestDurationRouterSource, TestGetAvgKvHitRateEdgeCases, plus the get_avg_per_gpu_power_by_component half of TestPowerAwareDcgmQueries. - profiler/utils/aic_dataframe.py: 14-line defensive fix for AIC table lookups. The PR 1b hand-authored intermediate of the 6 shared files plus this PR's verbatim checkout reproduces kaim/power-planner byte-for-byte (validated via git diff). Part of the PR #9369 split (PR 3 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>

Adds the synthetic-metrics testbed: deterministic, no GPU, no cluster. Alpha-class (27 scenarios, A1-F26): synthetic fleet + fake metrics + fake actuator drive the planner through fault-injection scenarios covering AIC drift, NVML clamps, K8s RBAC denials, node loss / recovery, Prometheus outages, MDC gaps, budget shrinkage, AIC infeasibility, and drift-threshold boundary cases. Gamma-class (3 scenarios, G1-G3): mocker-driven trace replay with synthetic-power overlay (replay/synthetic_power_overlay.py) and a power-aware replay adapter (replay/power_aware_replay_adapter.py) exercise the closed loop against real Mooncake traces. Infrastructure: - runner.py + scenarios.py + assertions.py + recorder.py + clock.py (run-loop, scenario loader, invariant checks, recording). - synthetic_fleet.py + fake_actuator.py + fake_aic.py + fake_planner_metrics.py + fake_prometheus.py (test doubles for every external dependency of the planner run loop). - _runtime_stub.py installs a stub dynamo._core when the compiled Rust binding is absent, so the testbed runs on developer laptops without a CUDA toolchain (carries every dynamo._core symbol used by dynamo.llm at module-load time, including the post-rebase RoutingConstraints addition from main PR #9558). - grafana/testbed_dashboard.json + systems/ (h100_pcie / h100_sxm / h200_sxm SKUs) + traces/placeholder_h200_disagg_1rps.jsonl provide a complete observability + replay stack. Production code: - offline/replay_adapter.py: 9-line timing fix (now_s = max(tick.at_s, bridge_now_s)) prevents stale-tick loops on sparse traces. 86 testbed tests + 30 scenarios (27 alpha + 3 gamma) ship green at this tip (1 skipped pending env-var; test_aic_real_data.py is module-skipped unless AIC_SANDBOX_DIR is set). Part of the PR #9369 split (PR 4 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>

Adds documentation, dev environment, example DGD configs, operator scripts, and standalone analysis tools. Design documentation (~5,000 lines): - docs/design-docs/powerplanner-design.md (2,111 lines): full design with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics, 5.6 drift detection + hysteresis, 5.7 admission control, 6.5 Power Agent fail-closed cold-start cap, section 8 failure-mode catalog (14 modes). - docs/design-docs/powerplanner-testbed-design.md (2,187 lines): testbed design with alpha and gamma scenario classes, fake-driver contracts, replay-overlay timing semantics. - docs/components/planner/dpp-dev-env.md (675 lines): dev-pod bring-up + smoke-test recipes against viking-prod and umb-b200 cluster hardware. Dev environment + example DGDs: - deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml + qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for live-cluster iteration without rebuilds. - examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml + verify_poweraware.bash (operator smoke-test). Inspection + analysis tooling: - scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time helpers for DCGM attribution, frontend port resolution, planner DCGM query methods, worker /metrics endpoints, live planner metrics. - tools/compute_power_budget.py + integrate_aic_power_data.py + validate_aic_power_integration.py: standalone analysis utilities for sizing the budget, ingesting AIC offline data, and validating the integration end-to-end. - test_planner_power_launch.py: root-level pipeclean launcher. CI filter (.github/filters.yaml) accumulates the remaining planner-side paths so changes here trigger the planner job set (cumulative with PR 1a and PR 1b additions; the final file matches kaim/power-planner byte-for-byte). No production code changes; this PR is purely additive documentation, deployment manifests, and operator tooling. All planner/power_agent tests at PR 4 tip remain green (no test additions in this PR). Part of the PR #9369 split (PR 5 of 6, final). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>