feat(planner): power-aware planner — Phase 1+2+3 + Phase 4 pipeclean#9369
feat(planner): power-aware planner — Phase 1+2+3 + Phase 4 pipeclean#9369kaim-eng wants to merge 11 commits into
Conversation
WalkthroughThis pull request implements end-to-end power-aware GPU resource management for the Dynamo planner across three phases: (1) a privileged DaemonSet that enforces per-pod NVML power caps via cgroup-based pod mapping, (2) static planner budget enforcement, and (3) a closed-loop AIC optimizer with EMA-based power coefficient correction. It includes comprehensive tests, Kubernetes manifests, deployment examples, and tooling. ChangesPower-Aware GPU Resource Management
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes |
There was a problem hiding this comment.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py (1)
1-799:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix formatting and linting errors before merge.
The file has formatting issues and linting violations that must be addressed:
- Formatting: Run
ruff formatto fix whitespace and line wrapping.- Linting (manual fixes required):
- Replace ambiguous
×character withxin comments at lines 329, 708 and docstring at line 696 (RUF002, RUF003).- Prefix unused variable
n_pwith underscore at line 742:_n_p, n_d = ...(RUF059).Ruff violations detected
RUF002 at line 696: Docstring contains ambiguous × (MULTIPLICATION SIGN) RUF003 at lines 329, 708: Comment contains ambiguous × (MULTIPLICATION SIGN) RUF059 at line 742: Unpacked variable `n_p` is never used🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py` around lines 1 - 799, Run automatic formatting (ruff format) to fix whitespace/line-wrapping, then manually fix three lint issues: replace the Unicode multiplication sign '×' with ASCII 'x' in the explanatory comments/docstring around the budget and throughput calculations (references: the multi-replica budget comment in TestOptimizeBudgetConstraint, the docstring/throughput comment in TestThroughputRegressionWarning and the comment in TestSweepReplicas) and make the unused unpacked variable in TestSweepReplicas.test_no_budget_uses_max_replicas a named-unused by prefixing it with an underscore (change n_p, n_d = opt._sweep_replicas(...) to _n_p, n_d = ...). These changes address RUF002/RUF003 (ambiguous ×) and RUF059 (unused variable) while keeping function/class names like _sweep_replicas, TestSweepReplicas, TestOptimizeBudgetConstraint, and TestThroughputRegressionWarning locators unchanged.components/src/dynamo/planner/tests/unit/test_prometheus.py (1)
1-1078:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAddress pre-commit formatting failures.
The file has formatting issues flagged by ruff. Run
ruff format test_prometheus.pyto apply reformatting andruff check --fix test_prometheus.pyto fix lint violations.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/unit/test_prometheus.py` around lines 1 - 1078, The test file fails ruff pre-commit checks; run an automatic reformat and fixer (e.g. ruff format followed by ruff check --fix) on the file and commit the changes so tests and lint pass; if applying edits manually, ensure formatting and lint rules are respected across imports and function/class definitions (examples to inspect: PrometheusAPIClient, DirectRouterMetricsClient, TestPrometheusAPIClientRouterSource, and test_frontend_metric_container_with_nan_value) and re-run ruff to confirm no remaining violations.components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py (1)
1-711:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRun
ruff formatto fix spacing and line-wrapping issues.The code has formatting inconsistencies in comment alignment and line lengths. Run
ruff format test_aic_power_e2e_sim.pyto resolve the spacing before comments (e.g.,50.0, #→50.0, #) and line-wrapping issues.Formatting changes needed
-H200_PREFILL_SERVING_W = H200_PREFILL_P90_W # 603 W/GPU at heavy prefill load -H200_DECODE_SERVING_W = H200_DECODE_P90_W # 558 W/GPU at heavy decode load +H200_PREFILL_SERVING_W = H200_PREFILL_P90_W # 603 W/GPU at heavy prefill load +H200_DECODE_SERVING_W = H200_DECODE_P90_W # 558 W/GPU at heavy decode load🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py` around lines 1 - 711, The file has ruff-formatting issues (misaligned inline comments and long wrapped lines) across the test module (e.g., around numeric literals and comment spacing such as "50.0, #"), so run an automatic formatter and fix spacing/line-wrapping: run `ruff format` (or your repo's formatter) on this test file and ensure inline comments use a single space after the comma and are properly wrapped; focus on symbols/classes to inspect after formatting like TestFeedbackLoopH200.test_ema_converges_from_cold_start, _POWER_LEVELS entries, _make_config, _make_traffic, and TestPhase4SyntheticMultiLevel methods to verify comment alignment and line lengths are corrected.components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py (1)
1-1:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRun the code formatter before merge.
The pre-commit check failed due to formatting issues. Run
ruff formatto fix line wrapping and assertion formatting.#!/bin/bash # Check current formatting diff cd components/src/dynamo/planner ruff format --diff tests/unit/test_kubernetes_connector.py🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py` at line 1, Run the project's formatter on the test file to fix line wrapping and assertion formatting: execute ruff format (e.g., `ruff format tests/unit/test_kubernetes_connector.py` from the components/src/dynamo/planner directory) to update formatting in test_kubernetes_connector.py so the pre-commit checks pass; re-run the tests or pre-commit to confirm no remaining style diffs.
🟡 Minor comments (7)
components/src/dynamo/planner/config/defaults.py-81-103 (1)
81-103:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPer-engine power-limit defaults contradict the stated "no placeholder integers" philosophy.
The comment at lines 82–84 explicitly argues that defaults are
None"so the type itself signals 'operator must set'", yetprefill_engine_gpu_power_limitanddecode_engine_gpu_power_limitare hardcoded to300W. Withenable_power_awareness=True, these silently cap every prefill/decode GPU to 300 W regardless of SKU (≈43% of an H200 SXM's 700 W TDP), which would dramatically and silently regress throughput while the operator believes they are within budget. The validator inplanner_config.pyonly enforcestotal_gpu_power_limitandpower_agent_safe_default_watts, so this slips through.Either make these
Optional[int] = Noneand add a validator, or derive them fromtotal_gpu_power_limit / gpu_countso the operator only needs to set one number.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/config/defaults.py` around lines 81 - 103, The defaults prefill_engine_gpu_power_limit and decode_engine_gpu_power_limit contradict the "no placeholder integers" policy and silently cap GPUs when enable_power_awareness=True; change both to Optional[int]=None (or compute them from total_gpu_power_limit divided by gpu_count) and update the planner_config.py validation logic to require/derive per-engine limits when enable_power_awareness is true (i.e., enforce that either per-engine limits are set or automatically compute them from total_gpu_power_limit and known GPU count) while keeping total_gpu_power_limit and power_agent_safe_default_watts validation intact.components/src/dynamo/planner/tests/unit/test_state_machine.py-1356-1358 (1)
1356-1358:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRename the unused unpack target to keep pre-commit green.
num_pis never read in this test, and Ruff is already flagging it. Renaming it to_num_pavoids the RUF059 failure without changing the assertion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/unit/test_state_machine.py` around lines 1356 - 1358, The test assigns the tuple return from sm._apply_global_budget to num_p and num_d but never uses num_p; rename the unused unpack target to _num_p (i.e., change "num_p, num_d = sm._apply_global_budget(5, 3)" to "_num_p, num_d = sm._apply_global_budget(5, 3)") to satisfy Ruff/RUF059 while leaving the assertion against num_d unchanged.tools/validate_aic_power_integration.py-137-137 (1)
137-137:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDrop the stray
fprefixes to unblock Ruff.These strings have no placeholders, and pre-commit is already reporting the resulting F541 errors here.
Also applies to: 167-167, 305-305
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/validate_aic_power_integration.py` at line 137, Calls to _section are using f-string prefixes for plain string literals (e.g., _section(f" prefill estimate (isl=1024, tp=8)")) which triggers Ruff F541; remove the leading f so the calls use normal string literals (e.g., _section(" prefill estimate (isl=1024, tp=8)")). Apply the same change to the other occurrences of _section(f"...") in the file (the other similar call sites that currently use an f-prefix).components/src/dynamo/planner/tests/unit/test_actuation_knobs.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRun the code formatter before merge.
The pre-commit check failed because the code needs formatting. Run
ruff formaton this file to fix line wrapping issues.#!/bin/bash # Check current formatting diff cd components/src/dynamo/planner ruff format --diff tests/unit/test_actuation_knobs.py🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/unit/test_actuation_knobs.py` at line 1, The file tests/unit/test_actuation_knobs.py needs to be auto-formatted to satisfy pre-commit checks; run the project formatter (ruff format) on that file (e.g., ruff format components/src/dynamo/planner/tests/unit/test_actuation_knobs.py) to fix line wrapping and style issues, then stage the updated file so the pre-commit check passes.examples/deployments/powerplanner/disagg-conservative-cold-start.yaml-94-94 (1)
94-94:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCorrect the model ID to include the "Meta-" prefix.
The model ID
meta-llama/Llama-3-8B-Instructis incorrect. The official HuggingFace model ismeta-llama/Meta-Llama-3-8B-Instruct. Using the wrong ID will cause model download failures at runtime.Also applies to: line 114
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/deployments/powerplanner/disagg-conservative-cold-start.yaml` at line 94, The model ID strings are incorrect: replace every occurrence of "meta-llama/Llama-3-8B-Instruct" with the official HuggingFace ID "meta-llama/Meta-Llama-3-8B-Instruct" (e.g., the value used in the model field in the YAML), and update the second occurrence noted in the review as well; ensure both instances exactly match "meta-llama/Meta-Llama-3-8B-Instruct" so the runtime can successfully download the model.components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py-305-313 (1)
305-313:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winResolve the frontend port from the pod instead of forcing
8000.Line 308 hardcodes the default frontend port, so this live test will false-fail on deployments that override the
httpport. Use the same per-pod port resolution helper as the runtime path instead of assuming8000.As per coding guidelines, "Don’t hardcode ports in tests (no literal
8000/8081/...; use provided dynamic port fixtures/helpers)."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py` around lines 305 - 313, The test currently hardcodes port 8000 when calling connector.post_busy_threshold (pod, model, port=8000) which breaks deployments that override the http port; replace the literal with the per-pod frontend-port resolver used by the runtime path (call it to compute port_from_pod = resolve_frontend_port(pod) or the existing helper in your test runtime utilities) and pass that variable into connector.post_busy_threshold (ensure you await or import the helper if async). This fixes the test by dynamically resolving the correct frontend port for the given pod instead of assuming 8000.components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py-571-582 (1)
571-582:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winReplace
time.sleep(2)withawait asyncio.sleep(2)to avoid blocking the event loop.This async test uses
awaitcalls, andtime.sleep(2)blocks the entire asyncio event loop, which can interfere with async cleanup and make the test flaky. Importasyncioat the module level and useawait asyncio.sleep(2)instead.Suggested change
- time.sleep(2) # let the API server settle + await asyncio.sleep(2) # let the API server settle🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py` around lines 571 - 582, The test is blocking the asyncio event loop by calling time.sleep(2); replace that call with await asyncio.sleep(2) and add an import asyncio at module level. Locate the snippet where connector.set_component_replicas(...) is called (in the test function that uses await) and change the synchronous sleep to the async sleep so the event loop isn’t blocked and async cleanup can run correctly.
🧹 Nitpick comments (9)
components/power_agent/power_agent.py (1)
489-490: 💤 Low valueMove
argparseimport to module top per Python guidelines.import json import logging import os import re import signal import sys import threading import time from typing import Optional +import argparsedef main() -> None: - import argparse - parser = argparse.ArgumentParser(description="Dynamo Power Agent DaemonSet")As per coding guidelines (.ai/python-guidelines.md): "flag any
importinside function/method/class bodies (keep imports at module top)".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/power_agent.py` around lines 489 - 490, The import of argparse is inside the main() function; move the import to the module top to follow guidelines. Remove the inline "import argparse" from def main() and add "import argparse" alongside other top-level imports so that the module-level imports include argparse and main() simply uses it.components/power_agent/tests/test_multi_pod_policy.py (1)
66-126: ⚡ Quick winCoverage looks good for the policy itself; consider adding a same-pod-multi-PID case.
The cases here cover
_resolve_cap_for_gputhoroughly. To complete the matrix and lock down the deduplication concern raised againstpower_agent.py:449-455, please add a test that asserts a single pod producing two PIDs on the same GPU resolves to the pod's annotation without firingmulti_pod_agree:def test_same_pod_multi_pid_not_multi_pod(self): # After the dedup fix, two PIDs from one pod -> 1 entry, no warning. cap = _resolve_cap_for_gpu( 0, [("uid-1", "480")], SAFE_DEFAULT, self.m ) self.assertEqual(cap, 480) self.assertEqual(self.m.multi_pod_agree, 0)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/power_agent/tests/test_multi_pod_policy.py` around lines 66 - 126, Add a new unit test method test_same_pod_multi_pid_not_multi_pod to TestMultiPodPolicy that calls _resolve_cap_for_gpu with a single pod UID and a valid annotation (use SAFE_DEFAULT and the fixture self.m) and asserts the returned cap equals the annotated value and that self.m.multi_pod_agree remains 0; reference _resolve_cap_for_gpu, SAFE_DEFAULT and the metric multi_pod_agree so the test verifies same-pod/multiple-PID deduplication (after the fix around power_agent.py:449-455) does not trigger the multi-pod agreement metric.components/src/dynamo/planner/monitoring/planner_metrics.py (1)
189-213: ⚡ Quick winKeep the new AIC metrics under the
dynamo_planner_*namespace.Everything else in this registry uses
PREFIX, but these counters/gauges switch todynamo_aic_*. That splits one feature area across two metric namespaces and makes prefix-based dashboards/alerts miss the new series.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/monitoring/planner_metrics.py` around lines 189 - 213, The new AIC metric names use the "dynamo_aic_*" prefix but must live under the existing PREFIX namespace; update the metric name arguments for aic_correction_pegged_total, aic_consecutive_failures, aic_optimizer_exceptions_total, aic_optimizer_disabled_reason, and aic_throughput_regression_total to use PREFIX (e.g. PREFIX + "aic_correction_pegged_total" etc.) instead of hardcoded "dynamo_aic_*" so they appear under the dynamo_planner_* namespace; keep the same help strings, labelnames, and metric types unchanged.components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py (1)
742-742: ⚡ Quick winUnused variable in test assertion.
The variable
n_pis unpacked but never used in this test. Either use it in an assertion or prefix with_to indicate it's intentionally unused.♻️ Suggested fix
- n_p, n_d = opt._sweep_replicas(1, 5, 3200, 2400, None) + _, n_d = opt._sweep_replicas(1, 5, 3200, 2400, None) assert n_d == 5🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py` at line 742, The test unpacks n_p and n_d from opt._sweep_replicas(1, 5, 3200, 2400, None) but never uses n_p; update the test to either assert something about n_p (for example expected value) or mark it as intentionally unused by renaming it to _n_p (or simply _), e.g., change the unpacking from "n_p, n_d = opt._sweep_replicas(...)" to "_, n_d = opt._sweep_replicas(...)" or "_n_p, n_d = opt._sweep_replicas(...)" so the linter/test clearly reflects that n_p is not used.examples/deployments/powerplanner/README.md (1)
52-54: 💤 Low valueOptional: Specify language for fenced code block.
The budget formula code block could use a language specifier for consistency (e.g.,
textorplaintext), though it's not critical for a mathematical formula.♻️ Optional improvement
-``` +```text total_gpu_power_limit = (rack_capacity_W × headroom_factor) − non_gpu_overhead ```🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/deployments/powerplanner/README.md` around lines 52 - 54, The fenced code block containing the budget formula currently has no language specified; update the opening fence from ``` to a language label (e.g., ```text or ```plaintext) so the block becomes ```text (or ```plaintext) followed by the same formula `total_gpu_power_limit = (rack_capacity_W × headroom_factor) − non_gpu_overhead` and a closing ```, ensuring consistency and clearer rendering in README.md.deploy/power_agent/daemonset.yaml (1)
93-99: ⚡ Quick winVerify resource limits are appropriate for production.
The agent requests 50m CPU / 64Mi memory with limits of 200m CPU / 128Mi memory. Confirm these values are validated against:
- Typical NVML API latency (should be low)
- Kubernetes API list/watch overhead per 15s reconcile loop
- Number of GPUs per node (more GPUs = more NVML calls)
Consider monitoring
container_memory_working_set_bytesandrate(container_cpu_usage_seconds_total)in production to validate these limits don't cause OOM or CPU throttling on large nodes (e.g., 8× H200 SXM).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/power_agent/daemonset.yaml` around lines 93 - 99, The current resources.requests and resources.limits in daemonset.yaml (cpu: "50m"/"200m", memory: "64Mi"/"128Mi") are likely too low for production; validate and raise them based on measured NVML API latency, 15s controller reconcile list/watch overhead, and max GPUs per node by load-testing with representative nodes (e.g., 8× H200 SXM). Measure container_memory_working_set_bytes and rate(container_cpu_usage_seconds_total) under stress to find steady-state and peak usage, then update resources.requests to cover steady-state usage and resources.limits to provide safe headroom (avoid OOM/throttling), and iterate: simulate NVML-heavy loops, reconcile cycles, and multiple GPUs, adjust values in daemonset.yaml accordingly and add monitoring alerts if working_set or CPU rate approaches limits.components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py (1)
486-489: 💤 Low valueClass attribute dict comprehension triggers Ruff warning.
Ruff flags this as a mutable class attribute (RUF012), though it's effectively a constant. Consider moving it outside the class or converting to a module-level constant to silence the warning.
♻️ Optional refactor
+# Per-replica wattage for each level (power_w × 8 GPUs) +_WATTS_PER_REPLICA = { + lvl: _POWER_LEVELS[lvl]["prefill_w"] * H200_GPUS_PER_ENGINE + for lvl in _POWER_LEVELS +} + class TestPhase4SyntheticMultiLevel: - # Per-replica wattage for each level (power_w × 8 GPUs) - _WATTS_PER_REPLICA = { - lvl: _POWER_LEVELS[lvl]["prefill_w"] * H200_GPUS_PER_ENGINE - for lvl in _POWER_LEVELS - }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py` around lines 486 - 489, Move the mutable dict comprehension _WATTS_PER_REPLICA out of the class and make it a module-level constant to avoid the Ruff RUF012 warning: compute the dict using the existing symbols _POWER_LEVELS and H200_GPUS_PER_ENGINE at module scope (e.g., next to _POWER_LEVELS) and replace the class attribute reference with that module-level constant so the class no longer defines a mutable attribute.components/src/dynamo/planner/tests/unit/test_prometheus.py (1)
896-896: ⚡ Quick winPrefix unused unpacked variables with underscore.
Multiple test methods unpack tuples but don't use all variables. Prefix unused variables with
_to indicate this is intentional.♻️ Suggested fixes
# Line 896 - recent, per_worker_avg, cluster_avg = result + recent, _per_worker_avg, _cluster_avg = result assert recent["w0"]["active_prefill_tokens"] == pytest.approx(50.0) # Line 904 - recent, per_worker_avg, cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill") + _recent, per_worker_avg, _cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill") assert per_worker_avg["w0"]["active_prefill_tokens"] == pytest.approx(50.0) # Line 927 - _, per_worker_avg, cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill") + _, _per_worker_avg, cluster_avg = direct_client.get_recent_and_averaged_metrics("prefill") assert cluster_avg["active_prefill_tokens"] == pytest.approx(20.0)Also applies to: 904-904, 927-927
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/tests/unit/test_prometheus.py` at line 896, In components/src/dynamo/planner/tests/unit/test_prometheus.py there are tuple unpackings like "recent, per_worker_avg, cluster_avg = result" where some unpacked variables are unused; change the unused names to be prefixed with an underscore (e.g., "recent, _per_worker_avg, _cluster_avg = result" or replace with "_" placeholders) so the intent is explicit. Update the other similar unpackings in the same file (the occurrences that unpack into recent, per_worker_avg, cluster_avg) to follow the same underscore-prefixed naming convention.components/src/dynamo/planner/connectors/kubernetes.py (1)
306-307: ⚡ Quick winAvoid broad exception catching.
The broad
except Exception:can mask unexpected errors (e.g.,KeyError,AttributeErrorfrom malformed deployment dicts). Catch specific exceptions (PlannerError) to make the error-handling intent explicit.🔍 Catch specific exceptions
try: service = get_service_from_sub_component_type_or_name( deployment, sub_component_type=sub_component_type ) - except Exception: + except PlannerError: return []Based on learnings: Python guidelines require catching specific exceptions rather than broad
Exceptionto avoid swallowing unexpected errors.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/planner/connectors/kubernetes.py` around lines 306 - 307, The broad except Exception: in connectors/kubernetes.py masks unexpected errors; change it to catch the specific PlannerError (e.g., use except PlannerError as e:) and return [] for that case, while letting any other exceptions propagate (or re-raise them) so KeyError/AttributeError aren’t swallowed; also ensure PlannerError is imported where the except block resides and preserve the original return [] behavior for the PlannerError path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/power_agent/power_agent.py`:
- Around line 449-455: _reconcile_gpu currently appends one (uid, annotation)
tuple per PID causing duplicate entries for multi-PID pods; change the logic to
deduplicate by pod UID before calling _resolve_cap_for_gpu (e.g., build a set or
map of seen UIDs using _extract_pod_uid_from_cgroup and uid_to_annotation, then
produce a single (uid, annotation) entry per UID into pod_annotations) so
_resolve_cap_for_gpu sees unique pods; update references to pod_annotations and
keep uid_to_annotation lookup as before; add a regression test in
test_multi_pod_policy.py that covers "single pod, two PIDs" to prevent
regressions.
In `@components/power_agent/tests/__init__.py`:
- Line 1: Add the standard repository SPDX header to the new package module
components.power_agent.tests.__init__.py; open the empty __init__.py and prepend
the project's copyright/SPDX license header block (the same header used across
other top-level Python files in the repo) so the file passes copyright-checks.
In `@components/power_agent/tests/test_apply_cap.py`:
- Around line 17-24: The test module imports power_agent eagerly which can raise
ImportError for optional NVML/K8s deps and break test collection; wrap the
import block that pulls in power_agent and symbols (POWER_ANNOTATION_KEY,
PowerAgent, _apply_cap, _clamp_to_constraints, _nvml_uuid) in a try/except
ImportError and in the except call pytest.skip("optional dependency missing:
<package>", allow_module_level=True) at module scope (ensure pytest is imported
at top-level so skip can be called).
In `@components/src/dynamo/planner/connectors/kubernetes.py`:
- Around line 384-390: Move the httpx import to the module top and remove the
try/except ImportError block in the admission-control POST handling code: add a
plain "import httpx" at the top of
components/src/dynamo/planner/connectors/kubernetes.py, delete the in-function
try/except that currently raises RuntimeError for missing httpx, and rely on the
module-level import to surface ImportError early; update any references in the
admission-control POST support code to use httpx directly (no fallback or
conditional logic needed).
In `@components/src/dynamo/planner/core/base.py`:
- Around line 753-767: The planner's advisory/dry-run flag isn't respected:
update _apply_power_annotations to bail out when self.config.advisory is True
(in addition to existing checks) so it never PATCHes pod annotations in advisory
mode; likewise locate the autoset fanout logic that posts /busy_threshold (the
code around the second occurrence noted) and add the same advisory guard to skip
POSTing frontend thresholds when self.config.advisory is True. Ensure both
guards check self.config.advisory before making any Kubernetes PATCH or HTTP
POST calls so advisory mode performs no writes.
- Around line 946-976: The current fanout to self.connector.post_busy_threshold
can leave some frontend pods with old thresholds and the planner continuing;
after computing failed (pods with Exception) you must treat any partial failure
as a hard failure: increment the admission_partial_success_total metric as done,
log the error, then raise an exception (e.g., RuntimeError with the failed pod
names and original results) or otherwise abort/return an error so the planner
does not proceed; ensure this change is applied in the same block that builds
results/pods (refer to self.connector.post_busy_threshold, the failed list,
self.prometheus_metrics.admission_partial_success_total, and logger.error) so
partial success stops the planner instead of being treated as success.
In `@components/src/dynamo/planner/monitoring/aic_power_optimizer.py`:
- Around line 534-545: The current branch returns (min_ep, min_ep) when
remaining_after_min_p <= 0 which can still exceed the budget once dynamic power
(d_watts) is considered; update this branch to check whether min_ep * (p_watts +
d_watts) <= budget before returning min_ep and if that check fails, log an error
via logger.error (including budget, min_ep, p_watts, d_watts) and fail closed by
returning a safe zero/NO_OP result (e.g., (0, 0)) or raising a clear exception
instead of returning an over-budget config; change the code around the variables
budget, min_ep, p_watts, d_watts and the logger.warning call accordingly.
In `@components/src/dynamo/planner/monitoring/planner_metrics.py`:
- Around line 244-247: The metric variable admission_partial_success_total is
misnamed vs its help text; either rename the Counter to
admission_partial_failure_total (and update all references/usages of
admission_partial_success_total) so the metric name matches that it counts
frontend POST failures, or keep the variable name but change the Counter help
string to describe "Cumulative count of frontend POST partial successes" if it
truly counts successes; update the f-string name
(f"{PREFIX}_admission_partial_failure_total") and any code that reads or exports
this metric (search for admission_partial_success_total) to maintain
consistency.
In `@components/src/dynamo/planner/monitoring/traffic_metrics.py`:
- Around line 404-414: The try/except in get_total_dgd_power (and the sibling
power helper around lines 445–457) is catching Exception broadly and swallowing
errors by logging and returning None; change it to either catch only expected
parsing/Prometheus errors (e.g., ValueError, TypeError, KeyError, or the
specific exception type thrown by self.prom.custom_query) and handle those by
logging and returning None, or if you must catch Exception, log the error with
logger.exception/ logger.error and then re-raise so unexpected failures
propagate; locate the calls to self.prom.custom_query and the surrounding
logger.debug call in get_total_dgd_power and the other helper and apply the
narrowed exception list or add a re-raise after logging.
In
`@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py`:
- Line 75: The module-level pytestmark currently sets pytest.mark.integration
and pytest.mark.gpu_0 but lacks a timeout; update the pytestmark list (the
pytestmark variable) to also include a pytest.mark.timeout(...) marker (e.g.,
pytest.mark.timeout(120) or another appropriate upper bound >30s) so the entire
live suite has a hard timeout and network/polling calls fail fast; ensure you
import pytest if not already and add the timeout marker to the existing
pytestmark list rather than adding per-test decorators.
In `@components/src/dynamo/planner/tests/integration/test_metric_paths_live.py`:
- Line 73: The module-level pytestmarks lack a timeout; update the pytestmark
variable in test_metric_paths_live.py (the existing pytestmark =
[pytest.mark.integration, pytest.mark.gpu_0]) to include a timeout marker such
as pytest.mark.timeout(300) (or another appropriate value >=30s) so long-running
in-cluster Prometheus/K8s calls will fail instead of hanging; ensure pytest is
imported if not already.
In `@deploy/planner-pod-rbac-dev.yaml`:
- Around line 37-39: The RoleBinding subject for the ServiceAccount
"planner-dev-sa" is missing the required namespace field; update the RoleBinding
subjects block to add "namespace: <namespace>" for the ServiceAccount subject
(or wire a templating/Kustomize variable) so the subject entry for
planner-dev-sa includes its namespace and the binding correctly grants
permissions.
In `@deploy/power_agent/rbac.yaml`:
- Around line 9-14: The docs show an apply path that leaves the
ClusterRoleBinding.subjects[0].namespace set to the literal placeholder
`${POWER_AGENT_NAMESPACE}`, which breaks the binding because kubectl does not
perform variable substitution; update the README snippet to either remove the
misleading "kubectl apply -n <your-namespace> -f deploy/power_agent/rbac.yaml"
path or change it to explicitly instruct the user to substitute the placeholder
(e.g., edit the ServiceAccount and ClusterRoleBinding.subjects[0].namespace or
use envsubst) so the ServiceAccount namespace and
ClusterRoleBinding.subjects[0].namespace match.
In `@tools/compute_power_budget.py`:
- Line 148: The print statement currently uses an unnecessary f-string prefix
("print(f\" (= 70% of TDP; adjust for your model's idle power profile)\")")
which triggers Ruff F541; change the call to a plain string literal by removing
the leading "f" so the statement is normal (e.g., update the print call in
compute_power_budget.py where that string is printed). Ensure no other f-prefix
remains on that exact print invocation.
- Around line 316-336: Validate the numeric CLI inputs before building
SizingInput and calling compute_budget: check that args.nodes and
args.gpus_per_node are positive integers (greater than 0) so total_gpus =
args.nodes * args.gpus_per_node > 0, and also validate args.rack_capacity and
gpu_tdp (and any other numeric args used in SizingInput such as
args.non_gpu_overhead and elements of args.dgd_gpus) are positive where
applicable; if any validation fails, print a clear stderr error and exit
non-zero. Locate the creation of total_gpus and the SizingInput construction in
compute_power_budget.py (symbols: total_gpus, SizingInput) and add these
pre-checks before calling compute_budget() to avoid division-by-zero or
meaningless budgets.
In `@tools/integrate_aic_power_data.py`:
- Around line 230-231: After resolving args.h200_data and args.b200_data into
h200_data and b200_data, validate that any non-None Path actually exists (and is
a directory) and fail fast if not: if h200_data is not None and not
h200_data.exists()/is_dir() (same for b200_data) log/raise a clear error and
exit with non-zero status (or call the argparse parser.error) so a mistyped root
produces a hard failure instead of silent "nothing to copy"; update the
validation near the h200_data/b200_data assignment to perform these checks and
produce a descriptive message including the offending path variable name.
In `@tools/validate_aic_power_integration.py`:
- Around line 20-21: The tight-budget test never runs because the script only
calls the optimizer with a generous hardcoded cap (80_000); add an explicit
tight-budget run that sets the budget to just under one replica's power and
assert the optimizer returns fewer replicas than requested. Concretely: compute
the single-replica draw (use the same attribute/func the script uses to estimate
replica power, e.g., replica_power_watts or equivalent), call the optimizer (the
existing optimizer/optimize/run_optimizer call) with budget =
single_replica_power - 1, and add an assertion that the returned replica count <
requested_max_replicas (mirror the style used elsewhere in the file). Ensure
this new check is executed (not behind a dead branch) alongside the existing
generous-cap test (the one using 80_000).
---
Outside diff comments:
In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py`:
- Around line 1-711: The file has ruff-formatting issues (misaligned inline
comments and long wrapped lines) across the test module (e.g., around numeric
literals and comment spacing such as "50.0, #"), so run an automatic formatter
and fix spacing/line-wrapping: run `ruff format` (or your repo's formatter) on
this test file and ensure inline comments use a single space after the comma and
are properly wrapped; focus on symbols/classes to inspect after formatting like
TestFeedbackLoopH200.test_ema_converges_from_cold_start, _POWER_LEVELS entries,
_make_config, _make_traffic, and TestPhase4SyntheticMultiLevel methods to verify
comment alignment and line lengths are corrected.
In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py`:
- Around line 1-799: Run automatic formatting (ruff format) to fix
whitespace/line-wrapping, then manually fix three lint issues: replace the
Unicode multiplication sign '×' with ASCII 'x' in the explanatory
comments/docstring around the budget and throughput calculations (references:
the multi-replica budget comment in TestOptimizeBudgetConstraint, the
docstring/throughput comment in TestThroughputRegressionWarning and the comment
in TestSweepReplicas) and make the unused unpacked variable in
TestSweepReplicas.test_no_budget_uses_max_replicas a named-unused by prefixing
it with an underscore (change n_p, n_d = opt._sweep_replicas(...) to _n_p, n_d =
...). These changes address RUF002/RUF003 (ambiguous ×) and RUF059 (unused
variable) while keeping function/class names like _sweep_replicas,
TestSweepReplicas, TestOptimizeBudgetConstraint, and
TestThroughputRegressionWarning locators unchanged.
In `@components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py`:
- Line 1: Run the project's formatter on the test file to fix line wrapping and
assertion formatting: execute ruff format (e.g., `ruff format
tests/unit/test_kubernetes_connector.py` from the components/src/dynamo/planner
directory) to update formatting in test_kubernetes_connector.py so the
pre-commit checks pass; re-run the tests or pre-commit to confirm no remaining
style diffs.
In `@components/src/dynamo/planner/tests/unit/test_prometheus.py`:
- Around line 1-1078: The test file fails ruff pre-commit checks; run an
automatic reformat and fixer (e.g. ruff format followed by ruff check --fix) on
the file and commit the changes so tests and lint pass; if applying edits
manually, ensure formatting and lint rules are respected across imports and
function/class definitions (examples to inspect: PrometheusAPIClient,
DirectRouterMetricsClient, TestPrometheusAPIClientRouterSource, and
test_frontend_metric_container_with_nan_value) and re-run ruff to confirm no
remaining violations.
---
Minor comments:
In `@components/src/dynamo/planner/config/defaults.py`:
- Around line 81-103: The defaults prefill_engine_gpu_power_limit and
decode_engine_gpu_power_limit contradict the "no placeholder integers" policy
and silently cap GPUs when enable_power_awareness=True; change both to
Optional[int]=None (or compute them from total_gpu_power_limit divided by
gpu_count) and update the planner_config.py validation logic to require/derive
per-engine limits when enable_power_awareness is true (i.e., enforce that either
per-engine limits are set or automatically compute them from
total_gpu_power_limit and known GPU count) while keeping total_gpu_power_limit
and power_agent_safe_default_watts validation intact.
In
`@components/src/dynamo/planner/tests/integration/test_actuation_knobs_live.py`:
- Around line 305-313: The test currently hardcodes port 8000 when calling
connector.post_busy_threshold (pod, model, port=8000) which breaks deployments
that override the http port; replace the literal with the per-pod frontend-port
resolver used by the runtime path (call it to compute port_from_pod =
resolve_frontend_port(pod) or the existing helper in your test runtime
utilities) and pass that variable into connector.post_busy_threshold (ensure you
await or import the helper if async). This fixes the test by dynamically
resolving the correct frontend port for the given pod instead of assuming 8000.
- Around line 571-582: The test is blocking the asyncio event loop by calling
time.sleep(2); replace that call with await asyncio.sleep(2) and add an import
asyncio at module level. Locate the snippet where
connector.set_component_replicas(...) is called (in the test function that uses
await) and change the synchronous sleep to the async sleep so the event loop
isn’t blocked and async cleanup can run correctly.
In `@components/src/dynamo/planner/tests/unit/test_actuation_knobs.py`:
- Line 1: The file tests/unit/test_actuation_knobs.py needs to be auto-formatted
to satisfy pre-commit checks; run the project formatter (ruff format) on that
file (e.g., ruff format
components/src/dynamo/planner/tests/unit/test_actuation_knobs.py) to fix line
wrapping and style issues, then stage the updated file so the pre-commit check
passes.
In `@components/src/dynamo/planner/tests/unit/test_state_machine.py`:
- Around line 1356-1358: The test assigns the tuple return from
sm._apply_global_budget to num_p and num_d but never uses num_p; rename the
unused unpack target to _num_p (i.e., change "num_p, num_d =
sm._apply_global_budget(5, 3)" to "_num_p, num_d = sm._apply_global_budget(5,
3)") to satisfy Ruff/RUF059 while leaving the assertion against num_d unchanged.
In `@examples/deployments/powerplanner/disagg-conservative-cold-start.yaml`:
- Line 94: The model ID strings are incorrect: replace every occurrence of
"meta-llama/Llama-3-8B-Instruct" with the official HuggingFace ID
"meta-llama/Meta-Llama-3-8B-Instruct" (e.g., the value used in the model field
in the YAML), and update the second occurrence noted in the review as well;
ensure both instances exactly match "meta-llama/Meta-Llama-3-8B-Instruct" so the
runtime can successfully download the model.
In `@tools/validate_aic_power_integration.py`:
- Line 137: Calls to _section are using f-string prefixes for plain string
literals (e.g., _section(f" prefill estimate (isl=1024, tp=8)")) which triggers
Ruff F541; remove the leading f so the calls use normal string literals (e.g.,
_section(" prefill estimate (isl=1024, tp=8)")). Apply the same change to the
other occurrences of _section(f"...") in the file (the other similar call sites
that currently use an f-prefix).
---
Nitpick comments:
In `@components/power_agent/power_agent.py`:
- Around line 489-490: The import of argparse is inside the main() function;
move the import to the module top to follow guidelines. Remove the inline
"import argparse" from def main() and add "import argparse" alongside other
top-level imports so that the module-level imports include argparse and main()
simply uses it.
In `@components/power_agent/tests/test_multi_pod_policy.py`:
- Around line 66-126: Add a new unit test method
test_same_pod_multi_pid_not_multi_pod to TestMultiPodPolicy that calls
_resolve_cap_for_gpu with a single pod UID and a valid annotation (use
SAFE_DEFAULT and the fixture self.m) and asserts the returned cap equals the
annotated value and that self.m.multi_pod_agree remains 0; reference
_resolve_cap_for_gpu, SAFE_DEFAULT and the metric multi_pod_agree so the test
verifies same-pod/multiple-PID deduplication (after the fix around
power_agent.py:449-455) does not trigger the multi-pod agreement metric.
In `@components/src/dynamo/planner/connectors/kubernetes.py`:
- Around line 306-307: The broad except Exception: in connectors/kubernetes.py
masks unexpected errors; change it to catch the specific PlannerError (e.g., use
except PlannerError as e:) and return [] for that case, while letting any other
exceptions propagate (or re-raise them) so KeyError/AttributeError aren’t
swallowed; also ensure PlannerError is imported where the except block resides
and preserve the original return [] behavior for the PlannerError path.
In `@components/src/dynamo/planner/monitoring/planner_metrics.py`:
- Around line 189-213: The new AIC metric names use the "dynamo_aic_*" prefix
but must live under the existing PREFIX namespace; update the metric name
arguments for aic_correction_pegged_total, aic_consecutive_failures,
aic_optimizer_exceptions_total, aic_optimizer_disabled_reason, and
aic_throughput_regression_total to use PREFIX (e.g. PREFIX +
"aic_correction_pegged_total" etc.) instead of hardcoded "dynamo_aic_*" so they
appear under the dynamo_planner_* namespace; keep the same help strings,
labelnames, and metric types unchanged.
In `@components/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.py`:
- Around line 486-489: Move the mutable dict comprehension _WATTS_PER_REPLICA
out of the class and make it a module-level constant to avoid the Ruff RUF012
warning: compute the dict using the existing symbols _POWER_LEVELS and
H200_GPUS_PER_ENGINE at module scope (e.g., next to _POWER_LEVELS) and replace
the class attribute reference with that module-level constant so the class no
longer defines a mutable attribute.
In `@components/src/dynamo/planner/tests/integration/test_aic_power_optimizer.py`:
- Line 742: The test unpacks n_p and n_d from opt._sweep_replicas(1, 5, 3200,
2400, None) but never uses n_p; update the test to either assert something about
n_p (for example expected value) or mark it as intentionally unused by renaming
it to _n_p (or simply _), e.g., change the unpacking from "n_p, n_d =
opt._sweep_replicas(...)" to "_, n_d = opt._sweep_replicas(...)" or "_n_p, n_d =
opt._sweep_replicas(...)" so the linter/test clearly reflects that n_p is not
used.
In `@components/src/dynamo/planner/tests/unit/test_prometheus.py`:
- Line 896: In components/src/dynamo/planner/tests/unit/test_prometheus.py there
are tuple unpackings like "recent, per_worker_avg, cluster_avg = result" where
some unpacked variables are unused; change the unused names to be prefixed with
an underscore (e.g., "recent, _per_worker_avg, _cluster_avg = result" or replace
with "_" placeholders) so the intent is explicit. Update the other similar
unpackings in the same file (the occurrences that unpack into recent,
per_worker_avg, cluster_avg) to follow the same underscore-prefixed naming
convention.
In `@deploy/power_agent/daemonset.yaml`:
- Around line 93-99: The current resources.requests and resources.limits in
daemonset.yaml (cpu: "50m"/"200m", memory: "64Mi"/"128Mi") are likely too low
for production; validate and raise them based on measured NVML API latency, 15s
controller reconcile list/watch overhead, and max GPUs per node by load-testing
with representative nodes (e.g., 8× H200 SXM). Measure
container_memory_working_set_bytes and rate(container_cpu_usage_seconds_total)
under stress to find steady-state and peak usage, then update resources.requests
to cover steady-state usage and resources.limits to provide safe headroom (avoid
OOM/throttling), and iterate: simulate NVML-heavy loops, reconcile cycles, and
multiple GPUs, adjust values in daemonset.yaml accordingly and add monitoring
alerts if working_set or CPU rate approaches limits.
In `@examples/deployments/powerplanner/README.md`:
- Around line 52-54: The fenced code block containing the budget formula
currently has no language specified; update the opening fence from ``` to a
language label (e.g., ```text or ```plaintext) so the block becomes ```text (or
```plaintext) followed by the same formula `total_gpu_power_limit =
(rack_capacity_W × headroom_factor) − non_gpu_overhead` and a closing ```,
ensuring consistency and clearer rendering in README.md.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a7e07320-b76e-4140-a984-1251cbfdf781
📒 Files selected for processing (47)
components/power_agent/README.mdcomponents/power_agent/power_agent.pycomponents/power_agent/tests/__init__.pycomponents/power_agent/tests/test_apply_cap.pycomponents/power_agent/tests/test_cgroup_parser.pycomponents/power_agent/tests/test_multi_pod_policy.pycomponents/power_agent/tests/test_shutdown.pycomponents/src/dynamo/planner/config/defaults.pycomponents/src/dynamo/planner/config/planner_config.pycomponents/src/dynamo/planner/connectors/kubernetes.pycomponents/src/dynamo/planner/connectors/kubernetes_api.pycomponents/src/dynamo/planner/core/base.pycomponents/src/dynamo/planner/core/budget.pycomponents/src/dynamo/planner/core/load_scaling.pycomponents/src/dynamo/planner/core/state_machine.pycomponents/src/dynamo/planner/core/throughput_scaling.pycomponents/src/dynamo/planner/core/types.pycomponents/src/dynamo/planner/monitoring/aic_power_optimizer.pycomponents/src/dynamo/planner/monitoring/planner_metrics.pycomponents/src/dynamo/planner/monitoring/traffic_metrics.pycomponents/src/dynamo/planner/tests/integration/test_actuation_knobs_live.pycomponents/src/dynamo/planner/tests/integration/test_aic_power_e2e_sim.pycomponents/src/dynamo/planner/tests/integration/test_aic_power_optimizer.pycomponents/src/dynamo/planner/tests/integration/test_metric_paths_live.pycomponents/src/dynamo/planner/tests/unit/test_actuation_knobs.pycomponents/src/dynamo/planner/tests/unit/test_kubernetes_connector.pycomponents/src/dynamo/planner/tests/unit/test_prometheus.pycomponents/src/dynamo/planner/tests/unit/test_state_machine.pycomponents/src/dynamo/profiler/utils/aic_dataframe.pydeploy/helm/charts/platform/Chart.yamldeploy/helm/charts/platform/README.mddeploy/helm/charts/platform/components/operator/Chart.yamldeploy/helm/charts/platform/components/operator/templates/planner.yamldeploy/planner-pod-rbac-dev.yamldeploy/power_agent/daemonset.yamldeploy/power_agent/rbac.yamldocs/components/planner/dpp-dev-env.mddocs/design-docs/powerplanner-design.mdexamples/deployments/powerplanner/MULTI_DGD.mdexamples/deployments/powerplanner/PIPECLEAN.mdexamples/deployments/powerplanner/README.mdexamples/deployments/powerplanner/disagg-conservative-cold-start.yamlexamples/deployments/powerplanner/disagg-power-aware.yamlexamples/deployments/powerplanner/verify_poweraware.bashtools/compute_power_budget.pytools/integrate_aic_power_data.pytools/validate_aic_power_integration.py
| pod_annotations: list[tuple[str, Optional[str]]] = [] | ||
| for proc in procs: | ||
| uid = _extract_pod_uid_from_cgroup(proc.pid) | ||
| if uid is None: | ||
| continue # non-K8s process — skip | ||
| if uid in uid_to_annotation: | ||
| pod_annotations.append((uid, uid_to_annotation[uid])) |
There was a problem hiding this comment.
Same-pod multi-PID workloads are incorrectly flagged as multi-pod-per-GPU.
_reconcile_gpu appends one entry to pod_annotations per PID rather than per pod UID. NVML's nvmlDeviceGetComputeRunningProcesses returns every CUDA-context process on the GPU, so a single pod with multiple GPU-bound processes (TP workers, MPS clients, MPI ranks, child workers spawned by serving stacks) yields N tuples with the same UID and same annotation. _resolve_cap_for_gpu then sees len(pod_annotations) > 1, all unique == 1 → emits the WARN "multi-pod-per-GPU is unsupported topology" and increments multi_pod_gpu_total{disposition="agree"} on every reconcile (every 15 s).
The applied cap is still correct (values agree), but the warnings/metrics are false positives and contradict the policy comment at lines 319–325. Deduplicate by UID first:
🩹 Proposed fix
- pod_annotations: list[tuple[str, Optional[str]]] = []
- for proc in procs:
- uid = _extract_pod_uid_from_cgroup(proc.pid)
- if uid is None:
- continue # non-K8s process — skip
- if uid in uid_to_annotation:
- pod_annotations.append((uid, uid_to_annotation[uid]))
+ # Deduplicate by pod UID — multiple PIDs from the same pod (TP workers,
+ # MPI ranks, MPS clients) must count as one pod, not N.
+ seen_uids: set[str] = set()
+ pod_annotations: list[tuple[str, Optional[str]]] = []
+ for proc in procs:
+ uid = _extract_pod_uid_from_cgroup(proc.pid)
+ if uid is None or uid in seen_uids:
+ continue # non-K8s process or already-counted pod
+ seen_uids.add(uid)
+ if uid in uid_to_annotation:
+ pod_annotations.append((uid, uid_to_annotation[uid]))A regression test for "single pod, two PIDs" in test_multi_pod_policy.py would lock this in.
Based on retrieved learning from PR 5280: "Either enforce an explicit one-to-one mapping with clear guards and comments, or extend the implementation to correctly handle multiple pods per GPU".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/power_agent/power_agent.py` around lines 449 - 455, _reconcile_gpu
currently appends one (uid, annotation) tuple per PID causing duplicate entries
for multi-PID pods; change the logic to deduplicate by pod UID before calling
_resolve_cap_for_gpu (e.g., build a set or map of seen UIDs using
_extract_pod_uid_from_cgroup and uid_to_annotation, then produce a single (uid,
annotation) entry per UID into pod_annotations) so _resolve_cap_for_gpu sees
unique pods; update references to pod_annotations and keep uid_to_annotation
lookup as before; add a regression test in test_multi_pod_policy.py that covers
"single pod, two PIDs" to prevent regressions.
| @@ -0,0 +1 @@ | |||
|
|
|||
There was a problem hiding this comment.
Add the standard SPDX header to this new package file.
copyright-checks is already failing on this file, so the empty __init__.py still needs the repo header before merge.
🧰 Tools
🪛 GitHub Actions: Copyright Checks / 0_copyright-checks.txt
[error] 1-1: Copyright checkers detected invalid/missing header: components/power_agent/tests/init.py
🪛 GitHub Actions: Copyright Checks / copyright-checks
[error] 1-1: Copyright check failed: invalid/missing header.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/power_agent/tests/__init__.py` at line 1, Add the standard
repository SPDX header to the new package module
components.power_agent.tests.__init__.py; open the empty __init__.py and prepend
the project's copyright/SPDX license header block (the same header used across
other top-level Python files in the repo) so the file passes copyright-checks.
| import power_agent | ||
| from power_agent import ( | ||
| POWER_ANNOTATION_KEY, | ||
| PowerAgent, | ||
| _apply_cap, | ||
| _clamp_to_constraints, | ||
| _nvml_uuid, | ||
| ) |
There was a problem hiding this comment.
Guard the module import so optional NVML/K8s deps don't break collection.
This imports power_agent eagerly, so any missing optional dependency will fail test collection instead of skipping the module. Wrap the import in try/except ImportError and call pytest.skip(..., allow_module_level=True) at module scope. As per coding guidelines, **/test_*.py: Use try/except ImportError to skip test collection when optional packages are absent with pytest.skip(..., allow_module_level=True).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/power_agent/tests/test_apply_cap.py` around lines 17 - 24, The
test module imports power_agent eagerly which can raise ImportError for optional
NVML/K8s deps and break test collection; wrap the import block that pulls in
power_agent and symbols (POWER_ANNOTATION_KEY, PowerAgent, _apply_cap,
_clamp_to_constraints, _nvml_uuid) in a try/except ImportError and in the except
call pytest.skip("optional dependency missing: <package>",
allow_module_level=True) at module scope (ensure pytest is imported at top-level
so skip can be called).
| try: | ||
| import httpx | ||
| except ImportError as exc: | ||
| raise RuntimeError( | ||
| "httpx is required for admission-control POST support. " | ||
| "Install it with: pip install httpx" | ||
| ) from exc |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Move the httpx import to module top.
This import is inside the function body, which violates the Python guidelines. The guidelines state: "flag any import inside function/method/class bodies (keep imports at module top)" and "try/except ImportError around imports must be avoided unless the dependency is truly optional and a fallback/None path exists."
Here, the function raises RuntimeError when httpx is missing, so there's no fallback—httpx is a hard requirement for this code path. Move the import to the top of the file.
📦 Move import to module top
import json
import logging
import os
from typing import Optional
+
+try:
+ import httpx
+except ImportError:
+ httpx = None
from dynamo.planner.config.defaults import SubComponentType, TargetReplicaThen in the function:
) -> None:
"""POST /busy_threshold to a single frontend pod (§6.7).
Raises on HTTP error or network failure — callers should gather with
``return_exceptions=True`` and inspect the results.
Per-pod timeout is 5 s (frontend is in-cluster; long timeouts mask
saturation events).
"""
- try:
- import httpx
- except ImportError as exc:
+ if httpx is None:
raise RuntimeError(
"httpx is required for admission-control POST support. "
"Install it with: pip install httpx"
- ) from exc
+ )Based on learnings: Python guidelines require imports at module top unless truly optional with a fallback/None path. As per coding guidelines, this pattern avoids dual execution modes and makes dependencies explicit.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/src/dynamo/planner/connectors/kubernetes.py` around lines 384 -
390, Move the httpx import to the module top and remove the try/except
ImportError block in the admission-control POST handling code: add a plain
"import httpx" at the top of
components/src/dynamo/planner/connectors/kubernetes.py, delete the in-function
try/except that currently raises RuntimeError for missing httpx, and rely on the
module-level import to surface ImportError early; update any references in the
admission-control POST support code to use httpx directly (no fallback or
conditional logic needed).
| async def _apply_power_annotations(self) -> None: | ||
| """Annotate worker pods with per-GPU power limits (Phase 1). | ||
|
|
||
| Reads the actual annotation from each Pod object returned by | ||
| get_component_pods(). Only PATCHes when annotation is missing or wrong. | ||
| K8s is the source of truth — no local cache. | ||
|
|
||
| Called after _apply_effects() on every tick so newly-created pods | ||
| pick up the cap within one tick, and AIC-driven cap changes (Phase 3+) | ||
| propagate to all running pods automatically. | ||
| """ | ||
| if not self.config.enable_power_awareness: | ||
| return | ||
| if not isinstance(self.connector, KubernetesConnector): | ||
| return |
There was a problem hiding this comment.
Advisory mode still writes to the cluster through the new power/admission paths.
_apply_scaling_targets() honors self.config.advisory, but _apply_power_annotations() and the autoset /busy_threshold fanout do not. With advisory=True and power awareness enabled, the planner will still PATCH pod annotations and POST frontend thresholds, which breaks the dry-run contract and changes live behavior.
Suggested change
async def _apply_power_annotations(self) -> None:
"""Annotate worker pods with per-GPU power limits (Phase 1).
@@
- if not self.config.enable_power_awareness:
+ if self.config.advisory or not self.config.enable_power_awareness:
return
if not isinstance(self.connector, KubernetesConnector):
return
@@
- if self.config.admission_mode == "autoset" and isinstance(
- self.connector, KubernetesConnector
- ):
+ if (
+ not self.config.advisory
+ and self.config.admission_mode == "autoset"
+ and isinstance(self.connector, KubernetesConnector)
+ ):
await self._fanout_busy_threshold(new_config)Also applies to: 899-902
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/src/dynamo/planner/core/base.py` around lines 753 - 767, The
planner's advisory/dry-run flag isn't respected: update _apply_power_annotations
to bail out when self.config.advisory is True (in addition to existing checks)
so it never PATCHes pod annotations in advisory mode; likewise locate the
autoset fanout logic that posts /busy_threshold (the code around the second
occurrence noted) and add the same advisory guard to skip POSTing frontend
thresholds when self.config.advisory is True. Ensure both guards check
self.config.advisory before making any Kubernetes PATCH or HTTP POST calls so
advisory mode performs no writes.
| # Apply with an explicit namespace flag: | ||
| # kubectl apply -n <your-namespace> -f deploy/power_agent/rbac.yaml | ||
| # | ||
| # The ClusterRoleBinding subject below references that same namespace via | ||
| # `${POWER_AGENT_NAMESPACE}` — substitute it manually or pipe through envsubst: | ||
| # POWER_AGENT_NAMESPACE=<your-namespace> envsubst < rbac.yaml | kubectl apply -f - |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n deploy/power_agent/rbac.yamlRepository: ai-dynamo/dynamo
Length of output: 1758
The plain kubectl apply -n ... path leaves the binding broken because it doesn't substitute the placeholder.
When you run kubectl apply -n <your-namespace> -f deploy/power_agent/rbac.yaml, the ServiceAccount is created in the specified namespace via the -n flag, but the ClusterRoleBinding.subjects[0].namespace field (line 50) retains the literal string ${POWER_AGENT_NAMESPACE}. Since kubectl does not perform variable substitution, the binding references a non-existent namespace and the pod-read permissions never take effect. Only the envsubst path (lines 13–14) substitutes the placeholder correctly. Recommend either: (1) removing the first documentation path, or (2) updating it to clarify that manual substitution of line 50 is required before applying.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/power_agent/rbac.yaml` around lines 9 - 14, The docs show an apply
path that leaves the ClusterRoleBinding.subjects[0].namespace set to the literal
placeholder `${POWER_AGENT_NAMESPACE}`, which breaks the binding because kubectl
does not perform variable substitution; update the README snippet to either
remove the misleading "kubectl apply -n <your-namespace> -f
deploy/power_agent/rbac.yaml" path or change it to explicitly instruct the user
to substitute the placeholder (e.g., edit the ServiceAccount and
ClusterRoleBinding.subjects[0].namespace or use envsubst) so the ServiceAccount
namespace and ClusterRoleBinding.subjects[0].namespace match.
| print() | ||
| print(" Additional planner settings to review:") | ||
| print(f" power_agent_safe_default_watts: {math.ceil(self.gpu_tdp_w * 0.70)}") | ||
| print(f" (= 70% of TDP; adjust for your model's idle power profile)") |
There was a problem hiding this comment.
Remove the stray f prefix on Line 148.
Ruff is flagging this as F541, so the file will keep failing pre-commit until that string is a plain literal again.
🧰 Tools
🪛 Ruff (0.15.12)
[error] 148-148: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/compute_power_budget.py` at line 148, The print statement currently
uses an unnecessary f-string prefix ("print(f\" (= 70% of TDP; adjust for
your model's idle power profile)\")") which triggers Ruff F541; change the call
to a plain string literal by removing the leading "f" so the statement is normal
(e.g., update the print call in compute_power_budget.py where that string is
printed). Ensure no other f-prefix remains on that exact print invocation.
| total_gpus = args.nodes * args.gpus_per_node | ||
| non_gpu_overhead = ( | ||
| args.non_gpu_overhead | ||
| if args.non_gpu_overhead is not None | ||
| else args.nodes * _NON_GPU_OVERHEAD_PER_NODE_W | ||
| ) | ||
|
|
||
| if not (0 < args.headroom_factor <= 1.0): | ||
| print( | ||
| f"ERROR: --headroom-factor must be in (0, 1]. Got {args.headroom_factor}.", | ||
| file=sys.stderr, | ||
| ) | ||
| sys.exit(2) | ||
|
|
||
| inp = SizingInput( | ||
| rack_capacity_w=args.rack_capacity, | ||
| headroom_factor=args.headroom_factor, | ||
| non_gpu_overhead_w=non_gpu_overhead, | ||
| total_gpus=total_gpus, | ||
| gpu_tdp_w=gpu_tdp, | ||
| dgd_gpu_counts=args.dgd_gpus or [], |
There was a problem hiding this comment.
Validate the numeric CLI inputs before computing the budget.
argparse accepts 0 and negatives for these fields. With --nodes 0 / --gpus-per-node 0 the tool can emit a meaningless budget for zero GPUs, and in multi-DGD mode it can fall into gpus / inp.total_gpus inside compute_budget().
Suggested fix
total_gpus = args.nodes * args.gpus_per_node
non_gpu_overhead = (
args.non_gpu_overhead
if args.non_gpu_overhead is not None
else args.nodes * _NON_GPU_OVERHEAD_PER_NODE_W
)
+
+ if args.nodes <= 0 or args.gpus_per_node <= 0:
+ print("ERROR: --nodes and --gpus-per-node must be > 0.", file=sys.stderr)
+ sys.exit(2)
+ if gpu_tdp <= 0:
+ print("ERROR: --gpu-tdp must be > 0.", file=sys.stderr)
+ sys.exit(2)
+ if non_gpu_overhead < 0:
+ print("ERROR: --non-gpu-overhead must be >= 0.", file=sys.stderr)
+ sys.exit(2)
+ if any(g <= 0 for g in (args.dgd_gpus or [])):
+ print("ERROR: every --dgd-gpus entry must be > 0.", file=sys.stderr)
+ sys.exit(2)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/compute_power_budget.py` around lines 316 - 336, Validate the numeric
CLI inputs before building SizingInput and calling compute_budget: check that
args.nodes and args.gpus_per_node are positive integers (greater than 0) so
total_gpus = args.nodes * args.gpus_per_node > 0, and also validate
args.rack_capacity and gpu_tdp (and any other numeric args used in SizingInput
such as args.non_gpu_overhead and elements of args.dgd_gpus) are positive where
applicable; if any validation fails, print a clear stderr error and exit
non-zero. Locate the creation of total_gpus and the SizingInput construction in
compute_power_budget.py (symbols: total_gpus, SizingInput) and add these
pre-checks before calling compute_budget() to avoid division-by-zero or
meaningless budgets.
| h200_data = args.h200_data.resolve() if args.h200_data else None | ||
| b200_data = args.b200_data.resolve() if args.b200_data else None |
There was a problem hiding this comment.
Reject nonexistent data roots before starting integration.
A typo in --h200-data or --b200-data currently degrades into a stream of per-subdir warnings and a zero-file success exit. That makes a bad input path look like “nothing to copy” instead of a hard failure.
Suggested fix
h200_data = args.h200_data.resolve() if args.h200_data else None
b200_data = args.b200_data.resolve() if args.b200_data else None
+
+ for flag, path in (("--h200-data", h200_data), ("--b200-data", b200_data)):
+ if path is not None and not path.is_dir():
+ print(f"ERROR: {flag} directory not found: {path}", file=sys.stderr)
+ sys.exit(1)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/integrate_aic_power_data.py` around lines 230 - 231, After resolving
args.h200_data and args.b200_data into h200_data and b200_data, validate that
any non-None Path actually exists (and is a directory) and fail fast if not: if
h200_data is not None and not h200_data.exists()/is_dir() (same for b200_data)
log/raise a clear error and exit with non-zero status (or call the argparse
parser.error) so a mistyped root produces a hard failure instead of silent
"nothing to copy"; update the validation near the h200_data/b200_data assignment
to perform these checks and produce a descriptive message including the
offending path variable name.
| 5. Budget constraint: with a budget tighter than 1 full replica's power draw, | ||
| the optimizer correctly returns fewer replicas than the requested max. |
There was a problem hiding this comment.
The advertised tight-budget validation never actually runs.
The docstring says this script verifies the optimizer under a budget tighter than one full replica, but the only optimizer run uses an intentionally generous 80_000 W cap and never asserts replica reduction. This can pass even when the tight-budget path is broken.
Also applies to: 231-236, 272-290
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/validate_aic_power_integration.py` around lines 20 - 21, The
tight-budget test never runs because the script only calls the optimizer with a
generous hardcoded cap (80_000); add an explicit tight-budget run that sets the
budget to just under one replica's power and assert the optimizer returns fewer
replicas than requested. Concretely: compute the single-replica draw (use the
same attribute/func the script uses to estimate replica power, e.g.,
replica_power_watts or equivalent), call the optimizer (the existing
optimizer/optimize/run_optimizer call) with budget = single_replica_power - 1,
and add an assertion that the returned replica count < requested_max_replicas
(mirror the style used elsewhere in the file). Ensure this new check is executed
(not behind a dead branch) alongside the existing generous-cap test (the one
using 80_000).
grahamking
left a comment
There was a problem hiding this comment.
I'm not sure if you intended to send an 11,000+ line PR, but that's what this is. Please split this into separate far smaller pieces.
Ideally start with an Issue to set the context and attach your agent's plan (if using one).
218306c to
1ac66af
Compare
Adds power-aware intelligence to the Dynamo planner across three layers, plus tooling that pipecleans the Phase 4 selection path with currently available single-TDP AIC data. Reimplementation of unmerged PR #5280 against post-refactor ToT (planner/utils/ no longer exists; configuration is Pydantic; layout is config/, connectors/, core/, monitoring/). Full design: docs/design-docs/powerplanner-design.md Repro / dev environment: docs/components/planner/dpp-dev-env.md What's in this PR ----------------- Phase 1 — infrastructure - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/} Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop. Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore. - Pod annotation actuation: connectors/kubernetes.py writes dynamo.nvidia.com/gpu-power-limit on worker pods. - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml. - Operator chart fix: ClusterRole now grants `pods` permissions (deploy/helm/charts/platform/components/operator/templates/planner.yaml, chart bumped to 1.2.1). Older clusters can apply deploy/planner-pod-rbac-dev.yaml as a temporary patch. Phase 2 — budget enforcement - _apply_power_budget() in core/state_machine.py clamps replica scaling within a watt budget. New PlannerConfig fields: enable_power_awareness, total_gpu_power_limit (required when awareness enabled — validator enforces, no silent default), prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts. Phase 3 — AIC optimizer - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks (replica, power) configs that maximise throughput within the watt budget. Per-component EMA correction (c_ttft, c_itl, c_power_p, c_power_d for disagg; c_power_agg for agg) closes the loop on silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded re-optimize, auto-disable on consecutive failures. - core/base.py: NativePlannerBase plumbing (admission threshold push, named-port resolution, AIC integration). Phase 4 — pipeclean (using existing single-TDP data) - tools/integrate_aic_power_data.py: copies measured H200/B200 power data into an AIC checkout and reinstalls. - tools/validate_aic_power_integration.py: asserts estimate_perf() returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback). - tools/compute_power_budget.py: rack-capacity → safe-budget helper. - examples/deployments/powerplanner/: PIPECLEAN.md runbook, disagg-{power-aware,conservative-cold-start}.yaml, verify_poweraware.bash (8-section health check inc. Phase 4 preview), MULTI_DGD.md operator playbook, README.md. Bug fixes (found by live integration tests, fixed in this patchset) ------------------------------------------------------------------- - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner couldn't list workers or patch annotations. Fixed in chart template; workaround manifest provided for clusters not yet upgraded. - Stale label selectors in connectors/kubernetes.py: `dynamo-graph-deployment` → `dynamo-graph-deployment-name`, `dynamo-service` → `dynamo-component`/`dynamo-component-type`. - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the wrong namespace; corrected to `exported_pod=~` + `exported_namespace=<k8s-namespace>` and pod regex now matches the operator's `<dgd>-<replica-idx>-<service-key>-...` format (monitoring/traffic_metrics.py). - Frontend-source metric methods returned NaN on quiet clusters (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN guard, frontend didn't — added matching `math.isnan` filter. Tests ----- Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS dev cluster (qwen3-quickstart DGD, single dev namespace): Suite Pass Skip Fail ---------------------------------------------------- ---- ---- ---- planner/tests/unit/ 465 0 0 planner/tests/integration/test_aic_power_e2e_sim 15 0 0 planner/tests/integration/test_aic_power_optimizer 34 0 0 planner/tests/integration/test_metric_paths_live 22-23 3-2 0 planner/tests/integration/test_actuation_knobs_live 10-11 1-0 0 power_agent/tests/ 43 0 0 ---------------------------------------------------- ---- ---- ---- Total (cold deploy / after sanity chat completion) 590-591 4-3 0 Documented skips: - test_frontend_metric_series_exists — passes after any traffic. - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose dynamo_frontend_worker_* gauges (open-question #14 in design doc). - TestScalingRealMutation — opt-in disruptive test, gated by RUN_DISRUPTIVE_TESTS=1 (passes when enabled). Repro recipe: docs/components/planner/dpp-dev-env.md "From-Scratch Repro Script" section. Compatibility ------------- Power awareness is opt-in (enable_power_awareness=False by default). AIC optimizer is opt-in (enable_aic_optimizer=False by default). No existing planner behavior changes when both flags are off. Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation) remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4 pipeclean code path so it can be exercised before AIC ships the full sweep API. Signed-off-by: Kai Ma <kaim@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the synthetic-metrics stress testbed for the Power Planner state machine + AIC optimizer, plus the small source-side hooks that let the testbed swap in fake AIC and clamp tick-time monotonicity. Testbed (alpha + gamma classes, no GPU / no live cluster required): - 27 alpha-class scenarios driven by parameterized fakes (FakeAIC, FakeActuator, FakePrometheus, FakePlannerMetrics) wired against the real PlannerStateMachine + AICPowerOptimizer - 3 gamma-class scenarios that overlay synthetic power on the real dynamo-mocker scheduler + PlannerReplayBridge for realism back-stop - Self-consistency + overlay determinism tests - YAML scenario catalog (A/B/C/D/E/F/G families) with shared _base/ and systems/ definitions; loadable gate prevents silent schema drift - Auto-skips gamma when the local Python wheel ships only the testbed stub (no Rust mocker), so the same suite is runnable on Windows laptops AND inside the dev pod - Verified: 82 passed, 5 skipped on Windows; 86 passed, 1 skipped on the dev pod against the older create_disagg bridge Source enablement: - AICPowerOptimizer: optional _aic_estimator_factory injection point for testbed substitution; production path unchanged - ReplayPlannerAdapter: clamp now_s = max(tick.at_s, bridge_now_s) so sparse traces do not wedge throughput-tick scheduling Misc: - components/src/conftest.py: top-of-src dynamo._core stub installer (needed before any dynamo.* package init runs) - scripts/inspect_dcgm_qwen3.py, scripts/inspect_live_metrics.py: Phase 4 cluster-side metric probes - test_planner_power_launch.py: Phase 4 advisory-mode launch script used to validate the planner -> annotation -> NVML enforcement chain - docs/design-docs/powerplanner-testbed-design.md: full testbed design + defect/fix history through v2.3 (dev-pod gamma validation) Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Kai Ma <kaim@nvidia.com>
Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping:
makes the dev environment shareable across teammates by removing personal
identifiers, parameterizing all dev-pod / probe references, and folding 11
review fixes into the three Phase 1-4 design documents.
Dev-env hardening
-----------------
* Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID
(`aks-a100a-36888584-vmss000002`) removed from every dev-env asset:
- Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and
`Dockerfile.planner-dev` moved to `deploy/planner/dev/` with
`${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline
usage instructions.
- Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads
`DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime.
- 5 `scripts/inspect_*.py` cluster probes parameterized via
`DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit)
rather than a hard-coded namespace.
- `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to
`<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...`
one-liner showing how to discover the right node.
* `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally
not committed" convention (matches `examples/deployments/powerplanner/
.tmp-gp-minimal.yaml`) and to block the four common root-level
personal-scratch files from sneaking back in via `git add .`.
dpp-dev-env.md updates
----------------------
* All 10 path references rewritten to point at the new
`deploy/planner/dev/` and `scripts/dev/` homes.
* New §5 ("Deploy the Dev Pod") subsection documenting the
`${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an
`envsubst` (Linux/WSL) path and a Windows edit-in-place path.
* Quick Deploy Checklist filename corrected from the stale-DGDR
`qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a
cross-reference to the One-Time Setup §4 warning.
* DGD-ready wait command standardized to the programmatic
`kubectl wait --for=jsonpath='{.status.state}'=successful` form
in both §4 and the Checklist (removes the manual-`-w` divergence).
Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md)
-------------------------------------------------------------------------------
* powerplanner-design.md
- Header `Status` flipped from `Draft` to `Validated (Phases 1-3 -
590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still
draft.` with last-validation date.
- §3.1: added an explicit note that `aic_interpolation` and `mode`
are pre-existing PlannerConfig fields (owned by
`monitoring/aic_interpolation.py`) and intentionally absent from
the Names Registry.
- §5.7 / §6.7: reworded the cross-reference to failure mode #5 so
it correctly points at the *throughput regression* case rather
than reading as a blanket "config revert".
- §6.5 pseudocode: replaced module-level `self.device_count` with
`pynvml.nvmlDeviceGetCount()` (the original was syntactically
incorrect outside the daemon class).
- §13 Open Question #11: corrected the duplicated
`scheduled_decode_kv_tokens` typo so the agg-mode gate now reads
`scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching
§5.3.
* powerplanner-testbed-design.md
- §7: added a "Numeric-suffix convention" note explaining the
intentional D21/E21 ID collision (IDs unique by filename + tuple,
not renumbered).
- §11: corrected "Six guards" -> "Seven guards" to match the seven
items actually listed.
- §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`.
- §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper
+ 30 parametrized) so the listing matches the actual run.
Verification
------------
* Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented
Windows baseline; gamma auto-skipped without the Rust mocker).
* AIC no-cluster integration (test_aic_power_optimizer.py +
test_aic_power_e2e_sim.py): 49 passed.
* Unit tests: 456 passed; the 9 remaining failures are pre-existing
Windows-only environment limitations (cp1252 codec, `os.killpg`
POSIX-only, missing `filterpy`) confirmed via `git stash` to be
untouched by this commit.
* All touched .py files: `python3.10 -m py_compile` clean.
* All touched .yaml files: `yaml.safe_load_all` clean.
* `ReadLints` over all 12 touched files: no errors.
* Final `rg "kaim|aks-a100a-36888584"` outside committed
`tests/fault_tolerance/...` (pre-existing, separate component) and
`examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits.
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
Three minor issues surfaced by an end-to-end run of the optimizer
against real H200/B200 power-enabled AIC data (Llama-3.1-8B,
ISL=2048, OSL=256):
1. Non-physical AIC power_w. AIC's per-kernel 3D-cubic interpolator
extrapolates `generation_attention` to ~1563 W per op on H200
vLLM 0.19.1 at batch >= 256, producing an aggregate power_w of
1275 W against a 700 W TDP and a 721 W measured peak. Cap_d was
ending up at 1276 W, which NVML would silently clamp to TDP --
leaving the planner's stated cap fictional and pegging the EMA at
0.5 within a few ticks. AICPowerOptimizer now clamps the value
used for cap derivation at TDP*1.1; the *raw* value is preserved
on PowerAwareConfig as the EMA denominator so c_power_*
coefficients visibly converge below 1.0 whenever AIC over-predicts,
giving operators an independent signal alongside the new
`aic_power_w_clamped_total{side}` counter. New design-doc §8 row 14
wires the cascade with the existing §6 EMA-peg signal. 5 new
integration tests (TestAICPowerWClamp): non-physical clamp, raw
value preserved on config, side-selective clamp, threshold
boundary, agg-mode averaged path. The B200 TRT-LLM 1.3.0rc6 data
stays cold on this clamp because its lattice is ~10x denser than
the H200 sandbox -- confirms this is a data-density issue, not a
fundamental AIC bug.
2. Wrong runtime documentation. optimize docstring (and
powerplanner-design.md sec 6.4 + sec 10 row 12) claimed ~6 s per
sweep. Measured 2026-05-11 against real sandboxes: cold first
call is 3.5 s (H200 vLLM) / 5.9 s (B200 TRT-LLM), dominated by
perf-database load from disk; warm calls are 10-15 ms once
databases_cache is populated. Closed-loop CPU duty cycle at the
default aic_reoptimize_interval=300s is ~0.004 percent, not the
~2 percent the docstring implied. Docs updated to distinguish
cold vs warm and to clarify that the hysteresis guard exists to
suppress downstream PATCH/POST fanout, not AIC CPU.
3. No regression coverage against real AIC data. New opt-in testbed
class wires the AIC sandbox into pytest via the AIC_SANDBOX_DIR
env var (marker real_aic). 11 tests parametrized over every
(system, backend) tuple present in the sandbox: smoke + clamp
contract + multi-tick EMA drift loop (well-calibrated -> 1.0,
over-predict -> pegs at 0.5, under-predict -> ~1.5, hysteresis
fires at exactly aic_drift_consecutive_ticks, healthy load
produces zero spurious sweeps). Module-level skip when the env
var is unset, keeping the default test surface unchanged.
Test impact:
- tests/integration/test_aic_power_optimizer.py: 34 -> 39, all pass
- tests/testbed/tests/: 55 -> 66 with sandbox mounted (11 new
real_aic tests); 55 + 3 skips without (1 module skip + 2 gamma
skips), default test surface unchanged.
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
…bits) Mechanical post-rebase cleanup surfaced by `pre-commit run` on PR 9369: - isort/black reformat across 47 power-planner files - ruff F841: drop dead assignments (kwargs/sweep_fired/c_ttft_before_stable/clamp_threshold) - ruff F821: add missing TYPE_CHECKING imports for StructuredAssertion, ExprAssertion, NoiseModel, TickHistory in testbed forward-refs - ruff E402: `# noqa: E402` on intentional late imports after sys.path / stub install (testbed/conftest.py, test_planner_power_launch.py) - chmod +x: 5 scripts with shebangs (power_agent.py, compute/integrate/validate_aic_power tools, verify_poweraware.bash) No behavior changes. Signed-off-by: Kai Ma <kaim@nvidia.com>
…proportional_clamp_pair PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling beyond original demand (Open Question #2). The rebase took main's #9294 refactor that delegates to the new budget.proportional_clamp_pair helper. That helper distributes the remaining budget proportionally and can over-allocate decode in the asymmetric (p_gpu != d_gpu) case. Re-introduce the clamp at the call site rather than in the shared helper, so we don't change the contract for other callers (e.g. SLA-mode scale-down consolidation, also using the helper). Fixes: test_decode_never_exceeds_input_when_prefill_binding Signed-off-by: Kai Ma <kaim@nvidia.com>
…ebase Main's #9246 (chore: suffix duration config fields with units) renamed PlannerConfig.ttft -> ttft_ms and PlannerConfig.itl -> itl_ms. Construction keeps working via pydantic's validation_alias=AliasChoices("ttft_ms", "ttft"), so PlannerConfig(ttft=..., itl=...) keyword construction is still accepted, but Python attribute access must use the new names. PR 9369 was based on the pre-rename code and read the SLAs via .ttft / .itl on PlannerConfig instances: - aic_power_optimizer.py: 2 reads (single-engine feasibility check) + 2 reads (should_reoptimize SLA-violation check) - test_aic_real_data.py: 1 read (synthesizing 2x-SLA traffic) All 4 + 1 sites updated. Surfaced as AttributeError when running the 26 alpha-class testbed scenarios (test_scenarios.py::test_alpha[*]). Fixes: 26 testbed alpha-scenario failures, 1 test_self_consistency alpha test Signed-off-by: Kai Ma <kaim@nvidia.com>
Rebase onto main pulled in PR #9246's PlannerConfig field renames ( tft-> tft_ms, itl->itl_ms, hroughput_adjustment_interval-> hroughput_adjustment_interval_seconds, load_adjustment_interval->load_adjustment_interval_seconds). Two of the four ms-suffix accesses were updated in the prior fixup, but power_aware_replay_adapter.py still accessed the old hroughput_adjustment_interval attribute. Pydantic's AliasChoices only covers constructor input, not attribute access — so all four gamma-class testbed tests failed with:: AttributeError: 'PlannerConfig' object has no attribute 'throughput_adjustment_interval'. Did you mean: 'throughput_adjustment_interval_seconds'? This patch renames the attribute access at line 111 to the new canonical field name. Constructor sites still work via AliasChoices. Signed-off-by: Kai Ma <kaim@nvidia.com>
…uced suites Add pytestmark with pre_merge/unit-or-integration/gpu_0/planner markers to seven testbed and live-integration test files introduced by this PR. The pytest-marker-report pre-commit hook (newly enforced on main) requires every test to declare a Lifecycle, Test Type, and Hardware category; without these markers it fails with 'Missing sets: 122' across all PR9369 tests. Marker assignments: testbed unit suites (test_fakes, test_overlay, test_scenarios_loadable) get unit+planner; testbed scenario drivers (test_scenarios, test_self_consistency) get integration+planner since they run full planner closed-loop simulations; live K8s tests (test_metric_paths_live, test_actuation_knobs_live) already had integration+gpu_0 but lacked a Lifecycle marker ? added pre_merge (they skip cleanly when no cluster) and planner. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the 2-line NVIDIA SPDX header to 49 files introduced by the power-aware planner work (testbed scenario YAMLs, testbed Python suites, dev/inspect scripts, and root-level pipeclean launcher) so they pass copyright-checks. Extends the planner filter group in .github/filters.yaml to cover the new power_agent component, planner dev manifests, dev scripts under scripts/dev and scripts/inspect_*, and the tools/*power*.py helpers, so changed-files no longer reports them as uncovered. Signed-off-by: Kai Ma <kaim@nvidia.com>
Main PR #9558 (`feat: introduce RoutingConstraints + worker taints`) added `from dynamo._core import RoutingConstraints` to `dynamo.llm.__init__`. Without this symbol in the testbed runtime stub, every planner unit test that transitively imports `dynamo.common.utils.endpoint_types` (-> `dynamo.llm`) fails at collection time on dev machines that don't have the maturin-built native binding. Add `RoutingConstraints` to the stubbed class list in `tests/testbed/_runtime_stub.py`. Idempotent, doesn't affect CI or the dev pod (real `dynamo._core` is loaded there). Signed-off-by: Kai Ma <kaim@nvidia.com>
1ac66af to
cec5681
Compare
Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.
Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.
43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.
CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.
Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.
Signed-off-by: Kai Ma <kaim@nvidia.com>
…rometheus Adds the planner-side mechanism for power-aware operation: - 5 PlannerConfig fields (enable_power_awareness defaults False) with required-when-enabled validator for total_gpu_power_limit and power_agent_safe_default_watts - _apply_power_annotations() reconciliation loop on every tick, gated on the feature flag, with POWER_ANNOTATION_KEY constant - _publish_power_budget_metrics() dashboard gauges (static config, not DCGM, so it remains valid when attribution drops) - CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI - get_component_pods + list_frontend_pods + resolve_frontend_http_port on KubernetesConnector - get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the frontend-source path - 3 planner_metrics gauges: power_budget_total_watts, power_projected_watts, power_budget_utilization - Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml for non-operator development clusters - repo-shared components/src/conftest.py (planner pkg conftest) - 623-line unit suite (test_actuation_knobs.py) + 567-line live integration suite (test_actuation_knobs_live.py) covering every actuation knob this PR adds - Unit test additions on test_kubernetes_connector.py (~119 lines) and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries get_total_dgd_power subset, ~100 lines) - CI filter (.github/filters.yaml) extends the planner group with components/src/conftest.py Power awareness is off by default. Budget enforcement (_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3. Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent component) must merge first or land in parallel. See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds _apply_power_budget() to PlannerStateMachine, wired into both load-scaling and throughput-scaling paths (one-line call site each). Budget enforcement runs only when enable_power_awareness=True and total_gpu_power_limit is set, so it is a no-op on existing deployments. Includes a no-upscale-decode invariant in _apply_global_budget: when proportional_clamp_pair (from main's budget helper, see PR #9296) would allocate decode beyond original demand because prefill is the binding constraint, decode is clamped back to min(num_d, min_endpoint). This preserves the original behavioural contract from the pre-rebase inline formula while delegating to main's shared clamp. 68-line TestApplyGlobalBudgetNoUpscale suite (5 tests) pins the no-upscale invariant. Part of the PR #9369 split (PR 2 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the correction-coefficient EMA engine that observes live TTFT/ITL/power, detects drift against the planner's expected operating point, and re-invokes the AIC sweep to recompute (cap_p, cap_d, n_p, n_d) plus implied frontend admission thresholds. Components: - AICPowerOptimizer (monitoring/aic_power_optimizer.py): correction coefficients (c_ttft, c_itl, c_power_p, c_power_d, c_power_agg) with EMA smoothing, drift detection with hysteresis, optimize() sweep driver, defensive power_w clamp at nameplate TDP (guards against AIC's non-physical interpolator extrapolations at sparse data-grid corners), and the aic_to_planner_cap bridge. - core/base.py Phase-3 additions: _aic_optimizer field + TYPE_CHECKING import; AIC startup sweep in _async_init(); _apply_aic_config() (caps + replica targets + drift-ref pin + admission gauges + POST fanout); _fanout_busy_threshold() with B6 absolute-prefill defense; _min_prefill_max_num_batched_tokens() helper; AIC tick-loop block in run() with disagg vs agg branches and conditional re-sweep; traffic observation enrichment with AIC-driven fields. - 5 new TrafficObservation fields (core/types.py): ttft_avg, itl_avg, total_tokens_per_s, scheduled_prefill_tokens, scheduled_decode_kv_tokens. - 13 AIC config fields + 3 admission config fields on PlannerConfig with validator (enable_aic_optimizer requires enable_power_awareness AND aic_interpolation). - get_avg_per_gpu_power_by_component DCGM query on PrometheusAPIClient and supporting documentation on DirectRouterMetricsClient (callable surface unchanged). - 9 AIC gauges/counters + 5 admission metrics on PlannerPrometheusMetrics. Test suites (~2,000 lines new tests): - 967-line test_aic_power_optimizer.py: 34 unit-grade tests covering EMA mechanics, drift detection, sweep failure modes, and the power_w clamp. - 712-line test_aic_power_e2e_sim.py: 15 closed-loop integration tests with the fake-AIC + fake-Prometheus harness. - 558-line test_metric_paths_live.py: live-cluster validation of DCGM per-component queries. - 505 additional lines on test_prometheus.py: TestMetricsIsValid, TestDirectRouterMetricsClient*, TestGetAvgRequestDurationRouterSource, TestGetAvgKvHitRateEdgeCases, plus the get_avg_per_gpu_power_by_component half of TestPowerAwareDcgmQueries. - profiler/utils/aic_dataframe.py: 14-line defensive fix for AIC table lookups. The PR 1b hand-authored intermediate of the 6 shared files plus this PR's verbatim checkout reproduces kaim/power-planner byte-for-byte (validated via git diff). Part of the PR #9369 split (PR 3 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the synthetic-metrics testbed: deterministic, no GPU, no cluster. Alpha-class (27 scenarios, A1-F26): synthetic fleet + fake metrics + fake actuator drive the planner through fault-injection scenarios covering AIC drift, NVML clamps, K8s RBAC denials, node loss / recovery, Prometheus outages, MDC gaps, budget shrinkage, AIC infeasibility, and drift-threshold boundary cases. Gamma-class (3 scenarios, G1-G3): mocker-driven trace replay with synthetic-power overlay (replay/synthetic_power_overlay.py) and a power-aware replay adapter (replay/power_aware_replay_adapter.py) exercise the closed loop against real Mooncake traces. Infrastructure: - runner.py + scenarios.py + assertions.py + recorder.py + clock.py (run-loop, scenario loader, invariant checks, recording). - synthetic_fleet.py + fake_actuator.py + fake_aic.py + fake_planner_metrics.py + fake_prometheus.py (test doubles for every external dependency of the planner run loop). - _runtime_stub.py installs a stub dynamo._core when the compiled Rust binding is absent, so the testbed runs on developer laptops without a CUDA toolchain (carries every dynamo._core symbol used by dynamo.llm at module-load time, including the post-rebase RoutingConstraints addition from main PR #9558). - grafana/testbed_dashboard.json + systems/ (h100_pcie / h100_sxm / h200_sxm SKUs) + traces/placeholder_h200_disagg_1rps.jsonl provide a complete observability + replay stack. Production code: - offline/replay_adapter.py: 9-line timing fix (now_s = max(tick.at_s, bridge_now_s)) prevents stale-tick loops on sparse traces. 86 testbed tests + 30 scenarios (27 alpha + 3 gamma) ship green at this tip (1 skipped pending env-var; test_aic_real_data.py is module-skipped unless AIC_SANDBOX_DIR is set). Part of the PR #9369 split (PR 4 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds documentation, dev environment, example DGD configs, operator scripts, and standalone analysis tools. Design documentation (~5,000 lines): - docs/design-docs/powerplanner-design.md (2,111 lines): full design with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics, 5.6 drift detection + hysteresis, 5.7 admission control, 6.5 Power Agent fail-closed cold-start cap, section 8 failure-mode catalog (14 modes). - docs/design-docs/powerplanner-testbed-design.md (2,187 lines): testbed design with alpha and gamma scenario classes, fake-driver contracts, replay-overlay timing semantics. - docs/components/planner/dpp-dev-env.md (675 lines): dev-pod bring-up + smoke-test recipes against viking-prod and umb-b200 cluster hardware. Dev environment + example DGDs: - deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml + qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for live-cluster iteration without rebuilds. - examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml + verify_poweraware.bash (operator smoke-test). Inspection + analysis tooling: - scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time helpers for DCGM attribution, frontend port resolution, planner DCGM query methods, worker /metrics endpoints, live planner metrics. - tools/compute_power_budget.py + integrate_aic_power_data.py + validate_aic_power_integration.py: standalone analysis utilities for sizing the budget, ingesting AIC offline data, and validating the integration end-to-end. - test_planner_power_launch.py: root-level pipeclean launcher. CI filter (.github/filters.yaml) accumulates the remaining planner-side paths so changes here trigger the planner job set (cumulative with PR 1a and PR 1b additions; the final file matches kaim/power-planner byte-for-byte). No production code changes; this PR is purely additive documentation, deployment manifests, and operator tooling. All planner/power_agent tests at PR 4 tip remain green (no test additions in this PR). Part of the PR #9369 split (PR 5 of 6, final). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.
Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.
43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.
CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.
Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.
Signed-off-by: Kai Ma <kaim@nvidia.com>
…rometheus Adds the planner-side mechanism for power-aware operation: - 5 PlannerConfig fields (enable_power_awareness defaults False) with required-when-enabled validator for total_gpu_power_limit and power_agent_safe_default_watts - _apply_power_annotations() reconciliation loop on every tick, gated on the feature flag, with POWER_ANNOTATION_KEY constant - _publish_power_budget_metrics() dashboard gauges (static config, not DCGM, so it remains valid when attribution drops) - CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI - get_component_pods + list_frontend_pods + resolve_frontend_http_port on KubernetesConnector - get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the frontend-source path - 3 planner_metrics gauges: power_budget_total_watts, power_projected_watts, power_budget_utilization - Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml for non-operator development clusters - repo-shared components/src/conftest.py (planner pkg conftest) - 623-line unit suite (test_actuation_knobs.py) + 567-line live integration suite (test_actuation_knobs_live.py) covering every actuation knob this PR adds - Unit test additions on test_kubernetes_connector.py (~119 lines) and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries get_total_dgd_power subset, ~100 lines) - CI filter (.github/filters.yaml) extends the planner group with components/src/conftest.py Power awareness is off by default. Budget enforcement (_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3. Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent component) must merge first or land in parallel. See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds documentation, dev environment, example DGD configs, operator scripts, and standalone analysis tools. Design documentation (~5,000 lines): - docs/design-docs/powerplanner-design.md (2,111 lines): full design with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics, 5.6 drift detection + hysteresis, 5.7 admission control, 6.5 Power Agent fail-closed cold-start cap, section 8 failure-mode catalog (14 modes). - docs/design-docs/powerplanner-testbed-design.md (2,187 lines): testbed design with alpha and gamma scenario classes, fake-driver contracts, replay-overlay timing semantics. - docs/components/planner/dpp-dev-env.md (675 lines): dev-pod bring-up + smoke-test recipes against viking-prod and umb-b200 cluster hardware. Dev environment + example DGDs: - deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml + qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for live-cluster iteration without rebuilds. - examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml + verify_poweraware.bash (operator smoke-test). Inspection + analysis tooling: - scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time helpers for DCGM attribution, frontend port resolution, planner DCGM query methods, worker /metrics endpoints, live planner metrics. - tools/compute_power_budget.py + integrate_aic_power_data.py + validate_aic_power_integration.py: standalone analysis utilities for sizing the budget, ingesting AIC offline data, and validating the integration end-to-end. - test_planner_power_launch.py: root-level pipeclean launcher. CI filter (.github/filters.yaml) accumulates the remaining planner-side paths so changes here trigger the planner job set (cumulative with PR 1a and PR 1b additions; the final file matches kaim/power-planner byte-for-byte). No production code changes; this PR is purely additive documentation, deployment manifests, and operator tooling. All planner/power_agent tests at PR 4 tip remain green (no test additions in this PR). Part of the PR #9369 split (PR 5 of 6, final). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
…rometheus Adds the planner-side mechanism for power-aware operation: - 5 PlannerConfig fields (enable_power_awareness defaults False) with required-when-enabled validator for total_gpu_power_limit and power_agent_safe_default_watts - _apply_power_annotations() reconciliation loop on every tick, gated on the feature flag, with POWER_ANNOTATION_KEY constant - _publish_power_budget_metrics() dashboard gauges (static config, not DCGM, so it remains valid when attribution drops) - CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI - get_component_pods + list_frontend_pods + resolve_frontend_http_port on KubernetesConnector - get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the frontend-source path - 3 planner_metrics gauges: power_budget_total_watts, power_projected_watts, power_budget_utilization - Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml for non-operator development clusters - repo-shared components/src/conftest.py (planner pkg conftest) - 623-line unit suite (test_actuation_knobs.py) + 567-line live integration suite (test_actuation_knobs_live.py) covering every actuation knob this PR adds - Unit test additions on test_kubernetes_connector.py (~119 lines) and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries get_total_dgd_power subset, ~100 lines) - CI filter (.github/filters.yaml) extends the planner group with components/src/conftest.py Power awareness is off by default. Budget enforcement (_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3. Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent component) must merge first or land in parallel. See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds _apply_power_budget() to PlannerStateMachine, wired into both load-scaling and throughput-scaling paths (one-line call site each). Budget enforcement runs only when enable_power_awareness=True and total_gpu_power_limit is set, so it is a no-op on existing deployments. Includes a no-upscale-decode invariant in _apply_global_budget: when proportional_clamp_pair (from main's budget helper, see PR #9296) would allocate decode beyond original demand because prefill is the binding constraint, decode is clamped back to min(num_d, min_endpoint). This preserves the original behavioural contract from the pre-rebase inline formula while delegating to main's shared clamp. 68-line TestApplyGlobalBudgetNoUpscale suite (5 tests) pins the no-upscale invariant. Part of the PR #9369 split (PR 2 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the correction-coefficient EMA engine that observes live TTFT/ITL/power, detects drift against the planner's expected operating point, and re-invokes the AIC sweep to recompute (cap_p, cap_d, n_p, n_d) plus implied frontend admission thresholds. Components: - AICPowerOptimizer (monitoring/aic_power_optimizer.py): correction coefficients (c_ttft, c_itl, c_power_p, c_power_d, c_power_agg) with EMA smoothing, drift detection with hysteresis, optimize() sweep driver, defensive power_w clamp at nameplate TDP (guards against AIC's non-physical interpolator extrapolations at sparse data-grid corners), and the aic_to_planner_cap bridge. - core/base.py Phase-3 additions: _aic_optimizer field + TYPE_CHECKING import; AIC startup sweep in _async_init(); _apply_aic_config() (caps + replica targets + drift-ref pin + admission gauges + POST fanout); _fanout_busy_threshold() with B6 absolute-prefill defense; _min_prefill_max_num_batched_tokens() helper; AIC tick-loop block in run() with disagg vs agg branches and conditional re-sweep; traffic observation enrichment with AIC-driven fields. - 5 new TrafficObservation fields (core/types.py): ttft_avg, itl_avg, total_tokens_per_s, scheduled_prefill_tokens, scheduled_decode_kv_tokens. - 13 AIC config fields + 3 admission config fields on PlannerConfig with validator (enable_aic_optimizer requires enable_power_awareness AND aic_interpolation). - get_avg_per_gpu_power_by_component DCGM query on PrometheusAPIClient and supporting documentation on DirectRouterMetricsClient (callable surface unchanged). - 9 AIC gauges/counters + 5 admission metrics on PlannerPrometheusMetrics. Test suites (~2,000 lines new tests): - 967-line test_aic_power_optimizer.py: 34 unit-grade tests covering EMA mechanics, drift detection, sweep failure modes, and the power_w clamp. - 712-line test_aic_power_e2e_sim.py: 15 closed-loop integration tests with the fake-AIC + fake-Prometheus harness. - 558-line test_metric_paths_live.py: live-cluster validation of DCGM per-component queries. - 505 additional lines on test_prometheus.py: TestMetricsIsValid, TestDirectRouterMetricsClient*, TestGetAvgRequestDurationRouterSource, TestGetAvgKvHitRateEdgeCases, plus the get_avg_per_gpu_power_by_component half of TestPowerAwareDcgmQueries. - profiler/utils/aic_dataframe.py: 14-line defensive fix for AIC table lookups. The PR 1b hand-authored intermediate of the 6 shared files plus this PR's verbatim checkout reproduces kaim/power-planner byte-for-byte (validated via git diff). Part of the PR #9369 split (PR 3 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the synthetic-metrics testbed: deterministic, no GPU, no cluster. Alpha-class (27 scenarios, A1-F26): synthetic fleet + fake metrics + fake actuator drive the planner through fault-injection scenarios covering AIC drift, NVML clamps, K8s RBAC denials, node loss / recovery, Prometheus outages, MDC gaps, budget shrinkage, AIC infeasibility, and drift-threshold boundary cases. Gamma-class (3 scenarios, G1-G3): mocker-driven trace replay with synthetic-power overlay (replay/synthetic_power_overlay.py) and a power-aware replay adapter (replay/power_aware_replay_adapter.py) exercise the closed loop against real Mooncake traces. Infrastructure: - runner.py + scenarios.py + assertions.py + recorder.py + clock.py (run-loop, scenario loader, invariant checks, recording). - synthetic_fleet.py + fake_actuator.py + fake_aic.py + fake_planner_metrics.py + fake_prometheus.py (test doubles for every external dependency of the planner run loop). - _runtime_stub.py installs a stub dynamo._core when the compiled Rust binding is absent, so the testbed runs on developer laptops without a CUDA toolchain (carries every dynamo._core symbol used by dynamo.llm at module-load time, including the post-rebase RoutingConstraints addition from main PR #9558). - grafana/testbed_dashboard.json + systems/ (h100_pcie / h100_sxm / h200_sxm SKUs) + traces/placeholder_h200_disagg_1rps.jsonl provide a complete observability + replay stack. Production code: - offline/replay_adapter.py: 9-line timing fix (now_s = max(tick.at_s, bridge_now_s)) prevents stale-tick loops on sparse traces. 86 testbed tests + 30 scenarios (27 alpha + 3 gamma) ship green at this tip (1 skipped pending env-var; test_aic_real_data.py is module-skipped unless AIC_SANDBOX_DIR is set). Part of the PR #9369 split (PR 4 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds documentation, dev environment, example DGD configs, operator scripts, and standalone analysis tools. Design documentation (~5,000 lines): - docs/design-docs/powerplanner-design.md (2,111 lines): full design with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics, 5.6 drift detection + hysteresis, 5.7 admission control, 6.5 Power Agent fail-closed cold-start cap, section 8 failure-mode catalog (14 modes). - docs/design-docs/powerplanner-testbed-design.md (2,187 lines): testbed design with alpha and gamma scenario classes, fake-driver contracts, replay-overlay timing semantics. - docs/components/planner/dpp-dev-env.md (675 lines): dev-pod bring-up + smoke-test recipes against viking-prod and umb-b200 cluster hardware. Dev environment + example DGDs: - deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml + qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for live-cluster iteration without rebuilds. - examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml + verify_poweraware.bash (operator smoke-test). Inspection + analysis tooling: - scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time helpers for DCGM attribution, frontend port resolution, planner DCGM query methods, worker /metrics endpoints, live planner metrics. - tools/compute_power_budget.py + integrate_aic_power_data.py + validate_aic_power_integration.py: standalone analysis utilities for sizing the budget, ingesting AIC offline data, and validating the integration end-to-end. - test_planner_power_launch.py: root-level pipeclean launcher. CI filter (.github/filters.yaml) accumulates the remaining planner-side paths so changes here trigger the planner job set (cumulative with PR 1a and PR 1b additions; the final file matches kaim/power-planner byte-for-byte). No production code changes; this PR is purely additive documentation, deployment manifests, and operator tooling. All planner/power_agent tests at PR 4 tip remain green (no test additions in this PR). Part of the PR #9369 split (PR 5 of 6, final). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Updates the three reference sites in examples/deployments/powerplanner/ that previously instructed users to ``kubectl apply -f deploy/power_agent/...``, flipping them to the new Helm chart at deploy/helm/charts/power-agent/ that landed in PR #9682: * disagg-power-aware.yaml header recipe * README.md Prerequisites + verify section * MULTI_DGD.md file index Also updates the verify-pods label selector from the legacy ``app=dynamo-power-agent`` to the chart-emitted ``app.kubernetes.io/name=power-agent``. Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md, pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference the old paths in their historical / architectural narrative sections, which is intentional -- those describe the pre-chart state and the transition rationale, not current deployment instructions. Part of the PR #9369 cascade following PR #9682''s Helm chart landing. Signed-off-by: Kai Ma <kaim@nvidia.com>
The two ``[Appendix C.10]``/``[Appendix D.7]`` links in the testbed README pointed at ``docs/design-docs/powerplanner-testbed-design.md``, which is introduced by PR #9687 and does not yet exist on this branch. The Docs link check (lychee) has therefore failed on PR #9686 since the PR was opened on 2026-05-18 -- a cross-PR forward reference baked into the original PR #9369 split. Convert both occurrences from ``[text](relative-path)`` syntax to plain backticked text references. The information value (appendix numbers + target file) is preserved; lychee no longer treats them as candidate links to resolve. Once both PRs land on ``main`` the file resolves naturally and reviewers can grep for the path. No code or test change. Cascade-affected branches: this commit lives on pr4/testbed; pr5/docs-devenv will rebase onto the new tip. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds documentation, dev environment, example DGD configs, operator scripts, and standalone analysis tools. Design documentation (~5,000 lines): - docs/design-docs/powerplanner-design.md (2,111 lines): full design with Phase 1-5 layered roadmap, 5.3 correction-coefficient mechanics, 5.6 drift detection + hysteresis, 5.7 admission control, 6.5 Power Agent fail-closed cold-start cap, section 8 failure-mode catalog (14 modes). - docs/design-docs/powerplanner-testbed-design.md (2,187 lines): testbed design with alpha and gamma scenario classes, fake-driver contracts, replay-overlay timing semantics. - docs/components/planner/dpp-dev-env.md (675 lines): dev-pod bring-up + smoke-test recipes against viking-prod and umb-b200 cluster hardware. Dev environment + example DGDs: - deploy/planner/dev/Dockerfile.planner-dev + planner-dev-pod.yaml + qwen3-quickstart-dgd.yaml: bring-your-own-source planner dev pod for live-cluster iteration without rebuilds. - examples/deployments/powerplanner/: README + PIPECLEAN + MULTI_DGD guides + disagg-power-aware.yaml + disagg-conservative-cold-start.yaml + verify_poweraware.bash (operator smoke-test). Inspection + analysis tooling: - scripts/dev/test_k8s_access.py and scripts/inspect_*.py: dev-time helpers for DCGM attribution, frontend port resolution, planner DCGM query methods, worker /metrics endpoints, live planner metrics. - tools/compute_power_budget.py + integrate_aic_power_data.py + validate_aic_power_integration.py: standalone analysis utilities for sizing the budget, ingesting AIC offline data, and validating the integration end-to-end. - test_planner_power_launch.py: root-level pipeclean launcher. CI filter (.github/filters.yaml) accumulates the remaining planner-side paths so changes here trigger the planner job set (cumulative with PR 1a and PR 1b additions; the final file matches kaim/power-planner byte-for-byte). No production code changes; this PR is purely additive documentation, deployment manifests, and operator tooling. All planner/power_agent tests at PR 4 tip remain green (no test additions in this PR). Part of the PR #9369 split (PR 5 of 6, final). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Updates the three reference sites in examples/deployments/powerplanner/ that previously instructed users to ``kubectl apply -f deploy/power_agent/...``, flipping them to the new Helm chart at deploy/helm/charts/power-agent/ that landed in PR #9682: * disagg-power-aware.yaml header recipe * README.md Prerequisites + verify section * MULTI_DGD.md file index Also updates the verify-pods label selector from the legacy ``app=dynamo-power-agent`` to the chart-emitted ``app.kubernetes.io/name=power-agent``. Design docs (powerplanner-design.md, power-agent-dcgm-actuator.md, pr9369-split-plan.md, power-agent-helm-chart-plan.md) still reference the old paths in their historical / architectural narrative sections, which is intentional -- those describe the pre-chart state and the transition rationale, not current deployment instructions. Part of the PR #9369 cascade following PR #9682''s Helm chart landing. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.
Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.
43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.
CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.
Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.
Signed-off-by: Kai Ma <kaim@nvidia.com>
…rometheus Adds the planner-side mechanism for power-aware operation: - 5 PlannerConfig fields (enable_power_awareness defaults False) with required-when-enabled validator for total_gpu_power_limit and power_agent_safe_default_watts - _apply_power_annotations() reconciliation loop on every tick, gated on the feature flag, with POWER_ANNOTATION_KEY constant - _publish_power_budget_metrics() dashboard gauges (static config, not DCGM, so it remains valid when attribution drops) - CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI - get_component_pods + list_frontend_pods + resolve_frontend_http_port on KubernetesConnector - get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the frontend-source path - 3 planner_metrics gauges: power_budget_total_watts, power_projected_watts, power_budget_utilization - Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml for non-operator development clusters - repo-shared components/src/conftest.py (planner pkg conftest) - 623-line unit suite (test_actuation_knobs.py) + 567-line live integration suite (test_actuation_knobs_live.py) covering every actuation knob this PR adds - Unit test additions on test_kubernetes_connector.py (~119 lines) and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries get_total_dgd_power subset, ~100 lines) - CI filter (.github/filters.yaml) extends the planner group with components/src/conftest.py Power awareness is off by default. Budget enforcement (_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3. Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent component) must merge first or land in parallel. See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds the Power Agent DaemonSet (privileged, hostPID) that watches pods
for the dynamo.nvidia.com/gpu-power-limit annotation, parses
/proc/{pid}/cgroup to recover pod UIDs (handles all K8s QoS x cgroup
driver x CRI runtime variants), and calls
nvmlDeviceSetPowerManagementLimit per physical GPU.
Includes NVML cap clamping to SKU-specific [min_w, max_w] bounds,
UUID-gated orphan-cap restoration on startup, SIGTERM-handler default-
TGP restore, and a multi-pod-per-GPU policy that falls back to
power_agent_safe_default_watts on conflict.
43 unit tests covering apply_cap, cgroup_parser, multi_pod_policy,
shutdown. Standalone component - no coupling to planner internals.
CI filter (.github/filters.yaml) extends the planner group to cover the
new component + deploy paths so changes here trip the planner job set.
Part of the PR #9369 split (PR 1a of 6). See docs/design-docs/pr9369-split-plan.md.
Signed-off-by: Kai Ma <kaim@nvidia.com>
…rometheus Adds the planner-side mechanism for power-aware operation: - 5 PlannerConfig fields (enable_power_awareness defaults False) with required-when-enabled validator for total_gpu_power_limit and power_agent_safe_default_watts - _apply_power_annotations() reconciliation loop on every tick, gated on the feature flag, with POWER_ANNOTATION_KEY constant - _publish_power_budget_metrics() dashboard gauges (static config, not DCGM, so it remains valid when attribution drops) - CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI - get_component_pods + list_frontend_pods + resolve_frontend_http_port on KubernetesConnector - get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the frontend-source path - 3 planner_metrics gauges: power_budget_total_watts, power_projected_watts, power_budget_utilization - Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml for non-operator development clusters - repo-shared components/src/conftest.py (planner pkg conftest) - 623-line unit suite (test_actuation_knobs.py) + 567-line live integration suite (test_actuation_knobs_live.py) covering every actuation knob this PR adds - Unit test additions on test_kubernetes_connector.py (~119 lines) and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries get_total_dgd_power subset, ~100 lines) - CI filter (.github/filters.yaml) extends the planner group with components/src/conftest.py Power awareness is off by default. Budget enforcement (_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3. Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent component) must merge first or land in parallel. See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
…rometheus Adds the planner-side mechanism for power-aware operation: - 5 PlannerConfig fields (enable_power_awareness defaults False) with required-when-enabled validator for total_gpu_power_limit and power_agent_safe_default_watts - _apply_power_annotations() reconciliation loop on every tick, gated on the feature flag, with POWER_ANNOTATION_KEY constant - _publish_power_budget_metrics() dashboard gauges (static config, not DCGM, so it remains valid when attribution drops) - CoreV1Api + patch_pod_annotation + list_pods_by_label on KubernetesAPI - get_component_pods + list_frontend_pods + resolve_frontend_http_port on KubernetesConnector - get_total_dgd_power DCGM Prometheus query + NaN-filter fix on the frontend-source path - 3 planner_metrics gauges: power_budget_total_watts, power_projected_watts, power_budget_utilization - Operator Helm chart bump 1.2.0 -> 1.2.1 with planner pod-RBAC patch (pods: get/list/watch/patch) plus deploy/planner-pod-rbac-dev.yaml for non-operator development clusters - repo-shared components/src/conftest.py (planner pkg conftest) - 623-line unit suite (test_actuation_knobs.py) + 567-line live integration suite (test_actuation_knobs_live.py) covering every actuation knob this PR adds - Unit test additions on test_kubernetes_connector.py (~119 lines) and test_prometheus.py (NaN tests + TestPowerAwareDcgmQueries get_total_dgd_power subset, ~100 lines) - CI filter (.github/filters.yaml) extends the planner group with components/src/conftest.py Power awareness is off by default. Budget enforcement (_apply_power_budget) and the AIC closed loop arrive in PR 2 and PR 3. Part of the PR #9369 split (PR 1b of 6). Predecessor PR 1a (power-agent component) must merge first or land in parallel. See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
Adds power-aware intelligence to the Dynamo planner across three layers, plus tooling that pipecleans the Phase 4 selection path with currently available single-TDP AIC data. Reimplementation of unmerged PR #5280 against post-refactor ToT (planner/utils/ no longer exists; configuration is Pydantic; layout is config/, connectors/, core/, monitoring/).
Full design: docs/design-docs/powerplanner-design.md
Repro / dev environment: docs/components/planner/dpp-dev-env.md
What's in this PR
Phase 1 — infrastructure
podspermissions (deploy/helm/charts/platform/components/operator/templates/planner.yaml, chart bumped to 1.2.1). Older clusters can apply deploy/planner-pod-rbac-dev.yaml as a temporary patch.Phase 2 — budget enforcement
Phase 3 — AIC optimizer
Phase 4 — pipeclean (using existing single-TDP data)
Bug fixes (found by live integration tests, fixed in this patchset) -------------------------------------------------------------------
podsrules → planner couldn't list workers or patch annotations. Fixed in chart template; workaround manifest provided for clusters not yet upgraded.dynamo-graph-deployment→dynamo-graph-deployment-name,dynamo-service→dynamo-component/dynamo-component-type.pod=~(matched the DCGM exporter pod) and the wrong namespace; corrected toexported_pod=~+exported_namespace=<k8s-namespace>and pod regex now matches the operator's<dgd>-<replica-idx>-<service-key>-...format (monitoring/traffic_metrics.py).increase(_sum)/increase(_count)); router path had a NaN guard, frontend didn't — added matchingmath.isnanfilter.Tests
Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS dev cluster (qwen3-quickstart DGD, single dev namespace):
Suite Pass Skip Fail
planner/tests/unit/ 465 0 0
planner/tests/integration/test_aic_power_e2e_sim 15 0 0
planner/tests/integration/test_aic_power_optimizer 34 0 0
planner/tests/integration/test_metric_paths_live 22-23 3-2 0
planner/tests/integration/test_actuation_knobs_live 10-11 1-0 0
power_agent/tests/ 43 0 0
Total (cold deploy / after sanity chat completion) 590-591 4-3 0
Documented skips:
Repro recipe: docs/components/planner/dpp-dev-env.md
"From-Scratch Repro Script" section.
Compatibility
Power awareness is opt-in (enable_power_awareness=False by default). AIC optimizer is opt-in (enable_aic_optimizer=False by default). No existing planner behavior changes when both flags are off.
Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation) remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4 pipeclean code path so it can be exercised before AIC ships the full sweep API.
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Release Notes
New Features
Documentation
Deployment