Skip to content

feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus#9683

Open
kaim-eng wants to merge 18 commits into
pr1a/power-agentfrom
pr1b/planner-infra
Open

feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus#9683
kaim-eng wants to merge 18 commits into
pr1a/power-agentfrom
pr1b/planner-infra

Conversation

@kaim-eng

@kaim-eng kaim-eng commented May 18, 2026

Copy link
Copy Markdown

Part of the PR #9369 split plan. This is PR 2 of 6 (PR 1b - Planner Power Infrastructure). Predecessor: #9682 - Power Agent DaemonSet (base = pr1a/power-agent; #9682 must land first) Successor: #9684 (Draft) - Power budget enforcement ## Summary Planner-side wires from config to pod annotation. Nothing in this PR changes the scaling policy to consume power yet; Phase 1 emits and observes the power infrastructure, while the AIC closed-loop optimizer and admission coupling remain deferred. Current branch shape: 32 files changed against pr1a/power-agent. The 6 shared files with later planner PRs are hand-authored here at the Phase-1 line subset from the split plan. - 6 power-aware config fields: enable_power_awareness (default False), total_gpu_power_limit, prefill_engine_gpu_power_limit, decode_engine_gpu_power_limit, power_agent_safe_default_watts, and power_annotation_interval_seconds (default 60s sweep throttle). The per-engine power-limit fields are ge=1 validated. - _apply_power_annotations() and _publish_power_budget_metrics() in core/base.py: emit dynamo.nvidia.com/gpu-power-limit per worker pod and expose Phase-1 power-budget gauges. Advisory mode is a hard no-op for pod mutation, and steady-state annotation sweeps are throttled with a scale-up force window. - 7e74aa65ea fixes the sweep force-window logic for the orchestrator path: _scaling_up() now reads cached WorkerCounts before PSM internals and never constructs a PSM just to classify scale-up. - KubernetesConnector / KubernetesAPI: patch_pod_annotation, list_pods_by_label, get_component_pods, list_frontend_pods, and resolve_frontend_http_port. - Prometheus plumbing: DCGM get_total_dgd_power() plus the NaN-filter fix in _get_metric_increase. - Operator-chart planner pod RBAC: pods [get, list, watch, patch] for the planner ServiceAccount in both namespace and cluster-wide modes. The platform chart stays at the main baseline 1.3.0; the obsolete standalone deploy/planner-pod-rbac-dev.yaml patch was removed. - examples/multi-dgd-live-test/: multi-DGD live-repro scaffolding and test_multi_dgd_live.py; personal identifiers were scrubbed. Reviewer context: - Design context: docs/design-docs/powerplanner-design.md sections 5.5, 6.4, and 7 (lands later in the stack; readable from pr5/docs-devenv). - Dev environment: docs/components/planner/dpp-dev-env.md section 8 (also in PR 5). - Split-plan context: sections 2.2, 2.2.2, 3.2, and 5. ## Validation - Required/current GitHub checks pass at 7e74aa65ea; rust jobs and team-request are intentionally skipped by filters. - Dev-pod validation from 2026-05-18, before the final 7e74aa65ea sweep-throttle fix: - Full planner sweep, excluding test_virtual_connector.py: 483 passed, 2 skipped in 12.46s. - Same sweep with RUN_DISRUPTIVE_TESTS=1: 484 passed, 1 skipped. - Phase-1 critical files (test_actuation_knobs.py, test_kubernetes_connector.py, test_prometheus.py): 121 / 121 passed. - test_actuation_knobs_live.py: 9 passed, 2 skipped; with disruptive opt-in: 10 passed, 1 skipped. - Power Agent regression at PR 1b tip at that time: 43 / 43 passed. - 7e74aa65ea adds 3 focused unit tests in test_actuation_knobs.py for cached worker-count precedence, orchestrator/no-PSM behavior, and non-scale-up cached counts. These assertions are lint/import clean and required CI is green, but the dev-pod planner pytest count has not been refreshed after this final commit. - Pre-commit on origin/main..pr1b/planner-infra: 14 hooks pass, 4 skipped; marker compliance was reconfirmed in a Linux pod with Missing sets: 0. The Windows host has the known pre-existing POSIX-only fcntl import issue in untouched test utilities; Linux CI is unaffected. - DCO: commits are signed off. Expected skips from the May 18 dev-pod run: - TestPostBusyThresholdLive::test_post_busy_threshold_returns_2xx: frontend image lacks /busy_threshold; admission control lands later. - TestScalingRealMutation::test_set_component_replicas_mutates_dgd_spec: opt-in via RUN_DISRUPTIVE_TESTS=1; passes when enabled. ## Merge Strategy Rebase-and-merge, no squash, per split-plan section 4.3. ## Stack Order Merge #9682 first. After #9682 lands, #9683 can be rebased/retargeted onto the updated base and merged. #9790 is a sibling stacked on #9682 and can merge after #9682 as well.

@kaim-eng kaim-eng requested review from a team as code owners May 18, 2026 15:54
@copy-pr-bot

copy-pr-bot Bot commented May 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added feat documentation Improvements or additions to documentation deployment::k8s Relates to dynamo deployment in kubernetes planner actions labels May 18, 2026
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 3cde449 to 1338028 Compare May 18, 2026 16:03
@kaim-eng kaim-eng force-pushed the pr1b/planner-infra branch from 2222dbf to 1d7b243 Compare May 18, 2026 16:03
Comment thread components/src/dynamo/planner/core/base.py Outdated
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch from 1338028 to ec21081 Compare May 19, 2026 12:54
@kaim-eng kaim-eng force-pushed the pr1b/planner-infra branch 2 times, most recently from 8126b1a to f7f3226 Compare May 19, 2026 15:54
kaim-eng added a commit that referenced this pull request May 22, 2026
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:

  - A2  Three planners patch DISJOINT pod sets in one namespace
  - A3  AggPlanner-C uses only the DECODE branch (no PREFILL calls)
  - B2  Power Agent applies 8 distinct caps on one tick
  - B4  Happy-path zero-counter invariant
  - B5  Multi-pod conflict + safe-default behaviour (skipped pending
        Finding #7 — CDI-safe bystander rewrite tracked separately)
  - 8.5.6 / 8.6.2  NVML / DCGM cap-application proven by reading the
                   live driver state (not via mocks)

Layout introduced by this commit:

  components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
      ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
      TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
      shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
      pods with hardcoded annotations) and skips Path-A-only tests
      cleanly when running Path B. DCGM tests gated by
      DCGM_HOSTENGINE_AVAILABLE=1.

  examples/multi-dgd-live-test/
      Manifests + helm values overrides + teardown script + README
      walk-through that an operator applies before running the test:
        00-namespace.yaml                 — namespace + pull secret
        10-power-agent-values-nvml.yaml   — Power Agent NVML helm values
        11-power-agent-values-dcgm.yaml   — Power Agent DCGM helm values
        20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
        30/31/32-vllm-dgd-{A,B,C}.yaml    — three DGDs (Path A)
        40-conflict-bystander-pod.yaml    — multi-pod conflict fixture
        50-stub-workers.yaml              — Path-B stub workers
        99-teardown.sh                    — idempotent cleanup w/ cap-
                                            restore verification
        README.md                         — full reproduction playbook
                                            including a 2026-05-21 live
                                            run record + 11 findings

  docs/design-docs/multi-tenant-test.md
      Design doc (§8 — multi-DGD test plan) that establishes the
      ground-truth topology, the planner / agent contract under test,
      and the rationale for the artifact split above.

  pyproject.toml
      Registers two new test markers consumed by the live test's
      `pytestmark`:
        multi_dgd_live  — scenario selector for this specific file
        power_agent     — shared with components/power_agent/tests/

Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).

Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
…nd DCGM 4.5

Two stacked defects surfaced on the 2026-05-21T20:11Z Path-B re-run
caused `test_dcgm_per_gpu_cap_matches_topology` and
`test_dcgm_target_config_registered` to skip silently when they should
have PASSED:

1. **`kubernetes-client>=35.0.0` mangles JSON pod-exec stdout.**
   `kubernetes.stream.stream(..., _preload_content=True)` sniffs
   JSON-shaped stdout, parses it client-side, and returns `str(dict)`.
   That means `printf '%s' '{"a":1}'` comes back as the literal 8-char
   string `{'a': 1}` (single quotes, Python-dict repr), which then
   trips `json.loads` at every read-back site — including the four
   inside `test_dcgm_*`. The intended-as-defensive
   "did not return JSON" skip branch fired instead.

2. **DCGM 4.5+ rotated the `dcgmi config --get -g <id> -j` JSON shape.**
   Pre-4.5: `body` is a list of per-GPU entries keyed by `"Power Limit (W)"`
   with flat `{"Current": "...", "Target": "..."}` children.
   4.5+:    `body` is a dict of fields keyed by `"Power Limit"`
            (no `(W)` suffix) with a `"children"` wrapper whose leaves
            are `{"value": "..."}` objects.
   Three breaking changes in one upgrade — `body` flipped list→dict,
   the field name lost its suffix, the leaves gained a `value` wrapper.

The previous implementation handled neither, so even if defect (1)
were absent, the strict per-field reader would still skip.

This commit lands a single combined fix:

- **`_exec_in_pod`**: switch from `_preload_content=True` (default) to
  `_preload_content=False` + raw stdout/stderr accumulation off the
  `WSClient` via `peek_stdout`/`read_stdout`. Bypasses the mangling
  regardless of which `kubernetes-client` version is installed, and is
  the documented "I want bytes back" pattern in the kubernetes-client
  examples.

- **New `_extract_power_limit_from_dcgmi_body(body)` helper**: walks
  both shapes uniformly and returns `(current_w, target_w)` ints or
  `(None, None)` if neither shape matches. Both DCGM read-back tests
  (`test_dcgm_per_gpu_cap_matches_topology` and
  `test_dcgm_target_config_registered`) now consume this helper
  instead of inlining duplicated nested-`dict.get` chains.

Verified locally:
  - synthetic 4.0-shape payload `[{"Power Limit (W)": {"Current": "350",
    "Target": "350"}}]`  → `(350, 350)` ✓
  - synthetic 4.5-shape payload `{"Power Limit": {"children":
    {"Current": {"value": "375"}, "Target": {"value": "375"}}}}`
      → `(375, 375)` ✓
  - missing field, wrong types, partial children: graceful `(None, None)`
    / `(int, None)` returns ✓
  - test module still collects cleanly (`pytest --collect-only`)

`examples/multi-dgd-live-test/README.md` Finding #11 updated to
"Status: fixed" with a cross-reference to this commit and the expected
SKIPPED→PASSED flip on the next live re-run.

Refs: PR #9683 review (Finding #11 in
examples/multi-dgd-live-test/README.md).

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng requested a review from a team as a code owner May 22, 2026 14:23
kaim-eng added a commit that referenced this pull request May 22, 2026
`TestMultiPodConflict` previously created a bystander pod with
`NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS
clusters running the NVIDIA Container Toolkit in CDI device mode (the
dpp-dev-env default, and the direction NVIDIA is pushing for newer
operator releases) reject that with
`unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to
admit the pod — so the test class was carrying a
`@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled
the whole §8.5.7 contract on every run.

This commit lands the CDI-mode-safe rewrite:

  - Bystander now claims `nvidia.com/gpu: 1` from the device plugin
    (both `requests` and `limits` per K8s extended-resource rules).
    CDI-mode-agnostic — the device plugin owns the actual GPU wiring
    via its own machinery.

  - `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`
    mirrors the inline Python manifest exactly so YAML-debugging and
    test-debugging agree (the README explicitly calls this out).

  - New helper `_per_gpu_compute_pids()` snapshots
    `nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the
    agent pod (which has hostPID + NVML access) and returns
    `{gpu_idx: {pid, ...}}`. The test polls this until it sees a new
    PID on some index after the bystander reaches Running — that
    index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES
    pin needed.

  - Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` →
    `test_conflict_triggers_safe_default_on_assigned_gpu` to reflect
    the index-agnostic contract. Blast-radius / recovery asserts key
    off the discovered `assigned_gpu`; the loop over the other GPUs
    iterates `range(8)` and skips that index instead of hard-coding
    `range(1, 8)`.

  - Class-level `@pytest.mark.skip` decorator removed. The supporting
    `_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither
    in code nor as a runtime branch, since neither path is needed
    once the CDI-safe flow works.

  - `bystander_uuid` class fixture removed (no longer needed — the
    UUID/index discovery is post-creation, not pre).

`examples/multi-dgd-live-test/README.md` updated:
  - Finding #7 marked **Status: fixed** with cross-reference to this
    commit and an explicit description of the new (a)/(b)/(c) flow.
  - Expected-outcome counts updated: Path B now expects 5 PASSED /
    2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves
    from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the
    next live run.
  - Live-run record row updated to note the rewrite and expected
    PASS outcome.

Verified locally: test module compiles + collects cleanly. The
end-to-end CDI-vs-non-CDI assignment behaviour is by construction the
device plugin's responsibility, so the failure mode of this rewrite
is bounded to "device plugin returns no GPU within 90 s" — handled
with a clear pytest.fail message that includes a before/after PID
snapshot for debugging.

Refs: PR #9683 review (Finding #7 in
examples/multi-dgd-live-test/README.md).

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
…dd SPDX header

Two follow-up fixes to commit 552be92 caught in PR #9683 review.

1. Bystander cannot schedule. The previous CDI-safe rewrite swapped
   `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device-
   plugin claim, expecting the plugin to allocate a free GPU. But
   the 5 stub workers in `50-stub-workers.yaml` already claim every
   `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so
   the bystander's claim sits Pending forever and the conflict never
   materialises.

   Switch to the same pattern the Power Agent itself uses
   (`deploy/helm/charts/power-agent/values.yaml`):
   `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the
   toolkit without consuming a device-plugin claim. The bystander
   pins itself to `cuda:0` (deterministically the first visible
   device — same NVML index 0 as the agent, since both have
   `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery
   loop in the test still verifies which physical GPU the context
   landed on rather than trusting the mapping, so the test stays
   robust under future toolkit / CDI changes.

   Touches the inline `bystander_manifest` in
   `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML
   companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`,
   plus the README's Finding #7 status to document both
   non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI
   rejection AND the `nvidia.com/gpu: 1` over-subscription) and
   the surviving fix-shape.

2. Copyright check failed. `40-conflict-bystander-pod.yaml` was
   imported from `.aic_sandbox/` in commit 004df25 without the
   standard two-line SPDX header that every other YAML in
   `examples/multi-dgd-live-test/` carries (00-namespace through
   50-stub-workers — verified via head-of-file scan). Add the
   header.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
… lychee

Lychee on PR #9683 fails on 5 URLs that aren't really URLs (caught
in review). All five are inherited from f7f3226 (the original
"feat(planner): power infrastructure …" commit on this branch) but
block #9683 from going green either way.

The 5 cited locations and fix shape:

- `components/src/dynamo/planner/config/defaults.py:48`
- `components/src/dynamo/planner/config/planner_config.py:147`
  In-cluster Prometheus URL
  `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090`.
  `*.svc.cluster.local` is the canonical Kubernetes-internal DNS
  suffix and is not externally resolvable. Add an ignore entry for
  the suffix so any cluster-internal default URL passes lychee.

- `components/src/dynamo/planner/config/planner_config.py:317`
  Docstring placeholder `http://host:port/`.
- `deploy/helm/charts/platform/README.md:118-119`
  Helm-rendered Format hints `"nats://hostname:port"` /
  `"http://hostname:port"` / `"https://hostname:port"`.
  Add three anchored ignore entries for these exact placeholder
  forms (kept narrow so a real `*.com` hostname can't accidentally
  slip past as `hostname:port`).

- `deploy/helm/charts/platform/Chart.yaml:23`
  `home: https://nvidia.com`. The same file already uses
  `https://www.nvidia.com` on line 19 for `url:` and that one passes
  lychee, so the bare-apex form on line 23 must be hitting a 4xx
  (Cloudflare bot challenge, redirect-loop, etc.). Canonicalise to
  the `www.` form so the two `nvidia.com` URLs in the same file
  agree, and no ignore entry is needed.

Verified all 5 URLs match the new ignore patterns via local regex
spot-check; `https://www.nvidia.com` (existing line 19 + line 23 fix)
deliberately does NOT match any ignore pattern.

Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1a/power-agent branch 2 times, most recently from 1e1ef6f to 29f0831 Compare May 25, 2026 13:46
kaim-eng added a commit that referenced this pull request May 25, 2026
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:

  - A2  Three planners patch DISJOINT pod sets in one namespace
  - A3  AggPlanner-C uses only the DECODE branch (no PREFILL calls)
  - B2  Power Agent applies 8 distinct caps on one tick
  - B4  Happy-path zero-counter invariant
  - B5  Multi-pod conflict + safe-default behaviour (skipped pending
        Finding #7 — CDI-safe bystander rewrite tracked separately)
  - 8.5.6 / 8.6.2  NVML / DCGM cap-application proven by reading the
                   live driver state (not via mocks)

Layout introduced by this commit:

  components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
      ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
      TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
      shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
      pods with hardcoded annotations) and skips Path-A-only tests
      cleanly when running Path B. DCGM tests gated by
      DCGM_HOSTENGINE_AVAILABLE=1.

  examples/multi-dgd-live-test/
      Manifests + helm values overrides + teardown script + README
      walk-through that an operator applies before running the test:
        00-namespace.yaml                 — namespace + pull secret
        10-power-agent-values-nvml.yaml   — Power Agent NVML helm values
        11-power-agent-values-dcgm.yaml   — Power Agent DCGM helm values
        20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
        30/31/32-vllm-dgd-{A,B,C}.yaml    — three DGDs (Path A)
        40-conflict-bystander-pod.yaml    — multi-pod conflict fixture
        50-stub-workers.yaml              — Path-B stub workers
        99-teardown.sh                    — idempotent cleanup w/ cap-
                                            restore verification
        README.md                         — full reproduction playbook
                                            including a 2026-05-21 live
                                            run record + 11 findings

  docs/design-docs/multi-tenant-test.md
      Design doc (§8 — multi-DGD test plan) that establishes the
      ground-truth topology, the planner / agent contract under test,
      and the rationale for the artifact split above.

  pyproject.toml
      Registers two new test markers consumed by the live test's
      `pytestmark`:
        multi_dgd_live  — scenario selector for this specific file
        power_agent     — shared with components/power_agent/tests/

Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).

Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1b/planner-infra branch from 7696f3f to 8f47413 Compare May 25, 2026 14:17
kaim-eng added a commit that referenced this pull request May 25, 2026
…nd DCGM 4.5

Two stacked defects surfaced on the 2026-05-21T20:11Z Path-B re-run
caused `test_dcgm_per_gpu_cap_matches_topology` and
`test_dcgm_target_config_registered` to skip silently when they should
have PASSED:

1. **`kubernetes-client>=35.0.0` mangles JSON pod-exec stdout.**
   `kubernetes.stream.stream(..., _preload_content=True)` sniffs
   JSON-shaped stdout, parses it client-side, and returns `str(dict)`.
   That means `printf '%s' '{"a":1}'` comes back as the literal 8-char
   string `{'a': 1}` (single quotes, Python-dict repr), which then
   trips `json.loads` at every read-back site — including the four
   inside `test_dcgm_*`. The intended-as-defensive
   "did not return JSON" skip branch fired instead.

2. **DCGM 4.5+ rotated the `dcgmi config --get -g <id> -j` JSON shape.**
   Pre-4.5: `body` is a list of per-GPU entries keyed by `"Power Limit (W)"`
   with flat `{"Current": "...", "Target": "..."}` children.
   4.5+:    `body` is a dict of fields keyed by `"Power Limit"`
            (no `(W)` suffix) with a `"children"` wrapper whose leaves
            are `{"value": "..."}` objects.
   Three breaking changes in one upgrade — `body` flipped list→dict,
   the field name lost its suffix, the leaves gained a `value` wrapper.

The previous implementation handled neither, so even if defect (1)
were absent, the strict per-field reader would still skip.

This commit lands a single combined fix:

- **`_exec_in_pod`**: switch from `_preload_content=True` (default) to
  `_preload_content=False` + raw stdout/stderr accumulation off the
  `WSClient` via `peek_stdout`/`read_stdout`. Bypasses the mangling
  regardless of which `kubernetes-client` version is installed, and is
  the documented "I want bytes back" pattern in the kubernetes-client
  examples.

- **New `_extract_power_limit_from_dcgmi_body(body)` helper**: walks
  both shapes uniformly and returns `(current_w, target_w)` ints or
  `(None, None)` if neither shape matches. Both DCGM read-back tests
  (`test_dcgm_per_gpu_cap_matches_topology` and
  `test_dcgm_target_config_registered`) now consume this helper
  instead of inlining duplicated nested-`dict.get` chains.

Verified locally:
  - synthetic 4.0-shape payload `[{"Power Limit (W)": {"Current": "350",
    "Target": "350"}}]`  → `(350, 350)` ✓
  - synthetic 4.5-shape payload `{"Power Limit": {"children":
    {"Current": {"value": "375"}, "Target": {"value": "375"}}}}`
      → `(375, 375)` ✓
  - missing field, wrong types, partial children: graceful `(None, None)`
    / `(int, None)` returns ✓
  - test module still collects cleanly (`pytest --collect-only`)

`examples/multi-dgd-live-test/README.md` Finding #11 updated to
"Status: fixed" with a cross-reference to this commit and the expected
SKIPPED→PASSED flip on the next live re-run.

Refs: PR #9683 review (Finding #11 in
examples/multi-dgd-live-test/README.md).

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added 4 commits June 8, 2026 12:39
…dd SPDX header

Two follow-up fixes to commit 552be92 caught in PR #9683 review.

1. Bystander cannot schedule. The previous CDI-safe rewrite swapped
   `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device-
   plugin claim, expecting the plugin to allocate a free GPU. But
   the 5 stub workers in `50-stub-workers.yaml` already claim every
   `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so
   the bystander's claim sits Pending forever and the conflict never
   materialises.

   Switch to the same pattern the Power Agent itself uses
   (`deploy/helm/charts/power-agent/values.yaml`):
   `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the
   toolkit without consuming a device-plugin claim. The bystander
   pins itself to `cuda:0` (deterministically the first visible
   device — same NVML index 0 as the agent, since both have
   `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery
   loop in the test still verifies which physical GPU the context
   landed on rather than trusting the mapping, so the test stays
   robust under future toolkit / CDI changes.

   Touches the inline `bystander_manifest` in
   `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML
   companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`,
   plus the README's Finding #7 status to document both
   non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI
   rejection AND the `nvidia.com/gpu: 1` over-subscription) and
   the surviving fix-shape.

2. Copyright check failed. `40-conflict-bystander-pod.yaml` was
   imported from `.aic_sandbox/` in commit 004df25 without the
   standard two-line SPDX header that every other YAML in
   `examples/multi-dgd-live-test/` carries (00-namespace through
   50-stub-workers — verified via head-of-file scan). Add the
   header.

Signed-off-by: Kai Ma <kaim@nvidia.com>
… lychee

Lychee on PR #9683 fails on 5 URLs that aren't really URLs (caught
in review). All five are inherited from f7f3226 (the original
"feat(planner): power infrastructure …" commit on this branch) but
block #9683 from going green either way.

The 5 cited locations and fix shape:

- `components/src/dynamo/planner/config/defaults.py:48`
- `components/src/dynamo/planner/config/planner_config.py:147`
  In-cluster Prometheus URL
  `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090`.
  `*.svc.cluster.local` is the canonical Kubernetes-internal DNS
  suffix and is not externally resolvable. Add an ignore entry for
  the suffix so any cluster-internal default URL passes lychee.

- `components/src/dynamo/planner/config/planner_config.py:317`
  Docstring placeholder `http://host:port/`.
- `deploy/helm/charts/platform/README.md:118-119`
  Helm-rendered Format hints `"nats://hostname:port"` /
  `"http://hostname:port"` / `"https://hostname:port"`.
  Add three anchored ignore entries for these exact placeholder
  forms (kept narrow so a real `*.com` hostname can't accidentally
  slip past as `hostname:port`).

- `deploy/helm/charts/platform/Chart.yaml:23`
  `home: https://nvidia.com`. The same file already uses
  `https://www.nvidia.com` on line 19 for `url:` and that one passes
  lychee, so the bare-apex form on line 23 must be hitting a 4xx
  (Cloudflare bot challenge, redirect-loop, etc.). Canonicalise to
  the `www.` form so the two `nvidia.com` URLs in the same file
  agree, and no ignore entry is needed.

Verified all 5 URLs match the new ignore patterns via local regex
spot-check; `https://www.nvidia.com` (existing line 19 + line 23 fix)
deliberately does NOT match any ignore pattern.

Signed-off-by: Kai Ma <kaim@nvidia.com>
…at parse time

Follow-up to b10cb8c. Reviewer reproduced lychee still failing on
that head with two underlying problems the previous ignore-based fix
did not address:

  1. Lychee parses URLs *before* applying ignore rules. The
     `host:port` / `hostname:port` placeholders are extracted as URL
     candidates, then fail RFC-3986 port parsing (port is not numeric)
     and surface as errors that bypass the .lycheeignore matcher.

  2. The CI workflow (`.github/workflows/docs-link-check.yml`) invokes
     lychee with `--root-dir $WS . --exclude-path "ATTRIBUTIONS"
     --exclude-path "lib/llm/tests/data/.*"` and nothing else, so
     `.lycheeignore` itself is one of the files lychee scans. The
     b10cb8c version of that file contained literal
     `://hostname:port` text in both regex bodies AND comments —
     lychee extracted those as URLs too.

Two-part fix:

  A. Fix the placeholders at the source. Switch every `host:port` /
     `hostname:port` doc-string to the standard angle-bracket
     convention `<host>:<port>` / `<hostname>:<port>` — the same
     form used in Cobra/Click CLI help text and Markdown
     documentation. `<` is reserved by RFC 3986, so lychee's URL
     extractor halts at the `<` and never produces a candidate.
     Touches:
       - components/src/dynamo/planner/config/planner_config.py:317
         (live_dashboard_port Field description).
       - deploy/helm/charts/platform/values.yaml:67, 70
         (operator natsAddr / etcdAddr Format hints — source of
         truth for helm-docs).
       - deploy/helm/charts/platform/README.md:118-119
         (helm-docs-rendered downstream copy of the values.yaml
         hints — kept in lockstep so a future regeneration is a no-op).

  B. Make .lycheeignore safe to scan. Replace literal `://` with the
     character class `:[/][/]` in every regex body, and strip
     URL-shaped text from comments (the `*.svc.cluster.local`
     suffix in the surviving comment is a DNS suffix, not a URL,
     and lychee does not extract bare hostnames). Since (A) removes
     the need for `host:port` / `hostname:port` ignores entirely,
     drop those three entries — `.svc.cluster.local` (for the
     in-cluster Prometheus default) is the only domain-level
     escape we still need.

Verified locally: `.lycheeignore` now contains 0 literal `://`
substrings, and the surviving `\.svc\.cluster\.local` pattern still
matches `defaults.py:48` and `planner_config.py:147`.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Per PR #9683 review (dynamo-ops on core/base.py:964): advisory mode is supposed to log decisions without mutating cluster state, but the per-tick call to _apply_power_annotations() was still issuing PATCH requests against customer-owned Pod objects on every tick. Pod annotations are a cluster-visible side effect, so this violated the same contract that _apply_scaling_targets already follows (core/base.py:826).

Fix: add `if self.config.advisory: return` as the first guard in _apply_power_annotations, mirroring the sibling method's pattern. Documented in the docstring so the contract is discoverable from future readers without grepping commit history.

Test: components/src/dynamo/planner/tests/unit/test_actuation_knobs.py::TestApplyPowerAnnotations::test_advisory_mode_skips_patch — uses the existing _bare_planner / _power_config / _mock_pod fixtures and asserts both that get_component_pods is never called AND that patch_pod_annotation is never called, even when a stale pod would otherwise trigger a PATCH.
Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng kaim-eng force-pushed the pr1b/planner-infra branch from 35716ac to b1ebd8b Compare June 8, 2026 16:41
@kaim-eng

kaim-eng commented Jun 8, 2026

Copy link
Copy Markdown
Author

Rebased onto pr1a/power-agent — ready for review

This branch has been rebased onto the current tip of the stack base
pr1a/power-agent (#9682). All 8 planner commits replayed cleanly
(git range-diff confirms a 1:1 mapping; the diff vs the base is
planner-only — no power-agent files leaked in).

Conflicts resolved during the rebase

  • Chart versions (deploy/helm/charts/platform/Chart.yaml,
    components/operator/Chart.yaml, README.md): main had advanced both
    charts to 1.3.0. Re-applied this PR's patch-level bump (planner pod
    RBAC templates) on top → 1.3.1 (chart + appVersion + dependency
    pin + rendered README badges/table).
  • .lycheeignore: kept both sides — main's etcd.io/martinfowler
    entries and this PR's cluster-internal DNS + host:port placeholder
    entries.
  • Chart.yaml home: took the lychee-friendly https://www.nvidia.com.
  • test_kubernetes_connector.py: both sides were additive — kept
    main's new _main_container/_component/_deployment helpers and
    this PR's hermetic monkeypatch.delenv(...) in
    test_kubernetes_connector_no_env_var.

Semantic-merge fix (ruff F821, caught by CI pre-commit)

main renamed the planner helper
get_service_from_sub_component_type_or_name
get_component_from_type_or_name (now positional sub_component_type).
The git auto-merge updated every existing call site but this PR's new
KubernetesConnector.get_component_pods call site still used the old
name. Fixed connectors/kubernetes.py:326 to call
get_component_from_type_or_name(deployment, sub_component_type),
matching the sibling call sites (e.g. lines 115/133) and the definition in
monitoring/dgd_services.py. Folded into the feature commit.

Verification

  • ruff / flake8 / black / isort + all pre-commit content hooks: green.
  • helm lint operator subchart: green; helm template renders the new
    planner.yaml RBAC (ClusterRole cluster-wide / Role+SA+binding in
    namespace-restricted mode) without error.
  • DCO: 8/8 commits signed off, conventional-commit prefixes.
  • Planner unit suite is build-gated (compiled dynamo bindings) and runs in
    CI rather than locally; py_compile is clean on all resolved files.

CI status

  • Required checks green: DCO, copyright-checks,
    pre-merge-status-check (includes pre-commit, operator, snapshot).
  • The three NVIDIA Test Lab required checks
    (backend-/deploy-/dynamo-status-check) await the maintainer-triggered
    internal pipeline, same as feat(power-agent): per-node power-cap enforcement DaemonSet #9682 / feat(power-agent): add DCGM dual actuator (opt-in; NVML remains default) #9790.
  • The lone red check is non-required lychee, failing on a single link
    https://charts.bitnami.com/bitnami in the auto-generated platform chart
    README (the etcd dependency's repo URL). Bitnami sunset their public
    Helm registry, so this fails repo-wide from CI runners; it is not
    introduced by this PR and does not block merge. Left out of scope here
    (it belongs in a standalone docs/deps maintenance change).

@sttts this is the planner half of the #9369 split, rebased and green on
everything in scope — ready for a review stamp.

Comment thread .gitignore Outdated
/ "planner"
/ "tests"
/ "testbed"
/ "_runtime_stub.py"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comes in a follow-up PR ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the hook here for now. The pre-merge guard makes it a harmless no-op in this PR, so it isn't blocking. If you'd prefer zero early no-op code, I can move it into #9686 — let me know and I'll relocate it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick clarification on the mechanism — my earlier "pre-merge guard" wording was imprecise and could read as pytest-marker gating, which isn't what's happening.

components/src/conftest.py is a no-op on this PR purely because of a file-existence check, not any marker/pre-merge logic:

_STUB_PATH = pathlib.Path(__file__).parent / "dynamo" / "planner" / "tests" / "testbed" / "_runtime_stub.py"

def _install_dynamo_core_stub_if_needed() -> None:
    if not _STUB_PATH.is_file():
        return
    ...

_STUB_PATH points at dynamo/planner/tests/testbed/_runtime_stub.py, which is not part of #9683 — it ships with the testbed in #9686. So on this branch _STUB_PATH.is_file() is False and the installer returns immediately; it only does real work once the testbed lands. (And if the real dynamo._core is importable, it's a no-op regardless.)

So it's harmless here for that concrete reason. Offer still stands: if you'd prefer zero early no-op code, I can relocate the hook to #9686 — just say the word.

Comment thread deploy/helm/charts/platform/Chart.yaml Outdated
Comment thread deploy/planner-pod-rbac-dev.yaml Outdated
Comment thread components/src/dynamo/planner/config/planner_config.py
Comment thread components/src/dynamo/planner/tests/unit/test_actuation_knobs.py Outdated
# upstream (Dockerfile DCGM tag + bindings path + missing logger.py shim).
# This NVML-pass build used DCGM 3.3.9 bindings (no DCGM-mode in
# this pass, so the 3.3.9 vs 4.5+ struct gap is irrelevant here).
repository: ttl.sh/dynamo-pa-kaim-4c3c606

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be the official repo?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — ttl.sh/dynamo-pa-* is an ephemeral 24h-TTL image used only for this live-repro scaffold, not a release artifact. The file's IMAGE NOTE already flags the production path: switch to nvcr.io/nvidia/dynamo/power-agent:vX.Y.Z once the NGC build lands. Keeping ttl.sh here so the example stays runnable today without depending on an unpublished registry tag. (The personal -kaim infix was scrubbed in 1326d0c737.)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To state it explicitly: ttl.sh/dynamo-pa-* is a dev / live-repro-only image — it is never a production default.

  • It appears only in examples/multi-dgd-live-test/10-power-agent-values-nvml.yaml, which is throwaway scaffolding for reproducing the multi-DGD scenario (24h-TTL anonymous registry, no imagePullSecret).
  • The shipped chart deploy/helm/charts/power-agent/values.yaml defaults image.repository to nvcr.io/nvidia/ai-dynamo/power-agent with a REQUIRED, no-default image.tag: "", so no release path can ever resolve to ttl.sh.

Keeping the ephemeral tag in the example only so it stays runnable today without depending on an unpublished NGC tag; the file's IMAGE NOTE already documents the production switch.

One correction I noticed while confirming this: that IMAGE NOTE still spells the registry nvcr.io/nvidia/dynamo/power-agent, whereas the production chart (and every other Dynamo image) uses nvcr.io/nvidia/ai-dynamo/power-agent. Happy to align the comment to ai-dynamo so the example's documented production path matches the chart.

# the agent. The agent then calls `list_pod_for_all_namespaces` and
# gets 403 Forbidden. Until the chart is fixed to wire
# `--namespace={{ .Release.Namespace }}` when namespaceRestricted=true,
# use cluster-scoped RBAC. See power_agent.py:798 (--namespace flag exists).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about these findings? They should stay?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are deliberate repro notes documenting real upstream gaps I hit while running this example (namespace-restricted RBAC 403, DCGM 3.3.9-vs-4.5+ struct gap, bindings path, missing logger.py shim). They stay — this directory is live-test reproduction material, and the notes are what let someone re-run it without re-discovering the same chart bugs. Happy to file a tracking issue and trim them to a one-line pointer if you'd rather keep the example terse.

Comment thread examples/multi-dgd-live-test/20-nvidia-dcgm-standalone-ds.yaml Outdated
Comment thread examples/multi-dgd-live-test/README.md Outdated
Comment thread components/src/dynamo/planner/core/base.py Outdated
kaim-eng added 6 commits June 9, 2026 16:21
…ersion revert

- planner_config: enforce ge=1 on prefill/decode_engine_gpu_power_limit so
  a 0 W or negative per-GPU cap is rejected at config validation time.
- test_actuation_knobs: convert stale v1alpha1 spec.services mocks to the
  v1beta1 spec.components list shape (type: prefill/decode) so the
  label-selector tests exercise real DGD role resolution instead of
  short-circuiting to [] via SubComponentNotFoundError.
- platform/operator charts: revert version 1.3.1 -> 1.3.0, operator
  appVersion, the dynamo-operator dependency version, and the home URL
  back to the main baseline (out-of-scope chart metadata churn).

Addresses sttts review feedback on #9683.

Signed-off-by: Kai Ma <kaim@nvidia.com>
The Phase-1 power-annotation reconcile ran on every tick, and each
sweep resolved prefill and decode pods independently — two DGD GETs
plus two pod lists per tick per DGD (~4 reads / 5s) before any PATCH.
This multiplies apiserver load for a side effect that rarely changes.

- Throttle steady-state sweeps to power_annotation_interval_seconds
  (new config, default 60s). run() gates _apply_power_annotations via
  _should_sweep_power_annotations(); the method itself stays an
  unthrottled single reconcile so direct-call tests are unaffected.
- Force a per-tick sweep for one interval after a scale-up
  (_scaling_up): freshly-created pods get annotated promptly instead
  of waiting out the throttle, so the throttle never strands new pods.
- Resolve prefill+decode from one DGD read per sweep by threading the
  fetched deployment through get_component_pods(deployment=...),
  cutting reads-per-sweep from 4 to 3 (the two pod lists remain).

Addresses sttts review feedback on #9683 (apiserver load from the
per-tick annotation scan).

Signed-off-by: Kai Ma <kaim@nvidia.com>
The multi-DGD power-aware live-test scaffold carried environment- and
author-specific identifiers, making it read as a personal repro rather
than a reusable example. Genericize them:

- namespace kaim-dynamo-system -> dynamo-power-test
- node-selector value power-test/node: kaim -> target
- image tags drop the personal infix (ttl.sh/dynamo-pa-kaim-* ->
  ttl.sh/dynamo-pa-*, ttl.sh/dynamo-dcgm45-kaim -> ttl.sh/dynamo-dcgm45)
- state hostPath dynamo-power-agent-kaim-test -> dynamo-power-agent-test
- README build scratch dir /tmp/kaim -> /tmp/dynamo-build, concrete AKS
  node names -> placeholder/$TEST_NODE, and the dpp-dev-env cluster name
  -> generic AKS-test-cluster wording (heading anchor kept in sync)

No behavioral change to the manifests; this is identifier hygiene only.
Addresses sttts review feedback on #9683.

Signed-off-by: Kai Ma <kaim@nvidia.com>
deploy/planner-pod-rbac-dev.yaml was a transitional Role + RoleBinding
granting pods get/list/watch/patch to planner-dev-sa, for clusters
running operator chart <= 1.2.0 whose planner ClusterRole lacked any
pods permissions. The operator chart now ships those rules built in:
templates/planner.yaml grants pods [get, list, watch, patch] to the
planner ServiceAccount in both namespace-restricted (planner-role) and
cluster-wide (...-planner ClusterRole) modes, a strict superset of the
dev file. The file was self-marked DEPRECATED for chart 1.2.1+, and the
chart here is 1.3.0, so the workaround is dead code.

Addresses sttts review feedback on #9683.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng

kaim-eng commented Jun 9, 2026

Copy link
Copy Markdown
Author

Review status — all @sttts threads resolved

Thanks for the thorough pass. Below is a consolidated status of every review thread: 8 resolved via code changes (verified on pr1b/planner-infra) and 3 resolved as deliberate review decisions with rationale posted in-thread.

Thread Resolution Verification
.gitignore Removed the personal scratch block; will use a global personal excludes file (core.excludesFile) instead .gitignore is back to the main baseline — no .tmp-* / dev-pod / qwen3 / planner-dev / test_k8s_access / test_planner_launch entries
components/src/conftest.py No code change — inert in this PR because the installer is file-existence gated (if not _STUB_PATH.is_file(): return), and testbed/_runtime_stub.py ships in the testbed PR (#9686), not here conftest.py:39 early-returns; not marker-gated. Clarification posted in-thread
deploy/helm/charts/platform/Chart.yaml Kept version: 1.3.0 as the dev target; reverted the unrelated home: churn version: 1.3.0 confirmed on branch
deploy/planner-pod-rbac-dev.yaml Deleted the deprecated dev RBAC manifest (operator chart already grants the superset) File no longer exists on branch
planner_config.py Added ge=1 to both per-GPU power-limit fields so a 0/negative cap is rejected at config validation prefill_engine_gpu_power_limit and decode_engine_gpu_power_limit both carry ge=1
test_actuation_knobs.py Converted dead v1alpha1 spec.services dict mocks to the v1beta1 spec.components list shape Mocks now use "components": [{"name": ..., "type": "prefill"/"decode"}]
examples/.../10-power-agent-values-nvml.yaml (image) ttl.sh stays as the dev/live-repro-only image (not a production default); the stale production-followup comment was corrected nvcr.io/nvidia/dynamo/power-agentnvcr.io/nvidia/ai-dynamo/power-agent Tag appears only under examples/multi-dgd-live-test/; production chart defaults to nvcr.io/nvidia/ai-dynamo/power-agent, tag: "" (required)
examples/.../10-power-agent-values-nvml.yaml:56 (findings) No code change — kept the findings as deliberate live-repro notes Reply posted with rationale
examples/.../20-nvidia-dcgm-standalone-ds.yaml Removed personal namespace/identifier references No kaim / dpp-dev-env identifiers remain in the example manifests
examples/.../README.md Removed personal namespace references; also scrubbed the related integration test No kaim / dpp-dev-env in README or test_multi_dgd_live.py
core/base.py::_apply_power_annotations What changed: added _should_sweep_power_annotations() throttling on power_annotation_interval_seconds (default 60s) with a force-window on scale-up, plus a single deployment-scoped pod read threaded into both the prefill and decode sweeps. Why it satisfies the concern: steady-state apiserver load drops from ~4 reads / 5s to ~3 reads / 60s per DGD, while freshly-scaled pods are still annotated promptly. _should_sweep_power_annotations, the 60.0 (gt=0) interval default, and get_component_pods(..., deployment=deployment) for both roles all confirmed on branch

kaim-eng added 2 commits June 9, 2026 22:54
…/ai-dynamo/power-agent

The multi-DGD live-test values comment and the multi-tenant test doc
pointed the production-followup image at nvcr.io/nvidia/dynamo/power-agent,
which is missing the ai- prefix. The shipped chart default
(deploy/helm/charts/power-agent/values.yaml) is
nvcr.io/nvidia/ai-dynamo/power-agent; align both references.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>

# Conflicts:
#	components/src/dynamo/planner/core/base.py
#	components/src/dynamo/planner/monitoring/planner_metrics.py
@kaim-eng kaim-eng requested a review from a team as a code owner June 10, 2026 03:09
kaim-eng added 2 commits June 10, 2026 13:35
Signed-off-by: Kai Ma <kaim@nvidia.com>
…safe sweep throttle)

_scaling_up read self.state_machine._num_p_workers/_num_d_workers directly. On the orchestrator path the `state_machine` property lazily constructs a PlannerStateMachine (and warms predictors) when none exists, and the freshly-built PSM reports zero worker counts the orchestrator path never updates. Every positive scaling decision then looks like a scale-up, holding the power-annotation force-window permanently open and defeating the steady-state sweep throttle.

Extract _current_worker_counts() (cached self._last_worker_counts first, PSM internals only when a state machine already exists, else 0,0) and use it in both _scaling_up and _log_decision_summary so the two cannot drift -- _scaling_up's docstring already claimed to mirror _log_decision_summary.

Tests: cached counts take precedence over PSM internals; orchestrator path (no state machine) uses cached counts without constructing a PSM; cached counts at/above the decision do not force a sweep.
Signed-off-by: Kai Ma <kaim@nvidia.com>
@kaim-eng

Copy link
Copy Markdown
Author

Status update -- Phase-1 planner infra.

Resolved 9 review threads: advisory-mode gate (if self.config.advisory: return first in _apply_power_annotations), .gitignore revert, Chart.yaml back to 1.3.0, planner-pod-rbac-dev.yaml removal, ge=1 validators on the engine power limits, v1beta1 components/type test mocks, the apiserver-load sweep throttle, and the multi-DGD identifier scrub. Also landed 7e74aa65ea: _scaling_up now reads cached worker counts (orchestrator-safe) via a shared _current_worker_counts() helper, fixing a latent throttle-defeat where the lazily-built PSM reported zero counts. CI re-ran green -> CLEAN.

@sttts -- three threads are left open because they are your call rather than a clear-cut fix:

  • conftest stub hook (#discussion_r3381755500) -- happy to relocate it to feat(planner): power planner stress testbed (α + γ) #9686 if you'd prefer zero early no-op code; otherwise it's a file-existence no-op on this branch.
  • ttl.sh image in the live-repro example (#discussion_r3381962018) -- intentional dev-only ephemeral tag; the shipped chart defaults to nvcr.io/nvidia/ai-dynamo/power-agent. Keep, or switch the example too?
  • repro findings notes (#discussion_r3381965777) -- kept as live-test reproduction material; can trim to a tracking-issue pointer if you'd rather.

Tell me your preference on those three and I'll action them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat planner size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants