feat(planner): power planner stress testbed (α + γ) by kaim-eng · Pull Request #9686 · ai-dynamo/dynamo

kaim-eng · 2026-05-18T15:54:55Z

Part of the PR #9369 split plan.
This is PR 5 of 6 (PR 4 — Stress Testbed, α + γ). Held in Draft per plan §4.5.

Predecessor: #9685 — AIC closed-loop optimizer
Successor: #9687 (Draft) — Docs + dev environment

Scope

The synthetic stress harness — deterministic, no GPU, no cluster. Adds α-class (27 synthetic scenarios) and γ-class (3 mocker-driven scenarios), including the replay-adapter timing fix, fakes, scenario YAMLs, and testbed self-tests.

~7,700 lines · ~65 files. Large by line count but easy to review: almost entirely new self-contained test files and YAML scenario definitions. Only 9 lines of production code change.

components/src/dynamo/planner/tests/testbed/ — full subtree (synthetic_fleet, fakes, runner, scenarios, A–F scenario YAMLs + γ G1–G3, replay/, traces/, grafana dashboards)
components/src/dynamo/planner/offline/replay_adapter.py — 9-line timing fix (now_s = max(tick.at_s, bridge_now_s)) — prevents stale-tick loops on sparse traces

Alternative α/γ split available on request (plan §2.5). Defaulting to a single PR because the testbed is read as one coherent design (powerplanner-testbed-design.md describes α and γ together).

Reviewer onboarding

Design context: docs/design-docs/powerplanner-testbed-design.md (lands in PR 5; readable from this branch directly via git show pr5/docs-devenv:docs/design-docs/powerplanner-testbed-design.md)
Plan section: §2.5 (this PR)

Tests at this tip (measured 2026-05-18, post-v3.3 rebase tip `cec5681f15`)

All PR 1a + 1b + 2 + 3 tests still pass
test_scenarios.py — α (27) + γ (3) = 30 passed
tests/test_fakes.py / test_overlay.py / test_scenarios_loadable.py — all passed
tests/test_self_consistency.py — passed (1 skipped — older create_disagg bridge can't drive AIC drift; auto-enables on newer mocker, see testbed design Appendix D.7)
tests/test_aic_real_data.py — passed (module-skipped without AIC_SANDBOX_DIR)
Testbed subtotal: 86 passed, 1 skipped
Full planner sweep: 696 passed, 6 skipped, 0 failures; power_agent 43 passed → 739 total

Merge strategy

Rebase-and-merge (no squash).

copy-pr-bot · 2026-05-18T15:54:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Adds the synthetic-metrics testbed: deterministic, no GPU, no cluster. Alpha-class (27 scenarios, A1-F26): synthetic fleet + fake metrics + fake actuator drive the planner through fault-injection scenarios covering AIC drift, NVML clamps, K8s RBAC denials, node loss / recovery, Prometheus outages, MDC gaps, budget shrinkage, AIC infeasibility, and drift-threshold boundary cases. Gamma-class (3 scenarios, G1-G3): mocker-driven trace replay with synthetic-power overlay (replay/synthetic_power_overlay.py) and a power-aware replay adapter (replay/power_aware_replay_adapter.py) exercise the closed loop against real Mooncake traces. Infrastructure: - runner.py + scenarios.py + assertions.py + recorder.py + clock.py (run-loop, scenario loader, invariant checks, recording). - synthetic_fleet.py + fake_actuator.py + fake_aic.py + fake_planner_metrics.py + fake_prometheus.py (test doubles for every external dependency of the planner run loop). - _runtime_stub.py installs a stub dynamo._core when the compiled Rust binding is absent, so the testbed runs on developer laptops without a CUDA toolchain (carries every dynamo._core symbol used by dynamo.llm at module-load time, including the post-rebase RoutingConstraints addition from main PR #9558). - grafana/testbed_dashboard.json + systems/ (h100_pcie / h100_sxm / h200_sxm SKUs) + traces/placeholder_h200_disagg_1rps.jsonl provide a complete observability + replay stack. Production code: - offline/replay_adapter.py: 9-line timing fix (now_s = max(tick.at_s, bridge_now_s)) prevents stale-tick loops on sparse traces. 86 testbed tests + 30 scenarios (27 alpha + 3 gamma) ship green at this tip (1 skipped pending env-var; test_aic_real_data.py is module-skipped unless AIC_SANDBOX_DIR is set). Part of the PR #9369 split (PR 4 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>

The two ``[Appendix C.10]``/``[Appendix D.7]`` links in the testbed README pointed at ``docs/design-docs/powerplanner-testbed-design.md``, which is introduced by PR #9687 and does not yet exist on this branch. The Docs link check (lychee) has therefore failed on PR #9686 since the PR was opened on 2026-05-18 -- a cross-PR forward reference baked into the original PR #9369 split. Convert both occurrences from ``[text](relative-path)`` syntax to plain backticked text references. The information value (appendix numbers + target file) is preserved; lychee no longer treats them as candidate links to resolve. Once both PRs land on ``main`` the file resolves naturally and reviewers can grep for the path. No code or test change. Cascade-affected branches: this commit lives on pr4/testbed; pr5/docs-devenv will rebase onto the new tip. Signed-off-by: Kai Ma <kaim@nvidia.com>

pull-request-size Bot added the size/XXL label May 18, 2026

github-actions Bot added feat documentation Improvements or additions to documentation planner labels May 18, 2026

This was referenced May 18, 2026

feat(planner): AIC closed-loop optimizer #9685

Draft

docs(planner): power planner design docs + dev environment #9687

Draft

kaim-eng force-pushed the pr3/aic-optimizer branch from 7f28917 to c0e744b Compare May 18, 2026 16:04

kaim-eng force-pushed the pr4/testbed branch from 6995868 to 3fa0915 Compare May 18, 2026 16:05

kaim-eng force-pushed the pr3/aic-optimizer branch from c0e744b to 2761ac2 Compare May 19, 2026 12:58

kaim-eng force-pushed the pr4/testbed branch from 3fa0915 to 1d1b5b9 Compare May 19, 2026 12:58

kaim-eng force-pushed the pr3/aic-optimizer branch from 2761ac2 to 3edb72e Compare May 19, 2026 15:55

kaim-eng force-pushed the pr4/testbed branch from 1d1b5b9 to da36145 Compare May 19, 2026 15:56

kaim-eng mentioned this pull request Jun 10, 2026

feat(planner): power infrastructure — pod annotation, RBAC, config, Prometheus #9683

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(planner): power planner stress testbed (α + γ)#9686

feat(planner): power planner stress testbed (α + γ)#9686
kaim-eng wants to merge 2 commits into
pr3/aic-optimizerfrom
pr4/testbed

kaim-eng commented May 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaim-eng commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

Reviewer onboarding

Tests at this tip (measured 2026-05-18, post-v3.3 rebase tip cec5681f15)

Merge strategy

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kaim-eng commented May 18, 2026 •

edited

Loading

Tests at this tip (measured 2026-05-18, post-v3.3 rebase tip `cec5681f15`)