feat(planner): power planner stress testbed (α + γ)#9686
Draft
kaim-eng wants to merge 2 commits into
Draft
Conversation
This was referenced May 18, 2026
7f28917 to
c0e744b
Compare
c0e744b to
2761ac2
Compare
2761ac2 to
3edb72e
Compare
Adds the synthetic-metrics testbed: deterministic, no GPU, no cluster. Alpha-class (27 scenarios, A1-F26): synthetic fleet + fake metrics + fake actuator drive the planner through fault-injection scenarios covering AIC drift, NVML clamps, K8s RBAC denials, node loss / recovery, Prometheus outages, MDC gaps, budget shrinkage, AIC infeasibility, and drift-threshold boundary cases. Gamma-class (3 scenarios, G1-G3): mocker-driven trace replay with synthetic-power overlay (replay/synthetic_power_overlay.py) and a power-aware replay adapter (replay/power_aware_replay_adapter.py) exercise the closed loop against real Mooncake traces. Infrastructure: - runner.py + scenarios.py + assertions.py + recorder.py + clock.py (run-loop, scenario loader, invariant checks, recording). - synthetic_fleet.py + fake_actuator.py + fake_aic.py + fake_planner_metrics.py + fake_prometheus.py (test doubles for every external dependency of the planner run loop). - _runtime_stub.py installs a stub dynamo._core when the compiled Rust binding is absent, so the testbed runs on developer laptops without a CUDA toolchain (carries every dynamo._core symbol used by dynamo.llm at module-load time, including the post-rebase RoutingConstraints addition from main PR #9558). - grafana/testbed_dashboard.json + systems/ (h100_pcie / h100_sxm / h200_sxm SKUs) + traces/placeholder_h200_disagg_1rps.jsonl provide a complete observability + replay stack. Production code: - offline/replay_adapter.py: 9-line timing fix (now_s = max(tick.at_s, bridge_now_s)) prevents stale-tick loops on sparse traces. 86 testbed tests + 30 scenarios (27 alpha + 3 gamma) ship green at this tip (1 skipped pending env-var; test_aic_real_data.py is module-skipped unless AIC_SANDBOX_DIR is set). Part of the PR #9369 split (PR 4 of 6). See docs/design-docs/pr9369-split-plan.md. Signed-off-by: Kai Ma <kaim@nvidia.com>
The two ``[Appendix C.10]``/``[Appendix D.7]`` links in the testbed README pointed at ``docs/design-docs/powerplanner-testbed-design.md``, which is introduced by PR #9687 and does not yet exist on this branch. The Docs link check (lychee) has therefore failed on PR #9686 since the PR was opened on 2026-05-18 -- a cross-PR forward reference baked into the original PR #9369 split. Convert both occurrences from ``[text](relative-path)`` syntax to plain backticked text references. The information value (appendix numbers + target file) is preserved; lychee no longer treats them as candidate links to resolve. Once both PRs land on ``main`` the file resolves naturally and reviewers can grep for the path. No code or test change. Cascade-affected branches: this commit lives on pr4/testbed; pr5/docs-devenv will rebase onto the new tip. Signed-off-by: Kai Ma <kaim@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of the PR #9369 split plan.
This is PR 5 of 6 (PR 4 — Stress Testbed, α + γ). Held in Draft per plan §4.5.
Predecessor: #9685 — AIC closed-loop optimizer
Successor: #9687 (Draft) — Docs + dev environment
Scope
The synthetic stress harness — deterministic, no GPU, no cluster. Adds α-class (27 synthetic scenarios) and γ-class (3 mocker-driven scenarios), including the replay-adapter timing fix, fakes, scenario YAMLs, and testbed self-tests.
~7,700 lines · ~65 files. Large by line count but easy to review: almost entirely new self-contained test files and YAML scenario definitions. Only 9 lines of production code change.
components/src/dynamo/planner/tests/testbed/— full subtree (synthetic_fleet, fakes, runner, scenarios, A–F scenario YAMLs + γ G1–G3, replay/, traces/, grafana dashboards)components/src/dynamo/planner/offline/replay_adapter.py— 9-line timing fix (now_s = max(tick.at_s, bridge_now_s)) — prevents stale-tick loops on sparse tracesReviewer onboarding
docs/design-docs/powerplanner-testbed-design.md(lands in PR 5; readable from this branch directly viagit show pr5/docs-devenv:docs/design-docs/powerplanner-testbed-design.md)Tests at this tip (measured 2026-05-18, post-v3.3 rebase tip
cec5681f15)test_scenarios.py— α (27) + γ (3) = 30 passedtests/test_fakes.py/test_overlay.py/test_scenarios_loadable.py— all passedtests/test_self_consistency.py— passed (1 skipped — oldercreate_disaggbridge can't drive AIC drift; auto-enables on newer mocker, see testbed design Appendix D.7)tests/test_aic_real_data.py— passed (module-skipped withoutAIC_SANDBOX_DIR)Merge strategy
Rebase-and-merge (no squash).