Skip to content

feat(planner): power planner stress testbed (α + γ)#9686

Draft
kaim-eng wants to merge 2 commits into
pr3/aic-optimizerfrom
pr4/testbed
Draft

feat(planner): power planner stress testbed (α + γ)#9686
kaim-eng wants to merge 2 commits into
pr3/aic-optimizerfrom
pr4/testbed

Conversation

@kaim-eng

@kaim-eng kaim-eng commented May 18, 2026

Copy link
Copy Markdown

Part of the PR #9369 split plan.
This is PR 5 of 6 (PR 4 — Stress Testbed, α + γ). Held in Draft per plan §4.5.

Predecessor: #9685 — AIC closed-loop optimizer
Successor: #9687 (Draft) — Docs + dev environment

Scope

The synthetic stress harness — deterministic, no GPU, no cluster. Adds α-class (27 synthetic scenarios) and γ-class (3 mocker-driven scenarios), including the replay-adapter timing fix, fakes, scenario YAMLs, and testbed self-tests.

~7,700 lines · ~65 files. Large by line count but easy to review: almost entirely new self-contained test files and YAML scenario definitions. Only 9 lines of production code change.

  • components/src/dynamo/planner/tests/testbed/ — full subtree (synthetic_fleet, fakes, runner, scenarios, A–F scenario YAMLs + γ G1–G3, replay/, traces/, grafana dashboards)
  • components/src/dynamo/planner/offline/replay_adapter.py — 9-line timing fix (now_s = max(tick.at_s, bridge_now_s)) — prevents stale-tick loops on sparse traces

Alternative α/γ split available on request (plan §2.5). Defaulting to a single PR because the testbed is read as one coherent design (powerplanner-testbed-design.md describes α and γ together).

Reviewer onboarding

  • Design context: docs/design-docs/powerplanner-testbed-design.md (lands in PR 5; readable from this branch directly via git show pr5/docs-devenv:docs/design-docs/powerplanner-testbed-design.md)
  • Plan section: §2.5 (this PR)

Tests at this tip (measured 2026-05-18, post-v3.3 rebase tip cec5681f15)

  • All PR 1a + 1b + 2 + 3 tests still pass
  • test_scenarios.py — α (27) + γ (3) = 30 passed
  • tests/test_fakes.py / test_overlay.py / test_scenarios_loadable.py — all passed
  • tests/test_self_consistency.py — passed (1 skipped — older create_disagg bridge can't drive AIC drift; auto-enables on newer mocker, see testbed design Appendix D.7)
  • tests/test_aic_real_data.py — passed (module-skipped without AIC_SANDBOX_DIR)
  • Testbed subtotal: 86 passed, 1 skipped
  • Full planner sweep: 696 passed, 6 skipped, 0 failures; power_agent 43 passed739 total

Merge strategy

Rebase-and-merge (no squash).

@copy-pr-bot

copy-pr-bot Bot commented May 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Adds the synthetic-metrics testbed: deterministic, no GPU, no cluster.

Alpha-class (27 scenarios, A1-F26): synthetic fleet + fake metrics
+ fake actuator drive the planner through fault-injection scenarios
covering AIC drift, NVML clamps, K8s RBAC denials, node loss / recovery,
Prometheus outages, MDC gaps, budget shrinkage, AIC infeasibility,
and drift-threshold boundary cases.

Gamma-class (3 scenarios, G1-G3): mocker-driven trace replay with
synthetic-power overlay (replay/synthetic_power_overlay.py) and a
power-aware replay adapter (replay/power_aware_replay_adapter.py)
exercise the closed loop against real Mooncake traces.

Infrastructure:
- runner.py + scenarios.py + assertions.py + recorder.py + clock.py
  (run-loop, scenario loader, invariant checks, recording).
- synthetic_fleet.py + fake_actuator.py + fake_aic.py +
  fake_planner_metrics.py + fake_prometheus.py (test doubles for
  every external dependency of the planner run loop).
- _runtime_stub.py installs a stub dynamo._core when the compiled
  Rust binding is absent, so the testbed runs on developer laptops
  without a CUDA toolchain (carries every dynamo._core symbol used
  by dynamo.llm at module-load time, including the post-rebase
  RoutingConstraints addition from main PR #9558).
- grafana/testbed_dashboard.json + systems/ (h100_pcie / h100_sxm /
  h200_sxm SKUs) + traces/placeholder_h200_disagg_1rps.jsonl provide
  a complete observability + replay stack.

Production code:
- offline/replay_adapter.py: 9-line timing fix
  (now_s = max(tick.at_s, bridge_now_s)) prevents stale-tick loops
  on sparse traces.

86 testbed tests + 30 scenarios (27 alpha + 3 gamma) ship green at
this tip (1 skipped pending env-var; test_aic_real_data.py is
module-skipped unless AIC_SANDBOX_DIR is set).

Part of the PR #9369 split (PR 4 of 6). See docs/design-docs/pr9369-split-plan.md.

Signed-off-by: Kai Ma <kaim@nvidia.com>
The two ``[Appendix C.10]``/``[Appendix D.7]`` links in the testbed
README pointed at ``docs/design-docs/powerplanner-testbed-design.md``,
which is introduced by PR #9687 and does not yet exist on this branch.
The Docs link check (lychee) has therefore failed on PR #9686 since the
PR was opened on 2026-05-18 -- a cross-PR forward reference baked into
the original PR #9369 split.

Convert both occurrences from ``[text](relative-path)`` syntax to plain
backticked text references. The information value (appendix numbers +
target file) is preserved; lychee no longer treats them as candidate
links to resolve. Once both PRs land on ``main`` the file resolves
naturally and reviewers can grep for the path.

No code or test change. Cascade-affected branches: this commit lives on
pr4/testbed; pr5/docs-devenv will rebase onto the new tip.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation feat planner size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant