Skip to content

Planner Plugin Framework#1

Open
kangclzjc wants to merge 8 commits into
mainfrom
plugin_frame_kang
Open

Planner Plugin Framework#1
kangclzjc wants to merge 8 commits into
mainfrom
plugin_frame_kang

Conversation

@kangclzjc

Copy link
Copy Markdown
Owner

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

tthakkal and others added 4 commits May 11, 2026 06:21
…") (ai-dynamo#9034)

Signed-off-by: Thanaji Rao Thakkalapelli <thanaji.rao.thakkalapelli@intel.com>
Signed-off-by: Kang Zhang <kangz@nvidia.com>
Branch-only items that don't belong in the DEP-XXXX plugin framework PR:
- ConnectorBusyError + the kubernetes.py raise (separable DX fix)
- GlobalPlannerMetrics class (family-4 metrics, server+client wiring
  never reached production)
- EXECUTE family-5 metrics in PlannerPrometheusMetrics + the
  ConnectorBusyError handling in _apply_scaling_targets
- metrics parameter threading in ScaleRequestHandler + GlobalPlannerConnector
- Branch-only tests for the above three (test_scale_handler_metrics,
  test_apply_scaling_targets, test_global_planner_metrics)

Also align gateway socket default with EndpointsConfig.uds_socket_path
(/var/run/dynamo/planner/registry.sock instead of /tmp) to avoid the
ConfigMap schema split.
Signed-off-by: Kang Zhang <kangz@nvidia.com>

@github-advanced-security github-advanced-security AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

kangclzjc added 4 commits May 12, 2026 20:13
…s comments

- Delete _wire_predicted_load_if_supported (never called by main loop
  in production; entire predicted_load chain on orchestrator path
  was 95% wired but never connected the last step). Its only callers
  were the test file, which is removed as well.
- Update OrchestratorEngineAdapter comment to reflect the actual
  consumer of predicted_* fields (diagnostics recorder + Prometheus
  gauges, not the removed wire).
- Tighten TickDiagnostics plugin-era field comments.
- Drop "(family N)" suffixes from metric section names; replace with
  descriptive labels (plugin invocation, RECONCILE/CONSTRAIN clamp,
  tick scheduling).
- Remove stale generated planner_report HTML and audit-events doc.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Remove 7 test files whose coverage is duplicated by other suites:

- test_type_aware_properties.py + test_chain_augment_properties.py
  (hypothesis property tests; basic / worked-example / constrain /
  short-circuit / clamp-tracking tests already cover the algorithms)
- test_g3_real_parity.py (orchestrator G3 parity; the same scenarios
  flow through test_engine_adapter.py's test_g3_parity_via_adapter)
- test_external_plugin_wiring.py (W1+W2 startup wiring; overlaps with
  test_external_bootstrap.py + test_external_plugin_e2e.py)
- test_replay_mooncake_trace.py (real-trace replay; synthetic
  test_replay_dual_path.py already exercises dual-path replay)
- test_external_plugin_subprocess_e2e.py + subprocess runner
  (cross-process gRPC; in-process gRPC equivalent in
  test_external_plugin_e2e.py covers the same contract)

Inline the _observe_fpm_into_regressions helper into the one
remaining caller (test_builtin_plugins_e2e.py) and clean up stale
docstring references to the deleted files.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
These design documents were used to plan the plugin framework
implementation and are no longer needed in-tree.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Remove dangling references to DEP-XXXX_* design docs and AGENTS.md
from in-tree READMEs and CLAUDE.md. Affected files:

- plugins/CLAUDE.md (intro paragraph + Pointers table)
- plugins/proto/v1/README.md (References)
- plugins/merge/README.md (worked-example table footer + property
  test references for already-removed test files)
- plugins/registry/README.md (Pointers table)
- plugins/transport/README.md (References)
- plugins/orchestrator/README.md (Pointers table)
- tests/plugins/g3_fixtures/README.md (intro)

Signed-off-by: Kang Zhang <kangz@nvidia.com>
@kangclzjc kangclzjc changed the title Plugin frame kang Planner Plugin Framework May 12, 2026
kangclzjc added a commit that referenced this pull request May 15, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 15, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 15, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 16, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 16, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 18, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 18, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 18, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 18, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 18, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 18, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 18, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 18, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 18, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 25, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 25, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 25, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 26, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 27, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.
kangclzjc added a commit that referenced this pull request May 29, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request May 29, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request May 29, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request May 31, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request Jun 2, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which
left no seam for replay paths to propagate trace time into the plugin
layer.  Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and
HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a
fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock
barely moves and any plugin with ``execution_interval_seconds`` larger
than the real-time duration never re-fires after its first call.

This is invisible in PR #1's current ship surface (PR #1 has no
builtin plugins, K8s smoke runs in real time, and PSM-only replay
goes through ``_PSMEngineAdapter`` instead) but would block PR #10
(``use_orchestrator=True`` default) — once orchestrator becomes the
default path, mooncake replay must work.

Fix:

- ``OrchestratorEngineAdapter.__init__`` accepts an optional
  ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production
  behaviour is unchanged.
- ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s``
  at the start of every tick when a ``VirtualClock`` is in play
  (``advance(delta)`` only if ``delta > 0`` — backwards trace time
  is a silent no-op rather than a crash).
- ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it
  to the adapter on the orchestrator path so plugin scheduler sees
  trace time.

Regression tests in ``test_engine_adapter.py``:

- ``test_tick_advances_injected_virtual_clock_to_trace_time``:
  drive two ticks at trace time 180s and 360s, assert clock follows.
- ``test_tick_does_not_advance_clock_backwards``: pre-advance the
  clock past tick_input.now_s, assert no exception and clock stays
  put.
- ``test_default_clock_is_wallclock``: lock production default so a
  future refactor that flips it doesn't silently break K8s.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request Jun 2, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request Jun 2, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which
left no seam for replay paths to propagate trace time into the plugin
layer.  Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and
HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a
fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock
barely moves and any plugin with ``execution_interval_seconds`` larger
than the real-time duration never re-fires after its first call.

This is invisible in PR #1's current ship surface (PR #1 has no
builtin plugins, K8s smoke runs in real time, and PSM-only replay
goes through ``_PSMEngineAdapter`` instead) but would block PR #10
(``use_orchestrator=True`` default) — once orchestrator becomes the
default path, mooncake replay must work.

Fix:

- ``OrchestratorEngineAdapter.__init__`` accepts an optional
  ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production
  behaviour is unchanged.
- ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s``
  at the start of every tick when a ``VirtualClock`` is in play
  (``advance(delta)`` only if ``delta > 0`` — backwards trace time
  is a silent no-op rather than a crash).
- ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it
  to the adapter on the orchestrator path so plugin scheduler sees
  trace time.

Regression tests in ``test_engine_adapter.py``:

- ``test_tick_advances_injected_virtual_clock_to_trace_time``:
  drive two ticks at trace time 180s and 360s, assert clock follows.
- ``test_tick_does_not_advance_clock_backwards``: pre-advance the
  clock past tick_input.now_s, assert no exception and clock stays
  put.
- ``test_default_clock_is_wallclock``: lock production default so a
  future refactor that flips it doesn't silently break K8s.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request Jun 3, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request Jun 3, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which
left no seam for replay paths to propagate trace time into the plugin
layer.  Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and
HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a
fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock
barely moves and any plugin with ``execution_interval_seconds`` larger
than the real-time duration never re-fires after its first call.

This is invisible in PR #1's current ship surface (PR #1 has no
builtin plugins, K8s smoke runs in real time, and PSM-only replay
goes through ``_PSMEngineAdapter`` instead) but would block PR #10
(``use_orchestrator=True`` default) — once orchestrator becomes the
default path, mooncake replay must work.

Fix:

- ``OrchestratorEngineAdapter.__init__`` accepts an optional
  ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production
  behaviour is unchanged.
- ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s``
  at the start of every tick when a ``VirtualClock`` is in play
  (``advance(delta)`` only if ``delta > 0`` — backwards trace time
  is a silent no-op rather than a crash).
- ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it
  to the adapter on the orchestrator path so plugin scheduler sees
  trace time.

Regression tests in ``test_engine_adapter.py``:

- ``test_tick_advances_injected_virtual_clock_to_trace_time``:
  drive two ticks at trace time 180s and 360s, assert clock follows.
- ``test_tick_does_not_advance_clock_backwards``: pre-advance the
  clock past tick_input.now_s, assert no exception and clock stays
  put.
- ``test_default_clock_is_wallclock``: lock production default so a
  future refactor that flips it doesn't silently break K8s.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request Jun 4, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request Jun 4, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which
left no seam for replay paths to propagate trace time into the plugin
layer.  Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and
HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a
fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock
barely moves and any plugin with ``execution_interval_seconds`` larger
than the real-time duration never re-fires after its first call.

This is invisible in PR #1's current ship surface (PR #1 has no
builtin plugins, K8s smoke runs in real time, and PSM-only replay
goes through ``_PSMEngineAdapter`` instead) but would block PR #10
(``use_orchestrator=True`` default) — once orchestrator becomes the
default path, mooncake replay must work.

Fix:

- ``OrchestratorEngineAdapter.__init__`` accepts an optional
  ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production
  behaviour is unchanged.
- ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s``
  at the start of every tick when a ``VirtualClock`` is in play
  (``advance(delta)`` only if ``delta > 0`` — backwards trace time
  is a silent no-op rather than a crash).
- ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it
  to the adapter on the orchestrator path so plugin scheduler sees
  trace time.

Regression tests in ``test_engine_adapter.py``:

- ``test_tick_advances_injected_virtual_clock_to_trace_time``:
  drive two ticks at trace time 180s and 360s, assert clock follows.
- ``test_tick_does_not_advance_clock_backwards``: pre-advance the
  clock past tick_input.now_s, assert no exception and clock stays
  put.
- ``test_default_clock_is_wallclock``: lock production default so a
  future refactor that flips it doesn't silently break K8s.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit that referenced this pull request Jun 4, 2026
…oder reuse, doc fixes

Low-risk cleanup batch from the independent review (no decision-path change):

- ai-dynamo#4 chain_augment: add ``predicted_kv_hit_rate`` to ``_PREDICTION_FIELDS``
  so it participates in first-writer-wins partial merge like the other three
  predicted_* fields (was silently dropped in any 2+ plugin PREDICT chain,
  contradicting the proto/Pydantic contract). +2 chain_augment tests.
- #10 engine_adapter: add ``scale_down_capped_by_throughput`` to
  ``_aggregate_disagg_load_reason`` priority (PSM disagg emits it; placed
  between scale_up and scale_down to mirror PSM's _PRIORITY).
- ai-dynamo#11 dead code: drop ``contributing_plugin_ids`` (built, never read) in
  pipeline._run_fanout_stage; drop ``_set_enabled`` + ``_plugin_ids``
  (no caller in PR #1; would KeyError if reached).
- ai-dynamo#18 _encode_fpm: use the canonical
  ``dynamo.common.forward_pass_metrics.encode`` (shared module-level encoder)
  instead of allocating a fresh ``msgspec.msgpack.Encoder`` per tick and
  re-implementing the encoding. Byte-identical wire format; keeps FPM
  serialization in lock-step with the rest of dynamo.
- ai-dynamo#17 transport ABC docstring: timeout is enforced by the transport
  (``call()`` wraps ``asyncio.wait_for``), not the orchestrator — the
  pipeline uses a bare gather to avoid double-counting the deadline.
- ai-dynamo#20 scheduler docstring: note the heartbeat-eviction monitor is not wired
  in this PR (last_heartbeat_at is recorded but unread; monitor is follow-up).
- ai-dynamo#21 transport contract test: 7 inputs (not 8) → 14 cases (multi_pool fixture
  was removed with component_name; comments were stale).
- ai-dynamo#22 metrics test: remove the dead no-op ``pass`` loop in _sample_value.

828 planner tests pass (was 825; +3 chain-augment / merge tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kangclzjc added a commit that referenced this pull request Jun 7, 2026
…o#10124)

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants