Planner Plugin Framework#1
Open
kangclzjc wants to merge 8 commits into
Open
Conversation
…") (ai-dynamo#9034) Signed-off-by: Thanaji Rao Thakkalapelli <thanaji.rao.thakkalapelli@intel.com>
Signed-off-by: Kang Zhang <kangz@nvidia.com>
Branch-only items that don't belong in the DEP-XXXX plugin framework PR: - ConnectorBusyError + the kubernetes.py raise (separable DX fix) - GlobalPlannerMetrics class (family-4 metrics, server+client wiring never reached production) - EXECUTE family-5 metrics in PlannerPrometheusMetrics + the ConnectorBusyError handling in _apply_scaling_targets - metrics parameter threading in ScaleRequestHandler + GlobalPlannerConnector - Branch-only tests for the above three (test_scale_handler_metrics, test_apply_scaling_targets, test_global_planner_metrics) Also align gateway socket default with EndpointsConfig.uds_socket_path (/var/run/dynamo/planner/registry.sock instead of /tmp) to avoid the ConfigMap schema split.
Signed-off-by: Kang Zhang <kangz@nvidia.com>
There was a problem hiding this comment.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
…s comments - Delete _wire_predicted_load_if_supported (never called by main loop in production; entire predicted_load chain on orchestrator path was 95% wired but never connected the last step). Its only callers were the test file, which is removed as well. - Update OrchestratorEngineAdapter comment to reflect the actual consumer of predicted_* fields (diagnostics recorder + Prometheus gauges, not the removed wire). - Tighten TickDiagnostics plugin-era field comments. - Drop "(family N)" suffixes from metric section names; replace with descriptive labels (plugin invocation, RECONCILE/CONSTRAIN clamp, tick scheduling). - Remove stale generated planner_report HTML and audit-events doc. Signed-off-by: Kang Zhang <kangz@nvidia.com>
Remove 7 test files whose coverage is duplicated by other suites: - test_type_aware_properties.py + test_chain_augment_properties.py (hypothesis property tests; basic / worked-example / constrain / short-circuit / clamp-tracking tests already cover the algorithms) - test_g3_real_parity.py (orchestrator G3 parity; the same scenarios flow through test_engine_adapter.py's test_g3_parity_via_adapter) - test_external_plugin_wiring.py (W1+W2 startup wiring; overlaps with test_external_bootstrap.py + test_external_plugin_e2e.py) - test_replay_mooncake_trace.py (real-trace replay; synthetic test_replay_dual_path.py already exercises dual-path replay) - test_external_plugin_subprocess_e2e.py + subprocess runner (cross-process gRPC; in-process gRPC equivalent in test_external_plugin_e2e.py covers the same contract) Inline the _observe_fpm_into_regressions helper into the one remaining caller (test_builtin_plugins_e2e.py) and clean up stale docstring references to the deleted files. Signed-off-by: Kang Zhang <kangz@nvidia.com>
These design documents were used to plan the plugin framework implementation and are no longer needed in-tree. Signed-off-by: Kang Zhang <kangz@nvidia.com>
Remove dangling references to DEP-XXXX_* design docs and AGENTS.md from in-tree READMEs and CLAUDE.md. Affected files: - plugins/CLAUDE.md (intro paragraph + Pointers table) - plugins/proto/v1/README.md (References) - plugins/merge/README.md (worked-example table footer + property test references for already-removed test files) - plugins/registry/README.md (Pointers table) - plugins/transport/README.md (References) - plugins/orchestrator/README.md (Pointers table) - tests/plugins/g3_fixtures/README.md (intro) Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
May 15, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 15, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 15, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 16, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 16, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 18, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 18, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 18, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 18, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 18, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 18, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 18, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 18, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 18, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 25, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 25, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 25, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 26, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 27, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR.
kangclzjc
added a commit
that referenced
this pull request
May 29, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
May 29, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
May 29, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
May 31, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
Jun 2, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which left no seam for replay paths to propagate trace time into the plugin layer. Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock barely moves and any plugin with ``execution_interval_seconds`` larger than the real-time duration never re-fires after its first call. This is invisible in PR #1's current ship surface (PR #1 has no builtin plugins, K8s smoke runs in real time, and PSM-only replay goes through ``_PSMEngineAdapter`` instead) but would block PR #10 (``use_orchestrator=True`` default) — once orchestrator becomes the default path, mooncake replay must work. Fix: - ``OrchestratorEngineAdapter.__init__`` accepts an optional ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production behaviour is unchanged. - ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s`` at the start of every tick when a ``VirtualClock`` is in play (``advance(delta)`` only if ``delta > 0`` — backwards trace time is a silent no-op rather than a crash). - ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it to the adapter on the orchestrator path so plugin scheduler sees trace time. Regression tests in ``test_engine_adapter.py``: - ``test_tick_advances_injected_virtual_clock_to_trace_time``: drive two ticks at trace time 180s and 360s, assert clock follows. - ``test_tick_does_not_advance_clock_backwards``: pre-advance the clock past tick_input.now_s, assert no exception and clock stays put. - ``test_default_clock_is_wallclock``: lock production default so a future refactor that flips it doesn't silently break K8s. Full planner suite: 830 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
Jun 2, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
Jun 2, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which left no seam for replay paths to propagate trace time into the plugin layer. Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock barely moves and any plugin with ``execution_interval_seconds`` larger than the real-time duration never re-fires after its first call. This is invisible in PR #1's current ship surface (PR #1 has no builtin plugins, K8s smoke runs in real time, and PSM-only replay goes through ``_PSMEngineAdapter`` instead) but would block PR #10 (``use_orchestrator=True`` default) — once orchestrator becomes the default path, mooncake replay must work. Fix: - ``OrchestratorEngineAdapter.__init__`` accepts an optional ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production behaviour is unchanged. - ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s`` at the start of every tick when a ``VirtualClock`` is in play (``advance(delta)`` only if ``delta > 0`` — backwards trace time is a silent no-op rather than a crash). - ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it to the adapter on the orchestrator path so plugin scheduler sees trace time. Regression tests in ``test_engine_adapter.py``: - ``test_tick_advances_injected_virtual_clock_to_trace_time``: drive two ticks at trace time 180s and 360s, assert clock follows. - ``test_tick_does_not_advance_clock_backwards``: pre-advance the clock past tick_input.now_s, assert no exception and clock stays put. - ``test_default_clock_is_wallclock``: lock production default so a future refactor that flips it doesn't silently break K8s. Full planner suite: 830 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
Jun 3, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
Jun 3, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which left no seam for replay paths to propagate trace time into the plugin layer. Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock barely moves and any plugin with ``execution_interval_seconds`` larger than the real-time duration never re-fires after its first call. This is invisible in PR #1's current ship surface (PR #1 has no builtin plugins, K8s smoke runs in real time, and PSM-only replay goes through ``_PSMEngineAdapter`` instead) but would block PR #10 (``use_orchestrator=True`` default) — once orchestrator becomes the default path, mooncake replay must work. Fix: - ``OrchestratorEngineAdapter.__init__`` accepts an optional ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production behaviour is unchanged. - ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s`` at the start of every tick when a ``VirtualClock`` is in play (``advance(delta)`` only if ``delta > 0`` — backwards trace time is a silent no-op rather than a crash). - ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it to the adapter on the orchestrator path so plugin scheduler sees trace time. Regression tests in ``test_engine_adapter.py``: - ``test_tick_advances_injected_virtual_clock_to_trace_time``: drive two ticks at trace time 180s and 360s, assert clock follows. - ``test_tick_does_not_advance_clock_backwards``: pre-advance the clock past tick_input.now_s, assert no exception and clock stays put. - ``test_default_clock_is_wallclock``: lock production default so a future refactor that flips it doesn't silently break K8s. Full planner suite: 830 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
Jun 4, 2026
Squashed from 8 development commits. See PR description for full context. Infrastructure-only — builtin plugins + dual-path parity tests land in the follow-up PR. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
Jun 4, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which left no seam for replay paths to propagate trace time into the plugin layer. Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock barely moves and any plugin with ``execution_interval_seconds`` larger than the real-time duration never re-fires after its first call. This is invisible in PR #1's current ship surface (PR #1 has no builtin plugins, K8s smoke runs in real time, and PSM-only replay goes through ``_PSMEngineAdapter`` instead) but would block PR #10 (``use_orchestrator=True`` default) — once orchestrator becomes the default path, mooncake replay must work. Fix: - ``OrchestratorEngineAdapter.__init__`` accepts an optional ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production behaviour is unchanged. - ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s`` at the start of every tick when a ``VirtualClock`` is in play (``advance(delta)`` only if ``delta > 0`` — backwards trace time is a silent no-op rather than a crash). - ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it to the adapter on the orchestrator path so plugin scheduler sees trace time. Regression tests in ``test_engine_adapter.py``: - ``test_tick_advances_injected_virtual_clock_to_trace_time``: drive two ticks at trace time 180s and 360s, assert clock follows. - ``test_tick_does_not_advance_clock_backwards``: pre-advance the clock past tick_input.now_s, assert no exception and clock stays put. - ``test_default_clock_is_wallclock``: lock production default so a future refactor that flips it doesn't silently break K8s. Full planner suite: 830 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc
added a commit
that referenced
this pull request
Jun 4, 2026
…oder reuse, doc fixes Low-risk cleanup batch from the independent review (no decision-path change): - ai-dynamo#4 chain_augment: add ``predicted_kv_hit_rate`` to ``_PREDICTION_FIELDS`` so it participates in first-writer-wins partial merge like the other three predicted_* fields (was silently dropped in any 2+ plugin PREDICT chain, contradicting the proto/Pydantic contract). +2 chain_augment tests. - #10 engine_adapter: add ``scale_down_capped_by_throughput`` to ``_aggregate_disagg_load_reason`` priority (PSM disagg emits it; placed between scale_up and scale_down to mirror PSM's _PRIORITY). - ai-dynamo#11 dead code: drop ``contributing_plugin_ids`` (built, never read) in pipeline._run_fanout_stage; drop ``_set_enabled`` + ``_plugin_ids`` (no caller in PR #1; would KeyError if reached). - ai-dynamo#18 _encode_fpm: use the canonical ``dynamo.common.forward_pass_metrics.encode`` (shared module-level encoder) instead of allocating a fresh ``msgspec.msgpack.Encoder`` per tick and re-implementing the encoding. Byte-identical wire format; keeps FPM serialization in lock-step with the rest of dynamo. - ai-dynamo#17 transport ABC docstring: timeout is enforced by the transport (``call()`` wraps ``asyncio.wait_for``), not the orchestrator — the pipeline uses a bare gather to avoid double-counting the deadline. - ai-dynamo#20 scheduler docstring: note the heartbeat-eviction monitor is not wired in this PR (last_heartbeat_at is recorded but unread; monitor is follow-up). - ai-dynamo#21 transport contract test: 7 inputs (not 8) → 14 cases (multi_pool fixture was removed with component_name; comments were stale). - ai-dynamo#22 metrics test: remove the dead no-op ``pass`` loop in _sample_value. 828 planner tests pass (was 825; +3 chain-augment / merge tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kangclzjc
added a commit
that referenced
this pull request
Jun 7, 2026
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)