feat(planner): plugin framework infrastructure (PR #1 of 2) by kangclzjc · Pull Request #2 · kangclzjc/dynamo

kangclzjc · 2026-05-15T06:29:07Z

What this PR does

Introduces the plugin framework infrastructure for the Dynamo Planner: a parallel tick-engine path that runs decisions through a PREDICT → PROPOSE → RECONCILE → CONSTRAIN pipeline driven by in-process or out-of-process plugins. The legacy PlannerStateMachine (PSM) path remains the default; the new path is opt-in per planner via scheduling.use_orchestrator=true.

This PR is infrastructure only. It deliberately does not include builtin plugin implementations — those land in the follow-up PR with their parity tests.

Scope

In this PR

Proto + Pydantic types + bidirectional bridge for the 4 stage RPCs (PredictPlugin / ProposePlugin / ReconcilePlugin / ConstrainPlugin) plus PluginRegistry and PluginLifecycle.
Transport layer: PluginTransport ABC + two client implementations — InProcessTransport (inproc://) and GrpcTransport (grpc://, plaintext only, gated behind allow_insecure_grpc=True with a startup WARNING). UDS as a client transport was considered but dropped — see Design decisions below. mTLS support lands in PR 1.5.
PluginRegistry + PluginScheduler + CircuitBreaker with 3 auth sources (static_secret / multi / allow_unauthenticated). k8s_sa and spiffe_jwt were originally drafted but moved to PR 1.5 alongside mTLS.
Merge algorithms: type_aware_merge (PROPOSE / RECONCILE / CONSTRAIN — winner-takes-all on SET, max-of-floors on AT_LEAST, min-of-ceilings on AT_MOST) and chain_augment (PREDICT — sequential refinement in priority-ascending order with optional final=true chain terminator).
LocalPlannerOrchestrator + 4-stage pipeline driver (parallel fan-out on PROPOSE/RECONCILE/CONSTRAIN, sequential chain on PREDICT) with REJECT short-circuit, HOLD_LAST cache invalidation, per-stage / whole-tick deadlines.
EngineProtocol dual-path seam in NativePlannerBase so the same tick loop drives either PSM or orchestrator without branching. PSM stays the default.
W1 (static-config external plugins): at startup, the planner reads scheduling.external_plugins[] from PlannerConfig and registers each entry via registry.register(...). Per-entry failures (auth reject, bad scheme, protocol mismatch) are logged but do not crash the planner.
W2 (gRPC self-register gateway): when scheduling.gateway.enabled=true, the planner exposes PluginRegistry over gRPC so external plugin processes can call Register themselves. Implementation delegates to the same in-process registry the static-config path uses, so behaviour is identical between the two registration models.
Configurable external-plugin auth: plugin_registration.auth.trusted_sources (in PlannerConfig) drives which auth schemes the registry accepts; an empty list falls back to AllowUnauthenticatedAuth with a DEV-ONLY startup WARN. Verified end-to-end in K8s smoke for both static_secret and unauthenticated modes (see Verification).
Observability surface: dynamo_planner_* metric family extensions for plugin lifecycle (evaluations / latency / circuit / held-over / cache-age), tick scheduling (duration / timeout / lag / skipped), and execute funnel.
Structured audit events: per-tick audit_events: list[str] on PipelineOutcome (populated by chain-augment / type-aware merge) for downstream observability tooling.
Reference external plugin example (examples/external_plugin/reference_runner.py): a standalone Python gRPC server implementing all 4 stage servicers. Doubles as the canonical starting point for users writing their own plugin AND as the cross-process fixture used in K8s smoke validation. See examples/external_plugin/README.md for local / K8s / fork-for-real-plugin walkthroughs.
Replay adapter migration to EngineProtocol so offline replay drives either path symmetrically.
Tests: ~360 plugin-framework unit tests (proto / transport / registry / scheduler / merge / clock / orchestrator) + integration tests (external plugin e2e over real gRPC sockets); 725 passed, 1 skipped under -m "pre_merge and planner and gpu_0" with the standard env exclusions; full suite 806 passed, 1 skipped, 0 failed after upgrading the dev-env dynamo.prometheus_names to the in-tree version.

Explicitly NOT in this PR

These come in the follow-up PR (plugin_frame_kang_pr2):

plugins/builtins/ — 5 reference builtin plugins: BuiltinLoadPredictor, BuiltinLoadPropose, BuiltinThroughputPropose, BuiltinReconcile, BuiltinBudgetConstrain.
Dual-path byte-equal parity suite (G3 fixture-driven tests comparing PSM and orchestrator outputs scenario by scenario).
The replay-mooncake byte-equal tests that exercise both paths through real trace ingestion.

The orchestrator path does still run on this PR — it just produces empty proposals when no builtins or external plugins are registered. Tests verify it boots cleanly, ticks, and returns scale_to=None with a valid next_tick in that case. External plugins (via W1 or W2) are sufficient on their own to drive real scaling decisions — verified in the integration suite and on a live K8s cluster.

Design decisions worth highlighting

PSM is the default, opt-in to orchestrator — scheduling.use_orchestrator defaults to false. Existing deployments see no behaviour change; switching the flag is the operator's choice. The follow-up PR will not flip this default either — that's a separate, calendar-gated cutover after canary observation.
Drop UDS as a client transport, keep inproc + grpc — the original design supported three client transport schemes (inproc:// / unix:// / grpc://). UDS as a client transport is removed. Rationale:
- Cross-process plugin = single wire model (grpc://), simplifies the deployment + security story.
- Same-Pod sidecar plugins use loopback grpc://127.0.0.1:N instead. Cost is small; benefit is one less transport to test / document / threat-model.
- The W2 registration gateway server can still bind a Unix socket via scheduling.gateway.listen="unix:/path" for in-Pod registration — that is a server bind option, not a client transport scheme.
No-op _PSMEngineAdapter — the abstraction layer the dual path requires adds one async wrapper around psm.on_tick(). The wrapper body is literally return self._psm.on_tick(scheduled_tick, tick_input) — no await points, no behaviour change. PSM-mode tick is byte-equivalent to pre-plugin-framework.
Append-only enum extensions — Prometheus Enum metrics (load_scaling_decision, throughput_scaling_decision) gain new state names for plugin-era reasons (override_by_user_plugin, held_over, reconcile_clamped_to_*, etc). They are append-only so existing scrapers parsing the old label set continue to work; new label values report 0 on the PSM path.
PSM-parity cadence in OrchestratorEngineAdapter — _MERGE_TOLERANCE_S = 0.5 matches PSM's wall-clock-drift padding (not a 1e-9 float-equality epsilon); throughput_adjustment_interval_seconds is read by its canonical Pydantic field name (the short form is only a validation_alias for YAML input, not exposed as an attribute in Pydantic v2). Both fixed late in review; regression tests live in tests/plugins/orchestrator/test_engine_adapter.py.

Verification

Test suite

$ pytest dynamo/planner/tests/ -m "pre_merge and planner and gpu_0" \
    --ignore=...test_diagnostics_recorder.py \
    --ignore=...test_advisory_mode.py
725 passed, 1 skipped, 37 deselected in 6.20s

Without the ignore list (after upgrading the dev-env dynamo.prometheus_names to match the in-tree v1.2.0 binding): 806 passed, 1 skipped, 0 failed.

The one skip is test_virtual_connector (pre-existing source comment).

Live K8s validation

Both paths exercised on a real K8s cluster, on the v16 planner image (= round 8 fix + reference-runner example included):

PSM path (use_orchestrator=false): planner boots clean (no AttributeError), every-10s tick emits HOLD with load_reason=no_fpm_data (correct in absence of FPM observations), no Traceback / ERROR, byte-equivalent to pre-plugin-framework PSM behaviour.
Orchestrator path with 4 external plugins (use_orchestrator=true): all four ext-* Pods running reference_runner.py register through W1 (external plugin bootstrap: accepted=4), planner invokes each over real cross-Pod gRPC every 10s, circuit breakers remain CLOSED, and the priority=2 RECONCILE plugin's OverrideResult correctly drives the final ScalingProposal (SCALE_UP | recommended: prefill=2 decode=3).
Configurable auth — both static_secret (planner log: auth_source=static_secret subject=ext-plugins) and AllowUnauthenticatedAuth (auth_source=allow_unauthenticated subject=anonymous, plus the DEV-ONLY startup WARN) scenarios were applied to the same 5-plugin DGD and verified by inspecting register-accept log lines.
End-to-end gRPC wire — an independent in-cluster spike (Python gRPC client → ext-user-propose.dynamo.svc.cluster.local:9099 → ProposePluginServicer.Propose) returned the args-configured OverrideResult(prefill=10 SET, decode=12 SET), confirming the integration is not a mock at any layer.

How to review

Start with plugins/types.py and proto/v1/plugin.proto for the message schema.
Then plugins/merge/type_aware.py and chain_augment.py — pure functions, well-documented invariants.
plugins/orchestrator/pipeline.py is the per-tick driver; run_pipeline is the single async function reviewers should trace top-to-bottom.
plugins/registry/server.py for the 4 RPC handlers + auth / protocol / dedup gates.
core/engine_protocol.py + core/base.py for the dual-path seam and how PSM stays untouched.
plugins/orchestrator/engine_adapter.py for the OrchestratorEngineAdapter lifecycle (PSM-parity cadence, regression-backed in tests/plugins/orchestrator/test_engine_adapter.py).
examples/external_plugin/reference_runner.py for the canonical external-plugin shape; examples/external_plugin/README.md for the fork-for-real-plugin walkthrough.

The two largest README-equivalents under plugins/transport/README.md and plugins/registry/README.md encode the decision rationale and threat model.

Follow-up PRs

PR	Topic
PR #2	5 builtin plugins + G3 dual-path parity + replay-mooncake byte-equal
PR 1.5	mTLS (cert-manager / Secret mount) for `GrpcTransport`; revisit `k8s_sa` + `spiffe_jwt` auth modules
Future	`_async_init` startup observability (background metric while waiting for workers — surfaced during K8s validation)
Future	Flip `use_orchestrator` default to `true` (calendar-gated, after canary observation)
Future	Delete legacy PSM (gated on the flag flip)

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Adds the base pipeline cadence field used by the orchestrator path in the scale_interval cadence model. Surface-only in this commit: the field is stored on the config but no engine_adapter / scheduler code consults it yet. Default 5.0s matches the existing ``load_adjustment_interval_seconds`` default, so configs without this field continue to behave identically. Future commits in this PR plug this into: - ``OrchestratorEngineAdapter._compute_next_scheduled_tick`` — pipeline ticks at ``scale_interval_seconds`` instead of the PSM-mirror load/throughput merge. - ``PluginRegistryServer.register_internal`` / ``register`` — phase- align ``RegisteredPlugin.registered_at`` to scale_interval boundaries so plugins with the same execution interval fire in the same tick irrespective of registration timestamp skew. Three new tests in ``test_scheduling_config.py`` lock the default, the validator (must be > 0), and accepting explicit operator overrides. Full planner suite: 833 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>

Plugins registered milliseconds apart but with the same ``execution_interval_seconds`` must fire on the same pipeline tick. Without phase alignment, plugin A registered at T=0 and plugin B at T=0.003 (a typical async-bootstrap order skew) have throttles that drift permanently 3ms apart — and a future ``requires_produced_fields`` dependency from B onto A would silently deadlock at every fire (B's throttle is 3ms behind A's, so B sees A's predictions cleared from the previous tick rather than the current one). Fix: at registration time, snap ``RegisteredPlugin.registered_at`` to the nearest scale_interval boundary using ``floor(now / scale) * scale``. Both plugins anchor to T=0, both fire at T=180 / 360 / ... together. ``PluginRegistryServer`` gains a new optional kwarg ``scale_interval_seconds`` (default 0.0 = no alignment, preserves legacy / PSM-path behaviour and existing test fixtures). The orchestrator path's ``OrchestratorEngineAdapter`` constructs the server with ``scale_interval_seconds=config.scheduling .scale_interval_seconds``. Both the proto-driven ``register()`` and the in-process ``register_internal()`` paths route through the new ``_aligned_anchor()`` helper so alignment is uniform across all register call sites. New tests in ``test_phase_alignment.py`` (4): - ``test_aligned_anchor_snaps_to_floor_boundary`` — direct unit test of the arithmetic (0.0/2.5/4.999 → 0.0; 5.0/7.4 → 5.0; 180.6 → 180.0). - ``test_disabled_when_scale_interval_zero`` — legacy bypass path (PSM tests + back-compat) returns raw clock value. - ``test_two_plugins_same_interval_phase_aligned_after_register_skew`` — the end-to-end story: 3ms apart at registration, both snap to T=0, same first-fire moment. - ``test_alignment_preserved_when_registration_crosses_boundary`` — correctness at boundary: 4.9 → 0 vs 5.1 → 5 is the *correct* semantic (they really are 5s apart in tick terms). Full planner suite: 837 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>

Adds the second gate to ``PluginScheduler.compute_active_set``: plugin only fires when (a) its execution_interval throttle is due AND (b) every dot-path in ``requires_produced_fields`` resolves non-None in the current ``PipelineContext``. This implements the declarative dependency mechanism designed for the scale_interval cadence model. A throughput-style plugin declaring ``requires_produced_fields=["predictions"]`` will skip its turn whenever the upstream predict stage didn't produce predictions on the same tick — preventing stale-input fires that would silently violate the plugin's own contract. Implementation: - ``compute_active_set`` signature gains optional ``ctx`` kwarg. Passed from the two pipeline call sites (the fan-out runner and the predict-stage chain) so requires checks see the live ctx state at each stage transition. Backward compat: ``ctx=None`` (legacy callers, PSM path, test fixtures) treats plugins with non-empty ``requires_produced_fields`` as "unsatisfied" → conservative skip. - ``_requires_missing_field()``: walks ``requires_produced_fields`` against ctx and returns the first failed dot-path (or None on satisfied). First-fail short-circuit so the metric label captures the leading dependency. - ``_ctx_get()``: small dot-path walker that returns None on any missing intermediate attribute (rather than raising). Used by the requires check. - New metric ``dynamo_planner_tick_requires_unsatisfied_total{plugin_id, missing_field}`` — increments per (plugin, first-missing-field) on every requires-gated skip. Critical for diagnosing dependency cascades when an upstream plugin opens its circuit-breaker, all dependent plugins downstream record their gated skip under the missing-field label. Twelve new tests in ``test_requires_produced_fields.py`` cover: - ``_ctx_get`` dot-path walker (top-level / nested / missing intermediate / missing leaf / None ctx). - ``compute_active_set`` gating: no requires fires unconditionally; satisfied requires fire; unsatisfied skip with no inherit; nested paths; multiple requires all-or-nothing; conservative skip when ctx=None and plugin declared requires. - Metric emission: missing_field label is the first failed path; metric counter increments per skip. Full planner suite: 849 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>

Replaces the PSM-mirror dual-cadence model in ``OrchestratorEngineAdapter`` with a single base interval. Pipeline ticks fire every ``SchedulingConfig.scale_interval_seconds`` from the last tick moment. Per-plugin cadence decisions live entirely in ``PluginScheduler._is_due`` (per ``RegisteredPlugin .execution_interval_seconds``) — there is no more ``_next_load_s`` / ``_next_throughput_s`` reconciliation in the adapter. Behaviour change is observable but PSM-equivalent at the *scaling-decision* level (locked by the test_decision_level_parity test rewritten in the next commit, replacing the old ``test_g3_parity_via_adapter`` byte-identical guard). Tick counts and ``ScheduledTick`` field shapes legitimately differ from PSM under the new model — see /tmp/scale_interval_design.md §11 for the full parity trade-off. Engine-adapter delta: - Add ``self._scale_interval`` + ``self._last_tick_s`` fields, sourced from ``config.scheduling.scale_interval_seconds`` (default 5.0). - ``initial_tick`` now records ``_last_tick_s = start_s`` and returns a tick at ``start_s + scale_interval``. Legacy ``_next_load_s`` / ``_next_throughput_s`` still set for the compatibility shim window. - ``tick`` no longer gates ``_observe_fpm`` on the (now constantly True) ``run_load_scaling`` flag, and stops trying to advance two separate cadences from ``tick_input.now_s``. ``_last_tick_s`` is the single source of truth for cadence advancement. - ``_compute_next_scheduled_tick`` rewritten: * ``at_s = self._last_tick_s + self._scale_interval`` * ``need_traffic_metrics`` is True iff some currently-registered plugin lists ``observations.traffic`` in its ``needs`` AND is due at the next tick — recovers PSM's lazy-pull cost profile (6 Prometheus queries / 180s in mixed mode, vs 216 if we always pulled). * ``traffic_metrics_duration_s`` = max declared ``observation_window_seconds`` across due traffic consumers (falls back to ``scale_interval`` when all due consumers declared 0.0 = "freshness equal to base cadence"). * ``run_load_scaling`` / ``run_throughput_scaling`` flags kept on ScheduledTick for compatibility with PSM-path code and the diagnostics projection methods; under scale_interval both are always True (every tick is an opportunity for any plugin to fire, subject to its own throttle). - ``_MERGE_TOLERANCE_S`` constant removed. Plugin throttles are per-plugin independent; there is no second cadence stream to merge with the first. Test surface delta in this commit: - ``test_merge_tolerance_matches_psm_500ms_window`` replaced with ``test_pipeline_fires_at_scale_interval_cadence``. The merge- tolerance concept doesn't apply to the new model — the test that locked it as a *contract* would assert behaviour the adapter no longer exhibits. The new test locks the contract that DOES apply: ``next_tick.at_s == last_tick + scale_interval``, exactly. Full planner suite: 849 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>

Locks two contracts the scale_interval rewrite introduced but didn't have direct coverage for: - ``test_scale_interval_advances_from_actual_tick_now`` — pipeline cadence advances from ``tick_input.now_s`` (the actual fire moment) rather than from a pre-computed schedule. Same drift policy PSM uses; locking it as a contract makes scale_interval cadence accumulation match PSM-equivalent behaviour under wall-clock jitter. - ``test_lazy_traffic_pull_skips_prometheus_when_no_plugin_needs_traffic`` — when no registered plugin consumes ``observations.traffic``, ``ScheduledTick.need_traffic_metrics`` stays False. This is the core Prometheus-load reduction promise of the scale_interval + per-plugin ``needs``-declaration model (recovers PSM's "skip the query on load-only ticks" behaviour without per-tick cadence type knowledge inside the adapter). Also drops a stale code comment that referenced a ``test_g3_parity_via_adapter`` test which never landed in PR ai-dynamo#10124 — PSM-vs-orchestrator parity is asserted at decision level by the existing ``test_tick_async_wraps_psm_on_tick_identically`` rather than a dedicated orchestrator-side test. Full planner suite: 851 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>

…rmat + ruff) Mechanical formatting changes applied by the pre-commit hooks the ``pre-commit`` CI check enforces. No behaviour change. Also: fix copyright header on ``plugin.proto`` to include "All rights reserved." per the regex in ``.github/workflows/copyright-check.ps1`` (``copyright-checks`` CI was failing on this single mismatch). Full planner suite: 851 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>

- Lazy-import grpc/protobuf transports so PSM-only deployments don't pull generated proto stubs at import time - Catch broken predict-plugin exceptions in pipeline so one bad plugin no longer fails the whole tick (records circuit breaker failure + emits error metric instead) - Wire FPM observations through msgpack into PipelineContext - Strip TYPE_CHECKING wrappers from imports CodeQL was flagging as unused (`TrafficObservation`, `PluginFrameworkMetrics`, `WorkerCapabilities`) — annotations now reference the symbols at module top instead of through `Optional["..."]` strings - Drop redundant `_result = await task; del _result` workaround in cancelled-sleeper test - Various small fixups identified by review: empty-`except KeyError` rationale, scheduler `_requires_missing_field` static helper, first-fire anchor on `registered_at` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rides CI runners that pip-install ai_dynamo without running grpc_tools.protoc were importing the placeholder shim in plugin_pb2.py and hitting "RuntimeError: stub missing" on every test that touched the proto messages. Three coupled fixes: - ship the generated `plugin_pb2.py`, `plugin_pb2_grpc.py`, `plugin_pb2.pyi` in git (negated against the repo-wide `*_pb2.py` gitignore). Test/build images no longer need grpcio-tools just to import the module. - `requirements.planner.txt`: add `grpcio>=1.63.0` and `protobuf>=5.29.5,<7.0.0` — the planner.Dockerfile doesn't install `requirements.common.txt`, so the orchestrator gRPC transport had no runtime to bind against. - `pyproject.toml`: add `grpc` / `grpc.*` / `google.protobuf` / `google.protobuf.*` to the mypy `ignore_missing_imports` block. These libs ship without `py.typed` markers; mypy was emitting `[import-untyped]` on every transport file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three concrete fixes; one consistency sweep. CRITICAL: - proto/v1/README.md + proto/v1/__init__.py docstring + "Adding a new stage" workflow all said the generated stubs are gitignored and regenerated at install time. After the build infra fix they're shipped in git — flip the description to match. - examples/external_plugin/README.md + reference_runner.py docstring referenced ``tests/manual/ext-4stage.yaml`` (and a "K8s smoke fixtures" section that doesn't exist) as the K8s deployment shape. Replace with a description of the intended Pod-per-stage layout and note the actual ready-to-apply fixture is deferred to a follow-up. The shipped integration test ``test_external_plugin_e2e.py`` runs the same binary as a subprocess; reference that instead. - tests/manual/README.md + disagg_8b_planner_orchestrator.yaml + engine_adapter.py:532 all pointed at ``tests/integration/test_dual_path_parity.py`` as the dual-path parity lock. That file was never shipped; the actual decision-level + cadence + tick-merge parity is locked by ``tests/plugins/orchestrator/test_engine_adapter.py``. Same fix for the stale ``test_replica_calculation.py`` reference — point at ``test_load_based_scaling.py`` + ``test_state_machine.py`` which contain the actual replica-math tests. NIT sweep: - Drop "DEP-XXXX" placeholder from proto + transport README headers - Drop the forward-looking ``executor_max_workers <= 8`` reference in transport README — no such field exists in any shipped PR Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings from independent review on PR ai-dynamo#10124. 1. engine_adapter._compute_next_scheduled_tick mixed wall-clock and monotonic time when deciding whether to do a lazy Prometheus pull. ``self._last_tick_s + scale_interval`` is wall-epoch (matches ``tick_input.now_s`` and ``ScheduledTick.at_s``), but ``PluginScheduler._is_due`` compares against ``RegisteredPlugin.last_call_at`` which is set by the pipeline via ``self._clock.monotonic()``. In production ``WallClock`` deployments the projection is ~1.7e9 vs ~1e3 — every traffic-consuming plugin reads as always due, and the lazy-pull optimization silently degenerates to "pull every tick". Replay was unaffected because ``VirtualClock`` is synced to ``tick_input.now_s`` at the top of ``tick()``. Fix: parallel field ``_last_tick_monotonic`` updated alongside ``_last_tick_s`` at every assignment site (init / prime / tick). ``_is_due`` is now called with the monotonic projection. Wall epoch is preserved unchanged for ``ScheduledTick.at_s``. Regression test pins the precise scenario: a wall-vs-monotonic divergent clock with a plugin whose ``last_call_at`` is in monotonic domain — pre-fix the test asserts FAILS with "need_traffic_metrics should be False, was True". 2. authenticated_heartbeat / authenticated_unregister leaked plugin existence to gateway callers. Returning ``(False, None)`` for unknown-plugin vs ``(False, "permission_denied")`` for wrong-subject mapped to distinct gRPC status codes in the gateway (200 OK ``HeartbeatResponse(ok=false)`` vs PERMISSION_DENIED) — any valid-token holder could enumerate registered plugin_ids by probing. Same class of oracle as the previously-fixed StaticSecretAuth token-prefix leak, just shifted one layer up. Fix: collapse "unknown plugin" and "wrong subject" into a single ``(False, "permission_denied")`` response in both handlers. The gateway already maps that uniformly to PERMISSION_DENIED. 3. ``register_internal`` plugins have ``auth_subject == ""`` by design — they are not supposed to be reachable via the gateway. The existing equality check ``plugin.auth_subject != identity.subject`` relied on no AuthValidator ever returning ``subject=""``. This is currently true (``StaticSecretAuth.__init__`` rejects empty; ``AllowUnauthenticatedAuth`` returns ``"anonymous"``) but data- shape is the only thing keeping it true. Fix: explicit ``not plugin.auth_subject`` guard in the same collapsed branch — any future auth backend that returns an empty subject still cannot operate on in-process plugins through the gateway. Test updates: - test_authenticated_heartbeat_unknown_plugin_returns_permission_denied - test_authenticated_unregister_unknown_plugin_returns_permission_denied (renamed from ``_returns_false_no_reject`` — old expectation was the bug) - test_authenticated_heartbeat_in_process_plugin_returns_permission_denied - test_authenticated_unregister_in_process_plugin_not_reachable_via_gateway - test_lazy_traffic_due_check_uses_monotonic_not_wall_epoch 346 plugin tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nfig paths Independent review caught that ``RegisterRequest.requires_produced_fields`` and ``observation_window_seconds`` (the scale_interval cadence contract fields added in commits 13-18) were only reachable via the gRPC gateway self-register path. The two ConfigMap-driven intake paths — static external plugin list and in-process plugin spec — both used Pydantic ``extra="forbid"`` and silently dropped these fields. Net result: - ``throughput_propose`` declaring ``requires_produced_fields=["predictions"]`` could not be configured via ConfigMap, only via gRPC self-register. - ``observation_window_seconds=180`` (e.g. wanting a longer Prometheus aggregation window for stable averaging) was similarly unreachable. Five wiring fixes (top-down following the call chain): 1. ``LocalPlannerOrchestrator.register_internal``: added the two new kwargs, passed through to ``PluginRegistryServer.register_internal``. Until now the orchestrator-level facade silently dropped them. 2. ``ExternalPluginEntry`` (planner_config.py): added Pydantic ``Field`` declarations for both new fields, with descriptions pointing at the cadence contract. 3. ``register_external_from_config`` (orchestrator.py): threaded both fields into the ``RegisterRequest`` construction. Without this, #2 alone wouldn't reach the registry. 4. ``InProcessPluginSpec`` (registry/config.py): added ``needs`` (also previously missing!) + the two new cadence fields. 5. ``load_in_process_plugins`` (in_process_loader.py): passed ``needs`` + ``requires_produced_fields`` + ``observation_window_seconds`` into ``orchestrator.register_internal``. Tests: - ``test_loader_passes_scale_interval_fields_to_registered_plugin``: in-process loader passthrough end-to-end (spec → plugin). - ``test_bootstrap_passes_scale_interval_fields_through``: static external bootstrap passthrough end-to-end. 348 plugin tests pass (+2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…th prefix ``needs`` is defined in plugin.proto:55 as ``dot-paths into PipelineContext`` — a plugin declaring ``needs=["observations.traffic. num_req"]`` is signaling "I only consume num_req, you may trim the rest from ctx" (the proto comment says ``the orchestrator MAY trim the context to those fields to save wire / serialisation cost``). But ``engine_adapter._compute_next_scheduled_tick`` used ``"observations.traffic" in p.needs`` — a list-membership exact-string match. Sub-path declarations slipped past this guard, so: needs=["observations.traffic.num_req"] → "observations.traffic" in needs == False → traffic_consumers_due excludes this plugin → ScheduledTick.need_traffic_metrics = False → orchestrator skips Prometheus query → ctx.observations.traffic = None at tick time → plugin gets None despite declaring a dot-path INTO that subtree Same dot-path field elsewhere in this PR (``requires_produced_fields``) already uses proper getattr-chain walking via ``_ctx_get``; this is a consistency fix. Fix: match the parent path ``observations.traffic`` exactly OR any sub-path ``observations.traffic.<field>``. Both require the observation to be present. The trailing ``.`` in the prefix-match is load-bearing — it stops false-positive matches on a sibling field like ``observations.traffic_legacy``. Test ``test_lazy_traffic_pull_matches_dot_path_sub_paths_of_observations_traffic`` locks all four cases: 1. exact ``"observations.traffic"`` → pull triggered (existing path) 2. sub-path ``"observations.traffic.num_req"`` → pull triggered (was broken) 3. sibling ``"observations.traffic_legacy"`` → pull NOT triggered (prefix guard) 4. unrelated ``"observations.fpm" / "predictions"`` → pull NOT triggered Verified by reverting the fix: case 2 fails with the expected AssertionError; restoring the fix makes all 4 cases pass. 349 plugin tests pass (+1 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PSM internal types ``TrafficObservation.kv_hit_rate`` and ``TickDiagnostics.predicted_kv_hit_rate`` have always carried KV cache hit rate as part of the throughput-scaling input / output. But the wire proto and Pydantic mirrors that external plugins see did NOT include these fields: TrafficObservation (PSM) ── kv_hit_rate ✅ TickDiagnostics (PSM) ── predicted_kv_hit_rate ✅ proto TrafficMetrics (wire) ── kv_hit_rate ❌ proto PredictionData (wire) ── predicted_kv_hit_rate ❌ pyd TrafficMetrics ── kv_hit_rate ❌ pyd PredictionData ── predicted_kv_hit_rate ❌ So an external ``throughput_propose`` plugin could not reproduce PSM throughput behaviour — neither read ``ctx.observations.traffic. kv_hit_rate`` (field absent on the proto-derived Pydantic model) nor emit ``predicted_kv_hit_rate`` (same). A genuine PSM-parity gap on the plugin observation / prediction API. Schema changes (proto3, additive — safe per the schema-evolution policy in proto/v1/README.md): - ``TrafficMetrics``: add ``optional float kv_hit_rate = 5`` - ``PredictionData``: add ``optional float predicted_kv_hit_rate = 5`` Both are ``optional`` to preserve field presence — distinguishes "no datapoint" (unset) from "0.0 = all-cold cache" (set to 0.0), and for the predicted field, "no opinion" from "I assert 0.0" under chain_augment first-writer-wins partial-merge. Wire updates: - Pydantic mirror in ``plugins/types.py``: ``kv_hit_rate: Optional[float]`` and ``predicted_kv_hit_rate: Optional[float]`` matching the existing predicted_* fields' Optional pattern. - Regenerated ``plugin_pb2.py``, ``plugin_pb2_grpc.py``, ``plugin_pb2.pyi`` via the documented protoc + SPDX-prepend recipe. Adapter wiring: - ``OrchestratorEngineAdapter._tick_input_to_context`` propagates ``TickInput.traffic.kv_hit_rate`` into ``ctx.observations.traffic .kv_hit_rate`` so external plugins actually see it. - The predict-diagnostics block in ``tick()`` copies ``prediction.predicted_kv_hit_rate`` onto ``TickDiagnostics`` — symmetric with the existing predicted_num_req / _isl / _osl forwarding. Test: ``test_kv_hit_rate_round_trip_traffic_and_prediction`` locks the field-presence semantic in both directions on both messages; existing ``test_prediction_data_optional_unset_vs_zero`` extended to assert the new field is also unset by default. 350 plugin tests pass (+1 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…urface Reviewer P1 (PR ai-dynamo#10124): ``_project_scale_to`` collapses the merged proposal's per-target ``component_name`` to a single replicas-per-type scalar. The collapse itself is fine for a single-planner single-pool runtime — the problem is that proto/Pydantic still **advertise** ``component_name`` as if multi-pool addressing worked, while the runtime silently drops it. The single-planner runtime in this PR has exactly one ``WorkerInfo`` per ``sub_component_type``, populated from config at startup; there is no path by which a plugin-emitted ``component_name`` could ever address an alternate pool. Multi-pool execution is hierarchical-planner territory (separate planners per pool, with a router on top) and lands in a follow-up PR. Rather than paper over the mismatch with a "warn + drop" guardrail, remove the API surface that has no runtime consumer: - proto ``ComponentTarget``: ``reserved 2;`` (was ``optional string component_name``) — schema-evolution policy keeps the tag reserved so the hierarchical-planner PR can re-add at the same tag (or pick a new one) without breaking older recorded traces. Two proto comments referencing the field are likewise updated. - ``plugin_pb2.py`` / ``plugin_pb2.pyi`` / ``plugin_pb2_grpc.py``: regenerated via the standard protoc + SPDX re-prepend recipe from ``proto/v1/README.md``. - Pydantic ``ComponentTarget``: ``component_name`` field removed; docstring notes the deferred multi-pool semantic. - ``ComponentKey``: ``component_name`` field removed (kept as a single- field dataclass rather than collapsing to ``str`` so the hierarchical-planner PR can re-add the per-pool axis without touching every call site). - ``type_aware_merge`` / ``pipeline.py``: bucket key construction simplified to ``ComponentKey(sub_component_type=...)``; clamp counters drop the ``component_name`` Prometheus label. - ``planner_metrics.py``: ``reconcile_clamped_total`` and ``constrain_capped_total`` label sets drop ``component_name`` (was always ``""`` in single-planner runtime anyway — the label was pure metric-cardinality bloat). - ``core/types.py``: ``TickDiagnostics.plugin_overrides`` / ``reconcile_reasons`` doc-comment updated. Tests: - ``test_type_aware_basic.py``: drops ``test_component_name_creates_ separate_buckets`` (the multi-pool independence case it was validating is no longer in the data model). ``POOL_A`` / ``POOL_B`` fixtures + ``_ct`` factory ``component_name`` parameter removed. - ``test_type_aware_clamp_tracking.py``: ``"worker_a"`` / ``"worker_b"`` literals removed from ``PREFILL`` / ``DECODE`` fixtures and the inline ``ComponentTarget`` construction. - ``test_type_aware_worked_examples.py``: drops the ``hierarchical_pools`` row (8 cases now, was 9); tripwire updated; ``CT()`` / ``key()`` helpers no longer take ``component_name``. - ``test_type_aware_constrain.py`` / ``test_type_aware_short_circuit.py``: ``_ct(...component_name=None)`` default removed (default was the no-name case anyway, no behavior change). - ``test_pipeline_metrics.py``: ``component_name="worker"`` dropped everywhere — was a phantom label, never functionally needed. - ``test_round_trip.py``: drops ``test_component_target_with_pool_name`` + the ``HasField("component_name")`` assertion (proto field no longer exists to query). - ``test_transport_contract.py``: drops the ``multi_pool`` parametrize fixture (used ``ComponentTarget.component_name`` on wire). - ``test_plugin_framework_metrics.py``: drops ``component_name=...`` from the ``Counter.labels(...)`` calls so the label set matches the declared metric. Forward compat (hierarchical-planner PR re-adds): - Pick proto tag 2 (reserved here) or a fresh tag; proto3 ``optional`` semantic keeps it non-breaking for older clients. - Re-add the ``component_name`` axis to ``ComponentKey`` / ``ComponentTarget`` / ``TargetReplica`` projection / Prometheus labels in lock-step. 819 -> 813 planner tests pass (1 pre-existing skip; 6 deleted tests matched the deleted forward-compat surface). No K8s smoke needed — pure wire-format + internal-bucket layer; execution path is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… FPM docs Two P2 review threads, both proto-touching, bundled into one commit. **P2-1: WorkerState scaling-in-progress flags missing** ``WorkerCounts`` (core/types.py:62-63) already carries ``prefill_scaling_in_progress`` / ``decode_scaling_in_progress`` and PSM's state_machine.py:336-345 uses them to suppress further scale-up while a previous request is mid-flight and to explain "held" decisions in audit logs. ``WorkerState`` (proto + Pydantic mirror) only exposed ready/expected counts, so an external load-scaling plugin replicating PSM behaviour could not reach the same decisions through the public ``PipelineContext.observations.workers`` channel. Add two ``optional bool`` fields at proto tags 5 / 6: ``prefill_scaling_in_progress`` and ``decode_scaling_in_progress``. ``optional`` (proto3 presence) lets a plugin distinguish "connector did not report this tick" from an explicit ``false`` value — same contract as the existing optional ints. Pydantic mirror + Pydantic- side round-trip test (unset / mixed True+False) added. Threaded through ``OrchestratorEngineAdapter._tick_input_to_context``; PSM-side WorkerCounts already populates these from ``base.py:803-804``, so this is wire-level exposure of an already- computed signal. **P2-6: FPM API docstrings stale** The adapter wires FPM observations into ``PipelineContext.observations.fpm`` (engine_adapter.py:797 + ``_encode_fpm`` helper) using msgspec/msgpack, but three docstrings still claim FPM is "reserved for a follow-up PR" / "currently unpopulated": - proto ``FpmData`` block comment - Pydantic ``FpmData`` docstring - engine_adapter module docstring (responsibility 2) Refresh all three to describe the actual contract: per-engine map keyed by ``"<worker_id>/<dp_rank>"``, msgpack-encoded ``ForwardPassMetrics``, ``ObservationData.fpm`` absent (not empty-but-set) when no engines reported. Plugin authors reading the proto/Pydantic source now see the real shape they can consume. 814 planner tests pass (was 813; +1 for the new round-trip test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…protocol_versions on orchestrator path Two registry-side P2 review threads. **P2-2: observation_window_seconds validation documented but not enforced** ``plugin.proto:107`` promises: Constraint: must be 0 OR a positive multiple of ``SchedulingConfig.scale_interval_seconds`` so windows align to tick boundaries. Validator rejects otherwise. The server stored ``req.observation_window_seconds`` verbatim without checking — a plugin author misconfiguring the field saw their plugin accepted and silently drove misaligned Prometheus ``[Ns]`` queries (window spanning tick boundaries) that report moving-target values. Add ``_check_observation_window()`` helper (module-level so it's unit- testable in isolation) and a step 2.5 in ``register()`` between ``protocol_version`` and the duplicate check. Accepts: - ``window == 0.0`` (per-tick freshness, the safe default) - ``window == k * scale_interval`` for any positive integer k (1e-6 tolerance absorbs float round-trip noise from YAML → Pydantic → proto float32 → wire) Rejects anything else with ``observation_window_misaligned``, mirroring the existing ``protocol_version_unsupported`` reject style. When ``scale_interval_seconds == 0.0`` (PSM-path constructions, no alignment to enforce), the check is a no-op — accept any value. 5 new tests cover (zero / multiple / non-multiple / negative / unverifiable-when-no-scale-interval). **P2-3: orchestrator path ignores configured protocol_version_min/max** ``planner_config.PluginRegistrationConfig`` exposes ``protocol_version_min`` / ``protocol_version_max`` and ``build_registry_from_config`` passes them as ``protocol_versions=(min, max)`` to ``PluginRegistryServer``. The orchestrator path (``OrchestratorEngineAdapter.__init__`` constructing the server directly) was passing only ``scale_interval_seconds`` and inheriting the server's default ``("1.0", "1.0")`` — so config knob was dead on the live orchestrator path. Pass ``protocol_versions=(config.plugin_registration.protocol_version_min, config.plugin_registration.protocol_version_max)`` at construction time. Test introspects the constructed server's ``_protocol_min`` / ``_protocol_max`` to confirm a non-default range (``"1.0"`` → ``"1.5"``) propagates through. 820 planner tests pass (was 814; +6 for the new tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…otstrap fan-out Reviewer P2-4: ``OrchestratorEngineAdapter.bootstrap_plugins`` dispatched ``_orchestrator.bootstrap_plugins(historical_traffic=...)`` **first** and only then ``_wire_external_plugins_from_config()``. Static external plugins declared in ``scheduling.external_plugins`` (a supported W1 path; see ``examples/external_plugin/``) registered into an already-bootstrapped registry — they never received the warm pass or the Bootstrap RPC, so external implementations of e.g. load-predictor that need historical traffic to fit a regression saw an empty seed. Swap the order: ``wire externals → bootstrap → gateway``. Now ``_orchestrator.bootstrap_plugins`` iterates the registry with the static externals already present and the Bootstrap fan-out covers them too. Gateway still opens last so dynamically registered plugins arriving via the network can't race the bootstrap fan-out — those plugins intentionally don't get ``historical_traffic`` (Bootstrap has already moved past them; that's the documented contract for dynamic registration). Regression test pins the new ordering via monkey-patched call records on the three adapter methods. 821 planner tests pass (was 820; +1 for the new ordering test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Diagnostics Reviewer P2-5: ``PipelineOutcome`` carries three execute-time fields (``execute_action``, ``short_circuit_reason``, ``audit_events``) that the engine adapter dropped — ``PlannerEffects.diagnostics`` only projected prediction / load / throughput fields. Operators could see the action via Prometheus (``tick_skip_reasons_total{reason=...}`` etc.) but in-process consumers — the replay adapter, the diagnostics recorder writing HTML reports — could not distinguish ``apply`` from ``skip_short_circuit`` / ``skip_no_targets`` / ``skip_tick_timeout`` without scraping metrics. Add three fields to ``TickDiagnostics``: - ``execute_action: Optional[str]`` — mirrors ``PipelineOutcome.execute_action``; ``None`` on the PSM path. - ``short_circuit_reason: str`` — mirrors the pipeline counterpart; empty string when not short-circuited. - ``audit_events: list[str]`` — mirrors ``PipelineOutcome.audit_events`` (chain-augment warnings, CONSTRAIN SET drops, etc.). Empty list on the PSM path. Engine adapter copies all three from the pipeline outcome after the prediction projection. PSM path leaves them at defaults (which is the documented "not available on this path" semantic per the comment block already on ``TickDiagnostics``). Regression test mocks ``orchestrator.tick`` to return a known ``skip_short_circuit`` outcome and asserts all three fields surface on the resulting ``PlannerEffects.diagnostics``. 822 planner tests pass (was 821; +1 for the diagnostics propagation test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…existent add_observation → crash) P1 from independent review. ``OrchestratorEngineAdapter._observe_fpm`` and ``ReplayPlannerAdapter._feed_extra_fpm_to_regression`` called ``reg.add_observation(fpm)`` (singular) on the regression models. Those slots hold ``PlannerEnginePerfModel`` (state_machine.py:75/82/88 build them; the orchestrator installs the same type via install_regressions), which has **no** singular ``add_observation`` — it exposes only ``add_observations(dict[(worker_id, dp_rank) -> FPM])``. The singular call raises ``AttributeError`` and crashes the tick. Confirmed three ways: runtime ``hasattr(PlannerEnginePerfModel, 'add_observation') is False``; the existing test ``test_rust_perf_adapter.py:275`` already asserts that method does not exist; and PSM's own ``_observe_fpm`` (state_machine.py:357/362/365) correctly uses the plural ``add_observations``. The singular ``add_observation`` belongs to a *different*, production-unused class family (``_BaseRegressionModel`` → Agg/Prefill/DecodeRegressionModel, instantiated only in test_load_based_scaling.py) — same name minus an 's', which is why it looked correct statically. Reachability: - Default replay path (use_orchestrator=False, SLA mode): crashes on any tick carrying >1 FPM snapshot per worker. - Orchestrator runtime (production K8s, SLA mode): crashes on the first load tick once a regression is installed and FPM flows. Both dodged by easy-mode (the ``not is_easy`` guard) and by the inner ``if obs.decode:`` guard when no FPM subscriber delivered data — which is exactly why neither the 822-test suite nor the K8s smoke (run without a live FPM stream into an SLA regression) caught it. Fix: - engine_adapter._observe_fpm: hand ``obs.prefill`` / ``obs.decode`` (already the right dict) straight to ``add_observations`` — line-for-line PSM mirror. - replay_adapter._feed_extra_fpm_to_regression (3 sites): wrap each non-excluded snapshot as ``add_observations({(worker_id, dp_rank): fpm})``, preserving the per-snapshot feed semantics. Tests (close the combined unit+smoke blind spot — no test previously exercised SLA + non-empty FpmObservations + installed PlannerEnginePerfModel): - test_engine_adapter: agg + disagg ``_observe_fpm`` against a real PSM-built regression, asserting no AttributeError. - test_replay_adapter_fpm (new): ``_feed_extra_fpm_to_regression`` against a real PSM regression with 2 snapshots/worker so the non-excluded one feeds. Both verified to fail on the pre-fix code and pass after. 825 planner tests pass (was 822; +3). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…review ai-dynamo#9) ``type_aware_merge`` documents that keys present only in ``baseline`` still appear in the output so downstream stages see a complete proposal. The bucket-merge path honors this; the ``final=True`` path returned the winning plugin's targets verbatim and never consulted ``baseline``, so a final plugin emitting only ``SET prefill=N`` dropped ``decode`` from the proposal. In the real pipeline this is masked (``final`` is forced False at CONSTRAIN, and ``_proposal_to_baseline`` rebuilds each next stage's baseline from the full worker-count fallback), so no production scaling decision changed. But a RECONCILE plugin reading ``ctx.proposal`` saw an incomplete proposal, and the divergence was a latent foot-gun if the stage wiring ever changed. Fix: after assembling the final winner's targets (and dropping SET under set_allowed=False), fold in any baseline key the winner did not mention, matching the bucket path. Updated the CONSTRAIN-final test (test_final_in_constrain_drops_set_but_final_still_applied): a dropped SET-prefill now reappears via baseline passthrough at the current value, which is exactly what the bucket path already does for a dropped SET. New test: test_final_path_passes_baseline_only_keys_through. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…oder reuse, doc fixes Low-risk cleanup batch from the independent review (no decision-path change): - ai-dynamo#4 chain_augment: add ``predicted_kv_hit_rate`` to ``_PREDICTION_FIELDS`` so it participates in first-writer-wins partial merge like the other three predicted_* fields (was silently dropped in any 2+ plugin PREDICT chain, contradicting the proto/Pydantic contract). +2 chain_augment tests. - #10 engine_adapter: add ``scale_down_capped_by_throughput`` to ``_aggregate_disagg_load_reason`` priority (PSM disagg emits it; placed between scale_up and scale_down to mirror PSM's _PRIORITY). - ai-dynamo#11 dead code: drop ``contributing_plugin_ids`` (built, never read) in pipeline._run_fanout_stage; drop ``_set_enabled`` + ``_plugin_ids`` (no caller in PR #1; would KeyError if reached). - ai-dynamo#18 _encode_fpm: use the canonical ``dynamo.common.forward_pass_metrics.encode`` (shared module-level encoder) instead of allocating a fresh ``msgspec.msgpack.Encoder`` per tick and re-implementing the encoding. Byte-identical wire format; keeps FPM serialization in lock-step with the rest of dynamo. - ai-dynamo#17 transport ABC docstring: timeout is enforced by the transport (``call()`` wraps ``asyncio.wait_for``), not the orchestrator — the pipeline uses a bare gather to avoid double-counting the deadline. - ai-dynamo#20 scheduler docstring: note the heartbeat-eviction monitor is not wired in this PR (last_heartbeat_at is recorded but unread; monitor is follow-up). - ai-dynamo#21 transport contract test: 7 inputs (not 8) → 14 cases (multi_pool fixture was removed with component_name; comments were stale). - ai-dynamo#22 metrics test: remove the dead no-op ``pass`` loop in _sample_value. 828 planner tests pass (was 825; +3 chain-augment / merge tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…auge reset (review ai-dynamo#12/ai-dynamo#13/ai-dynamo#19) Hot-path quality fixes — none change a scaling decision. ai-dynamo#12 pipeline durations use Clock.monotonic(), not Clock.now() Clock's contract reserves now() for wall-clock timestamps and monotonic() for duration measurement. Six duration sites (predict latency, fan-out call latency, whole-tick duration) used now(); under WallClock a backward NTP step mid-tick distorted the latency/duration histograms. Switched all six to monotonic(). VirtualClock.monotonic() is synced to trace time in replay, so replay/test behavior is unchanged. ai-dynamo#13 ProposeResult derives result_kind + enforces the oneof ProposeResult carries the same accept/override/reject oneof as the stage responses but, unlike them, had no model_post_init — so building it the natural way (override=...) left result_kind='' and the proto round-trip came back 'override', breaking round-trip equality; a two-payload oneof violation also went unchecked. Extracted the derive+validate logic into a shared _derive_result_kind() helper used by both _StageOneofResponse and ProposeResult. +round-trip test (derive + oneof-violation reject). ai-dynamo#19 override_active gauge reset covers errored plugins _emit_override_active reset the gauge only for plugins in plugin_results; a plugin whose call raised is absent from that list, so a 1 it set on a prior tick lingered. Now reset every ATTEMPTED plugin id (triggered + inherited) before setting the contributors. 828 planner tests pass (+1 round-trip test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

/ai-dynamo#8/ai-dynamo#15/ai-dynamo#16) Test-only additions for the seams the review flagged as untested. ai-dynamo#7 _project_scale_to with a real apply outcome (4 cases): both components changed → full ScalingDecision; both equal current → None (PSM-equivalent no-change); single-component proposal → other count stays None; non-apply execute_action → None. Previously every adapter.tick test hit only the None/empty path, so a regression in the projection / no-change detection would have shipped silently. ai-dynamo#8 _tick_input_to_context + FPM encoding: build a TickInput with traffic (incl kv_hit_rate), worker counts (incl scaling-in-progress flags), and a real ForwardPassMetrics; assert the PipelineContext.observations mapping and that the FPM bytes decode back (key format "<worker_id>/<dp_rank>", canonical encoder). This is the ingress glue where the add_observations P1 + the projection live. ai-dynamo#15 registry mutation during an in-flight tick: suspend a PROPOSE plugin mid-gather (asyncio.Event), register a new plugin while suspended, release, and assert the late plugin did NOT join the in-flight stage (pre-tick snapshot) and the tick completed cleanly — then a fresh tick picks it up. Exercises the no-locks invariant that scheduler.py/server.py document but no test covered. ai-dynamo#16 test_tick_diagnostics_extended scope note: clarify in the module docstring that plugin_overrides / reconcile_reasons / held_over_plugins have no production populator in this PR; these tests lock the dataclass contract (defaults / no shared-mutable aliasing), not live behavior. 835 planner tests pass (+6). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ay (review ai-dynamo#6) The registration gateway receives external plugins' shared-secret ``RegisterRequest.auth_token``. ``start_gateway_server`` bound ``add_insecure_port`` unconditionally when no TLS creds were supplied, and the production caller never supplies creds and had no config to — so pointing ``gateway.listen`` at a TCP ``host:port`` silently stood up a plaintext gRPC server that received every plugin's token in cleartext, with only an INFO log. This is asymmetric with the OUTBOUND transport, which fails closed unless ``transport.allow_insecure_grpc=True``. Make the inbound side symmetric: - Add ``GatewayConfig.allow_insecure`` (default False). - ``start_gateway_server`` gains ``allow_insecure`` and, in the no-credentials branch, refuses to bind a non-``unix:`` (TCP) listen unless ``allow_insecure`` is set — raising a clear RuntimeError before any bind. ``unix:`` (Pod-local, trust-boundary) listens are always allowed. When a plaintext TCP bind IS opted into, it logs a WARNING (not INFO) naming the token-exposure risk. - ``_maybe_start_gateway`` passes ``gw_cfg.allow_insecure`` through. Tests: TCP + allow_insecure=False → RuntimeError "refusing to bind plaintext"; TCP + allow_insecure=True → binds (stubbed server). 837 planner tests pass (+2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… gate The new fail-closed guard (commit be38e21) raises "refusing to bind plaintext" for a TCP listen with allow_insecure=False, which fired before the port==0 bind-failure path that test_start_gateway_server_raises_when_ port_zero exercises (it uses listen="0.0.0.0:1"). Pass allow_insecure=True in that test so it reaches the intended bind-failure path; the plaintext-gate behavior is covered by its own dedicated tests. 837 planner tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eview ai-dynamo#5) TrafficMetrics.{duration_s,num_req,isl,osl,kv_hit_rate} and PredictionData.{predicted_num_req,predicted_isl,predicted_osl, predicted_kv_hit_rate} were proto ``float`` (IEEE-754 32-bit), but their source of truth (core/types.py + the Pydantic mirror) is Python float64. Every out-of-process plugin received these values truncated to float32 (and re-truncated on the way back), so rate-style values drifted ~1e-4 and the gRPC transport disagreed bit-for-bit with the in-process transport — which is presented as interchangeable. The round-trip tests passed only because every fixture float was hand-picked float32-exact. Change the nine fields to ``double`` / ``optional double``. Safe now: the v1 contract is pre-ship (no external plugin has shipped against it), and changing the wire type before release avoids a breaking change later. Regenerated *_pb2 stubs (+ SPDX re-prepend). The Pydantic mirror already uses float64, so no mirror change. New test asserts EXACT (not approx) round-trip for non-float32-exact values, so a regression to ``float`` is caught. 838 planner tests pass (+1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… path (review #3) On the orchestrator replay path (use_orchestrator=True) the regression models were never installed: replay/main.py guarded the AIC benchmark-FPM bootstrap behind ``adapter._sm is not None and not adapter._sm._is_easy``, and ``_sm`` is None under use_orchestrator — so the whole block (incl. the ``load_benchmark_fpms`` calls) was skipped. ``get_regression`` then returned None for the whole replay, the throughput regression stayed empty, and orchestrator-replay scaling decisions diverged from PSM — contradicting the adapter docstrings that claimed bootstrap_from_fpms→install_regressions was wired. Fix: - engine_adapter: extract ``install_regressions_from_fpms`` (synchronous, builds the regressions from benchmark FPMs via the throwaway-PSM factory and installs them on the shared store — no plugin bootstrap). ``bootstrap_from_fpms`` now = install_regressions_from_fpms + bootstrap_plugins. - replay_adapter: add path-agnostic ``install_benchmark_fpms`` — PSM path → ``PlannerStateMachine.load_benchmark_fpms``; orchestrator path → ``install_regressions_from_fpms`` (plugins were already bootstrapped at adapter construction, so this does NOT double-bootstrap). Corrected the ``_get_regression`` docstring to the real wiring. - replay/main.py: guard is now ``not adapter._is_easy_mode()`` (path- agnostic) and the two ``adapter._sm.load_benchmark_fpms(...)`` calls become ``adapter.install_benchmark_fpms(...)``. Test: install_benchmark_fpms on the orchestrator path makes get_regression("agg") non-None (was None pre-fix). 840 planner tests pass (+1). NOTE: replay/main.py's end-to-end path can't be exercised in this env (its Rust _core symbol predates the installed extension); the edit is py_compile-validated and the adapter method it calls is unit-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… test CI pins black 23.1.0 via pre-commit; the committed call was left multi-line by a newer local black. Collapse to one line so the pinned formatter and CI agree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…from_fpms CI mypy flagged replay_adapter.py:249 with [attr-defined] (note: not covered by the type:ignore). The call sits in install_benchmark_fpms where self._engine carries only its declared EngineProtocol type, so the absent method is attr-defined — not union-attr (the latter applies at the __init__ call site where mypy narrows the assigned union). Switch the ignore code to [attr-defined]. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…I runtime parity The review-ai-dynamo#5 regen (a2b8c1c) was run with a local grpcio-tools 1.80 / protobuf 6.33, baking ValidateProtobufRuntimeVersion(6,31,1) and GRPC_GENERATED_VERSION '1.80.0' into the committed stubs. CI's pinned runtime is protobuf 5.29.6 (requirements: protobuf>=5.29.5,<6.0dev), so the 6.31.1 gencode guard hard-raised VersionError at import, erroring out 5 planner test modules at collection. Regenerate from the unchanged plugin.proto with grpcio-tools 1.67.1 — the same toolchain the stubs originally shipped with (b39dde8) and which passed CI — restoring gencode 5.27.2 + grpc guard 1.67.1. The serialized descriptor blob is byte-identical (schema unchanged: doubles, kv_hit_rate, WorkerState flags, component_name strip all preserved); only the version guards revert to CI-compatible values. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added feat documentation Improvements or additions to documentation planner deployment::k8s container labels May 15, 2026

kangclzjc force-pushed the plugin_frame_kang_pr1 branch from 9452c2a to 04adc1b Compare May 15, 2026 09:16

github-advanced-security AI found potential problems May 15, 2026

View reviewed changes

kangclzjc force-pushed the plugin_frame_kang_pr1 branch 23 times, most recently from 98cafd0 to 917003f Compare May 22, 2026 01:25

kangclzjc and others added 30 commits June 4, 2026 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(planner): plugin framework infrastructure (PR #1 of 2)#2

feat(planner): plugin framework infrastructure (PR #1 of 2)#2
kangclzjc wants to merge 42 commits into
mainfrom
plugin_frame_kang_pr1

kangclzjc commented May 15, 2026 •

edited

Loading

Uh oh!

github-advanced-security AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kangclzjc commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Scope

In this PR

Explicitly NOT in this PR

Design decisions worth highlighting

Verification

Test suite

Live K8s validation

How to review

Follow-up PRs

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kangclzjc commented May 15, 2026 •

edited

Loading