Skip to content

feat(planner): plugin framework infrastructure (PR #1 of 2)#10124

Merged
kangclzjc merged 42 commits into
ai-dynamo:mainfrom
kangclzjc:plugin_frame_kang_pr1
Jun 4, 2026
Merged

feat(planner): plugin framework infrastructure (PR #1 of 2)#10124
kangclzjc merged 42 commits into
ai-dynamo:mainfrom
kangclzjc:plugin_frame_kang_pr1

Conversation

@kangclzjc

@kangclzjc kangclzjc commented May 29, 2026

Copy link
Copy Markdown
Contributor

What this PR does

Introduces the plugin framework infrastructure for the Dynamo Planner: a parallel tick-engine path that runs decisions through a PREDICT → PROPOSE → RECONCILE → CONSTRAIN pipeline driven by in-process or out-of-process plugins. The legacy PlannerStateMachine (PSM) path remains the default; the new path is opt-in per planner via scheduling.use_orchestrator=true.

This PR is infrastructure only. It deliberately does not include builtin plugin implementations — those land in the follow-up PR with their parity tests.

Scope

In this PR

  • Proto + Pydantic types + bidirectional bridge for the 4 stage RPCs (PredictPlugin / ProposePlugin / ReconcilePlugin / ConstrainPlugin) plus PluginRegistry and PluginLifecycle.
  • Transport layer: PluginTransport ABC + two client implementations — InProcessTransport (inproc://) and GrpcTransport (grpc://, plaintext only, gated behind allow_insecure_grpc=True with a startup WARNING). UDS as a client transport was considered but dropped — see Design decisions below. mTLS support lands in PR 1.5.
  • PluginRegistry + PluginScheduler + CircuitBreaker with 3 auth sources (static_secret / multi / allow_unauthenticated). k8s_sa and spiffe_jwt were originally drafted but moved to PR 1.5 alongside mTLS.
  • Merge algorithms: type_aware_merge (PROPOSE / RECONCILE / CONSTRAIN — winner-takes-all on SET, max-of-floors on AT_LEAST, min-of-ceilings on AT_MOST) and chain_augment (PREDICT — sequential refinement in priority-ascending order with optional final=true chain terminator).
  • LocalPlannerOrchestrator + 4-stage pipeline driver (parallel fan-out on PROPOSE/RECONCILE/CONSTRAIN, sequential chain on PREDICT) with REJECT short-circuit, HOLD_LAST cache invalidation, per-stage / whole-tick deadlines.
  • EngineProtocol dual-path seam in NativePlannerBase so the same tick loop drives either PSM or orchestrator without branching. PSM stays the default.
  • W1 (static-config external plugins): at startup, the planner reads scheduling.external_plugins[] from PlannerConfig and registers each entry via registry.register(...). Per-entry failures (auth reject, bad scheme, protocol mismatch) are logged but do not crash the planner.
  • W2 (gRPC self-register gateway): when scheduling.gateway.enabled=true, the planner exposes PluginRegistry over gRPC so external plugin processes can call Register themselves. Implementation delegates to the same in-process registry the static-config path uses, so behaviour is identical between the two registration models.
  • Configurable external-plugin auth: plugin_registration.auth.trusted_sources (in PlannerConfig) drives which auth schemes the registry accepts; an empty list falls back to AllowUnauthenticatedAuth with a DEV-ONLY startup WARN. Verified end-to-end in K8s smoke for both static_secret and unauthenticated modes (see Verification).
  • Observability surface: dynamo_planner_* metric family extensions for plugin lifecycle (evaluations / latency / circuit / held-over / cache-age), tick scheduling (duration / timeout / lag / skipped), and execute funnel.
  • Structured audit events: per-tick audit_events: list[str] on PipelineOutcome (populated by chain-augment / type-aware merge) for downstream observability tooling.
  • Reference external plugin example (examples/external_plugin/reference_runner.py): a standalone Python gRPC server implementing all 4 stage servicers. Doubles as the canonical starting point for users writing their own plugin AND as the cross-process fixture used in K8s smoke validation. See examples/external_plugin/README.md for local / K8s / fork-for-real-plugin walkthroughs.
  • Replay adapter migration to EngineProtocol so offline replay drives either path symmetrically.
  • scale_interval cadence model for the orchestrator path (six-commit refactor at the head of the PR). Pipeline ticks now fire once per SchedulingConfig.scale_interval_seconds (default 5.0) regardless of individual plugin cadences; per-plugin execution_interval_seconds throttling decides which plugins actually fire each tick. Introduces two new RegisterRequest fields — requires_produced_fields (declarative dependency gate: plugin skips its turn when an upstream dot-path is None in the current PipelineContext) and observation_window_seconds (plugin-declared Prometheus aggregation window for lazy traffic pull). Phase-aligns registered_at to the nearest scale_interval boundary so plugins registered milliseconds apart still fire in the same tick — required for requires_produced_fields to work. Replaces the PSM-mirror dual-cadence + 0.5s merge tolerance in OrchestratorEngineAdapter with the single base interval. Net code delta: −50 lines in engine_adapter; +28 new tests across test_phase_alignment, test_requires_produced_fields, and the updated test_engine_adapter.
  • Tests: ~360 plugin-framework unit tests (proto / transport / registry / scheduler / merge / clock / orchestrator) + integration tests (external plugin e2e over real gRPC sockets); 725 passed, 1 skipped under -m "pre_merge and planner and gpu_0" with the standard env exclusions; full suite 851 passed, 1 skipped, 0 failed after upgrading the dev-env dynamo.prometheus_names to the in-tree version.

Explicitly NOT in this PR

These come in the follow-up PR (plugin_frame_kang_pr2):

  • plugins/builtins/ — 5 reference builtin plugins: BuiltinLoadPredictor, BuiltinLoadPropose, BuiltinThroughputPropose, BuiltinReconcile, BuiltinBudgetConstrain.
  • Dual-path byte-equal parity suite (G3 fixture-driven tests comparing PSM and orchestrator outputs scenario by scenario).
  • The replay-mooncake byte-equal tests that exercise both paths through real trace ingestion.

The orchestrator path does still run on this PR — it just produces empty proposals when no builtins or external plugins are registered. Tests verify it boots cleanly, ticks, and returns scale_to=None with a valid next_tick in that case. External plugins (via W1 or W2) are sufficient on their own to drive real scaling decisions — verified in the integration suite and on a live K8s cluster.

Design decisions worth highlighting

  1. PSM is the default, opt-in to orchestratorscheduling.use_orchestrator defaults to false. Existing deployments see no behaviour change; switching the flag is the operator's choice. The follow-up PR will not flip this default either — that's a separate, calendar-gated cutover after canary observation.

  2. Drop UDS as a client transport, keep inproc + grpc — the original design supported three client transport schemes (inproc:// / unix:// / grpc://). UDS as a client transport is removed. Rationale:

    • Cross-process plugin = single wire model (grpc://), simplifies the deployment + security story.
    • Same-Pod sidecar plugins use loopback grpc://127.0.0.1:N instead. Cost is small; benefit is one less transport to test / document / threat-model.
    • The W2 registration gateway server can still bind a Unix socket via scheduling.gateway.listen="unix:/path" for in-Pod registration — that is a server bind option, not a client transport scheme.
  3. No-op _PSMEngineAdapter — the abstraction layer the dual path requires adds one async wrapper around psm.on_tick(). The wrapper body is literally return self._psm.on_tick(scheduled_tick, tick_input) — no await points, no behaviour change. PSM-mode tick is byte-equivalent to pre-plugin-framework.

  4. Append-only enum extensions — Prometheus Enum metrics (load_scaling_decision, throughput_scaling_decision) gain new state names for plugin-era reasons (override_by_user_plugin, held_over, reconcile_clamped_to_*, etc). They are append-only so existing scrapers parsing the old label set continue to work; new label values report 0 on the PSM path.

  5. scale_interval cadence model in OrchestratorEngineAdapter — addresses the design comment that the PSM-mirror dual-cadence model leaked the legacy load/throughput tick-type concept into the plugin contract. Pipeline now fires every scale_interval_seconds (single base cadence; default 5.0); the previous _MERGE_TOLERANCE_S = 0.5 merge tolerance and the _next_load_s / _next_throughput_s book-keeping are removed because there is no second cadence stream to merge with the first. Per-plugin execution_interval_seconds throttling (already in PluginScheduler._is_due) now carries all cadence-type decisions — BuiltinThroughputPropose(interval=180.0) simply skips ticks where the throttle says not due, no flag plumbing required. Plugins declare their data needs via RegisterRequest.needs + observation_window_seconds, and the adapter pulls Prometheus lazily — matching PSM's 6-query-per-180s mixed-mode cost profile without leaking cadence-type info into the observation contract. Two earlier round-fix items folded in: first-fire anchored on registered_at (commit 65d1968e6) and OrchestratorEngineAdapter clock-injection for replay (commit c045670d7).

  6. Decision-level parity with PSM (not bit-identical) — under scale_interval, the orchestrator path legitimately produces a different number of pipeline ticks per hour than PSM (one per scale_interval instead of one per load tick plus extras for throughput ticks), and ScheduledTick.diagnostics field shapes diverge. What's preserved is the actual scaling decision sequence at every wall-clock moment: scale_to outputs match PSM for the same input trace. See test_lazy_traffic_pull_skips_prometheus_when_no_plugin_needs_traffic and test_pipeline_fires_at_scale_interval_cadence for the new contract; the existing test_tick_async_wraps_psm_on_tick_identically continues to lock the PSM-path adapter shim.

Verification

Test suite

$ pytest dynamo/planner/tests/ -m "pre_merge and planner and gpu_0" \
    --ignore=...test_diagnostics_recorder.py \
    --ignore=...test_advisory_mode.py
725 passed, 1 skipped, 37 deselected in 6.20s

Without the ignore list (after upgrading the dev-env dynamo.prometheus_names to match the in-tree v1.2.0 binding): 806 passed, 1 skipped, 0 failed.

The one skip is test_virtual_connector (pre-existing source comment).

Live K8s validation

Both paths exercised on a real K8s cluster, on the v16 planner image (= round 8 fix + reference-runner example included):

  • PSM path (use_orchestrator=false): planner boots clean (no AttributeError), every-10s tick emits HOLD with load_reason=no_fpm_data (correct in absence of FPM observations), no Traceback / ERROR, byte-equivalent to pre-plugin-framework PSM behaviour.
  • Orchestrator path with 4 external plugins (use_orchestrator=true): all four ext-* Pods running reference_runner.py register through W1 (external plugin bootstrap: accepted=4), planner invokes each over real cross-Pod gRPC every 10s, circuit breakers remain CLOSED, and the priority=2 RECONCILE plugin's OverrideResult correctly drives the final ScalingProposal (SCALE_UP | recommended: prefill=2 decode=3).
  • Configurable auth — both static_secret (planner log: auth_source=static_secret subject=ext-plugins) and AllowUnauthenticatedAuth (auth_source=allow_unauthenticated subject=anonymous, plus the DEV-ONLY startup WARN) scenarios were applied to the same 5-plugin DGD and verified by inspecting register-accept log lines.
  • End-to-end gRPC wire — an independent in-cluster spike (Python gRPC client → ext-user-propose.dynamo.svc.cluster.local:9099ProposePluginServicer.Propose) returned the args-configured OverrideResult(prefill=10 SET, decode=12 SET), confirming the integration is not a mock at any layer.

How to review

  • Start with plugins/types.py and proto/v1/plugin.proto for the message schema.
  • Then plugins/merge/type_aware.py and chain_augment.py — pure functions, well-documented invariants.
  • plugins/orchestrator/pipeline.py is the per-tick driver; run_pipeline is the single async function reviewers should trace top-to-bottom.
  • plugins/registry/server.py for the 4 RPC handlers + auth / protocol / dedup gates.
  • core/engine_protocol.py + core/base.py for the dual-path seam and how PSM stays untouched.
  • plugins/orchestrator/engine_adapter.py for the OrchestratorEngineAdapter lifecycle under the new scale_interval cadence model (single base cadence, lazy traffic pull based on plugin needs union, regression-backed in tests/plugins/orchestrator/test_engine_adapter.py).
  • plugins/scheduler.py for the new declarative-dependency gate (_requires_satisfied + _ctx_get dot-path walker) and the existing _is_due throttle; tests under tests/plugins/scheduler/test_phase_alignment.py and test_requires_produced_fields.py.
  • examples/external_plugin/reference_runner.py for the canonical external-plugin shape; examples/external_plugin/README.md for the fork-for-real-plugin walkthrough.

The two largest README-equivalents under plugins/transport/README.md and plugins/registry/README.md encode the decision rationale and threat model.

Follow-up PRs

PR Topic
PR #2 5 builtin plugins + G3 dual-path parity + replay-mooncake byte-equal
PR 1.5 mTLS (cert-manager / Secret mount) for GrpcTransport; revisit k8s_sa + spiffe_jwt auth modules
Future _async_init startup observability (background metric while waiting for workers — surfaced during K8s validation)
Future Flip use_orchestrator default to true (calendar-gated, after canary observation)
Future Delete legacy PSM (gated on the flag flip)

Open in Devin Review

Summary by CodeRabbit

  • New Features

    • Introduced a plugin system for the planner, enabling external scaling decision plugins via gRPC with registration, circuit breaker protection, and scheduling controls.
    • Added new planner tick engine based on a plugin orchestrator pipeline (predict → propose → reconcile → constrain) for improved scaling decisions.
    • Added feature flag scheduling.use_orchestrator to enable orchestrator-based tick execution.
    • Extended planner diagnostics with plugin-specific audit fields.
  • Enhancements

    • Added comprehensive Prometheus metrics for plugin framework observability.
    • Expanded decision state labels for better load/throughput scaling reason tracking.
  • Documentation

    • Added plugin protocol (Proto v1) specification and documentation.
    • Added orchestrator rollout guidance and observability metrics reference.

Review Change Stack

@kangclzjc kangclzjc requested review from a team as code owners May 29, 2026 12:50
@github-actions github-actions Bot added the feat label May 29, 2026
@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot added the external-contribution Pull request is from an external contributor label May 29, 2026
@github-actions github-actions Bot added documentation Improvements or additions to documentation planner labels May 29, 2026

@github-advanced-security github-advanced-security AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Adds a plugin-based orchestrator engine alongside PSM, full plugin protocol and Pydantic mirrors, registry/auth/scheduler/circuit breaker, transports (in-process, gRPC), proto bridge, merge algorithms, metrics, configs, example external plugin, and extensive unit/integration tests. Planner selects engine via scheduling.use_orchestrator and routes ticks through EngineProtocol.

Changes

Orchestrator engine and plugin framework

Layer / File(s) Summary
Orchestrator path: contracts, runtime, transports, registry, metrics, config, core integration, examples, tests
components/src/dynamo/planner/..., docs/components/planner/*, components/src/dynamo/planner/tests/..., .gitignore, pyproject.toml
Defines plugin proto/Pydantic contracts, merge algorithms, registry/auth/scheduler/circuit breaker, transports and proto bridge, orchestrator and engine adapter; integrates engine selection in planner core and replay; extends metrics/diagnostics and configuration; adds example external plugin server and comprehensive tests/docs.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/planner/tests/plugins/registry/auth/test_allow_unauthenticated.py (1)

34-42: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the empty-token subject too.

This test covers the empty-token branch, but it only checks source. A regression that returns a different subject for "" would still pass.

💡 Suggested assertion
 async def test_accepts_any_token_including_empty():
     auth = AllowUnauthenticatedAuth()
     identity = await auth.validate("anything")
     assert identity.source == "allow_unauthenticated"
     assert identity.subject == "anonymous"

     identity_empty = await auth.validate("")
     assert identity_empty.source == "allow_unauthenticated"
+    assert identity_empty.subject == "anonymous"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@components/src/dynamo/planner/tests/plugins/registry/auth/test_allow_unauthenticated.py`
around lines 34 - 42, The test async function
test_accepts_any_token_including_empty() checks the
AllowUnauthenticatedAuth.validate("") branch but only asserts
identity_empty.source; add an assertion that identity_empty.subject equals
"anonymous" to ensure the empty-token path returns the expected subject
(referencing AllowUnauthenticatedAuth.validate and the identity_empty variable).
🟡 Minor comments (23)
components/src/dynamo/planner/core/base.py-997-1004 (1)

997-1004: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reset the cached worker counts on ticks that skip inventory collection.

_last_worker_counts is only updated when this tick requested worker states, so any later tick with need_worker_states=False logs the previous tick’s replica counts as “current”. That makes the new summary line report stale deltas on the orchestrator path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/core/base.py` around lines 997 - 1004, The
cached worker count handling currently only updates self._last_worker_counts
when tick_input.worker_counts is present, causing stale counts to persist for
later ticks that skip inventory; modify the conditional around
tick_input.worker_counts (the block that sets self._last_worker_counts) to
explicitly clear/reset self._last_worker_counts (e.g. set to None) when
tick_input.worker_counts is None so that _log_decision_summary and downstream
logic do not use stale replica counts; this change should be placed alongside
the existing code that calls engine.tick(...) and references
tick_input.worker_counts and _last_worker_counts.
components/src/dynamo/planner/config/planner_config.py-159-166 (1)

159-166: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Chain the invalid hold_policy error explicitly.

_coerce_hold_policy catches KeyError and then raises a ValueError without an explicit from ..., leaving the exception context implicit. Capture the KeyError as err and chain it.

Suggested change
-        if isinstance(v, str):
-            try:
-                return HoldPolicy[v.upper()]
-            except KeyError:
-                raise ValueError(
-                    f"hold_policy must be one of {[p.name for p in HoldPolicy]}, "
-                    f"got {v!r}"
-                )
+        if isinstance(v, str):
+            try:
+                return HoldPolicy[v.upper()]
+            except KeyError as err:
+                raise ValueError(
+                    f"hold_policy must be one of {[p.name for p in HoldPolicy]}, "
+                    f"got {v!r}"
+                ) from err
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/config/planner_config.py` around lines 159 -
166, The _coerce_hold_policy function currently catches KeyError when mapping a
string to HoldPolicy and raises ValueError without exception chaining; modify
the except block to capture the KeyError as err and re-raise the ValueError
using "raise ... from err" so the original KeyError is preserved in the
exception chain when mapping strings to HoldPolicy.
components/src/dynamo/planner/offline/replay_adapter.py-167-167 (1)

167-167: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Consider closing the event loop after replay.

The event loop created on line 167 is never explicitly closed. While Python's GC will eventually clean it up, explicit cleanup prevents resource warnings and is best practice.

Proposed fix

Add cleanup at the end of run() (after line 282):

         return ReplayPlannerReport(
             trace_report=trace_report,
             scaling_events=scaling_events,
             diagnostics_log=diagnostics_log,
             total_ticks=total_ticks,
             html_report_path=html_report_path,
         )
+        # Note: Consider adding a finally block or __del__ to close self._loop
+        # if self._loop is not None:
+        #     self._loop.close()

Or add a close() method and document that callers should invoke it, or use __del__:

def __del__(self) -> None:
    if self._loop is not None and not self._loop.is_closed():
        self._loop.close()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/offline/replay_adapter.py` at line 167, The
asyncio event loop stored on self._loop (created in the ReplayAdapter
constructor) is never closed; update the ReplayAdapter to explicitly close it
after replay by either 1) calling self._loop.close() at the end of the run()
method (ensure you check for existence and not already closed), or 2) add a
public close() method that closes self._loop and document callers must call it,
or 3) implement __del__ to close the loop defensively; reference the _loop
attribute and the run() method (or add close()/__del__) when applying the
change.
components/src/dynamo/planner/plugins/registry/README.md-11-39 (1)

11-39: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add fence languages to the ASCII blocks.

These unlabeled code fences trip markdownlint MD040. text is enough here.

Suggested fix
-```
+```text
             +-------------------+
             |  register (RPC)   |<--- gateway.py (gRPC server)
             |  heartbeat (RPC)  |
             |  unregister (RPC) |
             |  list_plugins     |
             +---------+---------+
                       | method calls (single-threaded asyncio)
                       v
             +-------------------+
             | PluginRegistry    |  <-- register_internal (in-process)
             |    Server         |
             +---------+---------+
                       |
        +-- on_unregister events ---+
        |                            |
        v                            v
 +---------------+          +------------------+
 | CircuitBreaker|<-- open->| PluginScheduler  |---> cache_age lookup
 |               |          |    - active set  |     (ListPlugins)
 |               |          |    - HOLD_LAST   |
 +---------------+          +------------------+
        ^
        | can_call()
        |
 +---------------+
 | Orchestrator  |   <-- composes all of the above
 +---------------+

@@
- +text
dev environment / single-tenant lab / pre-shared key OK?
└── yes → static_secret (PR #1 default)

└── share-a-secret-with-dynamo-planner K8s Secret; map the secret
value → subject label in AuthConfig.static_secrets

multi-cluster / mesh / zero-trust?
└── yes → k8s_sa (FOLLOW-UP PR — not in PR #1) — TokenReview against kube API
or → spiffe_jwt (FOLLOW-UP PR — not in PR #1) — SPIRE JWT-SVID

quick dev loop without real secrets?
└── allow_unauthenticated (emits WARNING on construction;
NEVER use in production)

Also applies to: 59-73

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/plugins/registry/README.md` around lines 11 -
39, The markdown has unlabeled code fences causing MD040; update the two
ASCII-art/code blocks in README.md by adding a fence language label "text"
(i.e., change ``` to ```text) so both the topology diagram block (the
PluginRegistry/CircuitBreaker/Orchestrator ASCII art) and the configuration flow
block (dev environment / multi-cluster / quick dev loop) are fenced as text;
locate these blocks by their distinctive content (the PluginRegistry diagram and
the "dev environment / single-tenant lab / pre-shared key OK?" flow) and apply
the label to each opening ``` fence.
components/src/dynamo/planner/plugins/registry/auth/__init__.py-26-32 (1)

26-32: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sort __all__ to clear RUF022.

The exported names are out of order, so Ruff will keep reporting this file.

Suggested fix
 __all__ = [
-    "AuthValidator",
+    "AllowUnauthenticatedAuth",
     "AuthIdentity",
+    "AuthValidator",
+    "MultiSourceAuth",
     "StaticSecretAuth",
-    "MultiSourceAuth",
-    "AllowUnauthenticatedAuth",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/plugins/registry/auth/__init__.py` around lines
26 - 32, The __all__ list is not sorted which triggers Ruff RUF022; reorder the
exported names in the __all__ tuple/list into lexicographic order so the names
(e.g., AllowUnauthenticatedAuth, AuthIdentity, AuthValidator, MultiSourceAuth,
StaticSecretAuth) are alphabetically sorted and ensure the list exactly matches
the actual exported symbols.
components/src/dynamo/planner/plugins/registry/__init__.py-30-35 (1)

30-35: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sort __all__ to satisfy Ruff.

RUF022 will keep flagging this export list until it matches isort-style ordering.

Suggested fix
 __all__ = [
+    "AuthError",
     "RegisteredPlugin",
-    "derive_transport_type",
     "RegistryError",
-    "AuthError",
+    "derive_transport_type",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/plugins/registry/__init__.py` around lines 30 -
35, Ruff flags the __all__ export list for not following isort-style ordering;
reorder the entries in the __all__ list so they are case-insensitively sorted
(e.g., "AuthError", "derive_transport_type", "RegisteredPlugin",
"RegistryError") to satisfy RUF022 and keep the module exports consistent;
update the __all__ variable in the registry module accordingly.
components/src/dynamo/planner/plugins/registry/auth/base.py-73-78 (1)

73-78: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sort __all__ so Ruff stops flagging the module.

This list is out of isort-style order (RUF022).

Suggested fix
 __all__ = [
+    "AllowUnauthenticatedAuth",
+    "AuthError",
+    "AuthIdentity",
     "AuthValidator",
-    "AuthIdentity",
-    "AllowUnauthenticatedAuth",
-    "AuthError",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/plugins/registry/auth/base.py` around lines 73
- 78, The module-level __all__ list is not in isort/ruff alphabetical order;
reorder the entries in the __all__ list so they are sorted alphabetically (e.g.,
"AllowUnauthenticatedAuth", "AuthError", "AuthIdentity", "AuthValidator") to
satisfy RUF022 and stop the Ruff warning for the symbols defined in this module.
components/src/dynamo/planner/plugins/transport/README.md-21-26 (1)

21-26: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language to this fenced block.

This trips MD040 today, so the docs will keep failing or staying noisy until the fence is tagged (for example text).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/plugins/transport/README.md` around lines 21 -
26, The fenced code block in README.md is missing a language tag which triggers
MD040; update the triple-backtick fence that wraps the plugin/orchestrator
decision diagram to include a language identifier (e.g., change ``` to ```text)
so the linter is satisfied; locate the block containing "plugin and orchestrator
in same process?" and the two arrows and add the language tag to the opening
fence.
components/src/dynamo/planner/plugins/types.py-368-411 (1)

368-411: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sort __all__ to satisfy Ruff RUF022.

__all__ ordering is currently flagged by static analysis and may fail lint gates.

Proposed change
-__all__ = [
-    # Enums
-    "HoldPolicy",
-    "CircuitState",
-    "OverrideType",
-    # PluginRegistry
-    "RegisterRequest",
-    "RegisterResponse",
-    "HeartbeatRequest",
-    "HeartbeatResponse",
-    "UnregisterRequest",
-    "UnregisterResponse",
-    "ListPluginsRequest",
-    "PluginInfo",
-    "ListPluginsResponse",
-    # Pipeline context + observation
-    "TrafficMetrics",
-    "FpmData",
-    "WorkerState",
-    "ObservationData",
-    "PredictionData",
-    "ComponentTarget",
-    "ScalingProposal",
-    "PipelineContext",
-    # Stage payloads
-    "AcceptResult",
-    "RejectResult",
-    "OverrideResult",
-    "ProposeResult",
-    # Stage request/response
-    "PredictStageRequest",
-    "PredictStageResponse",
-    "ProposeStageRequest",
-    "ProposeStageResponse",
-    "ReconcileStageRequest",
-    "ReconcileStageResponse",
-    "ConstrainStageRequest",
-    "ConstrainStageResponse",
-    # PluginLifecycle
-    "BootstrapRequest",
-    "BootstrapResponse",
-    "ResetRequest",
-    "ResetResponse",
-]
+__all__ = [
+    "AcceptResult",
+    "BootstrapRequest",
+    "BootstrapResponse",
+    "CircuitState",
+    "ComponentTarget",
+    "ConstrainStageRequest",
+    "ConstrainStageResponse",
+    "FpmData",
+    "HeartbeatRequest",
+    "HeartbeatResponse",
+    "HoldPolicy",
+    "ListPluginsRequest",
+    "ListPluginsResponse",
+    "ObservationData",
+    "OverrideResult",
+    "OverrideType",
+    "PipelineContext",
+    "PluginInfo",
+    "PredictStageRequest",
+    "PredictStageResponse",
+    "PredictionData",
+    "ProposeResult",
+    "ProposeStageRequest",
+    "ProposeStageResponse",
+    "ReconcileStageRequest",
+    "ReconcileStageResponse",
+    "RegisterRequest",
+    "RegisterResponse",
+    "RejectResult",
+    "ResetRequest",
+    "ResetResponse",
+    "ScalingProposal",
+    "TrafficMetrics",
+    "UnregisterRequest",
+    "UnregisterResponse",
+    "WorkerState",
+]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/plugins/types.py` around lines 368 - 411, The
__all__ list in types.py is not alphabetically sorted and triggers Ruff RUF022;
update the __all__ variable by sorting all exported symbol strings (e.g.,
"AcceptResult", "AlertName", etc.—specifically the current entries like
"HoldPolicy", "CircuitState", "OverrideType", "RegisterRequest",
"RegisterResponse", "HeartbeatRequest", "HeartbeatResponse",
"UnregisterRequest", "UnregisterResponse", "ListPluginsRequest", "PluginInfo",
"ListPluginsResponse", "TrafficMetrics", "FpmData", "WorkerState",
"ObservationData", "PredictionData", "ComponentTarget", "ScalingProposal",
"PipelineContext", "AcceptResult", "RejectResult", "OverrideResult",
"ProposeResult", "PredictStageRequest", "PredictStageResponse",
"ProposeStageRequest", "ProposeStageResponse", "ReconcileStageRequest",
"ReconcileStageResponse", "ConstrainStageRequest", "ConstrainStageResponse",
"BootstrapRequest", "BootstrapResponse", "ResetRequest", "ResetResponse") into a
single lexicographically ordered list so Ruff RUF022 is satisfied; keep the same
string values and formatting style when replacing the __all__ declaration.
components/src/dynamo/planner/tests/core/test_engine_protocol.py-134-139 (1)

134-139: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move _make_fpm imports to module scope.

Function-local imports violate the repo Python guideline and add avoidable per-call overhead.

Proposed change
 import pytest
 
+from dynamo.common.forward_pass_metrics import (
+    ForwardPassMetrics,
+    QueuedRequestMetrics,
+    ScheduledRequestMetrics,
+)
 from dynamo.planner.config.planner_config import PlannerConfig
 from dynamo.planner.core.engine_protocol import EngineProtocol, _PSMEngineAdapter
 from dynamo.planner.core.state_machine import PlannerStateMachine
@@
 def _make_fpm():
-    from dynamo.common.forward_pass_metrics import (
-        ForwardPassMetrics,
-        QueuedRequestMetrics,
-        ScheduledRequestMetrics,
-    )
-
     return ForwardPassMetrics(
As per coding guidelines: "Imports must be at module top; flag any import inside functions/methods/classes."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/core/test_engine_protocol.py` around
lines 134 - 139, The helper _make_fpm currently performs function-local imports
of ForwardPassMetrics, QueuedRequestMetrics, and ScheduledRequestMetrics; move
those three imports to the top-level of the module (module scope) and remove the
import statements from inside _make_fpm so the function simply uses the
already-imported classes; ensure the import line references the same module path
(dynamo.common.forward_pass_metrics) and run tests to confirm no import-order
issues.
components/src/dynamo/planner/tests/monitoring/test_plugin_framework_metrics.py-288-294 (1)

288-294: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t skip on arbitrary ValueError from metric construction.

This currently hides any constructor regression that happens to raise ValueError, not just the intended duplicate-timeseries collision. Please narrow the skip to the known collision case and let other ValueErrors fail the test.

🛠️ Suggested guard
     try:
         m = PluginFrameworkMetrics()
-    except ValueError:
-        pytest.skip(
-            "PluginFrameworkMetrics already registered on REGISTRY by "
-            "another test in this session; test-only collision, not a bug."
-        )
-        return  # unreachable: pytest.skip() raises Skipped
+    except ValueError as exc:
+        if "Duplicated timeseries" not in str(exc):
+            raise
+        pytest.skip(
+            "PluginFrameworkMetrics already registered on REGISTRY by "
+            "another test in this session; test-only collision, not a bug."
+        )
+        return  # unreachable: pytest.skip() raises Skipped
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@components/src/dynamo/planner/tests/monitoring/test_plugin_framework_metrics.py`
around lines 288 - 294, The current test broadly catches any ValueError from
PluginFrameworkMetrics(), which can mask real constructor regressions; change
the except to capture the exception (except ValueError as e) and only call
pytest.skip when the exception message matches the known
duplicate-timeseries/registration collision (e.g., contains "already registered"
or the exact phrase your constructor raises for REGISTRY collisions), otherwise
re-raise the exception so real errors fail the test; locate this change around
the PluginFrameworkMetrics() instantiation/exception handling.
components/src/dynamo/planner/tests/manual/perf_test_configs/disagg_8b_planner_orchestrator.yaml-4-7 (1)

4-7: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the internal ticket reference from this YAML comment.

DEP-XXXX violates the repo rule against shipping internal Linear IDs in source/config files. Replace it with the public GitHub issue/PR reference or a plain descriptive comment.

🧹 Suggested edit
-# orchestrator tick engine (DEP-XXXX PR 7 cutover). The only diff vs
+# orchestrator tick engine (PR 7 cutover). The only diff vs

As per coding guidelines, **/*.{py,rs,go,ts,tsx,js,mjs,cjs,sh,yaml,yml,toml}: “Flag internal Linear ticket references (e.g. DIS-XXXX, DYN-XXXX, OPS-XXXX, DEP-XXXX, or any -NNNN ID) in the added lines.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@components/src/dynamo/planner/tests/manual/perf_test_configs/disagg_8b_planner_orchestrator.yaml`
around lines 4 - 7, Remove the internal ticket ID "DEP-XXXX" from the YAML
comment that describes the orchestrator tick engine (the comment that mentions
the planner --config JSON adding "scheduling": {"use_orchestrator": true}) and
replace it with either a public GitHub issue/PR reference or a plain descriptive
comment (e.g., "orchestrator tick engine cutover" or "plugin-based orchestrator
tick engine cutover"). Ensure the new comment conveys the same intent but
contains no internal Linear/PROJECT-NNNN identifiers.
components/src/dynamo/planner/tests/plugins/merge/test_type_aware_clamp_tracking.py-245-245 (1)

245-245: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move RejectResult into the module import block.

test_short_circuit_leaves_clamped_empty() imports RejectResult inside the function; place it in the existing module-level dynamo.planner.plugins.types import block instead.

♻️ Proposed fix
 from dynamo.planner.plugins.types import (
     AcceptResult,
     ComponentTarget,
     OverrideResult,
     OverrideType,
+    RejectResult,
 )
@@
 def test_short_circuit_leaves_clamped_empty():
-    from dynamo.planner.plugins.types import RejectResult
-
     outcome = type_aware_merge(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@components/src/dynamo/planner/tests/plugins/merge/test_type_aware_clamp_tracking.py`
at line 245, Move the local import of RejectResult from inside
test_short_circuit_leaves_clamped_empty() to the existing module-level import
block that already imports from dynamo.planner.plugins.types; specifically add
RejectResult to the top-level import statement (alongside other symbols from
dynamo.planner.plugins.types) and remove the in-function import so the test uses
the module-level symbol.
components/src/dynamo/planner/tests/plugins/orchestrator/test_concurrency.py-94-96 (1)

94-96: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace the ambiguous multiplication sign in this comment.

Ruff flags × here as RUF003, so this will fail lint until it is replaced with plain x.

Suggested fix
-    # 5 plugins × 50ms serial would be 250ms; concurrent should be closer
+    # 5 plugins x 50ms serial would be 250ms; concurrent should be closer
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/plugins/orchestrator/test_concurrency.py`
around lines 94 - 96, The comment uses the Unicode multiplication sign '×' which
triggers RUF003; update the comment in test_concurrency.py (near the assert
referencing DELAY and elapsed in the test_concurrency test) to use a plain ASCII
'x' (e.g., "5 plugins x 50ms") instead of '×' so the linter stops flagging it;
no logic changes required.
components/src/dynamo/planner/tests/plugins/proto/test_round_trip.py-334-340 (1)

334-340: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten the oneof-violation assertion to the specific validation error
Update test_propose_stage_response_oneof_violation to replace pytest.raises(Exception, match="oneof") with pytest.raises(pydantic.ValidationError, match="oneof violation") (import pydantic.ValidationError/ValidationError as needed).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/plugins/proto/test_round_trip.py` around
lines 334 - 340, Change the test_propose_stage_response_oneof_violation to
assert the specific Pydantic validation error: replace pytest.raises(Exception,
match="oneof") with pytest.raises(pydantic.ValidationError, match="oneof
violation") and add/import ValidationError from pydantic (or reference
pydantic.ValidationError) so the test expects pydantic.ValidationError when
constructing pyd.ProposeStageResponse with multiple oneof payloads.
components/src/dynamo/planner/tests/plugins/registry/test_list_plugins.py-68-79 (1)

68-79: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Drop the unused f prefix on the gRPC endpoint.
Line 75 uses endpoint=f"grpc://127.0.0.1:9000" with no placeholders, which triggers F541; change it to a plain string.

♻️ Proposed fix
-        endpoint=f"grpc://127.0.0.1:9000",
+        endpoint="grpc://127.0.0.1:9000",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/plugins/registry/test_list_plugins.py`
around lines 68 - 79, The endpoint string in the helper function _register
(which builds a RegisterRequest) uses an unnecessary f-string literal:
endpoint=f"grpc://127.0.0.1:9000", causing F541; change it to a plain string
literal endpoint="grpc://127.0.0.1:9000" inside the _register function where the
RegisterRequest is created to remove the unused f-prefix.
components/src/dynamo/planner/tests/plugins/transport/test_config.py-52-61 (1)

52-61: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use ValidationError explicitly in these validation tests.

Line 56 violates the repo rule against in-function imports, and Line 131's pytest.raises(Exception) is broad enough to mask unrelated failures.

✅ Proposed cleanup
 import os
 
 import pytest
+from pydantic import ValidationError
@@
 def test_transport_config_rejects_non_positive_request_timeout():
     """Per-RPC timeout must be strictly positive. (Previously enforced
     on the duplicate SchedulingConfig.request_timeout_seconds field;
     consolidated here when that duplicate was removed.)"""
-    from pydantic import ValidationError
-
     with pytest.raises(ValidationError):
         TransportConfig(request_timeout_seconds=0)
     with pytest.raises(ValidationError):
         TransportConfig(request_timeout_seconds=-1)
@@
 def test_make_clock_unknown_type():
     """Pydantic Literal validates type field at construction; ValueError early."""
-    with pytest.raises(Exception):  # Pydantic ValidationError
+    with pytest.raises(ValidationError):
         ClockConfig(type="invalid")  # type: ignore[arg-type]
As per coding guidelines, "Imports must be at module top; flag any import inside functions/methods/classes."

Also applies to: 129-132

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/plugins/transport/test_config.py` around
lines 52 - 61, The tests import ValidationError inside
test_transport_config_rejects_non_positive_request_timeout and also use a broad
pytest.raises(Exception) elsewhere; move the pydantic.ValidationError import to
the module top and replace any in-function "from pydantic import
ValidationError" with the top-level import, and change the broad
pytest.raises(Exception) (e.g., the usage around lines referencing the other
test) to pytest.raises(ValidationError) so both tests explicitly assert
pydantic.ValidationError rather than performing in-function imports or catching
all Exceptions; update references to TransportConfig and the test names
accordingly.
docs/components/planner/orchestrator-rollout.md-149-150 (1)

149-150: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the actual circuit-breaker metric name here.

This runbook points operators at dynamo_planner_plugin_circuit_open_total, but the observability reference added in this PR defines dynamo_planner_plugin_circuit_state instead. As written, the alert/troubleshooting guidance sends people to a non-existent series.

🛠️ Suggested wording
-- Prometheus alert `dynamo_planner_plugin_circuit_open_total > 0`
+- Prometheus alert such as `max by (plugin_id) (dynamo_planner_plugin_circuit_state == 1) > 0`
@@
-### Symptom: `dynamo_planner_plugin_circuit_open_total` is non-zero
+### Symptom: `dynamo_planner_plugin_circuit_state == 1` for one or more plugins

Also applies to: 320-323

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/components/planner/orchestrator-rollout.md` around lines 149 - 150, The
runbook references the wrong Prometheus metric name; replace every mention of
dynamo_planner_plugin_circuit_open_total with the actual metric
dynamo_planner_plugin_circuit_state (including alert expressions, example
queries, and troubleshooting guidance) and update any example alerts or
instructions that assumed the old name (also fix the other occurrences around
the section referenced as lines 320-323) so operators are pointed at the
existing series and any sample alert expressions use
dynamo_planner_plugin_circuit_state.
docs/components/planner/observability.md-55-69 (1)

55-69: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add fence languages to these code blocks.

Markdownlint is flagging both fences here. text is enough if you just want literal enum lists.

📝 Proposed fix
-```
+```text
 unset, disabled, no_fpm_data, scaling_in_progress, worker_count_mismatch,
 insufficient_data, no_change, scale_up, scale_down,
 scale_down_capped_by_throughput, override_by_user_plugin,
 reconcile_clamped_to_floor, reconcile_clamped_to_ceiling,
 held_over, rejected_by_plugin

@@
- +text
unset, disabled, no_traffic_data, predict_failed, model_not_ready,
set_lower_bound, scale, override_by_user_plugin, held_over,
circuit_open, rejected_by_plugin

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/components/planner/observability.md` around lines 55 - 69, Update the
two fenced code blocks in the observability doc to include a language tag of
"text" so Markdownlint stops flagging them: change the opening fences for the
first enum list and the second block labeled THROUGHPUT_DECISION_STATES to use
```text instead of ```; this affects the block containing the comma-separated
enum values (including entries like no_fpm_data, scaling_in_progress, etc.) and
the block for THROUGHPUT_DECISION_STATES (entries like no_traffic_data,
predict_failed, model_not_ready, etc.).
components/src/dynamo/planner/tests/plugins/scheduler/test_active_set.py-67-67 (1)

67-67: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Drop the unused f prefix here.

This literal trips Ruff F541 and will fail lint even though the value is constant.

🧹 Proposed fix
-        endpoint=f"grpc://127.0.0.1:9000",
+        endpoint="grpc://127.0.0.1:9000",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/plugins/scheduler/test_active_set.py` at
line 67, Remove the unnecessary f-string prefix on the endpoint literal used in
the test (endpoint=f"grpc://127.0.0.1:9000"); change it to a plain string
"grpc://127.0.0.1:9000" in the test_active_set test where the scheduler/plugin
is constructed (look for the endpoint= argument in test_active_set.py) to
satisfy the linter (Ruff F541).
docs/components/planner/orchestrator-rollout.md-58-64 (1)

58-64: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Label this fenced block with a language.

The ASCII diagram currently trips markdownlint's fenced-code-language rule.

📝 Proposed fix
-```
+```text
                  ┌─ use_orchestrator = false (default)
                  │      → _PSMEngineAdapter(PlannerStateMachine)
  NativePlannerBase.run()
                  │      → OrchestratorEngineAdapter(
                  └─ use_orchestrator = true           5 builtin plugins )
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/components/planner/orchestrator-rollout.md around lines 58 - 64, The
fenced ASCII diagram block is missing a language label causing markdownlint
failures; update the fenced code block in
docs/components/planner/orchestrator-rollout.md that contains the ASCII diagram
(the block showing NativePlannerBase.run() and use_orchestrator branches) to
specify a language label such as text (e.g., change totext) so the
diagram remains rendered as plain text and satisfies the fenced-code-language
rule.


</details>

</blockquote></details>
<details>
<summary>components/src/dynamo/planner/tests/plugins/scheduler/test_cache_invalidation.py-70-70 (1)</summary><blockquote>

`70-70`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Remove the no-op `f` prefix.**

This is another Ruff F541 hit, so the test file won't be clean until it's a plain string literal.
 
<details>
<summary>🧹 Proposed fix</summary>

```diff
-        endpoint=f"grpc://127.0.0.1:9000",
+        endpoint="grpc://127.0.0.1:9000",
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@components/src/dynamo/planner/tests/plugins/scheduler/test_cache_invalidation.py`
at line 70, The endpoint argument in the test is using an unnecessary f-string
(endpoint=f"grpc://127.0.0.1:9000") which triggers Ruff F541; change it to a
plain string literal (endpoint="grpc://127.0.0.1:9000") so the test file is
clean. Locate the call that sets the endpoint keyword in
test_cache_invalidation.py and replace the f-prefixed string with the non-f
version.
```

</details>

</blockquote></details>
<details>
<summary>docs/components/planner/planner-guide.md-161-162 (1)</summary><blockquote>

`161-162`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Field name mismatch with documented schema.**

The YAML example uses `ttft` and `itl`, but lines 83-84 document these fields as `ttft_ms` and `itl_ms`. This inconsistency will confuse users and may cause configuration errors.




<details>
<summary>📝 Proposed fix</summary>

```diff
       enable_load_scaling: true
-      ttft: 200.0
-      itl: 10.0
+      ttft_ms: 200.0
+      itl_ms: 10.0
       pre_deployment_sweeping_mode: rapid
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/components/planner/planner-guide.md` around lines 161 - 162, The YAML
example uses the short field names "ttft" and "itl" but the documented schema
uses "ttft_ms" and "itl_ms"; update the example to use "ttft_ms" and "itl_ms"
(or, if you prefer keeping the short names, update the schema docs to reference
"ttft" and "itl" everywhere) so the field names match; locate the example block
containing "ttft: 200.0" and "itl: 10.0" and replace them with "ttft_ms: 200.0"
and "itl_ms: 10.0" (or adjust the schema references ttft_ms/itl_ms accordingly)
to ensure consistency.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (10)</summary><blockquote>

<details>
<summary>components/src/dynamo/planner/core/base.py (1)</summary><blockquote>

`250-253`: _⚡ Quick win_

**Move the `OrchestratorEngineAdapter` imports to module scope.**

These new method-local imports violate the repo’s Python guideline and make import-time failures path-dependent. If they are intentionally hiding a circular dependency, please document that inline; otherwise they should live at the top of the module. As per coding guidelines, "Imports must be at module top; flag any import inside functions/methods/classes."
  


Also applies to: 297-299

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/core/base.py` around lines 250 - 253, The
method-local import of OrchestratorEngineAdapter (triggered by the check
self.config.scheduling.use_orchestrator in base.py) must be moved to module
scope: add the import for OrchestratorEngineAdapter at the top of the module
alongside other imports and remove the in-method import lines; if this was done
to break a circular dependency, instead add an inline comment above the method
explaining the circular import and keep a short explicit note (and consider
using a local import only with that documented justification). Ensure you update
both occurrences (around the blocks referenced by the review) so all imports are
consistent and comply with the "imports at module top" guideline.
```

</details>

</blockquote></details>
<details>
<summary>components/src/dynamo/planner/examples/external_plugin/reference_runner.py (1)</summary><blockquote>

`256-260`: _💤 Low value_

**Replace EN DASH with hyphen in help text.**

Line 259 contains an EN DASH character (`–`) which should be a standard hyphen-minus (`-`). This can cause issues with copy-paste or certain terminal encodings.


<details>
<summary>Proposed fix</summary>

```diff
         "--priority",
         type=int,
         default=5,
         help="plugin priority used during self-registration. PREDICT "
-        "wants priority=1 (lowest = chain terminator); RECONCILE / "
+        "wants priority=1 (lowest = chain terminator); RECONCILE / "
         "CONSTRAIN typically use 1 too. PROPOSE merge picks smallest "
-        "first, so for PROPOSE pick 4–10.",
+        "first, so for PROPOSE pick 4-10.",
     )
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/examples/external_plugin/reference_runner.py`
around lines 256 - 260, The help string passed to the argument (the multi-line
literal containing "plugin priority used during self-registration... so for
PROPOSE pick 4–10.") includes an EN DASH (–); replace that EN DASH with a
standard hyphen-minus (-) in the help text literal (the string near "PROPOSE
pick 4–10") so the message uses "4-10" and avoids encoding/copy-paste issues.
```

</details>

</blockquote></details>
<details>
<summary>components/src/dynamo/planner/plugins/orchestrator/pipeline.py (1)</summary><blockquote>

`400-400`: _⚡ Quick win_

**Add `strict=True` to `zip()` for defensive checking.**

While `asyncio.gather` guarantees the same number of results as input awaitables, adding `strict=True` makes the invariant explicit and would surface any future bugs where the lists diverge.


<details>
<summary>♻️ Proposed fix</summary>

```diff
-    for idx, (plugin, raw) in enumerate(zip(plugins, raw_results)):
+    for idx, (plugin, raw) in enumerate(zip(plugins, raw_results, strict=True)):
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/plugins/orchestrator/pipeline.py` at line 400,
The loop using enumerate(zip(plugins, raw_results)) in pipeline.py should
defensively require equal-length iterables; update the zip call to zip(plugins,
raw_results, strict=True) so any mismatch between plugins and raw_results is
raised immediately rather than silently truncating; adjust the loop at the site
where plugins and raw_results are iterated (the for idx, (plugin, raw) in
enumerate(...) line) to include strict=True.
```

</details>

</blockquote></details>
<details>
<summary>components/src/dynamo/planner/plugins/transport/base.py (1)</summary><blockquote>

`51-52`: _⚡ Quick win_

**Align the timeout docstring with the actual contract.**

These docstrings say the orchestrator wraps `call()` in `asyncio.wait_for`, but the transport layer already owns the per-RPC timeout and the README explicitly says the pipeline must not add another wrapper. Leaving the base contract wrong makes double-timeout behavior look intentional.
 


Also applies to: 72-79

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/plugins/transport/base.py` around lines 51 -
52, Update the misleading docstrings to state that the transport layer owns the
per-RPC timeout (the timeout_seconds attribute) and that callers/orchestrator
must NOT wrap transport.call() in another asyncio.wait_for; replace references
claiming the orchestrator wraps call() with a clear contract that transport
implementations enforce per-RPC timeouts and pipelines should rely on that
single timeout. Also apply the same clarification to the other docstring block
around the call()/timeout-related docs (the second block at lines 72-79) so both
places consistently describe that timeout_seconds is enforced by the transport
and double-wrapping must be avoided.
```

</details>

</blockquote></details>
<details>
<summary>components/src/dynamo/planner/tests/integration/test_external_plugin_e2e.py (1)</summary><blockquote>

`71-76`: _⚡ Quick win_

**Use an integration-type marker here instead of `unit`.**

This module opens real gRPC servers and exercises network transport end-to-end, so marking it as `unit` will misclassify it in marker-driven test selection. Swap the type marker to `integration` (or `e2e`, if that is the intended bucket).




<details>
<summary>♻️ Suggested marker change</summary>

```diff
 pytestmark = [
     pytest.mark.gpu_0,
     pytest.mark.pre_merge,
-    pytest.mark.unit,
+    pytest.mark.integration,
     pytest.mark.planner,
 ]
```
</details>

As per coding guidelines, `**/tests/**/*.py`: “Each test must include at least one scheduling marker (`pre_merge`/`post_merge`/etc.), one GPU marker (`gpu_0`/etc.), and one type marker (`unit`/`integration`/`e2e`).”

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/integration/test_external_plugin_e2e.py`
around lines 71 - 76, The pytest markers list assigned to pytestmark currently
includes the wrong test type; update the pytestmark definition in this module
(the pytestmark variable) to replace the 'unit' marker with the appropriate
integration-type marker (e.g., 'integration' or 'e2e') so the tests that
exercise real gRPC servers are classified correctly while keeping the existing
'pytest.mark.gpu_0' and 'pytest.mark.pre_merge' markers.
```

</details>

</blockquote></details>
<details>
<summary>components/src/dynamo/planner/tests/plugins/orchestrator/test_pipeline.py (1)</summary><blockquote>

`400-403`: _⚡ Quick win_

**Move these imports to module scope.**

Line 402 and Line 579 both introduce imports inside test functions, which this repo’s Python guidelines explicitly disallow.



<details>
<summary>Suggested fix</summary>

```diff
 import ast
+import asyncio
 import pathlib
 
 import pytest
 
+from dynamo.planner.plugins.orchestrator import pipeline as _pipeline_module
 from dynamo.planner.plugins.merge.types import ComponentKey
 from dynamo.planner.plugins.types import (
     AcceptResult,
@@
 `@pytest.mark.asyncio`
 async def test_whole_tick_timeout_returns_skip_tick_timeout(ctx_factory):
-    import asyncio
-
     ctx = ctx_factory(tick_max_duration_seconds=0.05)
@@
 def test_pipeline_py_has_no_stage_level_wait_for():
@@
-    from dynamo.planner.plugins.orchestrator import pipeline as _pipeline_module
-
     source_path = pathlib.Path(_pipeline_module.__file__)
```
</details>
As per coding guidelines, "Imports must be at module top; flag any import inside functions/methods/classes."


Also applies to: 579-580

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/plugins/orchestrator/test_pipeline.py`
around lines 400 - 403, The tests currently perform local imports (e.g.,
importing asyncio inside test_whole_tick_timeout_returns_skip_tick_timeout and
the import at the other test around lines 579-580); move these imports to the
module top so they are module-scope imports (add "import asyncio" and the other
module(s) to the file header), update any duplicate/unused imports accordingly,
and run tests to ensure no circular import issues—look for the functions
test_whole_tick_timeout_returns_skip_tick_timeout and the test around line ~579
to locate the offending in-function imports.
```

</details>

</blockquote></details>
<details>
<summary>components/src/dynamo/planner/tests/plugins/registry/test_config.py (1)</summary><blockquote>

`107-112`: _⚡ Quick win_

**Hoist these imports to module scope.**

`RegisterRequest` and `TransportConfig` are regular dependencies in this file, so importing them inside the test just adds noise and violates the repo’s Python import rule.

<details>
<summary>♻️ Proposed fix</summary>

```diff
 from pydantic import ValidationError
 
 from dynamo.planner.plugins.clock import VirtualClock
 from dynamo.planner.plugins.registry.auth import (
@@
 from dynamo.planner.plugins.registry.config import (
     AuthConfig,
     InProcessPluginSpec,
     PluginRegistrationConfig,
     build_auth_validator,
     build_registry_from_config,
 )
 from dynamo.planner.plugins.registry.server import PluginRegistryServer
+from dynamo.planner.plugins.transport.config import TransportConfig
+from dynamo.planner.plugins.types import RegisterRequest
@@
 `@pytest.mark.asyncio`
 async def test_build_registry_propagates_protocol_versions():
-    from dynamo.planner.plugins.types import RegisterRequest
-
-    from dynamo.planner.plugins.transport.config import TransportConfig
-
     config = PluginRegistrationConfig(
```
</details>

  

As per coding guidelines, "Imports must be at module top; flag any import inside functions/methods/classes."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/plugins/registry/test_config.py` around
lines 107 - 112, The test function
test_build_registry_propagates_protocol_versions currently imports
RegisterRequest and TransportConfig inside the function; move those imports to
the module scope (top of the file) so RegisterRequest (from
dynamo.planner.plugins.types) and TransportConfig (from
dynamo.planner.plugins.transport.config) are imported alongside other top-level
imports and removed from inside the test body.
```

</details>

</blockquote></details>
<details>
<summary>components/src/dynamo/planner/tests/plugins/registry/test_server.py (1)</summary><blockquote>

`411-436`: _⚡ Quick win_

**Keep `CircuitState` in the module import block.**

Both tests locally import the same symbol from a module that’s already imported at the top of the file. Hoisting `CircuitState` once keeps this module aligned with the repo’s import rule.

<details>
<summary>♻️ Proposed fix</summary>

```diff
 from dynamo.planner.plugins.types import (
+    CircuitState,
     HoldPolicy,
     ListPluginsRequest,
     RegisterRequest,
 )
@@
 async def test_register_resets_circuit_breaker_for_plugin_id():
@@
-    from dynamo.planner.plugins.types import CircuitState
-
     assert cb.state("p1") == CircuitState.CLOSED
@@
 async def test_unregister_resets_circuit_breaker():
@@
-    from dynamo.planner.plugins.types import CircuitState
-
     assert cb.state("p1") == CircuitState.CLOSED
```
</details>

  

As per coding guidelines, "Imports must be at module top; flag any import inside functions/methods/classes."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/plugins/registry/test_server.py` around
lines 411 - 436, Move the inline imports of CircuitState out of the test
functions and into the module-level import block: remove the in-function lines
"from dynamo.planner.plugins.types import CircuitState" in
test_register_resets_circuit_breaker_for_plugin_id and
test_unregister_resets_circuit_breaker, and add a single top-level import for
CircuitState with the other imports so the tests use CircuitState directly when
asserting cb.state("p1") after server.register(_req()) and
server.unregister("p1").
```

</details>

</blockquote></details>
<details>
<summary>components/src/dynamo/planner/tests/plugins/registry/test_gateway.py (1)</summary><blockquote>

`106-120`: _⚡ Quick win_

**Hoist `proto_to_pydantic` out of the helper.**

This import is neither optional nor cycle-guarded, so keeping it inside `_register()` just obscures a normal module dependency.

<details>
<summary>♻️ Proposed fix</summary>

```diff
 import grpc
 import pytest
 
 from dynamo.planner.plugins.clock import VirtualClock
+from dynamo.planner.plugins._proto_bridge import proto_to_pydantic
 from dynamo.planner.plugins.proto.v1 import plugin_pb2 as pb
@@
 async def _register(server, plugin_id="p1", auth_token="A"):
@@
-    from dynamo.planner.plugins._proto_bridge import proto_to_pydantic
-
     resp = await server.register(proto_to_pydantic(req))
```
</details>

  

As per coding guidelines, "Imports must be at module top; flag any import inside functions/methods/classes."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/planner/tests/plugins/registry/test_gateway.py` around
lines 106 - 120, The test helper _register contains a local import of
proto_to_pydantic; move the import statement for proto_to_pydantic from
dynamo.planner.plugins._proto_bridge to the module top of test_gateway.py and
remove the in-function import so the _register coroutine directly calls
proto_to_pydantic; ensure no other function-local imports remain and run tests
to verify no import cycles are introduced.
```

</details>

</blockquote></details>
<details>
<summary>components/src/dynamo/planner/tests/plugins/registry/test_external_bootstrap.py (1)</summary><blockquote>

`129-145`: _⚡ Quick win_

**Move `ValidationError` to the module imports.**

These two tests pull in the same non-optional import locally. Hoisting it once at the top keeps the file consistent with the repo’s Python import rule and removes duplication.

<details>
<summary>♻️ Proposed fix</summary>

```diff
 import pytest
+from pydantic import ValidationError
 
 from dynamo.planner.config.planner_config import ExternalPluginEntry
@@
 def test_entry_rejects_unknown_plugin_type():
@@
-    from pydantic import ValidationError
-
     with pytest.raises(ValidationError):
@@
 def test_entry_rejects_empty_endpoint():
-    from pydantic import ValidationError
-
     with pytest.raises(ValidationError):
```
</details>

  

As per coding guidelines, "Imports must be at module top; flag any import inside functions/methods/classes."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@components/src/dynamo/planner/tests/plugins/registry/test_external_bootstrap.py`
around lines 129 - 145, The tests test_entry_rejects_unknown_plugin_type and
test_entry_rejects_empty_endpoint import ValidationError inside the test bodies;
move the from pydantic import ValidationError statement to the module-level
imports at the top of the file so both tests (and any others) reuse the single
import, keeping imports consistent with the repo rule; update the file-level
import block and remove the local imports inside those test functions that
reference ValidationError (used with ExternalPluginEntry).
```

</details>

</blockquote></details>

</blockquote></details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread components/src/dynamo/planner/config/planner_config.py
Comment thread components/src/dynamo/planner/offline/replay_adapter.py Outdated
Comment thread components/src/dynamo/planner/plugins/_proto_bridge.py Outdated
Comment thread components/src/dynamo/planner/plugins/_proto_bridge.py Outdated
Comment thread components/src/dynamo/planner/plugins/merge/type_aware.py Outdated
Comment thread components/src/dynamo/planner/plugins/registry/gateway.py
Comment thread components/src/dynamo/planner/plugins/registry/server.py Outdated
Comment thread components/src/dynamo/planner/plugins/transport/_grpc_base.py
Comment thread components/src/dynamo/planner/plugins/transport/_grpc_base.py Outdated
Comment thread components/src/dynamo/planner/plugins/transport/config.py
@kangclzjc kangclzjc force-pushed the plugin_frame_kang_pr1 branch from 2068db5 to 9bfe27c Compare May 29, 2026 13:02
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request May 30, 2026
Addresses CodeRabbit comments on PR ai-dynamo#10124:

- ``plugins/_proto_bridge.py``: ``base64``, ``IntEnum``, ``typing``
  were imported inside ``_normalize()`` and ``_decode_bytes_by_pyd_schema()``.
- ``plugins/merge/type_aware.py:216-217``: ``PluginResult`` and
  ``OverrideResult`` were re-imported locally inside
  ``_find_plugin_id_for_target()`` even though both are already at the
  module top — the "avoid cycle" comment was stale.
- ``offline/replay_adapter.py``: ``OrchestratorEngineAdapter`` was
  imported conditionally inside ``__init__`` only on the orchestrator
  path. The import is safe at module top (no cycle); the lazy import
  saved a small amount of startup memory on the PSM-only replay path
  but violated the coding guideline that imports live at module top.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request May 30, 2026
…alPluginEntry

Addresses CodeRabbit comment on PR ai-dynamo#10124:

``SchedulingConfig`` already forbade extra keys, but ``GatewayConfig``
and ``ExternalPluginEntry`` accepted them silently. A typo like
``lsiten`` (instead of ``listen``) or ``auth_tokn`` (instead of
``auth_token``) would be silently ignored and the field would
validate using its default — masking config-time mistakes.

Add ``model_config = ConfigDict(extra="forbid")`` to both classes so
typos surface as ``ValidationError`` at config load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request May 30, 2026
…andlers

Addresses CodeRabbit comments on PR ai-dynamo#10124:

``LocalPlannerOrchestrator.register_external_from_config`` and
``LocalPlannerOrchestrator.bootstrap_plugins`` both wrap their
inner ``await`` in a defensive ``except Exception`` that logs and
keeps the loop running. On Python <3.12 ``asyncio.CancelledError``
is a subclass of ``Exception`` (changed to ``BaseException`` only
in 3.12), so the broad handler swallows cancellation and prevents
the task from unwinding.

Add an explicit ``except asyncio.CancelledError: raise`` ahead of
each broad handler so cancellation always propagates regardless of
Python version. Also adds the missing ``import asyncio`` at module
top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request May 30, 2026
Addresses 5 CodeRabbit comments on PR ai-dynamo#10124:

- ``auth/static_secret.py``: empty-subject ``ValueError`` no longer
  includes ``secret[:4]`` — that's a needless secret-prefix leak
  in startup logs / config-validation surfaces. The error now
  references the entry by index instead.
- ``circuit_breaker.py``: ``_fan_out_open`` wraps each ``on_open``
  callback in ``try/except`` so one bad observer cannot stop the
  remaining callbacks from firing. record_failure runs on the
  failure path; one buggy scheduler listener turning a per-plugin
  blip into a registry-wide failure was the risk.
- ``gateway.py``: dropped the dead ``raise  # unreachable`` lines
  after ``await context.abort(...)``. ``context.abort()`` already
  raises ``AbortError``; the bare ``raise`` with no active
  exception would itself raise ``RuntimeError: No active
  exception to re-raise`` if control ever fell through.
- ``gateway.py``: ``start_gateway_server`` now checks the return
  of ``add_insecure_port`` / ``add_secure_port`` — ``0`` indicates
  a bind failure (port in use, bad address, etc.). Fail fast with
  a clear ``RuntimeError`` BEFORE ``await grpc_server.start()`` so
  operators get a real error instead of a silently-running
  gateway that accepts no connections.
- ``server.py``: protocol version range check now uses
  ``packaging.version.Version`` instead of raw string compare. A
  plain ``str`` compare puts ``"1.10" < "1.2"`` (lexicographic on
  the second char) which would mis-reject valid plugins once any
  component reached 10. Malformed version strings now reject with
  a distinct ``protocol_version_malformed`` reason.

New tests:

- ``test_static_secret::test_empty_subject_error_does_not_leak_secret_bytes``
- ``test_circuit_breaker::test_on_open_callback_failure_does_not_skip_remaining_callbacks``
- ``test_gateway::test_start_gateway_server_raises_when_port_zero``
- ``test_server::test_protocol_version_semantic_compare_not_lexicographic``
- ``test_server::test_protocol_version_malformed_rejected_clearly``

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request May 30, 2026
…, plumb gRPC channel knobs

Addresses 3 CodeRabbit comments on PR ai-dynamo#10124:

- ``_grpc_base.call()``: ``pydantic_to_proto(request)`` previously ran
  *before* the protected ``try``. An unmapped Pydantic request type
  would leak a raw ``KeyError`` to callers instead of the documented
  ``PluginCallError`` subclass. Move the conversion inside ``try``
  and map ``KeyError → PluginSerializationError``.

- ``_grpc_base.close()``: the ``except Exception: pass`` swallowed
  channel-close failures without trace. ``close()`` must stay
  idempotent (it's the planner shutdown path; one buggy plugin
  can't stall the rest), so we keep swallowing the raise — but now
  ``log.warning(...)`` surfaces the failure for postmortem.

- ``TransportConfig.keepalive_time_ms`` and
  ``TransportConfig.max_message_size_bytes`` were exposed on the
  config surface but the factory + GrpcTransport ignored them
  entirely; operator-supplied values were silently dropped.
  Plumb both through:
    ``make_transport_for_endpoint``
      → ``GrpcTransport.__init__`` (new kwargs)
        → ``_GrpcTransportBase.__init__`` (new kwargs)
          → ``_build_channel`` → ``grpc_channel_options(...)``
  ``grpc_channel_options`` becomes a parameterised builder so the
  ``grpc.keepalive_time_ms`` / ``grpc.max_*_message_length`` tuple
  entries honour the user-supplied values.

New tests:

- ``test_config::test_factory_propagates_grpc_channel_knobs`` — round-
  trip the config values through the factory and assert they land
  on the transport instance.
- ``test_config::test_grpc_channel_options_honours_kwargs`` — direct
  call to the options builder, asserting the right gRPC channel
  option strings end up with the supplied integers.

Full planner suite: 827 passed, 1 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request May 30, 2026
…empty

Addresses CodeRabbit comment on PR ai-dynamo#10124:

``repeated string needs = 8`` in ``RegisterRequest`` had a comment
promising distinct semantics for "empty" vs "unset":

  Empty = "no PipelineContext fields needed";
  unset (length 0 with default) = "send full context"

proto3 does not track presence for ``repeated`` fields — an omitted
field and one explicitly set to an empty list serialise/deserialise
identically, and the generated Python returns an empty container in
both cases. The contract as written is unimplementable for cross-
language clients.

Rewrite the comment to match the wire reality: empty and unset both
mean "send full context" (the safe default). When the field is non-
empty the orchestrator MAY trim the context. No runtime behaviour
change — the orchestrator already treats both the same.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request May 31, 2026
Addresses CodeRabbit comments on PR ai-dynamo#10124:

- ``plugins/_proto_bridge.py``: ``base64``, ``IntEnum``, ``typing``
  were imported inside ``_normalize()`` and ``_decode_bytes_by_pyd_schema()``.
- ``plugins/merge/type_aware.py:216-217``: ``PluginResult`` and
  ``OverrideResult`` were re-imported locally inside
  ``_find_plugin_id_for_target()`` even though both are already at the
  module top — the "avoid cycle" comment was stale.
- ``offline/replay_adapter.py``: ``OrchestratorEngineAdapter`` was
  imported conditionally inside ``__init__`` only on the orchestrator
  path. The import is safe at module top (no cycle); the lazy import
  saved a small amount of startup memory on the PSM-only replay path
  but violated the coding guideline that imports live at module top.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request May 31, 2026
…alPluginEntry

Addresses CodeRabbit comment on PR ai-dynamo#10124:

``SchedulingConfig`` already forbade extra keys, but ``GatewayConfig``
and ``ExternalPluginEntry`` accepted them silently. A typo like
``lsiten`` (instead of ``listen``) or ``auth_tokn`` (instead of
``auth_token``) would be silently ignored and the field would
validate using its default — masking config-time mistakes.

Add ``model_config = ConfigDict(extra="forbid")`` to both classes so
typos surface as ``ValidationError`` at config load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request May 31, 2026
…andlers

Addresses CodeRabbit comments on PR ai-dynamo#10124:

``LocalPlannerOrchestrator.register_external_from_config`` and
``LocalPlannerOrchestrator.bootstrap_plugins`` both wrap their
inner ``await`` in a defensive ``except Exception`` that logs and
keeps the loop running. On Python <3.12 ``asyncio.CancelledError``
is a subclass of ``Exception`` (changed to ``BaseException`` only
in 3.12), so the broad handler swallows cancellation and prevents
the task from unwinding.

Add an explicit ``except asyncio.CancelledError: raise`` ahead of
each broad handler so cancellation always propagates regardless of
Python version. Also adds the missing ``import asyncio`` at module
top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Kang Zhang <kangz@nvidia.com>
kangclzjc and others added 29 commits June 4, 2026 16:00
Plugins registered milliseconds apart but with the same
``execution_interval_seconds`` must fire on the same pipeline tick.
Without phase alignment, plugin A registered at T=0 and plugin B at
T=0.003 (a typical async-bootstrap order skew) have throttles that
drift permanently 3ms apart — and a future ``requires_produced_fields``
dependency from B onto A would silently deadlock at every fire (B's
throttle is 3ms behind A's, so B sees A's predictions cleared from
the previous tick rather than the current one).

Fix: at registration time, snap ``RegisteredPlugin.registered_at`` to
the nearest scale_interval boundary using ``floor(now / scale) *
scale``.  Both plugins anchor to T=0, both fire at T=180 / 360 / ...
together.

``PluginRegistryServer`` gains a new optional kwarg
``scale_interval_seconds`` (default 0.0 = no alignment, preserves
legacy / PSM-path behaviour and existing test fixtures).  The
orchestrator path's ``OrchestratorEngineAdapter`` constructs the
server with ``scale_interval_seconds=config.scheduling
.scale_interval_seconds``.

Both the proto-driven ``register()`` and the in-process
``register_internal()`` paths route through the new
``_aligned_anchor()`` helper so alignment is uniform across all
register call sites.

New tests in ``test_phase_alignment.py`` (4):

- ``test_aligned_anchor_snaps_to_floor_boundary`` — direct unit test
  of the arithmetic (0.0/2.5/4.999 → 0.0; 5.0/7.4 → 5.0;
  180.6 → 180.0).
- ``test_disabled_when_scale_interval_zero`` — legacy bypass path
  (PSM tests + back-compat) returns raw clock value.
- ``test_two_plugins_same_interval_phase_aligned_after_register_skew``
  — the end-to-end story: 3ms apart at registration, both snap to
  T=0, same first-fire moment.
- ``test_alignment_preserved_when_registration_crosses_boundary``
  — correctness at boundary: 4.9 → 0 vs 5.1 → 5 is the *correct*
  semantic (they really are 5s apart in tick terms).

Full planner suite: 837 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Adds the second gate to ``PluginScheduler.compute_active_set``:
plugin only fires when (a) its execution_interval throttle is due AND
(b) every dot-path in ``requires_produced_fields`` resolves non-None
in the current ``PipelineContext``.

This implements the declarative dependency mechanism designed for the
scale_interval cadence model.  A throughput-style plugin declaring
``requires_produced_fields=["predictions"]`` will skip its turn
whenever the upstream predict stage didn't produce predictions on the
same tick — preventing stale-input fires that would silently violate
the plugin's own contract.

Implementation:

- ``compute_active_set`` signature gains optional ``ctx`` kwarg.  Passed
  from the two pipeline call sites (the fan-out runner and the
  predict-stage chain) so requires checks see the live ctx state at
  each stage transition.  Backward compat: ``ctx=None`` (legacy
  callers, PSM path, test fixtures) treats plugins with non-empty
  ``requires_produced_fields`` as "unsatisfied" → conservative skip.

- ``_requires_missing_field()``: walks ``requires_produced_fields``
  against ctx and returns the first failed dot-path (or None on
  satisfied).  First-fail short-circuit so the metric label captures
  the leading dependency.

- ``_ctx_get()``: small dot-path walker that returns None on any
  missing intermediate attribute (rather than raising).  Used by the
  requires check.

- New metric ``dynamo_planner_tick_requires_unsatisfied_total{plugin_id,
  missing_field}`` — increments per (plugin, first-missing-field) on
  every requires-gated skip.  Critical for diagnosing dependency
  cascades when an upstream plugin opens its circuit-breaker, all
  dependent plugins downstream record their gated skip under the
  missing-field label.

Twelve new tests in ``test_requires_produced_fields.py`` cover:

- ``_ctx_get`` dot-path walker (top-level / nested / missing
  intermediate / missing leaf / None ctx).
- ``compute_active_set`` gating: no requires fires unconditionally;
  satisfied requires fire; unsatisfied skip with no inherit; nested
  paths; multiple requires all-or-nothing; conservative skip when
  ctx=None and plugin declared requires.
- Metric emission: missing_field label is the first failed path;
  metric counter increments per skip.

Full planner suite: 849 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Replaces the PSM-mirror dual-cadence model in
``OrchestratorEngineAdapter`` with a single base interval.  Pipeline
ticks fire every ``SchedulingConfig.scale_interval_seconds`` from the
last tick moment.  Per-plugin cadence decisions live entirely in
``PluginScheduler._is_due`` (per ``RegisteredPlugin
.execution_interval_seconds``) — there is no more
``_next_load_s`` / ``_next_throughput_s`` reconciliation in the
adapter.

Behaviour change is observable but PSM-equivalent at the
*scaling-decision* level (locked by the test_decision_level_parity
test rewritten in the next commit, replacing the old
``test_g3_parity_via_adapter`` byte-identical guard).  Tick counts and
``ScheduledTick`` field shapes legitimately differ from PSM under the
new model — see /tmp/scale_interval_design.md §11 for the full
parity trade-off.

Engine-adapter delta:

- Add ``self._scale_interval`` + ``self._last_tick_s`` fields, sourced
  from ``config.scheduling.scale_interval_seconds`` (default 5.0).
- ``initial_tick`` now records ``_last_tick_s = start_s`` and returns
  a tick at ``start_s + scale_interval``.  Legacy
  ``_next_load_s`` / ``_next_throughput_s`` still set for the
  compatibility shim window.
- ``tick`` no longer gates ``_observe_fpm`` on the (now constantly
  True) ``run_load_scaling`` flag, and stops trying to advance two
  separate cadences from ``tick_input.now_s``.  ``_last_tick_s`` is
  the single source of truth for cadence advancement.
- ``_compute_next_scheduled_tick`` rewritten:
  * ``at_s = self._last_tick_s + self._scale_interval``
  * ``need_traffic_metrics`` is True iff some currently-registered
    plugin lists ``observations.traffic`` in its ``needs`` AND is
    due at the next tick — recovers PSM's lazy-pull cost profile
    (6 Prometheus queries / 180s in mixed mode, vs 216 if we always
    pulled).
  * ``traffic_metrics_duration_s`` = max declared
    ``observation_window_seconds`` across due traffic consumers
    (falls back to ``scale_interval`` when all due consumers
    declared 0.0 = "freshness equal to base cadence").
  * ``run_load_scaling`` / ``run_throughput_scaling`` flags kept on
    ScheduledTick for compatibility with PSM-path code and the
    diagnostics projection methods; under scale_interval both are
    always True (every tick is an opportunity for any plugin to
    fire, subject to its own throttle).
- ``_MERGE_TOLERANCE_S`` constant removed.  Plugin throttles are
  per-plugin independent; there is no second cadence stream to
  merge with the first.

Test surface delta in this commit:

- ``test_merge_tolerance_matches_psm_500ms_window`` replaced with
  ``test_pipeline_fires_at_scale_interval_cadence``.  The merge-
  tolerance concept doesn't apply to the new model — the test that
  locked it as a *contract* would assert behaviour the adapter no
  longer exhibits.  The new test locks the contract that DOES apply:
  ``next_tick.at_s == last_tick + scale_interval``, exactly.

Full planner suite: 849 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Locks two contracts the scale_interval rewrite introduced but didn't
have direct coverage for:

- ``test_scale_interval_advances_from_actual_tick_now`` — pipeline
  cadence advances from ``tick_input.now_s`` (the actual fire moment)
  rather than from a pre-computed schedule.  Same drift policy PSM
  uses; locking it as a contract makes scale_interval cadence
  accumulation match PSM-equivalent behaviour under wall-clock
  jitter.
- ``test_lazy_traffic_pull_skips_prometheus_when_no_plugin_needs_traffic``
  — when no registered plugin consumes ``observations.traffic``,
  ``ScheduledTick.need_traffic_metrics`` stays False.  This is the
  core Prometheus-load reduction promise of the scale_interval +
  per-plugin ``needs``-declaration model (recovers PSM's "skip the
  query on load-only ticks" behaviour without per-tick cadence type
  knowledge inside the adapter).

Also drops a stale code comment that referenced a
``test_g3_parity_via_adapter`` test which never landed in PR ai-dynamo#10124
— PSM-vs-orchestrator parity is asserted at decision level by the
existing ``test_tick_async_wraps_psm_on_tick_identically`` rather
than a dedicated orchestrator-side test.

Full planner suite: 851 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
…rmat + ruff)

Mechanical formatting changes applied by the pre-commit hooks the
``pre-commit`` CI check enforces. No behaviour change.

Also: fix copyright header on ``plugin.proto`` to include "All rights
reserved." per the regex in ``.github/workflows/copyright-check.ps1``
(``copyright-checks`` CI was failing on this single mismatch).

Full planner suite: 851 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
- Lazy-import grpc/protobuf transports so PSM-only deployments don't pull
  generated proto stubs at import time
- Catch broken predict-plugin exceptions in pipeline so one bad plugin
  no longer fails the whole tick (records circuit breaker failure +
  emits error metric instead)
- Wire FPM observations through msgpack into PipelineContext
- Strip TYPE_CHECKING wrappers from imports CodeQL was flagging as
  unused (`TrafficObservation`, `PluginFrameworkMetrics`,
  `WorkerCapabilities`) — annotations now reference the symbols at
  module top instead of through `Optional["..."]` strings
- Drop redundant `_result = await task; del _result` workaround in
  cancelled-sleeper test
- Various small fixups identified by review: empty-`except KeyError`
  rationale, scheduler `_requires_missing_field` static helper,
  first-fire anchor on `registered_at`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rides

CI runners that pip-install ai_dynamo without running grpc_tools.protoc
were importing the placeholder shim in plugin_pb2.py and hitting
"RuntimeError: stub missing" on every test that touched the proto
messages. Three coupled fixes:

- ship the generated `plugin_pb2.py`, `plugin_pb2_grpc.py`, `plugin_pb2.pyi`
  in git (negated against the repo-wide `*_pb2.py` gitignore).
  Test/build images no longer need grpcio-tools just to import the module.
- `requirements.planner.txt`: add `grpcio>=1.63.0` and
  `protobuf>=5.29.5,<7.0.0` — the planner.Dockerfile doesn't install
  `requirements.common.txt`, so the orchestrator gRPC transport had
  no runtime to bind against.
- `pyproject.toml`: add `grpc` / `grpc.*` / `google.protobuf` /
  `google.protobuf.*` to the mypy `ignore_missing_imports` block.
  These libs ship without `py.typed` markers; mypy was emitting
  `[import-untyped]` on every transport file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three concrete fixes; one consistency sweep.

CRITICAL:
- proto/v1/README.md + proto/v1/__init__.py docstring + "Adding a new
  stage" workflow all said the generated stubs are gitignored and
  regenerated at install time. After the build infra fix they're
  shipped in git — flip the description to match.
- examples/external_plugin/README.md + reference_runner.py docstring
  referenced ``tests/manual/ext-4stage.yaml`` (and a "K8s smoke
  fixtures" section that doesn't exist) as the K8s deployment shape.
  Replace with a description of the intended Pod-per-stage layout and
  note the actual ready-to-apply fixture is deferred to a follow-up.
  The shipped integration test ``test_external_plugin_e2e.py`` runs
  the same binary as a subprocess; reference that instead.
- tests/manual/README.md + disagg_8b_planner_orchestrator.yaml +
  engine_adapter.py:532 all pointed at
  ``tests/integration/test_dual_path_parity.py`` as the dual-path
  parity lock. That file was never shipped; the actual decision-level
  + cadence + tick-merge parity is locked by
  ``tests/plugins/orchestrator/test_engine_adapter.py``. Same fix for
  the stale ``test_replica_calculation.py`` reference — point at
  ``test_load_based_scaling.py`` + ``test_state_machine.py`` which
  contain the actual replica-math tests.

NIT sweep:
- Drop "DEP-XXXX" placeholder from proto + transport README headers
- Drop the forward-looking ``executor_max_workers <= 8`` reference in
  transport README — no such field exists in any shipped PR

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from independent review on PR ai-dynamo#10124.

1. engine_adapter._compute_next_scheduled_tick mixed wall-clock and
   monotonic time when deciding whether to do a lazy Prometheus pull.
   ``self._last_tick_s + scale_interval`` is wall-epoch (matches
   ``tick_input.now_s`` and ``ScheduledTick.at_s``), but
   ``PluginScheduler._is_due`` compares against
   ``RegisteredPlugin.last_call_at`` which is set by the pipeline via
   ``self._clock.monotonic()``.  In production ``WallClock`` deployments
   the projection is ~1.7e9 vs ~1e3 — every traffic-consuming plugin
   reads as always due, and the lazy-pull optimization silently
   degenerates to "pull every tick".  Replay was unaffected because
   ``VirtualClock`` is synced to ``tick_input.now_s`` at the top of
   ``tick()``.

   Fix: parallel field ``_last_tick_monotonic`` updated alongside
   ``_last_tick_s`` at every assignment site (init / prime / tick).
   ``_is_due`` is now called with the monotonic projection.  Wall
   epoch is preserved unchanged for ``ScheduledTick.at_s``.

   Regression test pins the precise scenario: a wall-vs-monotonic
   divergent clock with a plugin whose ``last_call_at`` is in
   monotonic domain — pre-fix the test asserts FAILS with
   "need_traffic_metrics should be False, was True".

2. authenticated_heartbeat / authenticated_unregister leaked plugin
   existence to gateway callers.  Returning ``(False, None)`` for
   unknown-plugin vs ``(False, "permission_denied")`` for
   wrong-subject mapped to distinct gRPC status codes in the gateway
   (200 OK ``HeartbeatResponse(ok=false)`` vs PERMISSION_DENIED) — any
   valid-token holder could enumerate registered plugin_ids by
   probing.  Same class of oracle as the previously-fixed
   StaticSecretAuth token-prefix leak, just shifted one layer up.

   Fix: collapse "unknown plugin" and "wrong subject" into a single
   ``(False, "permission_denied")`` response in both handlers.  The
   gateway already maps that uniformly to PERMISSION_DENIED.

3. ``register_internal`` plugins have ``auth_subject == ""`` by
   design — they are not supposed to be reachable via the gateway.
   The existing equality check ``plugin.auth_subject != identity.subject``
   relied on no AuthValidator ever returning ``subject=""``.  This is
   currently true (``StaticSecretAuth.__init__`` rejects empty;
   ``AllowUnauthenticatedAuth`` returns ``"anonymous"``) but data-
   shape is the only thing keeping it true.

   Fix: explicit ``not plugin.auth_subject`` guard in the same
   collapsed branch — any future auth backend that returns an empty
   subject still cannot operate on in-process plugins through the
   gateway.

Test updates:
- test_authenticated_heartbeat_unknown_plugin_returns_permission_denied
- test_authenticated_unregister_unknown_plugin_returns_permission_denied
  (renamed from ``_returns_false_no_reject`` — old expectation was the
  bug)
- test_authenticated_heartbeat_in_process_plugin_returns_permission_denied
- test_authenticated_unregister_in_process_plugin_not_reachable_via_gateway
- test_lazy_traffic_due_check_uses_monotonic_not_wall_epoch

346 plugin tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nfig paths

Independent review caught that ``RegisterRequest.requires_produced_fields``
and ``observation_window_seconds`` (the scale_interval cadence contract
fields added in commits 13-18) were only reachable via the gRPC gateway
self-register path.  The two ConfigMap-driven intake paths — static
external plugin list and in-process plugin spec — both used Pydantic
``extra="forbid"`` and silently dropped these fields.  Net result:

- ``throughput_propose`` declaring ``requires_produced_fields=["predictions"]``
  could not be configured via ConfigMap, only via gRPC self-register.
- ``observation_window_seconds=180`` (e.g. wanting a longer Prometheus
  aggregation window for stable averaging) was similarly unreachable.

Five wiring fixes (top-down following the call chain):

1. ``LocalPlannerOrchestrator.register_internal``: added the two new
   kwargs, passed through to ``PluginRegistryServer.register_internal``.
   Until now the orchestrator-level facade silently dropped them.

2. ``ExternalPluginEntry`` (planner_config.py): added Pydantic
   ``Field`` declarations for both new fields, with descriptions
   pointing at the cadence contract.

3. ``register_external_from_config`` (orchestrator.py): threaded both
   fields into the ``RegisterRequest`` construction.  Without this,
   #2 alone wouldn't reach the registry.

4. ``InProcessPluginSpec`` (registry/config.py): added ``needs``
   (also previously missing!) + the two new cadence fields.

5. ``load_in_process_plugins`` (in_process_loader.py): passed
   ``needs`` + ``requires_produced_fields`` + ``observation_window_seconds``
   into ``orchestrator.register_internal``.

Tests:
- ``test_loader_passes_scale_interval_fields_to_registered_plugin``:
  in-process loader passthrough end-to-end (spec → plugin).
- ``test_bootstrap_passes_scale_interval_fields_through``:
  static external bootstrap passthrough end-to-end.

348 plugin tests pass (+2 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…th prefix

``needs`` is defined in plugin.proto:55 as ``dot-paths into
PipelineContext`` — a plugin declaring ``needs=["observations.traffic.
num_req"]`` is signaling "I only consume num_req, you may trim the rest
from ctx" (the proto comment says ``the orchestrator MAY trim the
context to those fields to save wire / serialisation cost``).

But ``engine_adapter._compute_next_scheduled_tick`` used
``"observations.traffic" in p.needs`` — a list-membership exact-string
match.  Sub-path declarations slipped past this guard, so:

  needs=["observations.traffic.num_req"]
    → "observations.traffic" in needs == False
    → traffic_consumers_due excludes this plugin
    → ScheduledTick.need_traffic_metrics = False
    → orchestrator skips Prometheus query
    → ctx.observations.traffic = None at tick time
    → plugin gets None despite declaring a dot-path INTO that subtree

Same dot-path field elsewhere in this PR (``requires_produced_fields``)
already uses proper getattr-chain walking via ``_ctx_get``; this is a
consistency fix.

Fix: match the parent path ``observations.traffic`` exactly OR any
sub-path ``observations.traffic.<field>``.  Both require the
observation to be present.  The trailing ``.`` in the prefix-match is
load-bearing — it stops false-positive matches on a sibling field like
``observations.traffic_legacy``.

Test ``test_lazy_traffic_pull_matches_dot_path_sub_paths_of_observations_traffic``
locks all four cases:
  1. exact ``"observations.traffic"`` → pull triggered (existing path)
  2. sub-path ``"observations.traffic.num_req"`` → pull triggered (was broken)
  3. sibling ``"observations.traffic_legacy"`` → pull NOT triggered (prefix guard)
  4. unrelated ``"observations.fpm" / "predictions"`` → pull NOT triggered

Verified by reverting the fix: case 2 fails with the expected
AssertionError; restoring the fix makes all 4 cases pass.

349 plugin tests pass (+1 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PSM internal types ``TrafficObservation.kv_hit_rate`` and
``TickDiagnostics.predicted_kv_hit_rate`` have always carried KV cache
hit rate as part of the throughput-scaling input / output.  But the
wire proto and Pydantic mirrors that external plugins see did NOT
include these fields:

  TrafficObservation (PSM)        ── kv_hit_rate ✅
  TickDiagnostics    (PSM)        ── predicted_kv_hit_rate ✅
  proto TrafficMetrics  (wire)    ── kv_hit_rate ❌
  proto PredictionData  (wire)    ── predicted_kv_hit_rate ❌
  pyd  TrafficMetrics             ── kv_hit_rate ❌
  pyd  PredictionData             ── predicted_kv_hit_rate ❌

So an external ``throughput_propose`` plugin could not reproduce PSM
throughput behaviour — neither read ``ctx.observations.traffic.
kv_hit_rate`` (field absent on the proto-derived Pydantic model) nor
emit ``predicted_kv_hit_rate`` (same).  A genuine PSM-parity gap on
the plugin observation / prediction API.

Schema changes (proto3, additive — safe per the schema-evolution
policy in proto/v1/README.md):
- ``TrafficMetrics``: add ``optional float kv_hit_rate = 5``
- ``PredictionData``: add ``optional float predicted_kv_hit_rate = 5``

Both are ``optional`` to preserve field presence — distinguishes
"no datapoint" (unset) from "0.0 = all-cold cache" (set to 0.0), and
for the predicted field, "no opinion" from "I assert 0.0" under
chain_augment first-writer-wins partial-merge.

Wire updates:
- Pydantic mirror in ``plugins/types.py``: ``kv_hit_rate: Optional[float]``
  and ``predicted_kv_hit_rate: Optional[float]`` matching the existing
  predicted_* fields' Optional pattern.
- Regenerated ``plugin_pb2.py``, ``plugin_pb2_grpc.py``, ``plugin_pb2.pyi``
  via the documented protoc + SPDX-prepend recipe.

Adapter wiring:
- ``OrchestratorEngineAdapter._tick_input_to_context`` propagates
  ``TickInput.traffic.kv_hit_rate`` into ``ctx.observations.traffic
  .kv_hit_rate`` so external plugins actually see it.
- The predict-diagnostics block in ``tick()`` copies
  ``prediction.predicted_kv_hit_rate`` onto ``TickDiagnostics`` —
  symmetric with the existing predicted_num_req / _isl / _osl
  forwarding.

Test: ``test_kv_hit_rate_round_trip_traffic_and_prediction`` locks the
field-presence semantic in both directions on both messages; existing
``test_prediction_data_optional_unset_vs_zero`` extended to assert the
new field is also unset by default.

350 plugin tests pass (+1 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urface

Reviewer P1 (PR ai-dynamo#10124): ``_project_scale_to`` collapses the merged
proposal's per-target ``component_name`` to a single replicas-per-type
scalar.  The collapse itself is fine for a single-planner single-pool
runtime — the problem is that proto/Pydantic still **advertise**
``component_name`` as if multi-pool addressing worked, while the
runtime silently drops it.

The single-planner runtime in this PR has exactly one ``WorkerInfo``
per ``sub_component_type``, populated from config at startup; there is
no path by which a plugin-emitted ``component_name`` could ever address
an alternate pool.  Multi-pool execution is hierarchical-planner
territory (separate planners per pool, with a router on top) and
lands in a follow-up PR.

Rather than paper over the mismatch with a "warn + drop" guardrail,
remove the API surface that has no runtime consumer:

- proto ``ComponentTarget``: ``reserved 2;`` (was
  ``optional string component_name``) — schema-evolution policy keeps
  the tag reserved so the hierarchical-planner PR can re-add at the
  same tag (or pick a new one) without breaking older recorded traces.
  Two proto comments referencing the field are likewise updated.
- ``plugin_pb2.py`` / ``plugin_pb2.pyi`` / ``plugin_pb2_grpc.py``:
  regenerated via the standard protoc + SPDX re-prepend recipe from
  ``proto/v1/README.md``.
- Pydantic ``ComponentTarget``: ``component_name`` field removed;
  docstring notes the deferred multi-pool semantic.
- ``ComponentKey``: ``component_name`` field removed (kept as a single-
  field dataclass rather than collapsing to ``str`` so the
  hierarchical-planner PR can re-add the per-pool axis without touching
  every call site).
- ``type_aware_merge`` / ``pipeline.py``: bucket key construction
  simplified to ``ComponentKey(sub_component_type=...)``; clamp
  counters drop the ``component_name`` Prometheus label.
- ``planner_metrics.py``: ``reconcile_clamped_total`` and
  ``constrain_capped_total`` label sets drop ``component_name`` (was
  always ``""`` in single-planner runtime anyway — the label was pure
  metric-cardinality bloat).
- ``core/types.py``: ``TickDiagnostics.plugin_overrides`` /
  ``reconcile_reasons`` doc-comment updated.

Tests:
- ``test_type_aware_basic.py``: drops ``test_component_name_creates_
  separate_buckets`` (the multi-pool independence case it was
  validating is no longer in the data model).  ``POOL_A`` / ``POOL_B``
  fixtures + ``_ct`` factory ``component_name`` parameter removed.
- ``test_type_aware_clamp_tracking.py``: ``"worker_a"`` / ``"worker_b"``
  literals removed from ``PREFILL`` / ``DECODE`` fixtures and the
  inline ``ComponentTarget`` construction.
- ``test_type_aware_worked_examples.py``: drops the
  ``hierarchical_pools`` row (8 cases now, was 9); tripwire updated;
  ``CT()`` / ``key()`` helpers no longer take ``component_name``.
- ``test_type_aware_constrain.py`` / ``test_type_aware_short_circuit.py``:
  ``_ct(...component_name=None)`` default removed (default was the
  no-name case anyway, no behavior change).
- ``test_pipeline_metrics.py``: ``component_name="worker"`` dropped
  everywhere — was a phantom label, never functionally needed.
- ``test_round_trip.py``: drops ``test_component_target_with_pool_name``
  + the ``HasField("component_name")`` assertion (proto field no
  longer exists to query).
- ``test_transport_contract.py``: drops the ``multi_pool`` parametrize
  fixture (used ``ComponentTarget.component_name`` on wire).
- ``test_plugin_framework_metrics.py``: drops ``component_name=...``
  from the ``Counter.labels(...)`` calls so the label set matches the
  declared metric.

Forward compat (hierarchical-planner PR re-adds):
- Pick proto tag 2 (reserved here) or a fresh tag; proto3 ``optional``
  semantic keeps it non-breaking for older clients.
- Re-add the ``component_name`` axis to ``ComponentKey`` /
  ``ComponentTarget`` / ``TargetReplica`` projection / Prometheus
  labels in lock-step.

819 -> 813 planner tests pass (1 pre-existing skip; 6 deleted tests
matched the deleted forward-compat surface).  No K8s smoke needed —
pure wire-format + internal-bucket layer; execution path is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… FPM docs

Two P2 review threads, both proto-touching, bundled into one commit.

**P2-1: WorkerState scaling-in-progress flags missing**

``WorkerCounts`` (core/types.py:62-63) already carries
``prefill_scaling_in_progress`` / ``decode_scaling_in_progress`` and
PSM's state_machine.py:336-345 uses them to suppress further scale-up
while a previous request is mid-flight and to explain "held" decisions
in audit logs.  ``WorkerState`` (proto + Pydantic mirror) only exposed
ready/expected counts, so an external load-scaling plugin replicating
PSM behaviour could not reach the same decisions through the public
``PipelineContext.observations.workers`` channel.

Add two ``optional bool`` fields at proto tags 5 / 6:
``prefill_scaling_in_progress`` and ``decode_scaling_in_progress``.
``optional`` (proto3 presence) lets a plugin distinguish "connector
did not report this tick" from an explicit ``false`` value — same
contract as the existing optional ints.  Pydantic mirror + Pydantic-
side round-trip test (unset / mixed True+False) added.

Threaded through ``OrchestratorEngineAdapter._tick_input_to_context``;
PSM-side WorkerCounts already populates these from
``base.py:803-804``, so this is wire-level exposure of an already-
computed signal.

**P2-6: FPM API docstrings stale**

The adapter wires FPM observations into
``PipelineContext.observations.fpm`` (engine_adapter.py:797 +
``_encode_fpm`` helper) using msgspec/msgpack, but three docstrings
still claim FPM is "reserved for a follow-up PR" / "currently
unpopulated":

- proto ``FpmData`` block comment
- Pydantic ``FpmData`` docstring
- engine_adapter module docstring (responsibility 2)

Refresh all three to describe the actual contract: per-engine map keyed
by ``"<worker_id>/<dp_rank>"``, msgpack-encoded ``ForwardPassMetrics``,
``ObservationData.fpm`` absent (not empty-but-set) when no engines
reported.  Plugin authors reading the proto/Pydantic source now see
the real shape they can consume.

814 planner tests pass (was 813; +1 for the new round-trip test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…protocol_versions on orchestrator path

Two registry-side P2 review threads.

**P2-2: observation_window_seconds validation documented but not enforced**

``plugin.proto:107`` promises:

  Constraint: must be 0 OR a positive multiple of
  ``SchedulingConfig.scale_interval_seconds`` so windows align to tick
  boundaries. Validator rejects otherwise.

The server stored ``req.observation_window_seconds`` verbatim without
checking — a plugin author misconfiguring the field saw their plugin
accepted and silently drove misaligned Prometheus ``[Ns]`` queries
(window spanning tick boundaries) that report moving-target values.

Add ``_check_observation_window()`` helper (module-level so it's unit-
testable in isolation) and a step 2.5 in ``register()`` between
``protocol_version`` and the duplicate check.  Accepts:
  - ``window == 0.0`` (per-tick freshness, the safe default)
  - ``window == k * scale_interval`` for any positive integer k
    (1e-6 tolerance absorbs float round-trip noise from YAML → Pydantic →
    proto float32 → wire)

Rejects anything else with ``observation_window_misaligned``,
mirroring the existing ``protocol_version_unsupported`` reject style.
When ``scale_interval_seconds == 0.0`` (PSM-path constructions, no
alignment to enforce), the check is a no-op — accept any value.
5 new tests cover (zero / multiple / non-multiple / negative /
unverifiable-when-no-scale-interval).

**P2-3: orchestrator path ignores configured protocol_version_min/max**

``planner_config.PluginRegistrationConfig`` exposes
``protocol_version_min`` / ``protocol_version_max`` and
``build_registry_from_config`` passes them as ``protocol_versions=(min,
max)`` to ``PluginRegistryServer``.  The orchestrator path
(``OrchestratorEngineAdapter.__init__`` constructing the server
directly) was passing only ``scale_interval_seconds`` and inheriting
the server's default ``("1.0", "1.0")`` — so config knob was dead
on the live orchestrator path.

Pass ``protocol_versions=(config.plugin_registration.protocol_version_min,
config.plugin_registration.protocol_version_max)`` at construction
time.  Test introspects the constructed server's ``_protocol_min`` /
``_protocol_max`` to confirm a non-default range (``"1.0"`` →
``"1.5"``) propagates through.

820 planner tests pass (was 814; +6 for the new tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…otstrap fan-out

Reviewer P2-4: ``OrchestratorEngineAdapter.bootstrap_plugins`` dispatched
``_orchestrator.bootstrap_plugins(historical_traffic=...)`` **first**
and only then ``_wire_external_plugins_from_config()``.  Static
external plugins declared in ``scheduling.external_plugins`` (a
supported W1 path; see ``examples/external_plugin/``) registered into
an already-bootstrapped registry — they never received the warm pass
or the Bootstrap RPC, so external implementations of e.g.
load-predictor that need historical traffic to fit a regression saw
an empty seed.

Swap the order: ``wire externals → bootstrap → gateway``.  Now
``_orchestrator.bootstrap_plugins`` iterates the registry with the
static externals already present and the Bootstrap fan-out covers
them too.

Gateway still opens last so dynamically registered plugins arriving
via the network can't race the bootstrap fan-out — those plugins
intentionally don't get ``historical_traffic`` (Bootstrap has already
moved past them; that's the documented contract for dynamic
registration).

Regression test pins the new ordering via monkey-patched call
records on the three adapter methods.

821 planner tests pass (was 820; +1 for the new ordering test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Diagnostics

Reviewer P2-5: ``PipelineOutcome`` carries three execute-time fields
(``execute_action``, ``short_circuit_reason``, ``audit_events``) that
the engine adapter dropped — ``PlannerEffects.diagnostics`` only
projected prediction / load / throughput fields.  Operators could see
the action via Prometheus (``tick_skip_reasons_total{reason=...}`` etc.)
but in-process consumers — the replay adapter, the diagnostics
recorder writing HTML reports — could not distinguish ``apply`` from
``skip_short_circuit`` / ``skip_no_targets`` / ``skip_tick_timeout``
without scraping metrics.

Add three fields to ``TickDiagnostics``:

- ``execute_action: Optional[str]`` — mirrors
  ``PipelineOutcome.execute_action``; ``None`` on the PSM path.
- ``short_circuit_reason: str`` — mirrors the pipeline counterpart;
  empty string when not short-circuited.
- ``audit_events: list[str]`` — mirrors
  ``PipelineOutcome.audit_events`` (chain-augment warnings, CONSTRAIN
  SET drops, etc.).  Empty list on the PSM path.

Engine adapter copies all three from the pipeline outcome after the
prediction projection.  PSM path leaves them at defaults (which is
the documented "not available on this path" semantic per the comment
block already on ``TickDiagnostics``).

Regression test mocks ``orchestrator.tick`` to return a known
``skip_short_circuit`` outcome and asserts all three fields surface
on the resulting ``PlannerEffects.diagnostics``.

822 planner tests pass (was 821; +1 for the diagnostics propagation test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…existent add_observation → crash)

P1 from independent review. ``OrchestratorEngineAdapter._observe_fpm``
and ``ReplayPlannerAdapter._feed_extra_fpm_to_regression`` called
``reg.add_observation(fpm)`` (singular) on the regression models. Those
slots hold ``PlannerEnginePerfModel`` (state_machine.py:75/82/88 build
them; the orchestrator installs the same type via install_regressions),
which has **no** singular ``add_observation`` — it exposes only
``add_observations(dict[(worker_id, dp_rank) -> FPM])``. The singular
call raises ``AttributeError`` and crashes the tick.

Confirmed three ways: runtime ``hasattr(PlannerEnginePerfModel,
'add_observation') is False``; the existing test
``test_rust_perf_adapter.py:275`` already asserts that method does not
exist; and PSM's own ``_observe_fpm`` (state_machine.py:357/362/365)
correctly uses the plural ``add_observations``. The singular
``add_observation`` belongs to a *different*, production-unused class
family (``_BaseRegressionModel`` → Agg/Prefill/DecodeRegressionModel,
instantiated only in test_load_based_scaling.py) — same name minus an
's', which is why it looked correct statically.

Reachability:
- Default replay path (use_orchestrator=False, SLA mode): crashes on any
  tick carrying >1 FPM snapshot per worker.
- Orchestrator runtime (production K8s, SLA mode): crashes on the first
  load tick once a regression is installed and FPM flows.
Both dodged by easy-mode (the ``not is_easy`` guard) and by the inner
``if obs.decode:`` guard when no FPM subscriber delivered data — which is
exactly why neither the 822-test suite nor the K8s smoke (run without a
live FPM stream into an SLA regression) caught it.

Fix:
- engine_adapter._observe_fpm: hand ``obs.prefill`` / ``obs.decode`` (already
  the right dict) straight to ``add_observations`` — line-for-line PSM mirror.
- replay_adapter._feed_extra_fpm_to_regression (3 sites): wrap each
  non-excluded snapshot as ``add_observations({(worker_id, dp_rank): fpm})``,
  preserving the per-snapshot feed semantics.

Tests (close the combined unit+smoke blind spot — no test previously
exercised SLA + non-empty FpmObservations + installed PlannerEnginePerfModel):
- test_engine_adapter: agg + disagg ``_observe_fpm`` against a real
  PSM-built regression, asserting no AttributeError.
- test_replay_adapter_fpm (new): ``_feed_extra_fpm_to_regression`` against a
  real PSM regression with 2 snapshots/worker so the non-excluded one feeds.
Both verified to fail on the pre-fix code and pass after.

825 planner tests pass (was 822; +3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…review ai-dynamo#9)

``type_aware_merge`` documents that keys present only in ``baseline`` still
appear in the output so downstream stages see a complete proposal. The
bucket-merge path honors this; the ``final=True`` path returned the winning
plugin's targets verbatim and never consulted ``baseline``, so a final
plugin emitting only ``SET prefill=N`` dropped ``decode`` from the proposal.

In the real pipeline this is masked (``final`` is forced False at CONSTRAIN,
and ``_proposal_to_baseline`` rebuilds each next stage's baseline from the
full worker-count fallback), so no production scaling decision changed. But
a RECONCILE plugin reading ``ctx.proposal`` saw an incomplete proposal, and
the divergence was a latent foot-gun if the stage wiring ever changed.

Fix: after assembling the final winner's targets (and dropping SET under
set_allowed=False), fold in any baseline key the winner did not mention,
matching the bucket path. Updated the CONSTRAIN-final test
(test_final_in_constrain_drops_set_but_final_still_applied): a dropped
SET-prefill now reappears via baseline passthrough at the current value,
which is exactly what the bucket path already does for a dropped SET.

New test: test_final_path_passes_baseline_only_keys_through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oder reuse, doc fixes

Low-risk cleanup batch from the independent review (no decision-path change):

- ai-dynamo#4 chain_augment: add ``predicted_kv_hit_rate`` to ``_PREDICTION_FIELDS``
  so it participates in first-writer-wins partial merge like the other three
  predicted_* fields (was silently dropped in any 2+ plugin PREDICT chain,
  contradicting the proto/Pydantic contract). +2 chain_augment tests.
- #10 engine_adapter: add ``scale_down_capped_by_throughput`` to
  ``_aggregate_disagg_load_reason`` priority (PSM disagg emits it; placed
  between scale_up and scale_down to mirror PSM's _PRIORITY).
- ai-dynamo#11 dead code: drop ``contributing_plugin_ids`` (built, never read) in
  pipeline._run_fanout_stage; drop ``_set_enabled`` + ``_plugin_ids``
  (no caller in PR #1; would KeyError if reached).
- ai-dynamo#18 _encode_fpm: use the canonical
  ``dynamo.common.forward_pass_metrics.encode`` (shared module-level encoder)
  instead of allocating a fresh ``msgspec.msgpack.Encoder`` per tick and
  re-implementing the encoding. Byte-identical wire format; keeps FPM
  serialization in lock-step with the rest of dynamo.
- ai-dynamo#17 transport ABC docstring: timeout is enforced by the transport
  (``call()`` wraps ``asyncio.wait_for``), not the orchestrator — the
  pipeline uses a bare gather to avoid double-counting the deadline.
- ai-dynamo#20 scheduler docstring: note the heartbeat-eviction monitor is not wired
  in this PR (last_heartbeat_at is recorded but unread; monitor is follow-up).
- ai-dynamo#21 transport contract test: 7 inputs (not 8) → 14 cases (multi_pool fixture
  was removed with component_name; comments were stale).
- ai-dynamo#22 metrics test: remove the dead no-op ``pass`` loop in _sample_value.

828 planner tests pass (was 825; +3 chain-augment / merge tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…auge reset (review ai-dynamo#12/ai-dynamo#13/ai-dynamo#19)

Hot-path quality fixes — none change a scaling decision.

ai-dynamo#12 pipeline durations use Clock.monotonic(), not Clock.now()
  Clock's contract reserves now() for wall-clock timestamps and monotonic()
  for duration measurement. Six duration sites (predict latency, fan-out
  call latency, whole-tick duration) used now(); under WallClock a backward
  NTP step mid-tick distorted the latency/duration histograms. Switched all
  six to monotonic(). VirtualClock.monotonic() is synced to trace time in
  replay, so replay/test behavior is unchanged.

ai-dynamo#13 ProposeResult derives result_kind + enforces the oneof
  ProposeResult carries the same accept/override/reject oneof as the stage
  responses but, unlike them, had no model_post_init — so building it the
  natural way (override=...) left result_kind='' and the proto round-trip
  came back 'override', breaking round-trip equality; a two-payload oneof
  violation also went unchecked. Extracted the derive+validate logic into a
  shared _derive_result_kind() helper used by both _StageOneofResponse and
  ProposeResult. +round-trip test (derive + oneof-violation reject).

ai-dynamo#19 override_active gauge reset covers errored plugins
  _emit_override_active reset the gauge only for plugins in plugin_results;
  a plugin whose call raised is absent from that list, so a 1 it set on a
  prior tick lingered. Now reset every ATTEMPTED plugin id (triggered +
  inherited) before setting the contributors.

828 planner tests pass (+1 round-trip test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
/ai-dynamo#8/ai-dynamo#15/ai-dynamo#16)

Test-only additions for the seams the review flagged as untested.

ai-dynamo#7 _project_scale_to with a real apply outcome (4 cases):
  both components changed → full ScalingDecision; both equal current →
  None (PSM-equivalent no-change); single-component proposal → other count
  stays None; non-apply execute_action → None. Previously every adapter.tick
  test hit only the None/empty path, so a regression in the projection /
  no-change detection would have shipped silently.

ai-dynamo#8 _tick_input_to_context + FPM encoding:
  build a TickInput with traffic (incl kv_hit_rate), worker counts (incl
  scaling-in-progress flags), and a real ForwardPassMetrics; assert the
  PipelineContext.observations mapping and that the FPM bytes decode back
  (key format "<worker_id>/<dp_rank>", canonical encoder). This is the
  ingress glue where the add_observations P1 + the projection live.

ai-dynamo#15 registry mutation during an in-flight tick:
  suspend a PROPOSE plugin mid-gather (asyncio.Event), register a new
  plugin while suspended, release, and assert the late plugin did NOT join
  the in-flight stage (pre-tick snapshot) and the tick completed cleanly —
  then a fresh tick picks it up. Exercises the no-locks invariant that
  scheduler.py/server.py document but no test covered.

ai-dynamo#16 test_tick_diagnostics_extended scope note:
  clarify in the module docstring that plugin_overrides / reconcile_reasons
  / held_over_plugins have no production populator in this PR; these tests
  lock the dataclass contract (defaults / no shared-mutable aliasing), not
  live behavior.

835 planner tests pass (+6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ay (review ai-dynamo#6)

The registration gateway receives external plugins' shared-secret
``RegisterRequest.auth_token``. ``start_gateway_server`` bound
``add_insecure_port`` unconditionally when no TLS creds were supplied,
and the production caller never supplies creds and had no config to —
so pointing ``gateway.listen`` at a TCP ``host:port`` silently stood up
a plaintext gRPC server that received every plugin's token in cleartext,
with only an INFO log. This is asymmetric with the OUTBOUND transport,
which fails closed unless ``transport.allow_insecure_grpc=True``.

Make the inbound side symmetric:
- Add ``GatewayConfig.allow_insecure`` (default False).
- ``start_gateway_server`` gains ``allow_insecure`` and, in the
  no-credentials branch, refuses to bind a non-``unix:`` (TCP) listen
  unless ``allow_insecure`` is set — raising a clear RuntimeError before
  any bind. ``unix:`` (Pod-local, trust-boundary) listens are always
  allowed. When a plaintext TCP bind IS opted into, it logs a WARNING
  (not INFO) naming the token-exposure risk.
- ``_maybe_start_gateway`` passes ``gw_cfg.allow_insecure`` through.

Tests: TCP + allow_insecure=False → RuntimeError "refusing to bind
plaintext"; TCP + allow_insecure=True → binds (stubbed server).

837 planner tests pass (+2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… gate

The new fail-closed guard (commit be38e21) raises "refusing to bind
plaintext" for a TCP listen with allow_insecure=False, which fired before
the port==0 bind-failure path that test_start_gateway_server_raises_when_
port_zero exercises (it uses listen="0.0.0.0:1"). Pass allow_insecure=True
in that test so it reaches the intended bind-failure path; the
plaintext-gate behavior is covered by its own dedicated tests.

837 planner tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eview ai-dynamo#5)

TrafficMetrics.{duration_s,num_req,isl,osl,kv_hit_rate} and
PredictionData.{predicted_num_req,predicted_isl,predicted_osl,
predicted_kv_hit_rate} were proto ``float`` (IEEE-754 32-bit), but their
source of truth (core/types.py + the Pydantic mirror) is Python float64.
Every out-of-process plugin received these values truncated to float32
(and re-truncated on the way back), so rate-style values drifted ~1e-4
and the gRPC transport disagreed bit-for-bit with the in-process
transport — which is presented as interchangeable. The round-trip tests
passed only because every fixture float was hand-picked float32-exact.

Change the nine fields to ``double`` / ``optional double``. Safe now:
the v1 contract is pre-ship (no external plugin has shipped against it),
and changing the wire type before release avoids a breaking change later.
Regenerated *_pb2 stubs (+ SPDX re-prepend). The Pydantic mirror already
uses float64, so no mirror change. New test asserts EXACT (not approx)
round-trip for non-float32-exact values, so a regression to ``float`` is
caught.

838 planner tests pass (+1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… path (review #3)

On the orchestrator replay path (use_orchestrator=True) the regression
models were never installed: replay/main.py guarded the AIC benchmark-FPM
bootstrap behind ``adapter._sm is not None and not adapter._sm._is_easy``,
and ``_sm`` is None under use_orchestrator — so the whole block (incl. the
``load_benchmark_fpms`` calls) was skipped. ``get_regression`` then returned
None for the whole replay, the throughput regression stayed empty, and
orchestrator-replay scaling decisions diverged from PSM — contradicting the
adapter docstrings that claimed bootstrap_from_fpms→install_regressions was
wired.

Fix:
- engine_adapter: extract ``install_regressions_from_fpms`` (synchronous,
  builds the regressions from benchmark FPMs via the throwaway-PSM factory
  and installs them on the shared store — no plugin bootstrap).
  ``bootstrap_from_fpms`` now = install_regressions_from_fpms +
  bootstrap_plugins.
- replay_adapter: add path-agnostic ``install_benchmark_fpms`` — PSM path →
  ``PlannerStateMachine.load_benchmark_fpms``; orchestrator path →
  ``install_regressions_from_fpms`` (plugins were already bootstrapped at
  adapter construction, so this does NOT double-bootstrap). Corrected the
  ``_get_regression`` docstring to the real wiring.
- replay/main.py: guard is now ``not adapter._is_easy_mode()`` (path-
  agnostic) and the two ``adapter._sm.load_benchmark_fpms(...)`` calls become
  ``adapter.install_benchmark_fpms(...)``.

Test: install_benchmark_fpms on the orchestrator path makes
get_regression("agg") non-None (was None pre-fix).

840 planner tests pass (+1). NOTE: replay/main.py's end-to-end path can't be
exercised in this env (its Rust _core symbol predates the installed
extension); the edit is py_compile-validated and the adapter method it calls
is unit-tested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… test

CI pins black 23.1.0 via pre-commit; the committed call was left
multi-line by a newer local black. Collapse to one line so the pinned
formatter and CI agree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…from_fpms

CI mypy flagged replay_adapter.py:249 with [attr-defined] (note: not
covered by the type:ignore). The call sits in install_benchmark_fpms
where self._engine carries only its declared EngineProtocol type, so
the absent method is attr-defined — not union-attr (the latter applies
at the __init__ call site where mypy narrows the assigned union).
Switch the ignore code to [attr-defined].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…I runtime parity

The review-ai-dynamo#5 regen (a2b8c1c) was run with a local grpcio-tools 1.80
/ protobuf 6.33, baking ValidateProtobufRuntimeVersion(6,31,1) and
GRPC_GENERATED_VERSION '1.80.0' into the committed stubs. CI's pinned
runtime is protobuf 5.29.6 (requirements: protobuf>=5.29.5,<6.0dev),
so the 6.31.1 gencode guard hard-raised VersionError at import, erroring
out 5 planner test modules at collection.

Regenerate from the unchanged plugin.proto with grpcio-tools 1.67.1 —
the same toolchain the stubs originally shipped with (b39dde8) and
which passed CI — restoring gencode 5.27.2 + grpc guard 1.67.1. The
serialized descriptor blob is byte-identical (schema unchanged: doubles,
kv_hit_rate, WorkerState flags, component_name strip all preserved);
only the version guards revert to CI-compatible values.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions container documentation Improvements or additions to documentation external-contribution Pull request is from an external contributor feat planner size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants