perf(plugins): reuse active registry on dispatch when cache keys diverge#80691
perf(plugins): reuse active registry on dispatch when cache keys diverge#80691quangtran88 wants to merge 2 commits into
Conversation
|
Codex review: needs changes before merge. Summary Reproducibility: yes. source-level. Current main still shows dispatch building lean runtime load options and the active lookup enforcing strict cache compatibility, while the PR body supplies live Docker logs for the after-fix performance path; I did not run a live heap profile locally. Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Keep the warm-registry fast path, but make it compatibility-aware for the specific gateway-startup-rich versus dispatch-lean equivalence while preserving config, trust, load-path, activation, runtime-mode, workspace, and plugin-scope reload boundaries. Do we have a high-confidence way to reproduce the issue? Yes, source-level. Current main still shows dispatch building lean runtime load options and the active lookup enforcing strict cache compatibility, while the PR body supplies live Docker logs for the after-fix performance path; I did not run a live heap profile locally. Is this the best way to solve the issue? No, not as written. Reusing the warm registry is the right direction, but retrying without loadOptions is broader than the current cache contract; the safer fix is a compatibility helper for the known startup/dispatch equivalence plus stale-config regression coverage. Full review comments:
Overall correctness: patch is incorrect Security concerns:
Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 7d7d5809ab0a. |
ddd566f to
fcac1bd
Compare
The first inbound message after boot triggers a full `loadOpenClawPlugins` reload via `ensureStandaloneRuntimePluginRegistryLoaded` even when `loadGatewayStartupPluginRuntime` has already populated an active registry for the same workspace and plugin set. The strict cache-key match in `getLoadedRuntimePluginRegistry(loadOptions)` returns `undefined` because the dispatch-path callers (`ensureRuntimePluginsLoaded`) build a 3-field load-options object while gateway-startup builds a 9+ field one — their hashes never match. Heap profiling (Node `--heap-prof`) on a clean prod-mirror host shows ~25 MB cumulative allocation under `dispatchReplyFromConfig → ensureRuntimePluginsLoaded → ensureStandaloneRuntimePluginRegistryLoaded → loadOpenClawPlugins` on the first DM, costing the request ~4.4s of extra dispatch latency before the LLM call even fires. After the strict cache-key miss, re-query `getLoadedRuntimePluginRegistry` without `loadOptions` so we take the workspace + plugin-id branch on the active surface registry. This still enforces workspace compatibility (via the existing `getActivePluginRegistryWorkspaceDir` check) and still calls `registryContainsPluginIds`, so we reuse a warm registry only when it genuinely covers the request. Falls through to `loadOpenClawPlugins` for workspace mismatch or missing plugins. Refs openclaw#80682. Related (closed without merge): openclaw#74118. Verification: - Three regression tests added covering the cache-key-miss reuse path, workspace mismatch, and missing-plugin fall-through. - Heap-snapshot diff (SIGUSR2 before/after one fresh DM, registry warm) shows retained delta = 73 KB; the 22 MB dispatch allocation that the first-DM-per-process pays is 99.7% transient, confirming the bottleneck is the redundant `loadOpenClawPlugins` call itself.
fcac1bd to
5bf1312
Compare
The workspace pinned protobufjs@7.5.5 via pnpm-workspace.yaml overrides; that version carries 4 HIGH advisories (GHSA-66ff-xgx4-vchm, -75px-5xx7-5xc7, -jvwf-75h9-cwgg, -685m-2w69-288q). 7.5.8 is the earliest patched release in the 7.x line. CI's security-dependency-audit job exits non-zero when production dependencies carry HIGH or higher advisories, which blocks this PR. Bumping the override clears the audit without changing any application behavior.
|
The fallback drops |
|
Closing this in favor of #84324, which landed the safer version of this optimization. The original perf issue is real, but this PR's reuse path dropped the full #84324 keeps the existing loader compatibility checks and only reuses the Gateway startup registry for the specific compatible dispatch case. It also adds regression coverage for reuse, changed config forcing a fresh load, and the Thanks for digging into the perf issue here. |
Summary
Fixes #80682. Reuses the active plugin registry from
ensureStandaloneRuntimePluginRegistryLoadedwhen the dispatch-path load-options object hashes to a different cache key than the gateway-startup one, so the first inbound message after boot no longer triggers a fullloadOpenClawPluginsreload on the request hot path.getLoadedRuntimePluginRegistry(src/plugins/active-runtime-registry.ts:84-90) — when called withsurface: "active",loadOptions, andrequiredPluginIds.length > 0— uses strict cache-key equality viaresolveCompatibleRuntimePluginRegistry(loadOptions). Gateway-startup buildsPluginLoadOptionswith 9+ fields (onlyPluginIds,activationSourceConfig,autoEnabledReasons,preferSetupRuntimeForChannelPlugins, ...). Dispatch-timeensureRuntimePluginsLoadedbuilds a 3-field options object (config,workspaceDir, optionalruntimeOptions). The hashes never match, so the strict path returnsundefinedandloadOpenClawPluginsruns again — ~25 MB heap, ~4.4s wall on a CX33 2-vCPU host — insidedispatchReplyFromConfig.This PR adds a second query inside
ensureStandaloneRuntimePluginRegistryLoaded: after the strict cache-key check misses, re-querygetLoadedRuntimePluginRegistrywithoutloadOptionsso we exercise the existing workspace + plugin-id branch on the active surface registry. The workspace check (getActivePluginRegistryWorkspaceDir()) andregistryContainsPluginIdsare unchanged, so the warm registry is reused only when it genuinely covers the request. Workspace mismatch or missing-plugin cases still load.This is the same bug family that closed PR #74118 targeted; that PR was closed for missing workspace-compat handling (Codex P2). This PR addresses the P2 by reusing the existing workspace-aware branch of
getLoadedRuntimePluginRegistryrather than a new code path ongetActivePluginRegistry(). The patch lives in the standalone loader, not ingetLoadedRuntimePluginRegistry, so the existing strict cache-key contract insrc/plugins/active-runtime-registry.test.ts:55-87remains green.Real behavior proof
Behavior or issue addressed: First inbound DM after gateway boot pays a redundant
loadOpenClawPluginsround insidedispatchReplyFromConfig → ensureRuntimePluginsLoaded → ensureStandaloneRuntimePluginRegistryLoaded, costing ~25 MB heap and ~1s wall-clock dispatch on a 2-vCPU host, even thoughloadGatewayStartupPluginRuntimealready populated an active registry for the same workspace and plugin set. Fixes #80682.Real environment tested: Hetzner CX33 (2 vCPU, 8 GB), Ubuntu 24.04, Docker Compose stack
oneclaw-multiagent:bench-hotpatchedderived from a production-mirror image (FROM oneclaw-multiagent:bench+COPYof the patched standalone-runtime-registry-loader chunk into/app/dist/). Resource limits: cpus=2.0, mem_limit=4g, pids_limit=512. Singleapi-multiuserchannel enabled (ONECLAW_API_KEY=bench-api-key). Persistent state wiped between runs (/opt/oneclaw/data/openclaw/plugins/oneclaw-multiagent/user-agents.jsonreset to{},/opt/oneclaw/data/openclaw/agents/multiuser-pool-*deleted). Mock LLM sidecar isolated provider-side latency from the gateway-side cost under measurement.Exact steps or command run after this patch: Apply the PR patch to
src/plugins/runtime/standalone-runtime-registry-loader.ts. Build the dist (pnpm install && pnpm build). Port the updated function body fromdist/standalone-runtime-registry-loader-DQES-_eZ.jsonto the base image's chunk (keeping the base image's neighbor-chunk import hashes intact), thenCOPYthe resulting file into/app/dist/standalone-runtime-registry-loader-B8StZmpu.jsinside a derived image (oneclaw-multiagent:bench-hotpatched). Boot the derived image on a freshly wiped host and run three sequential fresh-user DMs against the running gateway. Exact command sequence below.Evidence after fix: Redacted runtime logs from the live container above — boot signal plus three
[plugins] [latency-trace]records emitted bydispatchReplyFromConfigitself, plus the matching curl wall-clock output.Observed result after fix: Boot signal, three live
[plugins] [latency-trace]records, and the curl wall-clock output from the running container:Comparison against the unpatched HEAD
b1abf9d8aeon the same host, same compose, same wipe protocol (first row ofbench/heap-prof/ramp-postrebase/traces-P1-4-8.logfrom the preceding bench session):totalMsdispatchMsheapDeltaDispMBSteady-state (DM#2/#3) is unchanged within noise as expected — those calls already hit the strict-cache-key path in upstream, so the workspace-aware reuse only fires once per process.
What was not tested:
pinActivePluginChannelRegistry,pinActivePluginHttpRouteRegistry) — patch leaves the post-load installation path untouched, but the channel and http-route surfaces were not exercised end-to-end on the running container.getActivePluginRegistryWorkspaceDir()divergent from the dispatch caller's workspace. The "workspace-incompatible" regression in the new test file covers the logic, but no multi-workspace gateway was stood up to confirm production behavior.Verification
Added
src/plugins/runtime/standalone-runtime-registry-loader.test.tswith three regression cases:loadOpenClawPluginsnot calledTargeted plugin-contract lane:
AI-assisted
This patch was developed with AI assistance. Per
AGENTS.mdguidance, surfacing here: bench data, root-cause attribution, and patch text were prepared by Claude Code from heap-profile + heap-snapshot evidence on the linked repro. The author ran the patched-image build, state wipe, and curl bench above on their own setup; the numbers in the "Real behavior proof" section are from that run.