Skip to content

perf(plugins): reuse active registry on dispatch when cache keys diverge#80691

Closed
quangtran88 wants to merge 2 commits into
openclaw:mainfrom
quangtran88:perf-warm-active-registry-fast-path
Closed

perf(plugins): reuse active registry on dispatch when cache keys diverge#80691
quangtran88 wants to merge 2 commits into
openclaw:mainfrom
quangtran88:perf-warm-active-registry-fast-path

Conversation

@quangtran88

@quangtran88 quangtran88 commented May 11, 2026

Copy link
Copy Markdown

Summary

Fixes #80682. Reuses the active plugin registry from ensureStandaloneRuntimePluginRegistryLoaded when the dispatch-path load-options object hashes to a different cache key than the gateway-startup one, so the first inbound message after boot no longer triggers a full loadOpenClawPlugins reload on the request hot path.

getLoadedRuntimePluginRegistry (src/plugins/active-runtime-registry.ts:84-90) — when called with surface: "active", loadOptions, and requiredPluginIds.length > 0 — uses strict cache-key equality via resolveCompatibleRuntimePluginRegistry(loadOptions). Gateway-startup builds PluginLoadOptions with 9+ fields (onlyPluginIds, activationSourceConfig, autoEnabledReasons, preferSetupRuntimeForChannelPlugins, ...). Dispatch-time ensureRuntimePluginsLoaded builds a 3-field options object (config, workspaceDir, optional runtimeOptions). The hashes never match, so the strict path returns undefined and loadOpenClawPlugins runs again — ~25 MB heap, ~4.4s wall on a CX33 2-vCPU host — inside dispatchReplyFromConfig.

This PR adds a second query inside ensureStandaloneRuntimePluginRegistryLoaded: after the strict cache-key check misses, re-query getLoadedRuntimePluginRegistry without loadOptions so we exercise the existing workspace + plugin-id branch on the active surface registry. The workspace check (getActivePluginRegistryWorkspaceDir()) and registryContainsPluginIds are unchanged, so the warm registry is reused only when it genuinely covers the request. Workspace mismatch or missing-plugin cases still load.

This is the same bug family that closed PR #74118 targeted; that PR was closed for missing workspace-compat handling (Codex P2). This PR addresses the P2 by reusing the existing workspace-aware branch of getLoadedRuntimePluginRegistry rather than a new code path on getActivePluginRegistry(). The patch lives in the standalone loader, not in getLoadedRuntimePluginRegistry, so the existing strict cache-key contract in src/plugins/active-runtime-registry.test.ts:55-87 remains green.

Real behavior proof

Behavior or issue addressed: First inbound DM after gateway boot pays a redundant loadOpenClawPlugins round inside dispatchReplyFromConfig → ensureRuntimePluginsLoaded → ensureStandaloneRuntimePluginRegistryLoaded, costing ~25 MB heap and ~1s wall-clock dispatch on a 2-vCPU host, even though loadGatewayStartupPluginRuntime already populated an active registry for the same workspace and plugin set. Fixes #80682.

Real environment tested: Hetzner CX33 (2 vCPU, 8 GB), Ubuntu 24.04, Docker Compose stack oneclaw-multiagent:bench-hotpatched derived from a production-mirror image (FROM oneclaw-multiagent:bench + COPY of the patched standalone-runtime-registry-loader chunk into /app/dist/). Resource limits: cpus=2.0, mem_limit=4g, pids_limit=512. Single api-multiuser channel enabled (ONECLAW_API_KEY=bench-api-key). Persistent state wiped between runs (/opt/oneclaw/data/openclaw/plugins/oneclaw-multiagent/user-agents.json reset to {}, /opt/oneclaw/data/openclaw/agents/multiuser-pool-* deleted). Mock LLM sidecar isolated provider-side latency from the gateway-side cost under measurement.

Exact steps or command run after this patch: Apply the PR patch to src/plugins/runtime/standalone-runtime-registry-loader.ts. Build the dist (pnpm install && pnpm build). Port the updated function body from dist/standalone-runtime-registry-loader-DQES-_eZ.js onto the base image's chunk (keeping the base image's neighbor-chunk import hashes intact), then COPY the resulting file into /app/dist/standalone-runtime-registry-loader-B8StZmpu.js inside a derived image (oneclaw-multiagent:bench-hotpatched). Boot the derived image on a freshly wiped host and run three sequential fresh-user DMs against the running gateway. Exact command sequence below.

echo "{}" > /opt/oneclaw/data/openclaw/plugins/oneclaw-multiagent/user-agents.json
rm -rf /opt/oneclaw/data/openclaw/agents/multiuser-pool-*
docker compose -f compose-bench-prod.yml -f compose-bench-prod.override-patched.yml up -d
until curl -sf http://127.0.0.1:18789/health >/dev/null; do sleep 2; done
for u in proof-after-1 proof-after-2 proof-after-3; do
  curl -sN -o /dev/null -X POST http://127.0.0.1:18789/openclaw/v1/chat/completions \
    -H "authorization: Bearer bench-api-key" \
    -H "content-type: application/json" \
    -d "{\"stream\":true,\"user\":\"$u\",\"messages\":[{\"role\":\"user\",\"content\":\"x\"}]}" \
    -w "$u http_code=%{http_code} total=%{time_total}s\n"
done
docker logs oneclaw-multiagent 2>&1 | grep -E "latency-trace.*proof-after"

Evidence after fix: Redacted runtime logs from the live container above — boot signal plus three [plugins] [latency-trace] records emitted by dispatchReplyFromConfig itself, plus the matching curl wall-clock output.

Observed result after fix: Boot signal, three live [plugins] [latency-trace] records, and the curl wall-clock output from the running container:

2026-05-11T16:04:55.042+00:00 [gateway] http server listening (8 plugins: browser, device-pair, file-transfer, memory-core, oneclaw-multiagent, oneclaw-tool-compact, phone-control, talk-voice; 7.6s)

[trace 1 of 3 — proof-after-1, first DM since boot, registry cold]
turnId=52a39493 sessionKey=agent:multiuser-proof-after-1:main
totalMs=10394 entryMs=1892 dispatchMs=8499 firstDeliverMs=10377
heapDispStartMB=202.3 heapEndMB=365.1 heapDeltaDispMB=162.9 gcCount=25 gcMs=318

[trace 2 of 3 — proof-after-2, registry warm]
turnId=4195b7f3 sessionKey=agent:multiuser-proof-after-2:main
totalMs=7244 entryMs=1928 dispatchMs=5313 firstDeliverMs=7230
heapDispStartMB=208.1 heapEndMB=240.0 heapDeltaDispMB=31.9 gcCount=60 gcMs=384

[trace 3 of 3 — proof-after-3, registry warm]
turnId=ee706a49 sessionKey=agent:multiuser-proof-after-3:main
totalMs=6628 entryMs=1563 dispatchMs=5062 firstDeliverMs=6613
heapDispStartMB=231.0 heapEndMB=241.0 heapDeltaDispMB=10.0 gcCount=19 gcMs=87

curl wall-clock:
proof-after-1 http_code=200 total=10.431555s
proof-after-2 http_code=200 total=7.252838s
proof-after-3 http_code=200 total=6.636774s

Comparison against the unpatched HEAD b1abf9d8ae on the same host, same compose, same wipe protocol (first row of bench/heap-prof/ramp-postrebase/traces-P1-4-8.log from the preceding bench session):

Metric (first fresh-user DM) Unpatched Patched Delta
totalMs 11310 10394 −916 ms (−8.1%)
dispatchMs 9499 8499 −1000 ms (−10.5%)
heapDeltaDispMB 266.1 162.9 −103.2 MB (−38.7%)

Steady-state (DM#2/#3) is unchanged within noise as expected — those calls already hit the strict-cache-key path in upstream, so the workspace-aware reuse only fires once per process.

What was not tested:

  • Forked-runtime / subagent surfaces (pinActivePluginChannelRegistry, pinActivePluginHttpRouteRegistry) — patch leaves the post-load installation path untouched, but the channel and http-route surfaces were not exercised end-to-end on the running container.
  • A multi-workspace gateway with getActivePluginRegistryWorkspaceDir() divergent from the dispatch caller's workspace. The "workspace-incompatible" regression in the new test file covers the logic, but no multi-workspace gateway was stood up to confirm production behavior.
  • Real provider network calls. The patched code path runs before any provider call, so this should not affect the result, but no real provider was exercised.
  • Repro on Linux ARM64 or macOS — bench host was Linux x86_64 only.

Verification

Added src/plugins/runtime/standalone-runtime-registry-loader.test.ts with three regression cases:

  • Cache-key-miss but workspace + plugin-ids match → reuses active registry, loadOpenClawPlugins not called
  • Workspace mismatch → loads a fresh registry
  • Missing required plugin → loads a fresh registry
$ pnpm vitest run src/plugins/runtime/standalone-runtime-registry-loader.test.ts
RUN  v4.1.5 /tmp/openclaw-fork
Test Files  1 passed (1)
     Tests  3 passed (3)
  Duration  1.89s

Targeted plugin-contract lane:

$ pnpm test:contracts:plugins
Test Files  62 passed (62)
     Tests  833 passed (833)
  Duration  43.06s

AI-assisted

This patch was developed with AI assistance. Per AGENTS.md guidance, surfacing here: bench data, root-cause attribution, and patch text were prepared by Claude Code from heap-profile + heap-snapshot evidence on the linked repro. The author ran the patched-image build, state wipe, and curl bench above on their own setup; the numbers in the "Real behavior proof" section are from that run.

@clawsweeper

clawsweeper Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs changes before merge.

Summary
The PR adds a fallback to reuse an active runtime plugin registry after a standalone loader cache-key miss, adds focused regression tests, and bumps the root protobufjs override.

Reproducibility: yes. source-level. Current main still shows dispatch building lean runtime load options and the active lookup enforcing strict cache compatibility, while the PR body supplies live Docker logs for the after-fix performance path; I did not run a live heap profile locally.

Real behavior proof
Sufficient (logs): The PR body includes redacted patched-container runtime logs and curl wall-clock output showing after-fix dispatch and heap behavior in a real Docker host setup.

Next step before merge
A focused repair can keep the performance fast path while restoring compatibility guards and adding stale-config regression coverage; dependency override handling should be separated or left to maintainer approval.

Security
Needs attention: The diff does not add new code execution sources, but the fallback can reuse stale plugin runtime state and the PR also changes a root dependency override.

Review findings

  • [P2] Preserve load-options compatibility on fallback — src/plugins/runtime/standalone-runtime-registry-loader.ts:82-87
Review details

Best possible solution:

Keep the warm-registry fast path, but make it compatibility-aware for the specific gateway-startup-rich versus dispatch-lean equivalence while preserving config, trust, load-path, activation, runtime-mode, workspace, and plugin-scope reload boundaries.

Do we have a high-confidence way to reproduce the issue?

Yes, source-level. Current main still shows dispatch building lean runtime load options and the active lookup enforcing strict cache compatibility, while the PR body supplies live Docker logs for the after-fix performance path; I did not run a live heap profile locally.

Is this the best way to solve the issue?

No, not as written. Reusing the warm registry is the right direction, but retrying without loadOptions is broader than the current cache contract; the safer fix is a compatibility helper for the known startup/dispatch equivalence plus stale-config regression coverage.

Full review comments:

  • [P2] Preserve load-options compatibility on fallback — src/plugins/runtime/standalone-runtime-registry-loader.ts:82-87
    After the strict cache-key check misses, this lookup drops loadOptions, so an active registry with the same workspace and plugin IDs can satisfy requests whose config, trust, load path, activation, or runtime-mode inputs changed. That crosses the existing cache compatibility boundary and can dispatch with stale plugin runtime state instead of loading for the new options.
    Confidence: 0.91

Overall correctness: patch is incorrect
Overall confidence: 0.91

Security concerns:

  • [medium] Stale registry reuse can ignore trust changes — src/plugins/runtime/standalone-runtime-registry-loader.ts:82
    Because the second lookup omits loadOptions, a strict cache-key miss caused by changed plugin config, activation source, trust, load paths, runtime mode, or module-loading settings can still return the previous active registry when workspace and plugin IDs match.
    Confidence: 0.86
  • [low] Root dependency override needs approval — pnpm-workspace.yaml:62
    The PR changes the workspace-level protobufjs override and lockfile resolution in a performance branch. This may be a valid security-audit bump, but dependency override changes are supply-chain-sensitive and should be explicitly approved or split from the runtime fix.
    Confidence: 0.78

Acceptance criteria:

  • pnpm test src/plugins/runtime/standalone-runtime-registry-loader.test.ts src/plugins/active-runtime-registry.test.ts src/agents/runtime-plugins.test.ts
  • pnpm test:contracts:plugins
  • pnpm build

What I checked:

  • PR fallback drops loadOptions: PR head d7cbd13 adds a second getLoadedRuntimePluginRegistry call after the strict miss but passes only env, workspaceDir, requiredPluginIds, and surface, so loadOptions compatibility inputs are no longer checked on the fallback path. (src/plugins/runtime/standalone-runtime-registry-loader.ts:82, d7cbd13cd9b6)
  • Current main strict active lookup: Current main routes active-surface lookups with loadOptions and non-empty required plugin IDs through resolveCompatibleRuntimePluginRegistry and returns undefined when compatibility or plugin coverage fails. (src/plugins/active-runtime-registry.ts:84, 7d7d5809ab0a)
  • Cache key covers behavior and trust inputs: The loader cache context hashes workspace, normalized plugin/trust config, activation metadata, installed records, env, plugin scope, setup-runtime flags, runtime subagent mode, SDK resolution, gateway methods, loadModules, and activate. (src/plugins/loader.ts:1075, 7d7d5809ab0a)
  • Existing regression coverage protects config mismatches: Current tests assert that changing plugin config while passing loadOptions must not reuse the active registry. (src/plugins/active-runtime-registry.test.ts:55, 7d7d5809ab0a)
  • Dispatch uses lean load options: Current dispatch-time ensureRuntimePluginsLoaded delegates to the standalone loader with config, workspaceDir, optional startup plugin IDs, and optional gateway-bindable runtimeOptions. (src/agents/runtime-plugins.ts:46, 7d7d5809ab0a)
  • Gateway startup uses richer load options: Gateway startup loads plugins with config, activationSourceConfig, autoEnabledReasons, workspaceDir, onlyPluginIds, gateway methods, gateway-bindable runtimeOptions, preferSetupRuntimeForChannelPlugins, and preferBuiltPluginArtifacts. (src/gateway/server-plugins.ts:606, 7d7d5809ab0a)

Likely related people:

  • DmitryPogodaev: Public commit history shows 8283c5d introduced startup runtime registry reuse and the active-registry helper surface directly involved here. (role: feature introducer; confidence: high; commits: 8283c5d6cc3f; files: src/plugins/runtime/standalone-runtime-registry-loader.ts, src/plugins/active-runtime-registry.ts, src/agents/runtime-plugins.ts)
  • lilesjtu: Recent public history shows 66ffb29 changed the standalone loader for partial tool registries, adjacent to scoped active-registry reuse. (role: recent area contributor; confidence: medium; commits: 66ffb29679c7; files: src/plugins/runtime/standalone-runtime-registry-loader.ts)
  • steipete: Public and local history show recent work on active plugin runtime lookup and current plugin runtime files, plus prior review context around the same cache-key compatibility family. (role: adjacent loader owner; confidence: medium; commits: 848348f423b5, 6a5290e49e2b; files: src/plugins/active-runtime-registry.ts, src/plugins/runtime/standalone-runtime-registry-loader.ts, src/agents/runtime-plugins.ts)

Remaining risk / open question:

  • The fallback can reuse an active registry across changed plugin config, trust, load-path, activation, or runtime-mode inputs because it drops loadOptions after the strict miss.
  • The PR includes a root dependency override bump in a performance PR; that supply-chain surface needs maintainer approval or separation before merge.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7d7d5809ab0a.

@quangtran88 quangtran88 force-pushed the perf-warm-active-registry-fast-path branch from ddd566f to fcac1bd Compare May 11, 2026 16:09
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 11, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
The first inbound message after boot triggers a full `loadOpenClawPlugins`
reload via `ensureStandaloneRuntimePluginRegistryLoaded` even when
`loadGatewayStartupPluginRuntime` has already populated an active registry
for the same workspace and plugin set. The strict cache-key match in
`getLoadedRuntimePluginRegistry(loadOptions)` returns `undefined` because
the dispatch-path callers (`ensureRuntimePluginsLoaded`) build a 3-field
load-options object while gateway-startup builds a 9+ field one — their
hashes never match.

Heap profiling (Node `--heap-prof`) on a clean prod-mirror host shows
~25 MB cumulative allocation under `dispatchReplyFromConfig →
ensureRuntimePluginsLoaded → ensureStandaloneRuntimePluginRegistryLoaded
→ loadOpenClawPlugins` on the first DM, costing the request ~4.4s of
extra dispatch latency before the LLM call even fires.

After the strict cache-key miss, re-query
`getLoadedRuntimePluginRegistry` without `loadOptions` so we take the
workspace + plugin-id branch on the active surface registry. This still
enforces workspace compatibility (via the existing
`getActivePluginRegistryWorkspaceDir` check) and still calls
`registryContainsPluginIds`, so we reuse a warm registry only when it
genuinely covers the request. Falls through to `loadOpenClawPlugins` for
workspace mismatch or missing plugins.

Refs openclaw#80682. Related (closed without merge): openclaw#74118.

Verification:
- Three regression tests added covering the cache-key-miss reuse path,
  workspace mismatch, and missing-plugin fall-through.
- Heap-snapshot diff (SIGUSR2 before/after one fresh DM, registry warm)
  shows retained delta = 73 KB; the 22 MB dispatch allocation that the
  first-DM-per-process pays is 99.7% transient, confirming the bottleneck
  is the redundant `loadOpenClawPlugins` call itself.
@quangtran88 quangtran88 force-pushed the perf-warm-active-registry-fast-path branch from fcac1bd to 5bf1312 Compare May 12, 2026 16:03
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
The workspace pinned protobufjs@7.5.5 via pnpm-workspace.yaml overrides;
that version carries 4 HIGH advisories (GHSA-66ff-xgx4-vchm, -75px-5xx7-5xc7,
-jvwf-75h9-cwgg, -685m-2w69-288q). 7.5.8 is the earliest patched release in
the 7.x line.

CI's security-dependency-audit job exits non-zero when production dependencies
carry HIGH or higher advisories, which blocks this PR. Bumping the override
clears the audit without changing any application behavior.
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
@ai-hpc

ai-hpc commented May 19, 2026

Copy link
Copy Markdown
Member

The fallback drops loadOptions after the strict cache miss, so reuse is based only on workspace + required plugin ids. That can bypass config/trust/load path/activation/runtime-mode compatibility and reuse stale plugin runtime state. I think this needs an explicit compatibility check for the safe warm-registry case before approval.

@shakkernerd

Copy link
Copy Markdown
Member

Closing this in favor of #84324, which landed the safer version of this optimization.

The original perf issue is real, but this PR's reuse path dropped the full loadOptions compatibility boundary after a strict cache miss. That made reuse depend too narrowly on workspace/plugin identity and could bypass config, trust, load path, activation, and runtime-mode compatibility checks.

#84324 keeps the existing loader compatibility checks and only reuses the Gateway startup registry for the specific compatible dispatch case. It also adds regression coverage for reuse, changed config forcing a fresh load, and the ensureRuntimePluginsLoaded() caller path.

Thanks for digging into the perf issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(plugins): standalone runtime registry reloads on every fresh dispatch despite warm gateway-startup registry (~4.4s, 25 MB per process)

3 participants