Skip to content

perf(plugins): standalone runtime registry reloads on every fresh dispatch despite warm gateway-startup registry (~4.4s, 25 MB per process) #80682

@quangtran88

Description

@quangtran88

Summary

ensureStandaloneRuntimePluginRegistryLoaded (called from dispatchReplyFromConfig via ensureRuntimePluginsLoaded) reloads the full standalone runtime plugin registry on the first inbound dispatch per process, even when loadGatewayStartupPluginRuntime has already populated an active registry at boot. This adds ~4.4s + ~25 MB heap allocation to the first user's dispatch latency.

Instrumented heap profiling on a clean prod-mirror host (Hetzner CX33, 2 vCPU, mock-LLM, rebased onto current origin/main HEAD b1abf9d8ae) shows the lazy load fires inside the dispatch path, after gateway boot completes:

Cohort totalMs dispatchMs heapDeltaDispMB gcCount
Warm boot, first fresh-user DM 10185 8478 260 24
Second fresh-user DM (registry warm) 6770 5000 7-15 10

The 4-5s + ~250 MB transient delta on DM#1 is loadOpenClawPlugins running inside the request, not anything user-specific.

Reproduction

  1. Clean state: wipe ~/.openclaw/{shared/users.json,plugins/<id>/user-agents.json,sandbox/containers,openclaw.json*}
  2. Boot gateway with N plugins enabled
  3. Issue an inbound chat completion to a fresh user
  4. Observe: dispatch latency ≈ 10s; heap snapshot shows loadOpenClawPlugins allocations under dispatchReplyFromConfig → ensureRuntimePluginsLoaded → ensureStandaloneRuntimePluginRegistryLoaded
  5. Send a second message (any user); dispatch latency drops 40-50%

Root cause

getLoadedRuntimePluginRegistry (src/plugins/active-runtime-registry.ts:71-105) — when called with surface: "active", loadOptions, and requiredPluginIds.length > 0, uses strict cache-key equality via resolveCompatibleRuntimePluginRegistry(loadOptions):

if (surface === "active" && params.loadOptions && requiredPluginIds?.length !== 0) {
    const compatible = resolveCompatibleRuntimePluginRegistry(params.loadOptions);
    if (!compatible || !registryContainsPluginIds(compatible, requiredPluginIds)) {
      return undefined;
    }
    return compatible;
}

Gateway-startup (loadGatewayStartupPluginRuntime) builds PluginLoadOptions with 9+ fields (onlyPluginIds, activationSourceConfig, autoEnabledReasons, preferSetupRuntimeForChannelPlugins, ...). Dispatch-time ensureRuntimePluginsLoaded builds 3-field options (config, workspaceDir, optional runtimeOptions). Cache keys never match → strict path returns undefined → full loadOpenClawPlugins reload on the request hot path.

This is the same bug family addressed by closed PR #74118 (proposed a fast-path on getActivePluginRegistry(); closed per Codex feedback on missing workspace-compat handling).

Proposed fix direction

In getLoadedRuntimePluginRegistry, when the strict cache-key match misses, fall through to the existing workspace+plugin-id compatibility check on the active registry (already present in lines 92-104 for the non-loadOptions case) instead of returning undefined:

if (surface === "active" && params.loadOptions && requiredPluginIds?.length !== 0) {
    const compatible = resolveCompatibleRuntimePluginRegistry(params.loadOptions);
    if (compatible && registryContainsPluginIds(compatible, requiredPluginIds)) {
      return compatible;
    }
    // Fall through to workspace+plugin-id check on active registry
    // instead of returning undefined unconditionally
}
// Existing fall-through (workspace check + registryContainsPluginIds) now
// also serves the strict-cache-miss case

Properties:

  • Addresses Codex's P2 from perf(agents/runtime): short-circuit ensureRuntimePluginsLoaded when active registry exists #74118 (workspace compat preserved by reusing the existing branch's workspace check)
  • Preserves laziness per src/plugins/CLAUDE.md — still loads on demand when active registry is genuinely missing or workspace-incompatible
  • Affects both ensureStandaloneRuntimePluginRegistryLoaded and ensureRuntimePluginsLoaded (the wrapper), so dispatch-path and standalone-path callers both benefit
  • Public contract unchanged

Evidence

  • Heap profiles (Node --heap-prof, 64K interval) — pre-rebase + post-rebase. Top call site: ensureStandaloneRuntimePluginRegistryLoaded → loadOpenClawPlugins at ~25 MB cumulative inside dispatch.
  • Heap-snapshot diff (SIGUSR2 before/after one fresh DM, registry already warm): retained delta = 73 KB. The 22 MB dispatch alloc is 99.7% transient — per-user cost is allocation churn + GC, not steady-state retention.
  • 39-trace P=1/4/8 ramp on origin/main HEAD:
    • P=1 p50: 6770 ms (warm), 10185 ms (first DM)
    • P=4 p50: 30889 ms
    • P=8 p50: 63274 ms
    • TTFT(P) slope: 7.7s × P; ELU=1.000 sustained; GC scales with concurrency (P=1 ~10/turn, P=8 165-190/turn)

Artifacts available on request: 3 .heapprofile files (1.8–3.4 MB), 2 heap snapshots (181 MB each), full latency-trace logs.

Environment

  • OpenClaw main at b1abf9d8ae (chore(release): refresh base config schema)
  • Node 24.14.0, Linux 6.x (Debian bookworm), Docker, 2 vCPU CX33
  • Multiagent extension under test (oneclaw-multiagent) routes dispatch through stock dispatchReplyFromConfig — bottleneck is in upstream loader, not extension code

I have a draft PR with the patch + workspace-compat regression test + changelog entry. Will link once posted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions