Skip to content

[Bug]: getCurrentPluginMetadataSnapshot workspaceDir mismatch defeats snapshot reuse — 192-stat sweep per dispatch (regression of #76182 / #73353 fix) #77519

@wiipud

Description

@wiipud

TL;DR

The fix shipped for #76182 and #73353 added a getCurrentPluginMetadataSnapshot reuse path, but the setter (setCurrentPluginMetadataSnapshot at server.impl.ts) does not pass workspaceDir, while every reader (e.g. models-config.ts, model-catalog.ts, tools.ts) does pass a concrete workspaceDir = resolveAgentWorkspaceDir(cfg, agentId). The strict equality check at current-plugin-metadata-snapshot.ts:80-87 therefore evaluates to mismatch on every call, so the cached snapshot is always rejected and loadPluginMetadataSnapshot (full rebuild + per-plugin stat sweep) is invoked instead. The manifest-contract-eligibility.ts reuse path becomes dead code.

For multi-agent / multi-channel deployments using webchat / Picker UI (high sessions.list polling rate), this saturates the gateway main thread and reproduces all symptoms reported in #76182, #73353, and #61701.

Bug type

Regression (against the fix shipped for #76182 / #73353)

Beta release blocker

No (workaround: rollback to v2026.4.27)

Environment

  • OpenClaw version: 2026.5.3-1 (cbc2ba0 was 4.27 baseline, regression confirmed against 5.3-1)
  • Previous known-good: 2026.4.27
  • Host: x86_64 Linux 6.8.0-111-generic (Ubuntu 24.04)
  • Node: v25.9.0 (Linuxbrew)
  • Install: global npm install
  • Workload: 30 agents, 9 channel accounts (6 Discord bots + 3 Feishu + 1 Telegram), webchat / Picker UI active during normal operation, 17 enabled plugins out of 96 bundled

Steps to reproduce

  1. Install OpenClaw 2026.5.3-1
  2. Configure ≥10 agents (or rely on stock 96 bundled plugins)
  3. Open webchat (control-ui) — triggers periodic sessions.list / chat.history / node.list RPCs
  4. Optionally trigger any agent dispatch
  5. Observe gateway main thread

Expected behavior

After the fix shipped for #76182:

  • getCurrentPluginMetadataSnapshot returns the cached snapshot when called from runtime paths
  • manifest-contract-eligibility.ts reuse short-circuits the manifest registry rebuild
  • No per-plugin manifest sweep on hot RPC paths

Actual behavior

strace 5s on gateway main thread (PID running 5.3-1, dispatch in progress):
  70.35%  statx        48,914 calls
   9.95%  access        6,895 calls
   8.24%  openat        4,564 calls
   5.90%  close         4,565 calls
   4.75%  read          3,437 calls
   Total: 67,201 syscalls in 5 seconds → ~13,400 syscalls/sec

Strace pattern shows alphabetical traversal of dist/extensions/<plugin>/ for every bundled plugin (96 of them):

[pid X] access("/path/dist/extensions/signal/package.json", F_OK) = 0
[pid X] openat("/path/dist/extensions/signal/openclaw.plugin.json", O_RDONLY)
[pid X] openat("/path/dist/extensions/signal/package.json", O_RDONLY)
[pid X] access("/path/dist/extensions/skill-workshop/package.json", F_OK) = 0
... (all 96 plugins, twice each — manifest.json + package.json)

Symptom impact:

  • sessions.list RPC: 156 seconds (vs 99-155 ms on 4.27)
  • chat.history RPC: 96 seconds
  • eventLoopUtilization: 1.0 (saturated)
  • eventLoopDelayMaxMs: 30000+ ms (regular)
  • Discord 6 bots: same-millisecond close 1000 storms (heartbeat misses due to event loop starvation)
  • Telegram getMe fetch-timeout with timer delayed 50000+ ms
  • Embedded run prep stages: workspace-sandbox 47s, bootstrap-context 162s (vs ms on 4.27, both stages mostly waiting for queued fs operations)

After rollback to 2026.4.27 with identical workload:

strace 5s on 4.27 idle:
  86.84%  epoll_pwait    583 calls
   7.09%  futex           50 calls
   3.66%  read            24 calls
   0.30%  access           3 calls
   0      statx            0 calls
   Total: 671 syscalls in 5 seconds (100× fewer)
sessions.list RPC: 99-155 ms (1500× faster)

Root cause analysis

Setter (without workspaceDir)

src/gateway/server.impl.ts:633 (and :1135):

setCurrentPluginMetadataSnapshot(pluginLookUpTable, {
  config: gatewayPluginConfigAtStart,
});  // ← workspaceDir not passed

Result: snapshot.workspaceDir === undefined (or "").

Readers (with concrete workspaceDir)

src/agents/model-catalog.ts:124:

const snapshot = getCurrentPluginMetadataSnapshot({
  config,
  ...(workspaceDir !== undefined ? { workspaceDir } : {}),
});
const resolvedSnapshot = snapshot ?? loadPluginMetadataSnapshot({...});

src/agents/models-config.ts:179 (similar pattern).

The reader passes workspaceDir = resolveAgentWorkspaceDir(cfg, agentId) which always resolves to a concrete path (e.g. ~/.openclaw/agents/<agent-id>).

The mismatch check

src/plugins/current-plugin-metadata-snapshot.ts:80-87:

if (snapshot.workspaceDir !== undefined && (snapshot.workspaceDir ?? "") !== (params.workspaceDir ?? "")) {
  return undefined;
}

Wait — the first guard snapshot.workspaceDir !== undefined should let the cached snapshot pass when setter omitted it. But because of the way setCurrentPluginMetadataSnapshot stores defaults, snapshot.workspaceDir is being stored as "" (empty string) rather than undefined, hitting the strict comparison branch on every call. Please verify against the codebase you're shipping — the practical effect (verified via strace) is 100% cache miss in our deployment.

Hot path

src/plugins/manifest-registry-installed.ts buildInstalledManifestRegistryIndexKey():

plugins: index.plugins.map((record) => {
  const packageJsonPath = resolvePackageJsonPath(record);
  return {
    ...
    manifestFile: safeFileSignature(record.manifestPath),     // sync fs.statSync
    packageJsonFile: safeFileSignature(packageJsonPath),       // sync fs.statSync
    enabled: record.enabled,                                    // read but not used to filter
    ...
  };
}),

96 plugins × 2 files = 192 synchronous statSync calls per cache miss. With webchat polling sessions.list every ~1s (and dispatch / model resolve / tool resolve also missing), this hits ~13K syscall/sec on the main thread, saturating the event loop.

Note: enabled is read but not used to filter the loop. With 96 bundled records and only 17 enabled, the sweep wastes 158 statSync per cycle on disabled plugins.

Suggested fixes

(In priority order — option A is sufficient, others are defensive)

Option A: Setter passes workspaceDir consistently with readers

Make setCurrentPluginMetadataSnapshot accept and store workspaceDir, populate it from the boot-time config, and assert non-empty before storing. Either of:

// At server.impl.ts:633
setCurrentPluginMetadataSnapshot(pluginLookUpTable, {
  config: gatewayPluginConfigAtStart,
  workspaceDir: resolveDefaultWorkspaceDir(gatewayPluginConfigAtStart),
});

Or change the reader contract: snapshot is workspaceDir-agnostic and readers should not pass it.

Option B: Snapshot Map keyed by workspaceDir

If different workspaceDir values legitimately produce different metadata, store a Map<workspaceDir, Snapshot> rather than a single slot. Keep insertion bounded.

Option C: Filter index.plugins.map to enabled plugins only

In buildInstalledManifestRegistryIndexKey, skip records where record.enabled === false. Cuts 96 → 17 plugins for our deployment (5.6× reduction). This is a strict improvement regardless of the snapshot bug.

Option D: mtime-based memoization on loadPluginMetadataSnapshot

Add a 5-30s TTL memo around loadPluginMetadataSnapshot keyed on the bundled extensions directory mtime. Cache invalidates when any plugin manifest mtime changes (which is rare on stable installs).

Why this affects multi-agent / webchat users disproportionately

  • Single-agent CLI users hit only the dispatch path (and only for non-trivial chat turns), so they may go minutes between cache misses
  • Multi-agent + webchat users have:
    • Periodic sessions.list from control-ui / Picker UI (every ~1s when webchat is open)
    • chat.history on every channel switch
    • node.list, commands.list, models.list, device.pair.list on dashboard load
    • All of these enter the loadGatewayModelCatalogloadManifestModelCatalog chain that triggers the cache miss
  • Each miss is 192 sync statSync = ~80ms blocked on the main thread (per @vincentkoc-style fingerprint patches that already shipped, the per-call cost should be lower; in our env it routinely blocks 50-200ms with the SSD warm cache)

Workaround

Roll back to 2026.4.27 (which doesn't include the loadManifestContractSnapshot call site added by e6825fceaa, so the dispatch path doesn't enter this code).

Related issues

Investigation method

  • 4-hour live debug session
  • gdb -p $PID stack traces during dispatch
  • sudo strace -p $PID -c 5-second syscall histograms (idle / dispatch / shim-removed states)
  • sudo perf record -F 99 -p $PID -g 25-second flame graph
  • git log v2026.4.27..v2026.5.3-1 source diff analysis
  • Cross-validation by independent AI reviewer (using the codebase at /tmp/openclaw-research/openclaw)
  • Rollback to 4.27 confirmed clean idle baseline (671 syscall/5s, 0 statx)

Happy to provide additional traces, flame graphs, or attempt a fix PR if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions