TL;DR
The fix shipped for #76182 and #73353 added a getCurrentPluginMetadataSnapshot reuse path, but the setter (setCurrentPluginMetadataSnapshot at server.impl.ts) does not pass workspaceDir, while every reader (e.g. models-config.ts, model-catalog.ts, tools.ts) does pass a concrete workspaceDir = resolveAgentWorkspaceDir(cfg, agentId). The strict equality check at current-plugin-metadata-snapshot.ts:80-87 therefore evaluates to mismatch on every call, so the cached snapshot is always rejected and loadPluginMetadataSnapshot (full rebuild + per-plugin stat sweep) is invoked instead. The manifest-contract-eligibility.ts reuse path becomes dead code.
For multi-agent / multi-channel deployments using webchat / Picker UI (high sessions.list polling rate), this saturates the gateway main thread and reproduces all symptoms reported in #76182, #73353, and #61701.
Bug type
Regression (against the fix shipped for #76182 / #73353)
Beta release blocker
No (workaround: rollback to v2026.4.27)
Environment
- OpenClaw version: 2026.5.3-1 (
cbc2ba0 was 4.27 baseline, regression confirmed against 5.3-1)
- Previous known-good: 2026.4.27
- Host: x86_64 Linux 6.8.0-111-generic (Ubuntu 24.04)
- Node: v25.9.0 (Linuxbrew)
- Install: global npm install
- Workload: 30 agents, 9 channel accounts (6 Discord bots + 3 Feishu + 1 Telegram), webchat / Picker UI active during normal operation, 17 enabled plugins out of 96 bundled
Steps to reproduce
- Install OpenClaw 2026.5.3-1
- Configure ≥10 agents (or rely on stock 96 bundled plugins)
- Open webchat (control-ui) — triggers periodic
sessions.list / chat.history / node.list RPCs
- Optionally trigger any agent dispatch
- Observe gateway main thread
Expected behavior
After the fix shipped for #76182:
getCurrentPluginMetadataSnapshot returns the cached snapshot when called from runtime paths
manifest-contract-eligibility.ts reuse short-circuits the manifest registry rebuild
- No per-plugin manifest sweep on hot RPC paths
Actual behavior
strace 5s on gateway main thread (PID running 5.3-1, dispatch in progress):
70.35% statx 48,914 calls
9.95% access 6,895 calls
8.24% openat 4,564 calls
5.90% close 4,565 calls
4.75% read 3,437 calls
Total: 67,201 syscalls in 5 seconds → ~13,400 syscalls/sec
Strace pattern shows alphabetical traversal of dist/extensions/<plugin>/ for every bundled plugin (96 of them):
[pid X] access("/path/dist/extensions/signal/package.json", F_OK) = 0
[pid X] openat("/path/dist/extensions/signal/openclaw.plugin.json", O_RDONLY)
[pid X] openat("/path/dist/extensions/signal/package.json", O_RDONLY)
[pid X] access("/path/dist/extensions/skill-workshop/package.json", F_OK) = 0
... (all 96 plugins, twice each — manifest.json + package.json)
Symptom impact:
sessions.list RPC: 156 seconds (vs 99-155 ms on 4.27)
chat.history RPC: 96 seconds
eventLoopUtilization: 1.0 (saturated)
eventLoopDelayMaxMs: 30000+ ms (regular)
- Discord 6 bots: same-millisecond
close 1000 storms (heartbeat misses due to event loop starvation)
- Telegram
getMe fetch-timeout with timer delayed 50000+ ms
- Embedded run
prep stages: workspace-sandbox 47s, bootstrap-context 162s (vs ms on 4.27, both stages mostly waiting for queued fs operations)
After rollback to 2026.4.27 with identical workload:
strace 5s on 4.27 idle:
86.84% epoll_pwait 583 calls
7.09% futex 50 calls
3.66% read 24 calls
0.30% access 3 calls
0 statx 0 calls
Total: 671 syscalls in 5 seconds (100× fewer)
sessions.list RPC: 99-155 ms (1500× faster)
Root cause analysis
Setter (without workspaceDir)
src/gateway/server.impl.ts:633 (and :1135):
setCurrentPluginMetadataSnapshot(pluginLookUpTable, {
config: gatewayPluginConfigAtStart,
}); // ← workspaceDir not passed
Result: snapshot.workspaceDir === undefined (or "").
Readers (with concrete workspaceDir)
src/agents/model-catalog.ts:124:
const snapshot = getCurrentPluginMetadataSnapshot({
config,
...(workspaceDir !== undefined ? { workspaceDir } : {}),
});
const resolvedSnapshot = snapshot ?? loadPluginMetadataSnapshot({...});
src/agents/models-config.ts:179 (similar pattern).
The reader passes workspaceDir = resolveAgentWorkspaceDir(cfg, agentId) which always resolves to a concrete path (e.g. ~/.openclaw/agents/<agent-id>).
The mismatch check
src/plugins/current-plugin-metadata-snapshot.ts:80-87:
if (snapshot.workspaceDir !== undefined && (snapshot.workspaceDir ?? "") !== (params.workspaceDir ?? "")) {
return undefined;
}
Wait — the first guard snapshot.workspaceDir !== undefined should let the cached snapshot pass when setter omitted it. But because of the way setCurrentPluginMetadataSnapshot stores defaults, snapshot.workspaceDir is being stored as "" (empty string) rather than undefined, hitting the strict comparison branch on every call. Please verify against the codebase you're shipping — the practical effect (verified via strace) is 100% cache miss in our deployment.
Hot path
src/plugins/manifest-registry-installed.ts buildInstalledManifestRegistryIndexKey():
plugins: index.plugins.map((record) => {
const packageJsonPath = resolvePackageJsonPath(record);
return {
...
manifestFile: safeFileSignature(record.manifestPath), // sync fs.statSync
packageJsonFile: safeFileSignature(packageJsonPath), // sync fs.statSync
enabled: record.enabled, // read but not used to filter
...
};
}),
96 plugins × 2 files = 192 synchronous statSync calls per cache miss. With webchat polling sessions.list every ~1s (and dispatch / model resolve / tool resolve also missing), this hits ~13K syscall/sec on the main thread, saturating the event loop.
Note: enabled is read but not used to filter the loop. With 96 bundled records and only 17 enabled, the sweep wastes 158 statSync per cycle on disabled plugins.
Suggested fixes
(In priority order — option A is sufficient, others are defensive)
Option A: Setter passes workspaceDir consistently with readers
Make setCurrentPluginMetadataSnapshot accept and store workspaceDir, populate it from the boot-time config, and assert non-empty before storing. Either of:
// At server.impl.ts:633
setCurrentPluginMetadataSnapshot(pluginLookUpTable, {
config: gatewayPluginConfigAtStart,
workspaceDir: resolveDefaultWorkspaceDir(gatewayPluginConfigAtStart),
});
Or change the reader contract: snapshot is workspaceDir-agnostic and readers should not pass it.
Option B: Snapshot Map keyed by workspaceDir
If different workspaceDir values legitimately produce different metadata, store a Map<workspaceDir, Snapshot> rather than a single slot. Keep insertion bounded.
Option C: Filter index.plugins.map to enabled plugins only
In buildInstalledManifestRegistryIndexKey, skip records where record.enabled === false. Cuts 96 → 17 plugins for our deployment (5.6× reduction). This is a strict improvement regardless of the snapshot bug.
Option D: mtime-based memoization on loadPluginMetadataSnapshot
Add a 5-30s TTL memo around loadPluginMetadataSnapshot keyed on the bundled extensions directory mtime. Cache invalidates when any plugin manifest mtime changes (which is rare on stable installs).
Why this affects multi-agent / webchat users disproportionately
- Single-agent CLI users hit only the dispatch path (and only for non-trivial chat turns), so they may go minutes between cache misses
- Multi-agent + webchat users have:
- Periodic
sessions.list from control-ui / Picker UI (every ~1s when webchat is open)
chat.history on every channel switch
node.list, commands.list, models.list, device.pair.list on dashboard load
- All of these enter the
loadGatewayModelCatalog → loadManifestModelCatalog chain that triggers the cache miss
- Each miss is 192 sync statSync = ~80ms blocked on the main thread (per @vincentkoc-style fingerprint patches that already shipped, the per-call cost should be lower; in our env it routinely blocks 50-200ms with the SSD warm cache)
Workaround
Roll back to 2026.4.27 (which doesn't include the loadManifestContractSnapshot call site added by e6825fceaa, so the dispatch path doesn't enter this code).
Related issues
Investigation method
- 4-hour live debug session
gdb -p $PID stack traces during dispatch
sudo strace -p $PID -c 5-second syscall histograms (idle / dispatch / shim-removed states)
sudo perf record -F 99 -p $PID -g 25-second flame graph
git log v2026.4.27..v2026.5.3-1 source diff analysis
- Cross-validation by independent AI reviewer (using the codebase at
/tmp/openclaw-research/openclaw)
- Rollback to 4.27 confirmed clean idle baseline (671 syscall/5s, 0 statx)
Happy to provide additional traces, flame graphs, or attempt a fix PR if useful.
TL;DR
The fix shipped for #76182 and #73353 added a
getCurrentPluginMetadataSnapshotreuse path, but the setter (setCurrentPluginMetadataSnapshotatserver.impl.ts) does not passworkspaceDir, while every reader (e.g.models-config.ts,model-catalog.ts,tools.ts) does pass a concreteworkspaceDir = resolveAgentWorkspaceDir(cfg, agentId). The strict equality check atcurrent-plugin-metadata-snapshot.ts:80-87therefore evaluates to mismatch on every call, so the cached snapshot is always rejected andloadPluginMetadataSnapshot(full rebuild + per-plugin stat sweep) is invoked instead. Themanifest-contract-eligibility.tsreuse path becomes dead code.For multi-agent / multi-channel deployments using webchat / Picker UI (high
sessions.listpolling rate), this saturates the gateway main thread and reproduces all symptoms reported in #76182, #73353, and #61701.Bug type
Regression (against the fix shipped for #76182 / #73353)
Beta release blocker
No (workaround: rollback to v2026.4.27)
Environment
cbc2ba0was 4.27 baseline, regression confirmed against 5.3-1)Steps to reproduce
sessions.list/chat.history/node.listRPCsExpected behavior
After the fix shipped for #76182:
getCurrentPluginMetadataSnapshotreturns the cached snapshot when called from runtime pathsmanifest-contract-eligibility.tsreuse short-circuits the manifest registry rebuildActual behavior
Strace pattern shows alphabetical traversal of
dist/extensions/<plugin>/for every bundled plugin (96 of them):Symptom impact:
sessions.listRPC: 156 seconds (vs 99-155 ms on 4.27)chat.historyRPC: 96 secondseventLoopUtilization: 1.0 (saturated)eventLoopDelayMaxMs: 30000+ ms (regular)close 1000storms (heartbeat misses due to event loop starvation)getMefetch-timeoutwithtimer delayed 50000+ msprep stages:workspace-sandbox47s,bootstrap-context162s (vs ms on 4.27, both stages mostly waiting for queued fs operations)After rollback to 2026.4.27 with identical workload:
Root cause analysis
Setter (without workspaceDir)
src/gateway/server.impl.ts:633(and:1135):Result:
snapshot.workspaceDir === undefined(or"").Readers (with concrete workspaceDir)
src/agents/model-catalog.ts:124:src/agents/models-config.ts:179(similar pattern).The reader passes
workspaceDir = resolveAgentWorkspaceDir(cfg, agentId)which always resolves to a concrete path (e.g.~/.openclaw/agents/<agent-id>).The mismatch check
src/plugins/current-plugin-metadata-snapshot.ts:80-87:Wait — the first guard
snapshot.workspaceDir !== undefinedshould let the cached snapshot pass when setter omitted it. But because of the waysetCurrentPluginMetadataSnapshotstores defaults,snapshot.workspaceDiris being stored as""(empty string) rather thanundefined, hitting the strict comparison branch on every call. Please verify against the codebase you're shipping — the practical effect (verified via strace) is 100% cache miss in our deployment.Hot path
src/plugins/manifest-registry-installed.ts buildInstalledManifestRegistryIndexKey():96 plugins × 2 files = 192 synchronous
statSynccalls per cache miss. With webchat pollingsessions.listevery ~1s (and dispatch / model resolve / tool resolve also missing), this hits ~13K syscall/sec on the main thread, saturating the event loop.Note:
enabledis read but not used to filter the loop. With 96 bundled records and only 17 enabled, the sweep wastes 158 statSync per cycle on disabled plugins.Suggested fixes
(In priority order — option A is sufficient, others are defensive)
Option A: Setter passes workspaceDir consistently with readers
Make
setCurrentPluginMetadataSnapshotaccept and storeworkspaceDir, populate it from the boot-time config, and assert non-empty before storing. Either of:Or change the reader contract: snapshot is workspaceDir-agnostic and readers should not pass it.
Option B: Snapshot Map keyed by workspaceDir
If different workspaceDir values legitimately produce different metadata, store a
Map<workspaceDir, Snapshot>rather than a single slot. Keep insertion bounded.Option C: Filter
index.plugins.mapto enabled plugins onlyIn
buildInstalledManifestRegistryIndexKey, skip records whererecord.enabled === false. Cuts 96 → 17 plugins for our deployment (5.6× reduction). This is a strict improvement regardless of the snapshot bug.Option D: mtime-based memoization on
loadPluginMetadataSnapshotAdd a 5-30s TTL memo around
loadPluginMetadataSnapshotkeyed on the bundled extensions directory mtime. Cache invalidates when any plugin manifest mtime changes (which is rare on stable installs).Why this affects multi-agent / webchat users disproportionately
sessions.listfrom control-ui / Picker UI (every ~1s when webchat is open)chat.historyon every channel switchnode.list,commands.list,models.list,device.pair.liston dashboard loadloadGatewayModelCatalog→loadManifestModelCatalogchain that triggers the cache missWorkaround
Roll back to 2026.4.27 (which doesn't include the
loadManifestContractSnapshotcall site added bye6825fceaa, so the dispatch path doesn't enter this code).Related issues
Investigation method
gdb -p $PIDstack traces during dispatchsudo strace -p $PID -c5-second syscall histograms (idle / dispatch / shim-removed states)sudo perf record -F 99 -p $PID -g25-second flame graphgit log v2026.4.27..v2026.5.3-1source diff analysis/tmp/openclaw-research/openclaw)Happy to provide additional traces, flame graphs, or attempt a fix PR if useful.