Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
sessions.list (and other per-row control-plane RPCs) becomes O(rows) slow — tens of seconds — only when other agents/crons are running concurrently. On an idle gateway it is fast. The cause is not cache size: the per-row plugin-metadata lookup reads a process-global "active plugin-registry workspace dir" that concurrent agent-turns/crons mutate while sessions.list is await-yielding between row batches. The metadata-snapshot memo key includes workspaceDir, so it changes on essentially every row, the memo never hits, and each row triggers a fresh full loadPluginMetadataSnapshot scan (~100 ms).
This is the residual concurrency facet of the now-closed #76562. That issue was closed as completed after the maintainer could not reproduce it on an idle/quiet 2026.5.28 gateway, but the failure mode persists under real multi-agent load. Filing separately so it is tracked on an open issue with a precise root cause and a fix.
Steps to reproduce
- Run a gateway with several concurrent actors (e.g. 1 main agent + a couple of crons actively taking turns).
- While they are running, issue
sessions.list (dashboard load, MCP client, or openclaw CLI).
- Observe the call takes ~10 s even though the session store index is tiny (7–13 entries).
- For contrast, stop all agents/crons (idle gateway) and repeat — the same
sessions.list returns in milliseconds.
To capture evidence: OPENCLAW_DIAGNOSTICS=1 OPENCLAW_DIAGNOSTICS_TIMELINE_PATH=/tmp/x.jsonl and inspect the gateway.sessions.list span tree plus plugins.metadata.scan spans.
Expected behavior
A single sessions.list resolves plugin metadata once and reuses it for all rows, independent of concurrent agent/cron activity. Wall time should track the store size (milliseconds for a small store), not the number of concurrent actors.
Actual behavior
One sessions.list call on a busy gateway (diagnostics timeline):
OpenClaw version
2026.5.22 (also reproduced against a current main source checkout while developing the fix below).
Operating system
Linux x64
Install method
Source checkout / development workflow.
Model
Anthropic-family + OpenAI-compat providers via a proxy; not model-specific — the hot path is plugin model-id normalization, which runs regardless of the routed model.
Provider / routing chain
Multi-provider config (Anthropic messages API + OpenAI-compat). The slowdown is independent of the routing chain; it is driven by concurrent actors mutating the global active workspace dir, not by any provider call.
Additional provider/model setup details
Multi-agent gateway: 1 main agent + a secondary agent + several crons.
Logs, screenshots, and evidence
Multi-agent gateway: 1 main agent + a secondary agent + several crons.
**Mechanism.** Call chain per row (lightweight list rows still hit this via model-ref/runtime resolution):
buildGatewaySessionRow
-> ... -> normalizeProviderModelIdWithRuntime
-> normalizeProviderModelIdWithManifest
-> resolveManifestModelIdNormalizationPolicy
-> resolveMetadataSnapshotForPolicies (src/plugins/manifest-model-id-normalization.ts)
const workspaceDir = params.workspaceDir ?? getActivePluginRegistryWorkspaceDirFromState();
const current = getCurrentPluginMetadataSnapshot({ config, env, workspaceDir });
if (current) return current;
return loadPluginMetadataSnapshot({ config: config ?? {}, env, workspaceDir }); // full scan
`listSessionsFromStoreAsync` deliberately yields every `SESSIONS_LIST_YIELD_BATCH_SIZE` rows:
if ((i + 1) % SESSIONS_LIST_YIELD_BATCH_SIZE === 0 && i + 1 < entries.length) {
await new Promise((resolve) => setImmediate(resolve)); // gives concurrent agent-turns the loop
}
During each yield, a concurrent agent-turn/cron calls `setActivePluginRegistry(...)`, setting the **global** active-registry workspace to *its* workspace. So row N reads workspace A, row N+1 reads B, etc. `computePluginMetadataSnapshotMemoKey` includes `workspaceDir`, so the key differs per row, `getCurrentPluginMetadataSnapshot(...)` misses, and `loadPluginMetadataSnapshot(...)` runs a fresh ~100 ms scan. With derived/bundled registries (`registrySource: "derived"`) the result is not stored in the process memo, so the next row cannot reuse it either.
In short: **a single `sessions.list` reads a process-global workspace per row, and that global is mutated underneath it by concurrent work across its own `await` points** → O(rows) full plugin-metadata scans. This is why an idle benchmark looked fixed by #76655 but real multi-agent gateways still pin CPU.
Impact and severity
High under load: control-plane RPCs (sessions.list, and any per-row resolver sharing this path) take tens of seconds and saturate the single gateway event loop, degrading UI/WebSocket responsiveness and channel turn latency for everyone on the gateway. Scales with both row count and concurrent-actor count.
Additional information
Suggested fix (implemented and validated locally). Pin the active plugin-registry workspace dir for the duration of the row-building batch so every row in one sessions.list reads a stable value, immune to concurrent global mutation, while other concurrent async contexts still observe the live global. AsyncLocalStorage scopes the pin to the batch's async context only:
// runtime-workspace-state.ts
const pinnedWorkspaceDirStorage = new AsyncLocalStorage<{ workspaceDir: string | undefined }>();
export function getActivePluginRegistryWorkspaceDirFromState(): string | undefined {
const pinned = pinnedWorkspaceDirStorage.getStore();
if (pinned) return pinned.workspaceDir;
return (globalThis as ...)[PLUGIN_REGISTRY_STATE]?.workspaceDir ?? undefined;
}
export function withPinnedActivePluginRegistryWorkspaceDir<T>(fn: () => T): T {
if (pinnedWorkspaceDirStorage.getStore()) return fn(); // nested: reuse outer pin
const workspaceDir = (globalThis as ...)[PLUGIN_REGISTRY_STATE]?.workspaceDir ?? undefined;
return pinnedWorkspaceDirStorage.run({ workspaceDir }, fn);
}
…then wrap the listSessionsFromStoreAsync row loop (including its inter-batch yields) in await withPinnedActivePluginRegistryWorkspaceDir(async () => { ... }). This collapses the O(rows) scans to one regardless of concurrent agent/cron activity.
A concurrency regression test locks it in: fire the pinned scope, mutate the active workspace via setActivePluginRegistry mid-scope across a setImmediate yield, and assert reads inside the scope stay stable while reads after exit observe the live (mutated) global.
I'm happy to open a PR with this change, and can share the raw diagnostics timeline JSONL if useful.
Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
sessions.list(and other per-row control-plane RPCs) becomes O(rows) slow — tens of seconds — only when other agents/crons are running concurrently. On an idle gateway it is fast. The cause is not cache size: the per-row plugin-metadata lookup reads a process-global "active plugin-registry workspace dir" that concurrent agent-turns/crons mutate whilesessions.listisawait-yielding between row batches. The metadata-snapshot memo key includesworkspaceDir, so it changes on essentially every row, the memo never hits, and each row triggers a fresh fullloadPluginMetadataSnapshotscan (~100 ms).This is the residual concurrency facet of the now-closed #76562. That issue was closed as completed after the maintainer could not reproduce it on an idle/quiet 2026.5.28 gateway, but the failure mode persists under real multi-agent load. Filing separately so it is tracked on an open issue with a precise root cause and a fix.
Steps to reproduce
sessions.list(dashboard load, MCP client, oropenclawCLI).sessions.listreturns in milliseconds.To capture evidence:
OPENCLAW_DIAGNOSTICS=1 OPENCLAW_DIAGNOSTICS_TIMELINE_PATH=/tmp/x.jsonland inspect thegateway.sessions.listspan tree plusplugins.metadata.scanspans.Expected behavior
A single
sessions.listresolves plugin metadata once and reuses it for all rows, independent of concurrent agent/cron activity. Wall time should track the store size (milliseconds for a small store), not the number of concurrent actors.Actual behavior
One
sessions.listcall on a busy gateway (diagnostics timeline):OpenClaw version
2026.5.22 (also reproduced against a current
mainsource checkout while developing the fix below).Operating system
Linux x64
Install method
Source checkout / development workflow.
Model
Anthropic-family + OpenAI-compat providers via a proxy; not model-specific — the hot path is plugin model-id normalization, which runs regardless of the routed model.
Provider / routing chain
Multi-provider config (Anthropic messages API + OpenAI-compat). The slowdown is independent of the routing chain; it is driven by concurrent actors mutating the global active workspace dir, not by any provider call.
Additional provider/model setup details
Multi-agent gateway: 1 main agent + a secondary agent + several crons.
Logs, screenshots, and evidence
Impact and severity
High under load: control-plane RPCs (
sessions.list, and any per-row resolver sharing this path) take tens of seconds and saturate the single gateway event loop, degrading UI/WebSocket responsiveness and channel turn latency for everyone on the gateway. Scales with both row count and concurrent-actor count.Additional information
Suggested fix (implemented and validated locally). Pin the active plugin-registry workspace dir for the duration of the row-building batch so every row in one
sessions.listreads a stable value, immune to concurrent global mutation, while other concurrent async contexts still observe the live global.AsyncLocalStoragescopes the pin to the batch's async context only:…then wrap the
listSessionsFromStoreAsyncrow loop (including its inter-batch yields) inawait withPinnedActivePluginRegistryWorkspaceDir(async () => { ... }). This collapses the O(rows) scans to one regardless of concurrent agent/cron activity.A concurrency regression test locks it in: fire the pinned scope, mutate the active workspace via
setActivePluginRegistrymid-scope across asetImmediateyield, and assert reads inside the scope stay stable while reads after exit observe the live (mutated) global.I'm happy to open a PR with this change, and can share the raw diagnostics timeline JSONL if useful.