Skip to content

[Bug]: sessions.list O(rows) plugin-metadata scans under concurrency: per-row read of a globally-mutated active workspaceDir (residual of #76562) #90814

@k-l-lambda

Description

@k-l-lambda

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

sessions.list (and other per-row control-plane RPCs) becomes O(rows) slow — tens of seconds — only when other agents/crons are running concurrently. On an idle gateway it is fast. The cause is not cache size: the per-row plugin-metadata lookup reads a process-global "active plugin-registry workspace dir" that concurrent agent-turns/crons mutate while sessions.list is await-yielding between row batches. The metadata-snapshot memo key includes workspaceDir, so it changes on essentially every row, the memo never hits, and each row triggers a fresh full loadPluginMetadataSnapshot scan (~100 ms).

This is the residual concurrency facet of the now-closed #76562. That issue was closed as completed after the maintainer could not reproduce it on an idle/quiet 2026.5.28 gateway, but the failure mode persists under real multi-agent load. Filing separately so it is tracked on an open issue with a precise root cause and a fix.

Steps to reproduce

  1. Run a gateway with several concurrent actors (e.g. 1 main agent + a couple of crons actively taking turns).
  2. While they are running, issue sessions.list (dashboard load, MCP client, or openclaw CLI).
  3. Observe the call takes ~10 s even though the session store index is tiny (7–13 entries).
  4. For contrast, stop all agents/crons (idle gateway) and repeat — the same sessions.list returns in milliseconds.

To capture evidence: OPENCLAW_DIAGNOSTICS=1 OPENCLAW_DIAGNOSTICS_TIMELINE_PATH=/tmp/x.jsonl and inspect the gateway.sessions.list span tree plus plugins.metadata.scan spans.

Expected behavior

A single sessions.list resolves plugin metadata once and reuses it for all rows, independent of concurrent agent/cron activity. Wall time should track the store size (milliseconds for a small store), not the number of concurrent actors.

Actual behavior

One sessions.list call on a busy gateway (diagnostics timeline):

OpenClaw version

2026.5.22 (also reproduced against a current main source checkout while developing the fix below).

Operating system

Linux x64

Install method

Source checkout / development workflow.

Model

Anthropic-family + OpenAI-compat providers via a proxy; not model-specific — the hot path is plugin model-id normalization, which runs regardless of the routed model.

Provider / routing chain

Multi-provider config (Anthropic messages API + OpenAI-compat). The slowdown is independent of the routing chain; it is driven by concurrent actors mutating the global active workspace dir, not by any provider call.

Additional provider/model setup details

Multi-agent gateway: 1 main agent + a secondary agent + several crons.

Logs, screenshots, and evidence

Multi-agent gateway: 1 main agent + a secondary agent + several crons.
**Mechanism.** Call chain per row (lightweight list rows still hit this via model-ref/runtime resolution):


buildGatewaySessionRow
  -> ... -> normalizeProviderModelIdWithRuntime
  -> normalizeProviderModelIdWithManifest
  -> resolveManifestModelIdNormalizationPolicy
  -> resolveMetadataSnapshotForPolicies        (src/plugins/manifest-model-id-normalization.ts)
       const workspaceDir = params.workspaceDir ?? getActivePluginRegistryWorkspaceDirFromState();
       const current = getCurrentPluginMetadataSnapshot({ config, env, workspaceDir });
       if (current) return current;
       return loadPluginMetadataSnapshot({ config: config ?? {}, env, workspaceDir });  // full scan


`listSessionsFromStoreAsync` deliberately yields every `SESSIONS_LIST_YIELD_BATCH_SIZE` rows:


if ((i + 1) % SESSIONS_LIST_YIELD_BATCH_SIZE === 0 && i + 1 < entries.length) {
  await new Promise((resolve) => setImmediate(resolve));   // gives concurrent agent-turns the loop
}


During each yield, a concurrent agent-turn/cron calls `setActivePluginRegistry(...)`, setting the **global** active-registry workspace to *its* workspace. So row N reads workspace A, row N+1 reads B, etc. `computePluginMetadataSnapshotMemoKey` includes `workspaceDir`, so the key differs per row, `getCurrentPluginMetadataSnapshot(...)` misses, and `loadPluginMetadataSnapshot(...)` runs a fresh ~100 ms scan. With derived/bundled registries (`registrySource: "derived"`) the result is not stored in the process memo, so the next row cannot reuse it either.

In short: **a single `sessions.list` reads a process-global workspace per row, and that global is mutated underneath it by concurrent work across its own `await` points** → O(rows) full plugin-metadata scans. This is why an idle benchmark looked fixed by #76655 but real multi-agent gateways still pin CPU.

Impact and severity

High under load: control-plane RPCs (sessions.list, and any per-row resolver sharing this path) take tens of seconds and saturate the single gateway event loop, degrading UI/WebSocket responsiveness and channel turn latency for everyone on the gateway. Scales with both row count and concurrent-actor count.

Additional information

Suggested fix (implemented and validated locally). Pin the active plugin-registry workspace dir for the duration of the row-building batch so every row in one sessions.list reads a stable value, immune to concurrent global mutation, while other concurrent async contexts still observe the live global. AsyncLocalStorage scopes the pin to the batch's async context only:

// runtime-workspace-state.ts
const pinnedWorkspaceDirStorage = new AsyncLocalStorage<{ workspaceDir: string | undefined }>();

export function getActivePluginRegistryWorkspaceDirFromState(): string | undefined {
  const pinned = pinnedWorkspaceDirStorage.getStore();
  if (pinned) return pinned.workspaceDir;
  return (globalThis as ...)[PLUGIN_REGISTRY_STATE]?.workspaceDir ?? undefined;
}

export function withPinnedActivePluginRegistryWorkspaceDir<T>(fn: () => T): T {
  if (pinnedWorkspaceDirStorage.getStore()) return fn();         // nested: reuse outer pin
  const workspaceDir = (globalThis as ...)[PLUGIN_REGISTRY_STATE]?.workspaceDir ?? undefined;
  return pinnedWorkspaceDirStorage.run({ workspaceDir }, fn);
}

…then wrap the listSessionsFromStoreAsync row loop (including its inter-batch yields) in await withPinnedActivePluginRegistryWorkspaceDir(async () => { ... }). This collapses the O(rows) scans to one regardless of concurrent agent/cron activity.

A concurrency regression test locks it in: fire the pinned scope, mutate the active workspace via setActivePluginRegistry mid-scope across a setImmediate yield, and assert reads inside the scope stay stable while reads after exit observe the live (mutated) global.

I'm happy to open a PR with this change, and can share the raw diagnostics timeline JSONL if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.bugSomething isn't workingclawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.regressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions