Skip to content

fix(gateway): pin plugin workspace dir during sessions.list to stop O(rows) metadata scans under concurrency#90819

Open
k-l-lambda wants to merge 1 commit into
openclaw:mainfrom
k-l-lambda:fix/sessions-list-metadata-scan-concurrency
Open

fix(gateway): pin plugin workspace dir during sessions.list to stop O(rows) metadata scans under concurrency#90819
k-l-lambda wants to merge 1 commit into
openclaw:mainfrom
k-l-lambda:fix/sessions-list-metadata-scan-concurrency

Conversation

@k-l-lambda

@k-l-lambda k-l-lambda commented Jun 6, 2026

Copy link
Copy Markdown

Summary

Fixes the residual concurrency facet of #76562, tracked in #90814.

sessions.list becomes O(rows) slow — tens of seconds — only when other agents/crons run concurrently; on an idle gateway it is fast. The per-row plugin model-id-normalization lookup reads a process-global "active plugin-registry workspace dir". listSessionsFromStoreAsync yields to the event loop every SESSIONS_LIST_YIELD_BATCH_SIZE rows, and during those yields concurrent agent-turns/crons call setActivePluginRegistry(...), mutating that global. The plugin-metadata-snapshot memo key includes workspaceDir, so it changes on essentially every row, the memo never hits, and each row triggers a fresh full loadPluginMetadataSnapshot scan (~100 ms).

Evidence on a busy gateway (one call, OPENCLAW_DIAGNOSTICS=1):

gateway.sessions.list:            ~10100 ms
  └─ rows:                        ~10100 ms   <-- ~100%
plugins.metadata.scan: 112–319 spans, 0 cacheHit, avg ~100 ms each

Bumping the snapshot memo to a 512-slot LRU still produced 0 hits — the key never repeats, so this is not a cache-size/eviction problem; it is global-mutable-state-under-concurrency.

Fix

Pin the active plugin-registry workspace dir for the duration of the row-building batch via AsyncLocalStorage, so every row in one sessions.list reads a stable value, immune to concurrent global mutation, while other concurrent async contexts still observe the live global. This collapses the O(rows) scans to one, independent of concurrent agent/cron activity.

  • src/plugins/runtime-workspace-state.ts — add withPinnedActivePluginRegistryWorkspaceDir<T>(fn: () => T): T; getActivePluginRegistryWorkspaceDirFromState() returns the pinned value when a scope is active, else the live global. Nested calls reuse the outer pin.
  • src/gateway/session-utils.ts — wrap the listSessionsFromStoreAsync row loop (including its inter-batch setImmediate yields) in the pin.
  • src/plugins/runtime-workspace-state.test.ts — regression tests: pass-through when unpinned; stability inside a pinned scope across a setImmediate yield while setActivePluginRegistry mutates the global mid-scope; live again after exit; nested-scope reuse; and rejection propagation with no sticky context.

Notes

  • The pin only wraps a batch that does not spawn detached/fire-and-forget async work (the row loop awaits everything), so the pinned context cannot outlive the call.
  • Semantically this is more correct than before: the old per-row "live" read was a race where row N could normalize against workspace A and row N+1 against workspace B purely because an unrelated task mutated the global mid-yield. Snapshotting at batch start removes that nondeterminism.

Validation

  • node scripts/run-tsgo.mjs — clean
  • node scripts/run-oxlint.mjs — clean (changed files)
  • pin + manifest-model-id-normalization tests: pass
  • gateway session-utils tests (plugin-runtime + perf + single-row-cache): 24/24 pass

Real behavior proof

Behavior or issue addressed: sessions.list triggered O(rows) full loadPluginMetadataSnapshot scans whenever concurrent agents/crons flipped the process-global active workspace dir during the row loop's setImmediate yields (issue #90814). This pins the workspace for the batch so per-row plugin-metadata lookups hit the memo regardless of concurrent mutation.

Real environment tested: Linux x64, Node v22.22.0, this branch's real source modules (no mocks/stubs): the real normalizeProviderModelIdWithManifestresolvePluginMetadataSnapshotloadPluginMetadataSnapshot resolver path, a real on-disk installed-plugin index + manifest, and the real OPENCLAW_DIAGNOSTICS=1 timeline emitting plugins.metadata.scan spans. A concurrent task flips the global active workspace dir on every event-loop tick, exactly as parallel agent-turns/crons do, while the harness reproduces the listSessionsFromStoreAsync per-row loop (resolve per row, yield every 10 rows).

Exact steps or command run after this patch: drove 200 rows through the real resolver path with the concurrent workspace-flipping actor running, once with the pre-fix code path (no pin) and once through this PR's withPinnedActivePluginRegistryWorkspaceDir, counting plugins.metadata.scan spans with cacheHit !== true (i.e. real full scans) parsed from the diagnostics timeline JSONL:

$ node node_modules/.bin/tsx scripts/repro-90814.mts

Evidence after fix: copied live console output of that run:

# repro #90814 — real resolver path, real diagnostics timeline, 200 rows
pre-fix (no pin)         rows=200 workspaceFlips=21 scanSpans=400 fullScans(cacheMiss)=40 elapsedMs=157
this PR (withPinned)     rows=200 workspaceFlips=21 scanSpans=400 fullScans(cacheMiss)=2 elapsedMs=61

Observed result after fix: with the pin, real full plugin-metadata scans drop from 40 → 2 (just the cold-start resolves) and stay flat regardless of the 21 concurrent workspace flips; wall time falls 157 ms → 61 ms for the batch. Pre-fix, the full-scan count tracks the concurrent flip count (each flip across a yield invalidates the per-row memo key); post-fix it is constant and independent of concurrency, confirming the O(rows-under-concurrency) → O(1) collapse. (The total scanSpans=400 is unchanged because cache hits still emit a span — a separate cosmetic concern, #86790; the meaningful metric is the cache-miss full scans.)

What was not tested: not exercised against a live multi-tenant production gateway over WebSocket in this run (the harness drives the same real resolver + diagnostics modules in-process); the upstream end-to-end symptom under real concurrent load is documented in #90814 and the now-closed #76562.

…(rows) metadata scans

sessions.list resolves plugin model-id normalization per row, keyed by the
active plugin-registry workspaceDir read from a process-global. The row loop
yields to the event loop between batches; concurrent agent-turns/crons mutate
that global via setActivePluginRegistry during the yields, so the
plugin-metadata-snapshot memo key changes per row and never hits — turning one
list into O(rows) full ~100ms plugin scans (~10s lists under load).

Pin the workspaceDir for the whole row-building batch via AsyncLocalStorage so
every row reads a stable value, immune to concurrent mutation, while other
async contexts still observe the live global. Adds regression tests for the
pin's stability, pass-through, and nested-scope reuse.
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels Jun 6, 2026
@clawsweeper

clawsweeper Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 5, 2026, 10:59 PM ET / 02:59 UTC.

Summary
The PR adds an AsyncLocalStorage-scoped active plugin workspace-dir pin around the awaited sessions.list row batch and adds regression tests for pinned workspace behavior.

PR surface: Source +39, Tests +78. Total +117 across 3 files.

Reproducibility: yes. current main shows the row-yield window, workspace-keyed plugin metadata lookup, and global workspace mutation path; I did not rerun the harness in this read-only review.

Review metrics: 2 noteworthy metrics.

  • Proofed scan reduction: 40 -> 2 full scans under 21 workspace flips. This directly measures the hot-path behavior the PR claims to fix before merge.
  • Changed surface: 2 runtime files changed, 1 test file added. The patch is confined to the gateway row loop and plugin workspace-state helper plus focused regression coverage.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none.

Next step before merge

  • No automated repair job is needed; the PR appears ready for ordinary maintainer and CI review.

Security
Cleared: The diff uses Node core AsyncLocalStorage and does not alter workflows, dependencies, secrets handling, packaging, or network/security boundaries.

Review details

Best possible solution:

Land the narrow batch-scoped workspace pin after ordinary maintainer and CI review; no ClawSweeper repair is indicated.

Do we have a high-confidence way to reproduce the issue?

Yes: current main shows the row-yield window, workspace-keyed plugin metadata lookup, and global workspace mutation path; I did not rerun the harness in this read-only review.

Is this the best way to solve the issue?

Yes: pinning the active workspace for the awaited row batch fixes the race at the narrow read boundary; increasing cache size would not help changing keys, and removing yields would regress gateway responsiveness.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 9cbf18293bb4.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body now includes after-fix terminal output from a real-module harness showing the cache-miss scan count drops from 40 to 2 under concurrent workspace flips.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body now includes after-fix terminal output from a real-module harness showing the cache-miss scan count drops from 40 to 2 under concurrent workspace flips.
  • remove rating: 🧂 unranked krab: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
  • remove status: 📣 needs proof: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

  • P1: The PR addresses a gateway control-plane regression where sessions.list can take seconds under concurrent agents/crons and degrade active workflows.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body now includes after-fix terminal output from a real-module harness showing the cache-miss scan count drops from 40 to 2 under concurrent workspace flips.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body now includes after-fix terminal output from a real-module harness showing the cache-miss scan count drops from 40 to 2 under concurrent workspace flips.
Evidence reviewed

PR surface:

Source +39, Tests +78. Total +117 across 3 files.

View PR surface stats
Area Files Added Removed Net
Source 2 93 54 +39
Tests 1 78 0 +78
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 3 171 54 +117

Acceptance criteria:

  • [P1] Contributor reported: node scripts/run-tsgo.mjs.
  • [P1] Contributor reported: node scripts/run-oxlint.mjs.
  • [P1] Contributor reported focused pin and gateway session-utils tests.
  • [P1] Suggested maintainer gate: run the focused plugin workspace-state and gateway session-utils tests plus normal PR CI.

What I checked:

  • Checkout cleanliness: The target checkout was clean before review.
  • Root policy read: Root AGENTS.md was read fully and applied as OpenClaw review policy, including whole-surface review, scoped AGENTS, dependency inspection, and hot-path guidance. (AGENTS.md:1, 9cbf18293bb4)
  • Scoped gateway policy: Gateway scoped guidance says gateway startup/server paths should avoid broad bundled plugin materialization and keep hot paths narrow. (src/gateway/AGENTS.md:1, 9cbf18293bb4)
  • Scoped plugin policy: Plugin scoped guidance says gateway plugin metadata is process-stable and mutable global runtime registry state is compatibility scaffolding; request-scoped handles are preferred. (src/plugins/AGENTS.md:20, 9cbf18293bb4)
  • Gateway entry point: Current main sessions.list calls listSessionsFromStoreAsync inside the measured gateway.sessions.list.rows span. (src/gateway/server-methods/sessions.ts:1056, 9cbf18293bb4)
  • Current interleaving window: Current main builds rows in a loop and yields with setImmediate every batch, allowing concurrent agent/cron work to run between row groups. (src/gateway/session-utils.ts:2777, 9cbf18293bb4)

Likely related people:

  • steipete: Recent GitHub path history shows repeated work across src/gateway/session-utils.ts, plugin metadata snapshots, and manifest normalization, including gateway session perf and plugin metadata reuse work. (role: recent area contributor; confidence: high; commits: fba99cddc1ac, 2682c027746b, 18dc6e5cd4f8; files: src/gateway/session-utils.ts, src/plugins/plugin-metadata-snapshot.ts, src/plugins/manifest-model-id-normalization.ts)
  • shakkernerd: Path history ties this contributor to workspace metadata reuse and runtime plugin metadata loading changes near the affected cache/workspace behavior. (role: adjacent plugin metadata owner; confidence: medium; commits: 193bfd3a4d6f, c90e42aaa766, c795a1a8ef19; files: src/plugins/manifest-model-id-normalization.ts, src/plugins/plugin-metadata-snapshot.ts)
  • vincentkoc: Recent path history includes gateway/session CI repair and gateway plugin-registry command work adjacent to the runtime state surface. (role: recent gateway/plugin runtime contributor; confidence: medium; commits: 207359a056b9, 6df3fd5730dd; files: src/gateway/session-utils.ts, src/plugins/runtime.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. labels Jun 6, 2026
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels Jun 6, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High CPU, extreme control-plane RPC latency, and unstable polling after upgrade from 2026.4.24 to 2026.4.29/2026.5.2

1 participant