fix(gateway): pin plugin workspace dir during sessions.list to stop O(rows) metadata scans under concurrency#90819
Conversation
…(rows) metadata scans sessions.list resolves plugin model-id normalization per row, keyed by the active plugin-registry workspaceDir read from a process-global. The row loop yields to the event loop between batches; concurrent agent-turns/crons mutate that global via setActivePluginRegistry during the yields, so the plugin-metadata-snapshot memo key changes per row and never hits — turning one list into O(rows) full ~100ms plugin scans (~10s lists under load). Pin the workspaceDir for the whole row-building batch via AsyncLocalStorage so every row reads a stable value, immune to concurrent mutation, while other async contexts still observe the live global. Adds regression tests for the pin's stability, pass-through, and nested-scope reuse.
|
Codex review: needs maintainer review before merge. Reviewed June 5, 2026, 10:59 PM ET / 02:59 UTC. Summary PR surface: Source +39, Tests +78. Total +117 across 3 files. Reproducibility: yes. current main shows the row-yield window, workspace-keyed plugin metadata lookup, and global workspace mutation path; I did not rerun the harness in this read-only review. Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Next step before merge
Security Review detailsBest possible solution: Land the narrow batch-scoped workspace pin after ordinary maintainer and CI review; no ClawSweeper repair is indicated. Do we have a high-confidence way to reproduce the issue? Yes: current main shows the row-yield window, workspace-keyed plugin metadata lookup, and global workspace mutation path; I did not rerun the harness in this read-only review. Is this the best way to solve the issue? Yes: pinning the active workspace for the awaited row batch fixes the race at the narrow read boundary; increasing cache size would not help changing keys, and removing yields would regress gateway responsiveness. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 9cbf18293bb4. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +39, Tests +78. Total +117 across 3 files. View PR surface stats
Acceptance criteria:
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
Summary
Fixes the residual concurrency facet of #76562, tracked in #90814.
sessions.listbecomes O(rows) slow — tens of seconds — only when other agents/crons run concurrently; on an idle gateway it is fast. The per-row plugin model-id-normalization lookup reads a process-global "active plugin-registry workspace dir".listSessionsFromStoreAsyncyields to the event loop everySESSIONS_LIST_YIELD_BATCH_SIZErows, and during those yields concurrent agent-turns/crons callsetActivePluginRegistry(...), mutating that global. The plugin-metadata-snapshot memo key includesworkspaceDir, so it changes on essentially every row, the memo never hits, and each row triggers a fresh fullloadPluginMetadataSnapshotscan (~100 ms).Evidence on a busy gateway (one call,
OPENCLAW_DIAGNOSTICS=1):Bumping the snapshot memo to a 512-slot LRU still produced 0 hits — the key never repeats, so this is not a cache-size/eviction problem; it is global-mutable-state-under-concurrency.
Fix
Pin the active plugin-registry workspace dir for the duration of the row-building batch via
AsyncLocalStorage, so every row in onesessions.listreads a stable value, immune to concurrent global mutation, while other concurrent async contexts still observe the live global. This collapses the O(rows) scans to one, independent of concurrent agent/cron activity.src/plugins/runtime-workspace-state.ts— addwithPinnedActivePluginRegistryWorkspaceDir<T>(fn: () => T): T;getActivePluginRegistryWorkspaceDirFromState()returns the pinned value when a scope is active, else the live global. Nested calls reuse the outer pin.src/gateway/session-utils.ts— wrap thelistSessionsFromStoreAsyncrow loop (including its inter-batchsetImmediateyields) in the pin.src/plugins/runtime-workspace-state.test.ts— regression tests: pass-through when unpinned; stability inside a pinned scope across asetImmediateyield whilesetActivePluginRegistrymutates the global mid-scope; live again after exit; nested-scope reuse; and rejection propagation with no sticky context.Notes
Validation
node scripts/run-tsgo.mjs— cleannode scripts/run-oxlint.mjs— clean (changed files)Real behavior proof
Behavior or issue addressed:
sessions.listtriggered O(rows) fullloadPluginMetadataSnapshotscans whenever concurrent agents/crons flipped the process-global active workspace dir during the row loop'ssetImmediateyields (issue #90814). This pins the workspace for the batch so per-row plugin-metadata lookups hit the memo regardless of concurrent mutation.Real environment tested: Linux x64, Node v22.22.0, this branch's real source modules (no mocks/stubs): the real
normalizeProviderModelIdWithManifest→resolvePluginMetadataSnapshot→loadPluginMetadataSnapshotresolver path, a real on-disk installed-plugin index + manifest, and the realOPENCLAW_DIAGNOSTICS=1timeline emittingplugins.metadata.scanspans. A concurrent task flips the global active workspace dir on every event-loop tick, exactly as parallel agent-turns/crons do, while the harness reproduces thelistSessionsFromStoreAsyncper-row loop (resolve per row, yield every 10 rows).Exact steps or command run after this patch: drove 200 rows through the real resolver path with the concurrent workspace-flipping actor running, once with the pre-fix code path (no pin) and once through this PR's
withPinnedActivePluginRegistryWorkspaceDir, countingplugins.metadata.scanspans withcacheHit !== true(i.e. real full scans) parsed from the diagnostics timeline JSONL:Evidence after fix: copied live console output of that run:
Observed result after fix: with the pin, real full plugin-metadata scans drop from 40 → 2 (just the cold-start resolves) and stay flat regardless of the 21 concurrent workspace flips; wall time falls 157 ms → 61 ms for the batch. Pre-fix, the full-scan count tracks the concurrent flip count (each flip across a yield invalidates the per-row memo key); post-fix it is constant and independent of concurrency, confirming the O(rows-under-concurrency) → O(1) collapse. (The total
scanSpans=400is unchanged because cache hits still emit a span — a separate cosmetic concern, #86790; the meaningful metric is the cache-miss full scans.)What was not tested: not exercised against a live multi-tenant production gateway over WebSocket in this run (the harness drives the same real resolver + diagnostics modules in-process); the upstream end-to-end symptom under real concurrent load is documented in #90814 and the now-closed #76562.