Fix/sessions list resolver cache by rolandrscheel · Pull Request #77187 · openclaw/openclaw

rolandrscheel · 2026-05-04T08:04:14Z

Summary

Problem: sessions.list (Control UI polling) burned huge amounts of CPU on stores with many sessions. Production CPU profile on a 1217-session store showed listSessionsFromStoreAsync → buildGatewaySessionRow at 88.9 s total, dominated by deterministic-but-uncached resolvers run once per row: resolveSessionDisplayModelIdentityRef/isCliProvider (44.9 s), listThinkingLevelOptions/resolveThinkingProfile (24.2 s), resolveModelCostConfig via resolveEstimatedSessionCostUsd (12.6 s), plus 5.8 s GC pressure from churned Maps.
Why it matters: Control UI polls sessions.list continuously. With sessions sharing only a handful of (provider, model) tuples, each poll wasted O(rows) work on results that depend on O(unique tuples). On real stores this manifested as sustained high CPU, GC stalls, and degraded responsiveness across the gateway (heartbeats, RPCs, channel I/O all share the same loop).
What changed:
- Per-call SessionListRowResolverCache (4 keyed Maps) is built once per listSessionsFromStore[Async] and threaded into buildGatewaySessionRow via the existing params bag.
- Cached wrappers for listThinkingLevelOptions, resolveGatewaySessionThinkingDefault, resolveSessionDisplayModelIdentityRef, and resolveModelCostConfig. Each falls back to the uncached direct call when no cache is provided, so external callers of buildGatewaySessionRow are unchanged.
- Skip the wasted resolveEstimatedSessionCostUsd driving needsTranscriptEstimatedCostUsd when skipTranscriptUsageFallback === true. The result was unconditionally discarded by the very next guard — pure dead CPU on the lightweight async polling path.
- New session-utils.perf.test.ts exercises 1000 synthetic sessions across 5 model tuples as a coarse regression smoke.
What did NOT change (scope boundary): Public APIs, row shape (GatewaySessionRow), filter/sort behavior, transcript I/O, plugin host hooks, and resolver semantics are all unchanged. Caches are scoped to a single list call (no cross-call state, no invalidation surface). The non-lightweight path keeps doing exactly the same work, just memoized within one call.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #
Related #7fe4ba013f (fix(gateway): add lightweight row path for sessions.list to reduce event-loop blocking — this PR completes the same hot-path effort by collapsing the remaining per-row resolvers)
Related [Bug]: sessions.list is extremely slow (4+ seconds) causing event loop saturation #77373
Related Control UI sessions.list refresh can stall Gateway with large session stores #77056
Related Regression since 2026.4.29: Slack-heavy session lists can saturate gateway via sessions.list #77062
Related sessions.list latency around 10s and fixed 10s pi-trajectory-flush timeout under moderate session load #75839
Related High CPU, extreme control-plane RPC latency, and unstable polling after upgrade from 2026.4.24 to 2026.4.29/2026.5.2 #76562
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: buildGatewaySessionRow calls four pure resolvers whose results depend only on (provider, model[, agentId, cfg]). Sessions in a single list typically share a small set of those tuples, but every row recomputed them from scratch — including provider plugin lookups (resolveRuntimeCliBackends, resolvePluginSetupCliBackendRuntime, thinking policy hooks) and the configured/JSON model-cost index Map build. Additionally needsTranscriptEstimatedCostUsd was computed even when skipTranscriptUsageFallback === true forced the downstream transcriptUsage to null, so its result was always discarded on the polling path.
Missing detection / guardrail: No micro-bench or perf regression test for listSessionsFromStore[Async] at scale. Unit tests only validated correctness on small fixtures, where the per-row cost is invisible.
Contributing context: A previous fix introduced the lightweightListRow path to skip the heaviest resolver (resolveSessionDisplayModelIdentityRef) for polling. That cut roughly half the cost but left the other three resolvers running per row; this PR closes that gap and additionally removes the dead needsTranscriptEstimatedCostUsd work on the lightweight path.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/gateway/session-utils.perf.test.ts (added).
Scenario the test should lock in: listSessionsFromStoreAsync over 1000 sessions spread across 5 distinct (provider, model) tuples completes well under 2.5 s wall time. This catches O(rows)-scaled regressions in the resolver hot path without depending on plugin-host fixtures.
Why this is the smallest reliable guardrail: The bug is purely in per-row scaling of pure resolvers. A unit-level wall-time bound on a synthetic store is the smallest harness that meaningfully exercises the cache without spinning up channels, plugins, or transcripts. Existing 231 session-utils.test.ts cases keep covering correctness.
Existing test that already covers this (if any): caps transcript title and last-message hydration for bulk list responses validates correctness at higher row counts but does not assert wall-time scaling, so it would not have caught this regression.
If no new test is added, why not: N/A — a new test is added.

User-visible / Behavior Changes

None. GatewaySessionRow shape, defaults, and ordering are unchanged. Only CPU/wall-time of sessions.list improves.

Diagram (if applicable)

Before (per sessions.list call, N=1217 sessions, ~5 unique provider/model tuples):
[poll] -> for each row:
            isCliProvider(...)               (plugin scan)
            resolveSessionDisplayModelIdentityRef(...)
            listThinkingLevelOptions(...)    (plugin policy hooks)
            resolveModelCostConfig(...)      (Map build)
        -> O(N) heavy resolver calls -> 88s CPU + GC churn

After:
[poll] -> build SessionListRowResolverCache once
       -> for each row:
            cache.get((provider, model[, agentId])) ?? compute & set
       -> O(unique tuples) heavy resolver calls -> small constant cost
       -> needsTranscriptEstimatedCostUsd skipped entirely when
          skipTranscriptUsageFallback (Control UI path)

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: Linux 6.14.0-33-generic (x64), Debian-based container
Runtime/container: Node v22.22.2, OpenClaw 2026.5.2 (8b2a6e5), branch built from release/2026.5.3
Model/provider: mixed (google-vertex/gemini-3-flash-preview, openai/gpt-5, anthropic/claude-opus-4-7, openrouter/z-ai/glm-5, google/gemini-2.5-pro)
Integration/channel (if any): Control UI WebSocket polling sessions.list
Relevant config (redacted): default agent config; 1217 real sessions on disk under /data/.openclaw/agents/*/sessions/

Steps

Capture a --cpu-prof of the running gateway during sustained Control UI polling against a session store with many entries (≥1000).
Render call tree with cpuprofile-tree.js at depth=4 minSelfMs=5000.
Inspect listSessionsFromStoreAsync → buildGatewaySessionRow total time and the resolvers underneath.

Expected

buildGatewaySessionRow self+children cost dominated by O(unique (provider, model)) work, not O(rows).
resolveEstimatedSessionCostUsd no longer appears under buildGatewaySessionRow on the lightweight Control UI path.

Actual (before fix)

buildGatewaySessionRow 88.9 s total / 63 ms self
- resolveSessionDisplayModelIdentityRef 44.9 s (gone after the prior lightweightListRow fix on release/2026.5.3, still memoized here for non-lightweight callers)
- listThinkingLevelOptions 24.2 s
- resolveEstimatedSessionCostUsd 12.6 s
- resolveSessionModelRef 6.3 s
(garbage collector) 5.8 s self

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Before (production CPU profile, 1217 sessions, depth=4 minSelfMs=5000):
 88.96s total | 34.2ms self | listSessionsFromStoreAsync
   88.87s total | 63.3ms self | buildGatewaySessionRow
     44.87s total | resolveSessionDisplayModelIdentityRef -> isCliProvider
     24.22s total | listThinkingLevelOptions -> resolveThinkingProfile
     12.56s total | resolveEstimatedSessionCostUsd -> resolveModelCostConfig
      6.27s total | resolveSessionModelRef
   5.75s total |  (garbage collector)

Local verification:
- npx tsc -p tsconfig.json --noEmit  -> clean
- npx vitest run src/gateway/session-utils.test.ts (3 project profiles: gateway-core,
  gateway-server, gateway-client) -> 231 passed / 231
- New perf smoke (src/gateway/session-utils.perf.test.ts) lists 1000 synthetic
  sessions across 5 (provider, model) tuples and asserts wall time < 2500 ms.

Human Verification (required)

Verified scenarios:
- Branched fix/sessions-list-resolver-cache from origin/release/2026.5.3 and ran npx tsc -p tsconfig.json --noEmit to confirm types are clean across the patched module and its imports.
- Ran npx vitest run src/gateway/session-utils.test.ts locally — all 231 pre-existing session-utils tests pass under gateway-core, gateway-server, and gateway-client projects.
- Re-read the patched code paths to confirm: (1) no cache state escapes a single list call, (2) cache keys cover every input that affects the wrapped resolver, (3) when resolverCache is omitted, behavior is byte-identical to the uncached call, (4) needsTranscriptEstimatedCostUsd is only short-circuited when its output is provably unused.
Edge cases checked: Lightweight vs non-lightweight row paths; non-CLI vs CLI provider models; agentId absent (uses \"\" in the cache key); rows with no token usage at all (cost branch returns undefined early, before any cached lookup); resolverCache undefined (passes through to existing direct calls).
What you did not verify: End-to-end CPU profile on a real running gateway after deploying the patched build (the local 1217-session repro source is ready, but rebuilding/installing the patched package on the live host is not part of this PR). The repository's full CI matrix (full vitest projects + boundary/integration suites) was not exercised locally because the build server pipeline covers that.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
If yes, exact upgrade steps: N/A. The new resolverCache parameter on buildGatewaySessionRow and resolveTranscriptUsageFallback is optional; all existing callers continue to work without it. SessionListRowResolverCache and createSessionListRowResolverCache are exported as forward-compatible additions for external embedders that want the same memoization.

Risks and Mitigations

Risk: A cache key omits an input that secretly affects the resolver result, returning a stale value for some session.
- Mitigation: Each cached wrapper keys on the full input set the underlying pure resolver actually consumes (provider, model, plus agentId for the identity-ref path). cfg and modelCatalog are constant for the duration of one listSessionsFromStore[Async] call, so caches scoped to that call cannot mix configs. Existing 231 unit tests covering the resolver outputs continue to pass.
Risk: Memory growth from cache entries on a single very diverse list.
- Mitigation: Cache is created per call and dropped at function exit. Bounded by O(unique (provider, model[, agentId]) tuples) which is at most a small constant in practice.
Risk: The skipped needsTranscriptEstimatedCostUsd change subtly diverges from prior behavior.
- Mitigation: When skipTranscriptUsageFallback === true, transcriptUsage is forced to null regardless of the needs* flags on the prior code path, so the value was already discarded. The change is dead-code elimination, not a behavioral change.

Real behavior proof

Behavior or issue addressed: sessions.list burned O(N-rows) CPU per call on stores with many sessions because four deterministic resolvers (resolveSessionDisplayModelIdentityRef, resolveGatewaySessionThinkingDefault, resolveModelCostConfig, listThinkingLevelOptions) recomputed their results once per row instead of once per unique (provider, model) tuple. On the pre-fix 1217-session production profile this was 88.9 s CPU + 5.8 s GC under listSessionsFromStoreAsync / buildGatewaySessionRow, causing sustained load and degraded gateway responsiveness.

Real environment tested: OS Linux 6.14.0-33-generic (x64), Node v22.22.2, OpenClaw gateway on loopback 127.0.0.1:18789, Control UI polling sessions.list, 1431 real session files under /data/.openclaw/agents/*/sessions/. The installed patched runtime at /usr/lib/node_modules/openclaw/dist/session-utils-Rcv9ufNE.js contained marker OPENCLAW_PATCH sessions-list-resolver-cache plus cached helpers __cached_resolveSessionDisplayModelIdentityRef and __cached_resolveGatewaySessionThinkingDefault.

Exact steps or command run after fix:

Applied the sessions-list resolver-cache hotpatch to the installed OpenClaw runtime.
Confirmed the patched runtime marker and cached helpers with grep against /usr/lib/node_modules/openclaw/dist/session-utils-Rcv9ufNE.js.
Counted the real session store with find /data/.openclaw/agents -path '*/sessions/*' -type f | wc -l -> 1431.
Captured a live CPU profile from the running patched gateway while Control UI polling was active: /tmp/openclaw_running.cpuprofile (120.1 s profile, written 2026-05-05 12:45 UTC).
Analyzed the profile with /tmp/analyze_cpuprofile.py /tmp/openclaw_running.cpuprofile plus targeted function search for sessions.list, listSessionsFromStoreAsync, buildGatewaySessionRow, and the formerly hot resolver functions.

After-fix evidence: Live copied console/profile output from the patched gateway on the real 1431-file session store:

$ grep -n "OPENCLAW_PATCH sessions-list-resolver-cache\|__cached_resolveSessionDisplayModelIdentityRef\|__cached_resolveGatewaySessionThinkingDefault"   /usr/lib/node_modules/openclaw/dist/session-utils-Rcv9ufNE.js
35:// OPENCLAW_PATCH sessions-list-resolver-cache
68:function __cached_resolveSessionDisplayModelIdentityRef(params) {
74:function __cached_resolveGatewaySessionThinkingDefault(params) {

$ find /data/.openclaw/agents -path '*/sessions/*' -type f | wc -l
1431

Profile /tmp/openclaw_running.cpuprofile: wall 120.1s, nodes=172863, samples=111422

sessions.list:
  106.9ms total | sessions.list openclaw/dist/server-methods-DTGNFOnM.js:7912
   74.6ms total | sessions.list openclaw/dist/server-methods-DTGNFOnM.js:7912
   56.7ms total | sessions.list openclaw/dist/server-methods-DTGNFOnM.js:7912

listSessionsFromStoreAsync:
  192.2ms total | listSessionsFromStoreAsync openclaw/dist/session-utils-Rcv9ufNE.js:1203
   97.2ms total | listSessionsFromStoreAsync openclaw/dist/session-utils-Rcv9ufNE.js:1203
   74.6ms total | listSessionsFromStoreAsync openclaw/dist/session-utils-Rcv9ufNE.js:1203
   44.0ms total | listSessionsFromStoreAsync openclaw/dist/session-utils-Rcv9ufNE.js:1203

buildGatewaySessionRow:
  496.0ms total | buildGatewaySessionRow openclaw/dist/session-utils-Rcv9ufNE.js:839
  323.9ms total | buildGatewaySessionRow openclaw/dist/session-utils-Rcv9ufNE.js:839
  177.3ms total | buildGatewaySessionRow openclaw/dist/session-utils-Rcv9ufNE.js:839
  123.9ms total | buildGatewaySessionRow openclaw/dist/session-utils-Rcv9ufNE.js:839

Former hot resolvers after fix:
  21.4ms total | __cached_resolveSessionDisplayModelIdentityRef -> resolveSessionDisplayModelIdentityRef
   3.2ms total | resolveModelCostConfig
   3.2ms total | resolveEstimatedSessionCostUsd
   2.1ms total | resolveGatewaySessionThinkingDefault

Before-fix reference from the production CPU profile on the same issue class:

 88.96s total | 34.2ms self | listSessionsFromStoreAsync
   88.87s total | 63.3ms self | buildGatewaySessionRow
     44.87s total | resolveSessionDisplayModelIdentityRef -> isCliProvider
     24.22s total | listThinkingLevelOptions -> resolveThinkingProfile
     12.56s total | resolveEstimatedSessionCostUsd -> resolveModelCostConfig
      6.27s total | resolveSessionModelRef
   5.75s total | (garbage collector)

Supplemental regression smoke output:

RUN  v4.1.5 /tmp/work/openclaw

✓ gateway-core  > listSessionsFromStore resolver cache > collapses non-lightweight per-row resolver work to O(unique provider/model tuples) 31015ms
✓ gateway-server > listSessionsFromStore resolver cache > collapses non-lightweight per-row resolver work to O(unique provider/model tuples) 17204ms
✓ gateway-client > listSessionsFromStore resolver cache > collapses non-lightweight per-row resolver work to O(unique provider/model tuples) 17397ms

 Test Files  3 passed (3)
     Tests  3 passed (3)
  Start at  08:04:24
  Duration  93.72s (transform 17.59s, setup 3.51s, import 23.95s, tests 65.63s, environment 0ms)

Observed result after fix: The same real host that previously showed listSessionsFromStoreAsync / buildGatewaySessionRow dominating at 88.9 s CPU now shows patched live sessions.list samples in the ~57-107 ms range and listSessionsFromStoreAsync samples in the ~44-192 ms range on a 1431-file real session store. The synthetic perf smoke also asserts O(unique-tuples) resolver call counts, so both live behavior and regression guardrail support the fix.

What was not tested: Full post-fix production soak across multiple operators. The live patched profile above covers the previously missing large real-store sessions.list behavior proof on this host.

clawsweeper · 2026-05-04T08:07:19Z

Codex review: needs maintainer review before merge.

Summary
The branch adds per-list memoization for gateway session row display/cost resolver work, a call-count regression smoke, a changelog entry, and a proof-policy line-ending comment.

Reproducibility: yes. Current main source shows cost and display-identity resolution still executed in the per-row builder, and the PR discussion supplies live before/after sessions.list profile evidence on large stores.

Real behavior proof
Sufficient (live_output): Sufficient: the PR body includes copied live console/profile output from a patched installed gateway on a real 1431-file session store showing improved sessions.list and listSessionsFromStoreAsync timings after the fix.

Next step before merge
No automated repair lane is needed because I found no narrow defect for automation to fix; the remaining action is maintainer review plus exact-head CI completion.

Security
Cleared: Cleared: the diff changes in-process gateway memoization, focused tests, changelog text, and a line-ending comment without adding dependencies, permissions, secret handling, downloads, or code-execution surface.

Review details

Best possible solution:

Have gateway maintainers review and land the scoped per-list cache after exact-head checks finish green, preserving per-call cache lifetime and the GatewaySessionRow contract.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main source shows cost and display-identity resolution still executed in the per-row builder, and the PR discussion supplies live before/after sessions.list profile evidence on large stores.

Is this the best way to solve the issue?

Yes. The per-list rowContext cache is the narrow maintainable fix because config and model catalog inputs are fixed for one list call, avoiding cross-call invalidation or new product policy.

What I checked:

Current main repeats cost lookup per row: resolveEstimatedSessionCostUsd still calls resolveModelCostConfig directly when token usage exists and no explicit cost is stored, so list rows with repeated provider/model tuples redo the same cost resolver work. (src/gateway/session-utils.ts:326, 5a91c7c2a749)
Current main lacks the proposed display/cost caches: SessionListRowContext currently carries subagent, child-session, selected-model, and thinking metadata caches, but no displayModelIdentityByKey or modelCostConfigByModelRef maps. (src/gateway/session-utils.ts:372, 5a91c7c2a749)
Current main repeats display identity resolution: The non-lightweight row path still calls resolveSessionDisplayModelIdentityRef directly while building each GatewaySessionRow. (src/gateway/session-utils.ts:1725, 5a91c7c2a749)
PR diff adds scoped memoization: The patch adds resolveModelCostConfigCached, displayModelIdentityByKey, modelCostConfigByModelRef, and cached display-identity use in buildGatewaySessionRow while preserving direct-call fallback when rowContext is absent. (src/gateway/session-utils.ts:296, dc65fa416dbc)
Regression smoke targets resolver call scaling: The added perf smoke creates repeated rows across five model tuples and asserts thinking/cost resolver calls stay bounded instead of row-linear. (src/gateway/session-utils.perf.test.ts:24, dc65fa416dbc)
Real behavior proof supplied: The PR body includes copied live console/profile output from a patched installed gateway on a real 1431-file session store, with sessions.list samples around 56.7-106.9 ms and listSessionsFromStoreAsync samples around 44.0-192.2 ms after the fix. (dc65fa416dbc)

Likely related people:

steipete: Recent commits on the same session-list hot path added thinking enrichment caching, bounded sessions-list responses, and other session-list performance work. (role: recent maintainer; confidence: high; commits: 18bd7b60e4fe, a224810a7f96, 3aaf30ffa600; files: src/gateway/session-utils.ts, src/gateway/server-methods/sessions.ts)
obviyus: Recent merged work changed session-list model resolution and plugin model resolution in the same gateway session row path. (role: recent adjacent owner; confidence: high; commits: 6be5422fd640, eab494ca6a9e; files: src/gateway/session-utils.ts, src/gateway/session-utils.test.ts)
vincentkoc: Recent work bounded transcript usage, indexed child links, and reused subagent registry state in session listing, which are adjacent to the rowContext cache this PR extends. (role: session-list performance maintainer; confidence: high; commits: a1dc8c066347, 37f8c3806ac9, ecf6cbf75d3d; files: src/gateway/session-utils.ts, src/gateway/server-methods/sessions.ts)
Peetiegonzalez: Authored the recent lightweight sessions.list row-path commit that this PR builds on for the polling hot path. (role: introduced adjacent lightweight row path; confidence: medium; commits: 7fe4ba013ff0; files: src/gateway/session-utils.ts, src/gateway/session-utils.test.ts)
pashpashpash: Recent proof-gate commits own the script touched by this PR's line-ending comment. (role: proof-policy owner; confidence: medium; commits: 70f34bf1779c, 33c42c8d3b65; files: scripts/github/real-behavior-proof-policy.mjs)

Remaining risk / open question:

Exact-head CI/check matrix was still pending for dc65fa4 during this review.
The live after-fix proof covers one large real session store; broader post-merge soak across operator stores remains useful but is not a merge-blocking code finding.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 5a91c7c2a749.

rolandrscheel · 2026-05-08T09:44:13Z

Hi maintainers — this PR looks ready for human review from my side.

Current status:

the branch has been cleaned up and squashed to a single commit
real-behavior proof is supplied and marked proof: sufficient
the bot review did not leave an active repair lane from what I can see
no public API / row shape / ordering behavior is intended to change
the remaining visible blocker appears to be normal maintainer review

Could one of the relevant maintainers please take a look when possible?

rolandrscheel requested a review from a team as a code owner May 4, 2026 08:04

rolandrscheel force-pushed the fix/sessions-list-resolver-cache branch from fd70d16 to e4c78ab Compare May 4, 2026 08:22

rolandrscheel force-pushed the fix/sessions-list-resolver-cache branch from e4c78ab to 26f1f29 Compare May 4, 2026 08:42

rolandrscheel force-pushed the fix/sessions-list-resolver-cache branch from 7b8071e to f27fb52 Compare May 6, 2026 11:22

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026

openclaw-barnacle Bot added size: M and removed size: S proof: sufficient ClawSweeper judged the real behavior proof convincing. labels May 7, 2026