Fix/sessions list resolver cache#77187
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. Current main source shows cost and display-identity resolution still executed in the per-row builder, and the PR discussion supplies live before/after sessions.list profile evidence on large stores. Real behavior proof Next step before merge Security Review detailsBest possible solution: Have gateway maintainers review and land the scoped per-list cache after exact-head checks finish green, preserving per-call cache lifetime and the GatewaySessionRow contract. Do we have a high-confidence way to reproduce the issue? Yes. Current main source shows cost and display-identity resolution still executed in the per-row builder, and the PR discussion supplies live before/after sessions.list profile evidence on large stores. Is this the best way to solve the issue? Yes. The per-list rowContext cache is the narrow maintainable fix because config and model catalog inputs are fixed for one list call, avoiding cross-call invalidation or new product policy. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 5a91c7c2a749. |
fd70d16 to
e4c78ab
Compare
e4c78ab to
26f1f29
Compare
7b8071e to
f27fb52
Compare
2932bde to
d69172a
Compare
|
Hi maintainers — this PR looks ready for human review from my side. Current status:
Could one of the relevant maintainers please take a look when possible? |
Summary
sessions.list(Control UI polling) burned huge amounts of CPU on stores with many sessions. Production CPU profile on a 1217-session store showedlistSessionsFromStoreAsync→buildGatewaySessionRowat 88.9 s total, dominated by deterministic-but-uncached resolvers run once per row:resolveSessionDisplayModelIdentityRef/isCliProvider(44.9 s),listThinkingLevelOptions/resolveThinkingProfile(24.2 s),resolveModelCostConfigviaresolveEstimatedSessionCostUsd(12.6 s), plus 5.8 s GC pressure from churned Maps.sessions.listcontinuously. With sessions sharing only a handful of(provider, model)tuples, each poll wasted O(rows) work on results that depend on O(unique tuples). On real stores this manifested as sustained high CPU, GC stalls, and degraded responsiveness across the gateway (heartbeats, RPCs, channel I/O all share the same loop).SessionListRowResolverCache(4 keyed Maps) is built once perlistSessionsFromStore[Async]and threaded intobuildGatewaySessionRowvia the existing params bag.listThinkingLevelOptions,resolveGatewaySessionThinkingDefault,resolveSessionDisplayModelIdentityRef, andresolveModelCostConfig. Each falls back to the uncached direct call when no cache is provided, so external callers ofbuildGatewaySessionRoware unchanged.resolveEstimatedSessionCostUsddrivingneedsTranscriptEstimatedCostUsdwhenskipTranscriptUsageFallback === true. The result was unconditionally discarded by the very next guard — pure dead CPU on the lightweight async polling path.session-utils.perf.test.tsexercises 1000 synthetic sessions across 5 model tuples as a coarse regression smoke.GatewaySessionRow), filter/sort behavior, transcript I/O, plugin host hooks, and resolver semantics are all unchanged. Caches are scoped to a single list call (no cross-call state, no invalidation surface). The non-lightweight path keeps doing exactly the same work, just memoized within one call.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
fix(gateway): add lightweight row path for sessions.list to reduce event-loop blocking— this PR completes the same hot-path effort by collapsing the remaining per-row resolvers)Root Cause (if applicable)
buildGatewaySessionRowcalls four pure resolvers whose results depend only on(provider, model[, agentId, cfg]). Sessions in a single list typically share a small set of those tuples, but every row recomputed them from scratch — including provider plugin lookups (resolveRuntimeCliBackends,resolvePluginSetupCliBackendRuntime, thinking policy hooks) and the configured/JSON model-cost index Map build. AdditionallyneedsTranscriptEstimatedCostUsdwas computed even whenskipTranscriptUsageFallback === trueforced the downstreamtranscriptUsagetonull, so its result was always discarded on the polling path.listSessionsFromStore[Async]at scale. Unit tests only validated correctness on small fixtures, where the per-row cost is invisible.lightweightListRowpath to skip the heaviest resolver (resolveSessionDisplayModelIdentityRef) for polling. That cut roughly half the cost but left the other three resolvers running per row; this PR closes that gap and additionally removes the deadneedsTranscriptEstimatedCostUsdwork on the lightweight path.Regression Test Plan (if applicable)
src/gateway/session-utils.perf.test.ts(added).listSessionsFromStoreAsyncover 1000 sessions spread across 5 distinct(provider, model)tuples completes well under 2.5 s wall time. This catches O(rows)-scaled regressions in the resolver hot path without depending on plugin-host fixtures.session-utils.test.tscases keep covering correctness.caps transcript title and last-message hydration for bulk list responsesvalidates correctness at higher row counts but does not assert wall-time scaling, so it would not have caught this regression.User-visible / Behavior Changes
None.
GatewaySessionRowshape, defaults, and ordering are unchanged. Only CPU/wall-time ofsessions.listimproves.Diagram (if applicable)
Security Impact (required)
NoNoNoNoNoYes, explain risk + mitigation: N/ARepro + Verification
Environment
8b2a6e5), branch built fromrelease/2026.5.3google-vertex/gemini-3-flash-preview,openai/gpt-5,anthropic/claude-opus-4-7,openrouter/z-ai/glm-5,google/gemini-2.5-pro)sessions.list/data/.openclaw/agents/*/sessions/Steps
--cpu-profof the running gateway during sustained Control UI polling against a session store with many entries (≥1000).cpuprofile-tree.jsatdepth=4 minSelfMs=5000.listSessionsFromStoreAsync→buildGatewaySessionRowtotal time and the resolvers underneath.Expected
buildGatewaySessionRowself+children cost dominated by O(unique(provider, model)) work, not O(rows).resolveEstimatedSessionCostUsdno longer appears underbuildGatewaySessionRowon the lightweight Control UI path.Actual (before fix)
buildGatewaySessionRow88.9 s total / 63 ms selfresolveSessionDisplayModelIdentityRef44.9 s (gone after the priorlightweightListRowfix onrelease/2026.5.3, still memoized here for non-lightweight callers)listThinkingLevelOptions24.2 sresolveEstimatedSessionCostUsd12.6 sresolveSessionModelRef6.3 s(garbage collector)5.8 s selfEvidence
Human Verification (required)
fix/sessions-list-resolver-cachefromorigin/release/2026.5.3and rannpx tsc -p tsconfig.json --noEmitto confirm types are clean across the patched module and its imports.npx vitest run src/gateway/session-utils.test.tslocally — all 231 pre-existingsession-utilstests pass undergateway-core,gateway-server, andgateway-clientprojects.resolverCacheis omitted, behavior is byte-identical to the uncached call, (4)needsTranscriptEstimatedCostUsdis only short-circuited when its output is provably unused.agentIdabsent (uses\"\"in the cache key); rows with no token usage at all (cost branch returnsundefinedearly, before any cached lookup);resolverCacheundefined (passes through to existing direct calls).Review Conversations
Compatibility / Migration
YesNoNoresolverCacheparameter onbuildGatewaySessionRowandresolveTranscriptUsageFallbackis optional; all existing callers continue to work without it.SessionListRowResolverCacheandcreateSessionListRowResolverCacheare exported as forward-compatible additions for external embedders that want the same memoization.Risks and Mitigations
provider,model, plusagentIdfor the identity-ref path).cfgandmodelCatalogare constant for the duration of onelistSessionsFromStore[Async]call, so caches scoped to that call cannot mix configs. Existing 231 unit tests covering the resolver outputs continue to pass.(provider, model[, agentId])tuples) which is at most a small constant in practice.needsTranscriptEstimatedCostUsdchange subtly diverges from prior behavior.skipTranscriptUsageFallback === true,transcriptUsageis forced tonullregardless of theneeds*flags on the prior code path, so the value was already discarded. The change is dead-code elimination, not a behavioral change.Real behavior proof
Behavior or issue addressed:
sessions.listburned O(N-rows) CPU per call on stores with many sessions because four deterministic resolvers (resolveSessionDisplayModelIdentityRef,resolveGatewaySessionThinkingDefault,resolveModelCostConfig,listThinkingLevelOptions) recomputed their results once per row instead of once per unique(provider, model)tuple. On the pre-fix 1217-session production profile this was 88.9 s CPU + 5.8 s GC underlistSessionsFromStoreAsync/buildGatewaySessionRow, causing sustained load and degraded gateway responsiveness.Real environment tested: OS Linux 6.14.0-33-generic (x64), Node v22.22.2, OpenClaw gateway on loopback
127.0.0.1:18789, Control UI pollingsessions.list, 1431 real session files under/data/.openclaw/agents/*/sessions/. The installed patched runtime at/usr/lib/node_modules/openclaw/dist/session-utils-Rcv9ufNE.jscontained markerOPENCLAW_PATCH sessions-list-resolver-cacheplus cached helpers__cached_resolveSessionDisplayModelIdentityRefand__cached_resolveGatewaySessionThinkingDefault.Exact steps or command run after fix:
grepagainst/usr/lib/node_modules/openclaw/dist/session-utils-Rcv9ufNE.js.find /data/.openclaw/agents -path '*/sessions/*' -type f | wc -l->1431./tmp/openclaw_running.cpuprofile(120.1 s profile, written 2026-05-05 12:45 UTC)./tmp/analyze_cpuprofile.py /tmp/openclaw_running.cpuprofileplus targeted function search forsessions.list,listSessionsFromStoreAsync,buildGatewaySessionRow, and the formerly hot resolver functions.After-fix evidence: Live copied console/profile output from the patched gateway on the real 1431-file session store:
Before-fix reference from the production CPU profile on the same issue class:
Supplemental regression smoke output:
Observed result after fix: The same real host that previously showed
listSessionsFromStoreAsync/buildGatewaySessionRowdominating at 88.9 s CPU now shows patched livesessions.listsamples in the ~57-107 ms range andlistSessionsFromStoreAsyncsamples in the ~44-192 ms range on a 1431-file real session store. The synthetic perf smoke also asserts O(unique-tuples) resolver call counts, so both live behavior and regression guardrail support the fix.What was not tested: Full post-fix production soak across multiple operators. The live patched profile above covers the previously missing large real-store
sessions.listbehavior proof on this host.