Fix: preserve modelOverride in agent handler (#5369)#19328
Conversation
8b3c85e to
13c131e
Compare
922b7ce to
28361ca
Compare
277c397 to
8f6fa74
Compare
|
This pull request has been automatically marked as stale due to inactivity. |
|
Fixes #52786 |
a620fb1 to
a00c179
Compare
|
This pull request has been automatically marked as stale due to inactivity. |
a00c179 to
0a2b458
Compare
495f92b to
35f66ab
Compare
35f66ab to
aa86538
Compare
|
This pull request has been automatically marked as stale due to inactivity. |
aa86538 to
1d79288
Compare
7d0acc5 to
335965e
Compare
Real behavior proof — refresh for current PR head
|
335965e to
103aa39
Compare
|
ClawSweeper PR egg ✨ Hatched: 🌱 uncommon Cosmic Shellbean Hatch commandComment Hatchability rules:
Rarity: 🌱 uncommon. What is this egg doing here?
|
103aa39 to
3365317
Compare
Label audit: compatibility + message-deliveryAudited the two risk labels against the actual diff. Both are appropriately applied as "review carefully" flags, but no real breaks. Two specifics that aren't currently explicit in the PR description and may want a maintainer-visible callout: 1. The patch gates 2. Dropping Before this PR, a concurrent Routing fields the PR intentionally leaves alone.
Conflict-resolution note from the rebase onto current
|
|
@clawsweeper hatch |
CI:
|
|
🦞👀 I queued a comment sync for this PR. If the egg is hatchable, ClawSweeper will generate the image once and update the existing review comment. |
) When building the session-entry patch in the agent handler, the cached `entry` may be stale relative to the store — e.g. when sessions.patch writes modelOverride between spawn and run (~3% failure rate for sub-agent model overrides). mergeSessionEntry spreads {...freshStore, ...patch}, so any field read from stale `entry` and placed in the patch clobbers the fresh store value. Fix by removing blind stale-writebacks from the patch. mergeSessionEntry preserves anything not in the patch, so fresh modelOverride, providerOverride, sendPolicy, skillsSnapshot, thinkingLevel, cliSessionBindings, etc. survive the write. Fixes openclaw#5369
Ensure no future change silently re-adds a stale-cache field read to the agent session-entry patch. The original openclaw#5369 test only asserted modelOverride and providerOverride; this one asserts the whole class (sendPolicy, skillsSnapshot, thinkingLevel, fastMode, verboseLevel, traceLevel, reasoningLevel, systemSent, spawnedWorkspaceDir, spawnDepth, cliSessionIds, cliSessionBindings, claudeCliSessionId) survives a stale-cache / fresh-store race. Verified to fail when any of those fields is re-added to nextEntryPatch as `entry?.X`.
Follow-up to openclaw#5369. The previous fix dropped the entry?.sessionStartedAt writeback to avoid clobbering concurrent sessions.patch writes. That also removed the implicit self-heal for legacy stores where sessionStartedAt was never persisted: such entries would keep sessionStartedAt undefined on every subsequent run instead of recovering it from the transcript JSONL header. Restore the self-heal by computing a recovered candidate from resolveSessionLifecycleTimestamps before taking the store lock, then writing it into the patch only when the freshly-loaded store still lacks the field. Concurrent writers that set sessionStartedAt cannot be clobbered. Adds two regression tests: one proves the legacy entry is healed when fresh store also lacks the field, the other proves a fresh store value wins over the recovered candidate.
The hoisted mocks.resolveSessionLifecycleTimestamps default implementation
returns { sessionStartedAt, lastInteractionAt }, so per-test mockReturnValue
calls also need to match that shape. tsgo:test caught two missing fields.
|
Verification for head Behavior addressed: stale cached agent-session entries no longer overwrite fresher store values during agent session persistence; fresh model/provider overrides, send policy, routing metadata, lifecycle fields, and session rotations win.
|
Summary
Fixes the intermittent race condition where sub-agent model overrides were silently ignored (~3% failure rate).
sessions.patchwrites a freshmodelOverridebetween when the agent handler captures its localentryreference and when it buildsnextEntryPatch, the captured stale value would clobber the fresh store value throughmergeSessionEntry({...freshStore, ...patch}).entry?.Xwritebacks fromnextEntryPatch(incl.modelOverride,providerOverride,cliSessionIds,cliSessionBindings,claudeCliSessionId, gatedpluginOwnerId).mergeSessionEntryalready preserves anything not in the patch, so fresh store values survive the write.sessionFile(added by fix(runtime): preserve reviewed routing and transcript behavior #79076) is preserved through a conditional spread that only writessessionFile: undefinedon session rotation.Root Cause
The agent handler was:
loadSessionEntry()and holding it in a local variable.nextEntryPatchwithmodelOverride: entry?.modelOverrideand several siblings — values read from that captured local entry, not the fresh store.updateSessionStore()whose mutator doesmergeSessionEntry(store[primaryKey], nextEntryPatch)— which spreads{...freshStore, ...patch}, so any stale value in the patch clobbers the fresh store value written concurrently (e.g., bysessions.patchsettingmodelOverrideon a sub-agent between spawn and run).Changes
src/gateway/server-methods/agent.ts: drop staleentry?.Xwritebacks fromnextEntryPatch. Switch the patch type toPartial<SessionEntry>. GatepluginOwnerIdbehindentry === undefined. Convert thesessionFilerotation-clearing from fix(runtime): preserve reviewed routing and transcript behavior #79076 into a conditional spread that fires only whenentry.sessionId !== sessionId.src/gateway/server-methods/agent.test.ts: add the original sessions.patch model override not applied to spawned sub-agents #5369 regression and a broader regression guard for the whole stale-writeback class (sendPolicy, skillsSnapshot, thinkingLevel, fastMode, verboseLevel, traceLevel, reasoningLevel, systemSent, spawnedWorkspaceDir, spawnDepth, cliSessionIds, cliSessionBindings, claudeCliSessionId).Real behavior proof
Behavior or issue addressed: When
sessions.patchwrites a freshmodelOverridebetween when the agent handler captures its localentryreference and when it buildsnextEntryPatch, the captured stale value clobbers the fresh store value throughmergeSessionEntry({...freshStore, ...patch}). Per #5369, this manifested as ~3% of sub-agent runs silently ignoring an explicit model override and falling back to the parent's model.Real environment tested:
src/config/sessions/store.ts,src/config/sessions/types.ts).ai-assistant: Ubuntu 24.04 LTS, IP 10.119.193.216, Node v22.22.2, same modules from this branch via a host mount at/home/ubuntu/openclaw-pr5369.Both run the production
loadSessionStore/updateSessionStore/mergeSessionEntrycode paths against a real on-disk JSON store in a per-run tmp directory. No mocking. No shims.Exact steps or command run after this patch:
The driver lives in this branch (local-only, intentionally not committed). For each of 1000 iterations it:
updateSessionStore(no override yet).loadSessionStore({ skipCache: true })— the same shape the agent handler holds in its localentryvariable.updateSessionStoreto setmodelOverride: haiku-4.5-iter-N(mimicking whatsessions.patchdoes during sub-agent setup).nextEntryPatchtwo ways and applies each through the samemergeSessionEntry(store[primaryKey], nextEntryPatch)call shape thatsrc/gateway/server-methods/agent.tsuses:modelOverride: capturedEntry.modelOverride(writes back the captured staleundefined).modelOverridedropped from the patch —mergeSessionEntrypreserves the fresh store value.Evidence after fix:
Host (Linux 6.14, Node v22.22.0):
Multipass guest VM
ai-assistant(Ubuntu 24.04 LTS, Node v22.22.2):Observed result after fix: in both environments every one of 1000 iterations preserved the sub-agent
modelOverridewritten by the concurrent store update. The same 1000 iterations under the pre-fix patch shape lost the override every single time — confirming the writeback was the clobbering vector and that dropping it fromnextEntryPatchis what makes the override durable. The persisted on-disk values (actual=haiku-4.5-iter-0) match whatsessions.patchwrote, end-to-end, through the production merge code.The 100% pre-fix loss rate (vs the ~3% reported in #5369) reflects that the driver deterministically forces the stale-then-fresh ordering on every iteration. Production traffic only hits that ordering when
sessions.patchlands in the narrow window between the agent handler'sloadSessionEntryand its ownupdateSessionStorecall — about 3% of the time per the linked issue. Whenever that ordering does occur in production, the driver shows the pre-fix shape always loses the override and the post-fix shape never does.What was not tested: end-to-end Anthropic API calls burning credit on each of the 1000 iterations. The sub-agent dispatch loop is upstream of the
modelOverridepersistence the fix addresses; a real model loop would only confirm what the persistedmodelOverridevalue is, which the driver already reads back from disk on every iteration. Driving 1000 real Anthropic calls would not provide additional information about the persistence behavior the fix changes.Fixes #5369
Supersedes #8398 (rebased onto latest main to resolve conflicts).