Skip to content

Fix: preserve modelOverride in agent handler (#5369)#19328

Merged
steipete merged 6 commits into
openclaw:mainfrom
CodeReclaimers:fix/5369-subagent-model-override-race
May 22, 2026
Merged

Fix: preserve modelOverride in agent handler (#5369)#19328
steipete merged 6 commits into
openclaw:mainfrom
CodeReclaimers:fix/5369-subagent-model-override-race

Conversation

@CodeReclaimers

@CodeReclaimers CodeReclaimers commented Feb 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes the intermittent race condition where sub-agent model overrides were silently ignored (~3% failure rate).

  • When sessions.patch writes a fresh modelOverride between when the agent handler captures its local entry reference and when it builds nextEntryPatch, the captured stale value would clobber the fresh store value through mergeSessionEntry({...freshStore, ...patch}).
  • The fix removes the stale entry?.X writebacks from nextEntryPatch (incl. modelOverride, providerOverride, cliSessionIds, cliSessionBindings, claudeCliSessionId, gated pluginOwnerId). mergeSessionEntry already preserves anything not in the patch, so fresh store values survive the write.
  • Rotation-clearing for sessionFile (added by fix(runtime): preserve reviewed routing and transcript behavior #79076) is preserved through a conditional spread that only writes sessionFile: undefined on session rotation.

Root Cause

The agent handler was:

  1. Reading the session entry early via loadSessionEntry() and holding it in a local variable.
  2. Building nextEntryPatch with modelOverride: entry?.modelOverride and several siblings — values read from that captured local entry, not the fresh store.
  3. Calling updateSessionStore() whose mutator does mergeSessionEntry(store[primaryKey], nextEntryPatch) — which spreads {...freshStore, ...patch}, so any stale value in the patch clobbers the fresh store value written concurrently (e.g., by sessions.patch setting modelOverride on a sub-agent between spawn and run).

Changes

  • src/gateway/server-methods/agent.ts: drop stale entry?.X writebacks from nextEntryPatch. Switch the patch type to Partial<SessionEntry>. Gate pluginOwnerId behind entry === undefined. Convert the sessionFile rotation-clearing from fix(runtime): preserve reviewed routing and transcript behavior #79076 into a conditional spread that fires only when entry.sessionId !== sessionId.
  • src/gateway/server-methods/agent.test.ts: add the original sessions.patch model override not applied to spawned sub-agents #5369 regression and a broader regression guard for the whole stale-writeback class (sendPolicy, skillsSnapshot, thinkingLevel, fastMode, verboseLevel, traceLevel, reasoningLevel, systemSent, spawnedWorkspaceDir, spawnDepth, cliSessionIds, cliSessionBindings, claudeCliSessionId).

Real behavior proof

Behavior or issue addressed: When sessions.patch writes a fresh modelOverride between when the agent handler captures its local entry reference and when it builds nextEntryPatch, the captured stale value clobbers the fresh store value through mergeSessionEntry({...freshStore, ...patch}). Per #5369, this manifested as ~3% of sub-agent runs silently ignoring an explicit model override and falling back to the parent's model.

Real environment tested:

  • Host: Linux 6.14.0-34-generic on x86_64, Node v22.22.0, openclaw production session-store modules from this branch (src/config/sessions/store.ts, src/config/sessions/types.ts).
  • Multipass guest VM ai-assistant: Ubuntu 24.04 LTS, IP 10.119.193.216, Node v22.22.2, same modules from this branch via a host mount at /home/ubuntu/openclaw-pr5369.

Both run the production loadSessionStore / updateSessionStore / mergeSessionEntry code paths against a real on-disk JSON store in a per-run tmp directory. No mocking. No shims.

Exact steps or command run after this patch:

node --import tsx scripts/repro-5369.mts 1000

The driver lives in this branch (local-only, intentionally not committed). For each of 1000 iterations it:

  1. Initializes a session entry on disk via updateSessionStore (no override yet).
  2. Captures the entry via loadSessionStore({ skipCache: true }) — the same shape the agent handler holds in its local entry variable.
  3. Fires a concurrent updateSessionStore to set modelOverride: haiku-4.5-iter-N (mimicking what sessions.patch does during sub-agent setup).
  4. Builds nextEntryPatch two ways and applies each through the same mergeSessionEntry(store[primaryKey], nextEntryPatch) call shape that src/gateway/server-methods/agent.ts uses:
    • PRE-FIX shape: modelOverride: capturedEntry.modelOverride (writes back the captured stale undefined).
    • POST-FIX shape: modelOverride dropped from the patch — mergeSessionEntry preserves the fresh store value.
  5. Reads the persisted entry back from disk and reports whether the override survived.

Evidence after fix:

Host (Linux 6.14, Node v22.22.0):

[repro #5369] node v22.22.0 on linux x64
[repro #5369] storePath=/tmp/openclaw-pr5369-0fZPH1/sessions.json
[repro #5369] iterations=1000

[repro #5369] running PRE-FIX patch shape (entry?.modelOverride writeback)...
[repro #5369] PRE-FIX :  1000/1000 sub-agents lost their override  (338 ms)
[repro #5369]           sample: expected=haiku-4.5-iter-0 actual=undefined

[repro #5369] running POST-FIX patch shape (modelOverride dropped from patch)...
[repro #5369] POST-FIX:  0/1000 sub-agents lost their override  (254 ms)
[repro #5369]           sample: expected=haiku-4.5-iter-0 actual=haiku-4.5-iter-0

[repro #5369] verdict: fix VERIFIED

Multipass guest VM ai-assistant (Ubuntu 24.04 LTS, Node v22.22.2):

[repro #5369] node v22.22.2 on linux x64
[repro #5369] storePath=/tmp/openclaw-pr5369-4EJXTF/sessions.json
[repro #5369] iterations=1000

[repro #5369] running PRE-FIX patch shape (entry?.modelOverride writeback)...
[repro #5369] PRE-FIX :  1000/1000 sub-agents lost their override  (2602 ms)
[repro #5369]           sample: expected=haiku-4.5-iter-0 actual=undefined

[repro #5369] running POST-FIX patch shape (modelOverride dropped from patch)...
[repro #5369] POST-FIX:  0/1000 sub-agents lost their override  (1337 ms)
[repro #5369]           sample: expected=haiku-4.5-iter-0 actual=haiku-4.5-iter-0

[repro #5369] verdict: fix VERIFIED

Observed result after fix: in both environments every one of 1000 iterations preserved the sub-agent modelOverride written by the concurrent store update. The same 1000 iterations under the pre-fix patch shape lost the override every single time — confirming the writeback was the clobbering vector and that dropping it from nextEntryPatch is what makes the override durable. The persisted on-disk values (actual=haiku-4.5-iter-0) match what sessions.patch wrote, end-to-end, through the production merge code.

The 100% pre-fix loss rate (vs the ~3% reported in #5369) reflects that the driver deterministically forces the stale-then-fresh ordering on every iteration. Production traffic only hits that ordering when sessions.patch lands in the narrow window between the agent handler's loadSessionEntry and its own updateSessionStore call — about 3% of the time per the linked issue. Whenever that ordering does occur in production, the driver shows the pre-fix shape always loses the override and the post-fix shape never does.

What was not tested: end-to-end Anthropic API calls burning credit on each of the 1000 iterations. The sub-agent dispatch loop is upstream of the modelOverride persistence the fix addresses; a real model loop would only confirm what the persisted modelOverride value is, which the driver already reads back from disk on every iteration. Driving 1000 real Anthropic calls would not provide additional information about the persistence behavior the fix changes.

Fixes #5369

Supersedes #8398 (rebased onto latest main to resolve conflicts).

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: M labels Feb 17, 2026
@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch 3 times, most recently from 8b3c85e to 13c131e Compare February 17, 2026 18:00
@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch from 922b7ce to 28361ca Compare March 4, 2026 18:55
@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch 2 times, most recently from 277c397 to 8f6fa74 Compare March 17, 2026 20:43
@openclaw-barnacle

Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@CodeReclaimers

Copy link
Copy Markdown
Contributor Author

Fixes #52786

@openclaw-barnacle openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Mar 24, 2026
@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch 2 times, most recently from a620fb1 to a00c179 Compare March 28, 2026 02:59
@openclaw-barnacle

Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Apr 2, 2026
@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch from a00c179 to 0a2b458 Compare April 2, 2026 11:24
@openclaw-barnacle openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Apr 3, 2026
@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch 5 times, most recently from 495f92b to 35f66ab Compare April 12, 2026 23:34
@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch from 35f66ab to aa86538 Compare April 17, 2026 20:28
@openclaw-barnacle

Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Apr 23, 2026
@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch from aa86538 to 1d79288 Compare April 23, 2026 10:34
@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch from 7d0acc5 to 335965e Compare May 18, 2026 12:22
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 18, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 18, 2026
@CodeReclaimers

Copy link
Copy Markdown
Contributor Author

Real behavior proof — refresh for current PR head 335965e51e

The PR has grown two new commits since the original proof: a self-heal of
sessionStartedAt for legacy session entries (7be938167e) and a test-type
fix (335965e51e). The original modelOverride race proof remains valid as-is
(no production code path for that scenario changed); this comment extends the
same scripts/repro-5369.mts driver to also cover the new self-heal logic,
end-to-end, against the production session-store modules.

Behavior or issue addressed (new in this revision): After the original
#5369 fix dropped entry?.sessionStartedAt from nextEntryPatch, legacy
session entries that lack sessionStartedAt in the store stayed unhealed
forever — mergeSessionEntryWithPolicy falls back to
existing.sessionStartedAt (still undefined) when the patch omits the
field. The follow-up commit pre-computes a JSONL-derived recovery candidate
outside the store lock and writes it into the patch only when the
freshly-loaded store still lacks the field, so a concurrent writer that sets
sessionStartedAt between the cache load and the store lock cannot be
clobbered.

Real environment tested:

  • Host: Linux 6.14.0-34-generic on x86_64, Node v22.22.0, openclaw production
    session-store modules from this branch
    (src/config/sessions/store.ts, src/config/sessions/types.ts,
    src/config/sessions/lifecycle.ts, src/config/sessions/paths.ts).

Runs the production loadSessionStore / updateSessionStore /
mergeSessionEntry / resolveSessionLifecycleTimestamps /
resolveSessionFilePath code paths against a real on-disk JSON store and a
real JSONL transcript header written into the path the production code
resolves. No mocking; no shims.

The previously-reported Multipass ai-assistant VM run for the modelOverride
race in the original comment above is unchanged and still covers
cross-environment proof for that batch.

Exact command run after this patch:

node --import tsx scripts/repro-5369.mts 1000

The driver lives in this branch (local-only, intentionally not committed).
Per iteration, for each of three new self-heal batches it:

  1. Writes a real JSONL transcript header at the path
    resolveSessionFilePath(sessionId, undefined, {rootDir}) produces, with
    {"type":"session","id":<sessionId>,"timestamp":<ISO>} as the first line.
  2. Initializes a session entry on disk via updateSessionStore and removes
    sessionStartedAt post-merge so the store entry truly lacks the field
    (mirrors a pre-field-introduction schema).
  3. Captures the entry via loadSessionStore({ skipCache: true }) — the same
    shape the agent handler holds in its local entry variable.
  4. For the no-clobber batch only: a concurrent updateSessionStore sets
    sessionStartedAt: <fresh> between the capture and the agent handler's
    own store lock.
  5. Computes the recovery candidate via the production
    resolveSessionLifecycleTimestamps({entry: captured, storePath, agentId})
    call site that src/gateway/server-methods/agent.ts:1252-1262 uses.
  6. Enters the store lock and applies the patch two ways:
  7. Reads the persisted entry back from disk and reports what survived.

Evidence after fix (host run, 1000 iterations):

[repro #5369] node v22.22.0 on linux x64
[repro #5369] storePath=/tmp/openclaw-pr5369-qMidRR/sessions.json
[repro #5369] iterations=1000

[repro #5369] running PRE-FIX patch shape (entry?.modelOverride writeback)...
[repro #5369] PRE-FIX :  1000/1000 sub-agents lost their override  (376 ms)
[repro #5369]           sample: expected=haiku-4.5-iter-0 actual=undefined

[repro #5369] running POST-FIX patch shape (modelOverride dropped from patch)...
[repro #5369] POST-FIX:  0/1000 sub-agents lost their override  (293 ms)
[repro #5369]           sample: expected=haiku-4.5-iter-0 actual=haiku-4.5-iter-0

[repro #5369] running PRE-SELF-HEAL on legacy entry (no recovery wired in)...
[repro #5369] PRE-SELF-HEAL  :  1000/1000 legacy entries left unhealed  (283 ms)
[repro #5369]                  sample: expected=1735689600000 actual=undefined (recovered candidate was 1735689600000)

[repro #5369] running POST-SELF-HEAL on legacy entry (recovery written when fresh store lacks)...
[repro #5369] POST-SELF-HEAL :  0/1000 legacy entries left unhealed  (291 ms)
[repro #5369]                  sample: expected=1735689600000 actual=1735689600000 (recovered candidate=1735689600000)

[repro #5369] running POST-SELF-HEAL no-clobber (concurrent writer sets fresh value)...
[repro #5369] NO-CLOBBER     :  0/1000 fresh values clobbered by recovery  (316 ms)
[repro #5369]                  sample: expected=1781654400000 actual=1781654400000 (recovered candidate=1735689600000)

[repro #5369] verdict: fix VERIFIED

Observed result after fix:

  • modelOverride race (PRE-FIX vs POST-FIX): unchanged from the original
    proof — every one of 1000 iterations preserves the sub-agent
    modelOverride under the post-fix patch shape; the same 1000 lose it
    under the pre-fix shape. (Cross-environment VM proof in the original
    comment above still applies — no production code changed in that path.)
  • sessionStartedAt self-heal (PRE-SELF-HEAL vs POST-SELF-HEAL): every
    one of 1000 legacy entries (lacking sessionStartedAt, with a real
    JSONL header on disk) is healed to the recovered timestamp under the
    current branch. The same 1000 legacy entries stayed undefined forever
    under the pre-self-heal patch shape — confirming the regression the
    follow-up commit fixes was real and the recovery candidate is what makes
    the persisted value durable.
  • No-clobber under concurrent write: when a concurrent writer sets a
    fresh sessionStartedAt between the cache load and the store lock, the
    inside-lock guard sees store[primaryKey]?.sessionStartedAt !== undefined
    and skips the recovery candidate. Every one of 1000 iterations preserves
    the concurrent writer's value (actual=1781654400000) over the recovered
    candidate (recovered candidate=1735689600000).

What was not tested: a cross-environment VM repeat of the new self-heal
batches. The original Multipass ai-assistant VM run already covered the
modelOverride race; the self-heal logic is deterministic against the same
disk-backed code paths the modelOverride proof exercises, and the host run
shows zero failures across 1000 iterations on each of the three new
batches. Cross-VM repeat can be added on request.

Fixes #5369

@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch from 335965e to 103aa39 Compare May 18, 2026 21:20
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 18, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 18, 2026
@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🌱 uncommon Cosmic Shellbean

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🌱 uncommon.
Trait: collects tiny proofs.
Image traits: location proof lagoon; accessory tiny test log scroll; palette amber, ink, and glacier blue; mood mischievous; pose balancing on a branch marker; shell polished stone shell; lighting soft underwater shimmer; background small review tokens.
Share on X: post this hatch
Copy: My PR egg hatched a 🌱 uncommon Cosmic Shellbean in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@CodeReclaimers CodeReclaimers force-pushed the fix/5369-subagent-model-override-race branch from 103aa39 to 3365317 Compare May 20, 2026 00:13
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 20, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. labels May 20, 2026
@CodeReclaimers

Copy link
Copy Markdown
Contributor Author

Label audit: compatibility + message-delivery

Audited the two risk labels against the actual diff. Both are appropriately applied as "review carefully" flags, but no real breaks. Two specifics that aren't currently explicit in the PR description and may want a maintainer-visible callout:

1. pluginOwnerId is now sticky on session reuse.

The patch gates pluginOwnerId behind entry === undefined, so it is only set on brand-new session entries and never updated on session reuse. If any current flow relies on the agent handler refreshing pluginOwnerId when a different plugin connects to an existing session, that flow stops working. Ownership updates on an existing session need to go through sessions.pluginPatch instead.

2. Dropping sendPolicy from the patch closes a delivery-suppression race.

Before this PR, a concurrent sessions.patch setting sendPolicy: "deny" could be clobbered back to a stale "allow" by the agent handler's entry?.sendPolicy writeback — letting messages through that should have been suppressed. After this PR, mergeSessionEntry preserves the freshly-set "deny". This is the message-delivery dimension of the same #5369 stale-writeback class.

Routing fields the PR intentionally leaves alone.

route, deliveryContext, lastChannel, lastTo, lastAccountId, lastThreadId, channel, chatType keep their pre-existing entry?.X fallback writeback shape. sessions.patch does not write any of these (verified against src/gateway/sessions-patch.ts), so the specific race window #5369 addresses doesn't apply to them. They have the same conceptual stale-fallback shape but no known concurrent writer — out of scope here, would need a separate change if a writer appeared.

Conflict-resolution note from the rebase onto current main.

route: effectiveDeliveryFields.route from #84xxx (commit 17eab1ed4d, "fix(channels): preserve route metadata on agent updates") was preserved in the conflict resolution. The route value is derived from the stale entry via normalizeSessionDeliveryFields(entry).route, so it has the same conceptual stale-writeback shape as the fields this PR drops — kept because the regression test added in 17eab1e depends on the writeback happening. If a future concurrent writer for route appears, that field will need the same drop-from-patch treatment.

@CodeReclaimers

Copy link
Copy Markdown
Contributor Author

@clawsweeper hatch

@CodeReclaimers

Copy link
Copy Markdown
Contributor Author

CI: build-artifacts failure is unrelated to this PR

The single CI failure (build-artifactscore-support-boundary) is a workflow-inventory snapshot mismatch in scripts/plugin-prerelease-test-plan.test.ts:469. The test expects "cancel-in-progress": false on the concurrency: block of .github/workflows/openclaw-release-checks.yml, but the actual file now contains "${{ startsWith(github.ref, 'refs/heads/tideclaw/alpha/') }}".

Caused by commit 375afbad2d on main ("ci: cancel duplicate Tideclaw alpha release runs"), which updated three release workflows' cancel-in-progress expressions without updating the snapshot expectation in plugin-prerelease-test-plan.test.ts.

This PR touches only src/gateway/server-methods/agent.ts and src/gateway/server-methods/agent.test.ts — nothing under .github/workflows/ or scripts/. The failure reproduces on any PR built against current main. Flagging for maintainer awareness; not in scope to fix here.

Failed job: https://github.com/openclaw/openclaw/actions/runs/26133178044/job/76862531460

@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper PR egg hatch requested.

I queued a comment sync for this PR. If the egg is hatchable, ClawSweeper will generate the image once and update the existing review comment.
Action: PR egg hatch queued (workflow sweep.yml, event repository_dispatch).
The ASCII egg stays as the fallback.

CodeReclaimers and others added 6 commits May 22, 2026 19:05
)

When building the session-entry patch in the agent handler, the cached
`entry` may be stale relative to the store — e.g. when sessions.patch
writes modelOverride between spawn and run (~3% failure rate for
sub-agent model overrides). mergeSessionEntry spreads
{...freshStore, ...patch}, so any field read from stale `entry` and
placed in the patch clobbers the fresh store value.

Fix by removing blind stale-writebacks from the patch. mergeSessionEntry
preserves anything not in the patch, so fresh modelOverride,
providerOverride, sendPolicy, skillsSnapshot, thinkingLevel,
cliSessionBindings, etc. survive the write.

Fixes openclaw#5369
Ensure no future change silently re-adds a stale-cache field read to the
agent session-entry patch. The original openclaw#5369 test only asserted
modelOverride and providerOverride; this one asserts the whole class
(sendPolicy, skillsSnapshot, thinkingLevel, fastMode, verboseLevel,
traceLevel, reasoningLevel, systemSent, spawnedWorkspaceDir, spawnDepth,
cliSessionIds, cliSessionBindings, claudeCliSessionId) survives a
stale-cache / fresh-store race. Verified to fail when any of those
fields is re-added to nextEntryPatch as `entry?.X`.
Follow-up to openclaw#5369. The previous fix dropped the entry?.sessionStartedAt
writeback to avoid clobbering concurrent sessions.patch writes. That also
removed the implicit self-heal for legacy stores where sessionStartedAt
was never persisted: such entries would keep sessionStartedAt undefined
on every subsequent run instead of recovering it from the transcript
JSONL header.

Restore the self-heal by computing a recovered candidate from
resolveSessionLifecycleTimestamps before taking the store lock, then
writing it into the patch only when the freshly-loaded store still
lacks the field. Concurrent writers that set sessionStartedAt cannot
be clobbered.

Adds two regression tests: one proves the legacy entry is healed
when fresh store also lacks the field, the other proves a fresh
store value wins over the recovered candidate.
The hoisted mocks.resolveSessionLifecycleTimestamps default implementation
returns { sessionStartedAt, lastInteractionAt }, so per-test mockReturnValue
calls also need to match that shape. tsgo:test caught two missing fields.
@steipete

Copy link
Copy Markdown
Contributor

Verification for head 133b8b47182e0a413cf7fb613e05058da924e07f:

Behavior addressed: stale cached agent-session entries no longer overwrite fresher store values during agent session persistence; fresh model/provider overrides, send policy, routing metadata, lifecycle fields, and session rotations win.
Real environment tested: local OpenClaw checkout on macOS; GitHub Actions on PR head.
Exact steps or command run after this patch:

  • pnpm test src/gateway/server-methods/agent.test.ts -- --reporter=verbose
  • env -u OPENCLAW_TESTBOX -u OPENCLAW_TESTBOX_REMOTE_RUN pnpm check:changed
  • /Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode local
  • GitHub Actions CI run 26304096178
  • GitHub Actions CodeQL run 26304096255
  • GitHub Actions CodeQL Critical Quality run 26304096175
  • GitHub Actions OpenGrep - PR Diff run 26304096261
  • GitHub Actions Real behavior proof run 26304093814
    Evidence after fix: focused gateway test file passed 108 tests locally; changed core/type/lint/import-cycle checks passed locally; autoreview reported no accepted/actionable findings; current PR-head CI, CodeQL, CodeQL Critical Quality, OpenGrep, and Real behavior proof all passed.
    Observed result after fix: session patch construction now uses the locked fresh store row; denied send policy returns without mutating store; fresh session rotation and recovered sessionStartedAt safeguards are covered by regression tests.
    What was not tested: no live multi-process gateway race harness beyond the source-level regression tests and CI shard coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: L status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sessions.patch model override not applied to spawned sub-agents

2 participants