fix(session-file-repair): drop null-role message entries instead of p…#77288
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. Source inspection on current main shows a valid session header plus a Next step before merge Security Review detailsBest possible solution: Land this narrow repair after standard CI and maintainer approval, while keeping #77228 open for the separate provider cooldown and prefill-placeholder failure modes. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection on current main shows a valid session header plus a Is this the best way to solve the issue? Yes. Dropping structurally invalid persisted message envelopes before existing rewrites is the narrowest maintainable fix, and the unchanged backup flow preserves original bytes for postmortem. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 89db1e5440f5. |
1363562 to
cd06dbe
Compare
0df7035 to
e2b58a5
Compare
…reserving them A type:"message" entry whose role is missing, null, or blank cannot be replayed to any provider — every provider router branches on message.role — and preserving it through the on-disk repair pass just relocates the corruption from the original file into the post-repair file. Treat such entries the same way as unparseable JSONL: drop during repair and count toward droppedLines so the cleaned transcript no longer carries them. Refs openclaw#77228 — addresses the auto-repair amplifier only. The cascading auth-profile cooldown amplifier and the trailing prefill placeholder root cause are separate failure modes and remain open.
e2b58a5 to
2560a79
Compare
|
Landed as squash commit |
…ainer-hardening * origin/main: (843 commits) docs(changelog): relocate openclaw#77046 and openclaw#77280 entries from 2026.5.3 to Unreleased (openclaw#77728) docs: reorder unreleased changelog fix: expose ollama thinking profile before activation (openclaw#77617) (thanks @yfge) fix: expose ollama thinking profile before activation test(gateway): preserve dispatch timers in waiter test(gateway): keep startup context timer live docs: document cache-friendly activity helper ci: install ffmpeg for Mantis media previews fix: avoid impossible device token rotation advice (openclaw#77688) (thanks @Conan-Scott) docs(changelog): note doctor device pairing advice fix fix(doctor): avoid impossible device token rotation advice ci: use Crabbox media previews for Mantis docs: filter maintainer-owned triage noise test: cover GitHub activity helper fix(session-file-repair): drop null-role message entries instead of preserving them (openclaw#77288) fix: prune orphan session artifacts perf: reduce GitHub activity cache misses fix: cache session list model resolution (openclaw#77650) (thanks @ragesaq) ci: embed Mantis desktop previews fix(replay-history): drop trailing stream-error placeholder before provider send (openclaw#77287) ... # Conflicts: # CHANGELOG.md
…reserving them (openclaw#77288) type:"message" entries with a null, missing, or blank role cannot be replayed to any provider — every router branches on message.role. The auto-repair pass was passing them through unchanged, relocating the corruption from the original file into the post-repair file (openclaw#77228 reported 935+ null-role entries surviving the pass). Add isStructurallyInvalidMessageEntry ahead of the existing rewrite predicates. Invalid message envelopes are counted as droppedLines and skipped; non-message envelope types (summary, custom, …) are unaffected. The .bak-* backup preserves the original bytes for postmortem before any entries are dropped. Tests: - pnpm test src/agents/session-file-repair.test.ts - pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/agents/session-file-repair.ts src/agents/session-file-repair.test.ts - pnpm check:changed Refs openclaw#77228
Summary
Problem: After the on-disk session-file repair pass runs against a transcript that already accumulated null-role JSONL corruption, the post-repair file (and its
.bak-*backup) still contains the corrupt entries — the report in Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228 saw "935+ entries with null roles (complete structural JSONL corruption)" surviving the repair, with the next provider call failing on a now-different shape error (400 messages: at least one message is requiredonce enough valid entries had been displaced). The reporter's debug line[session-init] session file repair: rewrote 1 assistant message(s), dropped 1 blank user message(s)confirms the repair pass did run on the file but left the null-role entries in place.Root Cause:
src/agents/session-file-repair.ts:repairSessionFileIfNeeded(:160-295) iterates persisted JSONL lines, applies two narrow rewrites (isAssistantEntryWithEmptyContentrewrite at:36-58/60-68,repairUserEntryWithBlankTextContentat:75-136), and otherwise pushes the parsed entry toentries[]as-is (:219). The two rewrites only fire on a non-nullassistant/userrole; atype:"message"envelope whosemessage.roleisnull,undefined, or a blank string does not match either rewrite predicate and falls through to the as-is push. The repair then atomically rewrites the file to disk, persisting the null-role entries unchanged. There is no provider router downstream that can do anything useful with arole: nullmessage — every classifier insrc/agents/session-transcript-repair.ts,src/agents/cli-runner/session-history.ts, and the provider-runtime replay sanitizers branches onmessage.role("user"/"assistant"/"toolResult"/"compactionSummary"); a null-role entry is unconditionally dead weight on the wire.Fix: Add a structural-validity predicate
isStructurallyInvalidMessageEntryand consult it inside the parse loop before the rewrite predicates run. When atype:"message"envelope has nomessageobject, has a non-objectmessage, or itsmessage.roleis missing,null, or a whitespace-only string, treat the line the same way the existingJSON.parsefailure branch treats an unparseable line: incrementdroppedLinesand skip the push toentries[]. The repair output is therefore guaranteed not to contain a null-roletype:"message"entry, regardless of what the input file accumulated. Entries of other envelope types (type:"session",type:"summary",type:"custom", anything not"message") are not subject to this check and pass through unchanged — only the message envelope contract is enforced.What changed:
src/agents/session-file-repair.ts— add theisStructurallyInvalidMessageEntryhelper and wire it into the parse loop ahead ofisAssistantEntryWithEmptyContentso the drop decision is made before either rewrite predicate runs. Doc comment records why preservation through repair is wrong and which downstream code branches onmessage.role.src/agents/session-file-repair.test.ts— add three new tests: (1) dropsrole: null/ missing-role / blank-roletype:"message"entries and counts them towarddroppedLines, asserting the post-repair file no longer contains"role":null; (2) dropstype:"message"envelopes whosemessagefield is missing or non-object; (3) preserves non-"message"envelope types (type:"summary",type:"custom") untouched so the structural check does not over-reach.CHANGELOG.md— single Fixes line under Unreleased referencing the issue with non-closingRefssyntax.What did NOT change (scope boundary):
isAssistantEntryWithEmptyContent,repairUserEntryWithBlankTextContent) or toBLANK_USER_FALLBACK_TEXT/STREAM_ERROR_FALLBACK_TEXT. Existing test cases continue to assert the same rewrite behavior — the new structural check runs before those predicates and on different inputs (null-role), so it cannot change the verdict for any input the existing tests exercise.:246-258). The repair still writes a.bak-*backup of the original and atomically renames the cleaned file in place..jsonland the older.reset.<iso>archive, suggesting the corruption pre-dates the repair pass) is out of scope for this PR. Defending the repair output from preserving the corruption is the narrowly-scoped, surgically defensible fix; locating the producer can land in a separate follow-up.Reproduction
type:"message"entries shaped like{"type":"message","id":"...","timestamp":"...","message":{"role":null,"content":"..."}}. The report's reproducer happens organically after a series of failures cascading off a prefill 400; for a deterministic reproducer in unit-test space, write the entries directly withJSON.stringify.{"type":"session","id":"...","version":7,"timestamp":"...","cwd":"..."}) and a normalrole:"user"message somewhere in the body so the repair pass cannot bail on theentries.length === 0/invalid session headerearly-returns.repairSessionFileIfNeeded({ sessionFile }).JSON.stringify.result.droppedLinesreflects only unparseable JSON lines; the null-role count is hidden, and the cleaned transcript is structurally identical (modulo the rewrite passes) to the original — including the corruption.result.droppedLinesand is absent from the post-repair file. The structural check fires before the rewrite predicates, so the existing rewrite paths are unaffected for any non-null-role input.Risk / Mitigation
type:"message"but whose role lives somewhere else? No. The check only fires whenentry.type === "message". Other envelope types ("session","summary","custom","model-snapshot", etc.) skip the check entirely. The "preserves non-messageenvelope types" test locks this boundary, including atype:"custom"model-snapshotentry shape that is observed in production replay history.Mitigation: explicit positive test for
type:"summary"andtype:"custom"; explicit negative tests forrole: null, missing role, and blank-string role;result.droppedLinesis incremented in lock-step so any drop is observable through the existing repair report and debug log.message.roleand would silently drop the entry on the wire anyway. The only "value" of preserving null-role entries is keeping the original byte payload around for debugging — and the existing.bak-*backup written by the repair pass already preserves that byte payload verbatim before any drop decisions are made.Mitigation: the existing backup-then-rewrite atomic-rename flow at
:246-258is unchanged; original bytes remain available off-line under the.bak-*file for postmortem.isAssistantEntryWithEmptyContent(which bails immediately becausemessage.role !== "assistant") norrepairUserEntryWithBlankTextContent(which is gated onmessage.role === "user"at the call site:206). For any non-null-role entry, the new check is a no-op and the existing predicates run unchanged.Mitigation: existing rewrite tests at
:101-281continue to pass without modification.entries.length === 0after the drop: If the repair drops all message entries (only the session header survives), the existingentries.length === 0check at:225-227already returnsrepaired: falsewithreason: "empty session file"— this is correct behavior and matches the existing contract: an empty transcript is reported, the original file is left untouched. The fix does not introduce a new "successfully wrote an empty transcript" failure mode.Mitigation: covered by existing assertions on the early-return branches.
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Refs #77228 — addresses the auto-repair amplifier only (the repair pass no longer preserves null-role JSONL corruption from the input). The cascading auth-profile cooldown amplifier (a single
format400 currently poisons the whole auth profile across sessions) and the trailing prefill-placeholder root cause (the stream-error sentinel ends up as the trailing message and triggers a prefill-strict 400) are independent failure modes in the same issue and are out of scope here so each can ship as its own narrowly-scoped PR.