Skip to content

fix(session-file-repair): drop null-role message entries instead of p…#77288

Merged
openperf merged 3 commits intoopenclaw:mainfrom
openperf:fix/77228-repair-drop-null-role-entries
May 5, 2026
Merged

fix(session-file-repair): drop null-role message entries instead of p…#77288
openperf merged 3 commits intoopenclaw:mainfrom
openperf:fix/77228-repair-drop-null-role-entries

Conversation

@openperf
Copy link
Copy Markdown
Member

@openperf openperf commented May 4, 2026

Summary

  • Problem: After the on-disk session-file repair pass runs against a transcript that already accumulated null-role JSONL corruption, the post-repair file (and its .bak-* backup) still contains the corrupt entries — the report in Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228 saw "935+ entries with null roles (complete structural JSONL corruption)" surviving the repair, with the next provider call failing on a now-different shape error (400 messages: at least one message is required once enough valid entries had been displaced). The reporter's debug line [session-init] session file repair: rewrote 1 assistant message(s), dropped 1 blank user message(s) confirms the repair pass did run on the file but left the null-role entries in place.

  • Root Cause: src/agents/session-file-repair.ts:repairSessionFileIfNeeded (:160-295) iterates persisted JSONL lines, applies two narrow rewrites (isAssistantEntryWithEmptyContent rewrite at :36-58/60-68, repairUserEntryWithBlankTextContent at :75-136), and otherwise pushes the parsed entry to entries[] as-is (:219). The two rewrites only fire on a non-null assistant/user role; a type:"message" envelope whose message.role is null, undefined, or a blank string does not match either rewrite predicate and falls through to the as-is push. The repair then atomically rewrites the file to disk, persisting the null-role entries unchanged. There is no provider router downstream that can do anything useful with a role: null message — every classifier in src/agents/session-transcript-repair.ts, src/agents/cli-runner/session-history.ts, and the provider-runtime replay sanitizers branches on message.role ("user"/"assistant"/"toolResult"/"compactionSummary"); a null-role entry is unconditionally dead weight on the wire.

  • Fix: Add a structural-validity predicate isStructurallyInvalidMessageEntry and consult it inside the parse loop before the rewrite predicates run. When a type:"message" envelope has no message object, has a non-object message, or its message.role is missing, null, or a whitespace-only string, treat the line the same way the existing JSON.parse failure branch treats an unparseable line: increment droppedLines and skip the push to entries[]. The repair output is therefore guaranteed not to contain a null-role type:"message" entry, regardless of what the input file accumulated. Entries of other envelope types (type:"session", type:"summary", type:"custom", anything not "message") are not subject to this check and pass through unchanged — only the message envelope contract is enforced.

  • What changed:

    • src/agents/session-file-repair.ts — add the isStructurallyInvalidMessageEntry helper and wire it into the parse loop ahead of isAssistantEntryWithEmptyContent so the drop decision is made before either rewrite predicate runs. Doc comment records why preservation through repair is wrong and which downstream code branches on message.role.
    • src/agents/session-file-repair.test.ts — add three new tests: (1) drops role: null / missing-role / blank-role type:"message" entries and counts them toward droppedLines, asserting the post-repair file no longer contains "role":null; (2) drops type:"message" envelopes whose message field is missing or non-object; (3) preserves non-"message" envelope types (type:"summary", type:"custom") untouched so the structural check does not over-reach.
    • CHANGELOG.md — single Fixes line under Unreleased referencing the issue with non-closing Refs syntax.
  • What did NOT change (scope boundary):

    • No changes to the existing rewrite predicates (isAssistantEntryWithEmptyContent, repairUserEntryWithBlankTextContent) or to BLANK_USER_FALLBACK_TEXT / STREAM_ERROR_FALLBACK_TEXT. Existing test cases continue to assert the same rewrite behavior — the new structural check runs before those predicates and on different inputs (null-role), so it cannot change the verdict for any input the existing tests exercise.
    • No changes to file-write / atomic-rename / backup logic (:246-258). The repair still writes a .bak-* backup of the original and atomically renames the cleaned file in place.
    • No source-of-null-role investigation or fix. The producer of the null-role entries (suspected to be a write-side path or an external mutation; the reporter observed null roles in both the active .jsonl and the older .reset.<iso> archive, suggesting the corruption pre-dates the repair pass) is out of scope for this PR. Defending the repair output from preserving the corruption is the narrowly-scoped, surgically defensible fix; locating the producer can land in a separate follow-up.
    • No changes to the cascading auth-profile cooldown that this corrupted session can trigger upstream of the repair pass, and no changes to the trailing prefill-placeholder shape that originally tripped the provider 400. Each is a separate failure mode in the same issue and gets its own narrowly-scoped PR.

Reproduction

  1. Construct (or copy from a real-world reproducer) a session JSONL whose tail contains one or more type:"message" entries shaped like {"type":"message","id":"...","timestamp":"...","message":{"role":null,"content":"..."}}. The report's reproducer happens organically after a series of failures cascading off a prefill 400; for a deterministic reproducer in unit-test space, write the entries directly with JSON.stringify.
  2. Add a session header line as the first entry ({"type":"session","id":"...","version":7,"timestamp":"...","cwd":"..."}) and a normal role:"user" message somewhere in the body so the repair pass cannot bail on the entries.length === 0 / invalid session header early-returns.
  3. Call repairSessionFileIfNeeded({ sessionFile }).
  4. Without this fix: the post-repair file still contains every null-role entry, just re-serialized through JSON.stringify. result.droppedLines reflects only unparseable JSON lines; the null-role count is hidden, and the cleaned transcript is structurally identical (modulo the rewrite passes) to the original — including the corruption.
  5. With this fix: each null-role entry counts toward result.droppedLines and is absent from the post-repair file. The structural check fires before the rewrite predicates, so the existing rewrite paths are unaffected for any non-null-role input.

Risk / Mitigation

  • Risk 1 — over-broad role validation: Could the new check reject a legitimate envelope type whose top-level shape is also type:"message" but whose role lives somewhere else? No. The check only fires when entry.type === "message". Other envelope types ("session", "summary", "custom", "model-snapshot", etc.) skip the check entirely. The "preserves non-message envelope types" test locks this boundary, including a type:"custom" model-snapshot entry shape that is observed in production replay history.
    Mitigation: explicit positive test for type:"summary" and type:"custom"; explicit negative tests for role: null, missing role, and blank-string role; result.droppedLines is incremented in lock-step so any drop is observable through the existing repair report and debug log.
  • Risk 2 — losing recoverable data: Could a null-role entry have carried information worth keeping? No: every provider router branches on message.role and would silently drop the entry on the wire anyway. The only "value" of preserving null-role entries is keeping the original byte payload around for debugging — and the existing .bak-* backup written by the repair pass already preserves that byte payload verbatim before any drop decisions are made.
    Mitigation: the existing backup-then-rewrite atomic-rename flow at :246-258 is unchanged; original bytes remain available off-line under the .bak-* file for postmortem.
  • Risk 3 — interaction with the existing rewrite predicates: Could the new check race with the empty-assistant-error rewrite or the blank-user rewrite and produce a different verdict? No. The new check fires first in the parse loop. A null-role entry never reaches isAssistantEntryWithEmptyContent (which bails immediately because message.role !== "assistant") nor repairUserEntryWithBlankTextContent (which is gated on message.role === "user" at the call site :206). For any non-null-role entry, the new check is a no-op and the existing predicates run unchanged.
    Mitigation: existing rewrite tests at :101-281 continue to pass without modification.
  • Risk 4 — entries.length === 0 after the drop: If the repair drops all message entries (only the session header survives), the existing entries.length === 0 check at :225-227 already returns repaired: false with reason: "empty session file" — this is correct behavior and matches the existing contract: an empty transcript is reported, the original file is left untouched. The fix does not introduce a new "successfully wrote an empty transcript" failure mode.
    Mitigation: covered by existing assertions on the early-return branches.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Agents (session-file-repair on-disk pass)
  • Tests (session-file-repair.test.ts)
  • Changelog (Unreleased Fixes entry)

Linked Issue/PR

Refs #77228 — addresses the auto-repair amplifier only (the repair pass no longer preserves null-role JSONL corruption from the input). The cascading auth-profile cooldown amplifier (a single format 400 currently poisons the whole auth profile across sessions) and the trailing prefill-placeholder root cause (the stream-error sentinel ends up as the trailing message and triggers a prefill-strict 400) are independent failure modes in the same issue and are out of scope here so each can ship as its own narrowly-scoped PR.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 4, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 4, 2026

Codex review: needs maintainer review before merge.

Summary
The PR drops malformed persisted type: "message" entries with missing, null, or blank roles during session-file repair, adds regression tests, and adds an Unreleased changelog entry.

Reproducibility: yes. Source inspection on current main shows a valid session header plus a type: "message" entry with message.role: null parses and falls through to entries.push(entry) unchanged; I did not execute the test because this was a read-only review.

Next step before merge
No repair lane is needed; the previous changelog blocker is fixed and the remaining action is normal maintainer/CI handling for a protected PR.

Security
Cleared: The diff only changes local session JSONL repair filtering, colocated tests, and changelog text, with no dependency, CI, credential, release, or code-execution surface changes.

Review details

Best possible solution:

Land this narrow repair after standard CI and maintainer approval, while keeping #77228 open for the separate provider cooldown and prefill-placeholder failure modes.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection on current main shows a valid session header plus a type: "message" entry with message.role: null parses and falls through to entries.push(entry) unchanged; I did not execute the test because this was a read-only review.

Is this the best way to solve the issue?

Yes. Dropping structurally invalid persisted message envelopes before existing rewrites is the narrowest maintainable fix, and the unchanged backup flow preserves original bytes for postmortem.

What I checked:

  • Current main preserves null-role messages: On current main, repairSessionFileIfNeeded parses JSONL, applies only the empty-assistant and blank-user rewrites, then pushes all other parsed entries unchanged; a type: "message" entry with message.role: null reaches entries.push(entry). (src/agents/session-file-repair.ts:195, 89db1e5440f5)
  • PR filters invalid message envelopes before rewrites: The PR adds isStructurallyInvalidMessageEntry and calls it before the existing rewrite predicates, incrementing droppedLines and skipping invalid type: "message" records. (src/agents/session-file-repair.ts:33, c944912db763)
  • Regression coverage added: The PR adds tests for null, missing, and blank roles; missing or non-object message; and preserving non-message envelope types unchanged. (src/agents/session-file-repair.test.ts:580, c944912db763)
  • Changelog blocker fixed: The latest PR diff includes an Unreleased Fixes entry for the session-file-repair null-role cleanup, so the earlier ClawSweeper changelog finding no longer applies. (CHANGELOG.md:310, c944912db763)
  • Docs support repair boundary: Transcript hygiene docs describe session-file repair as a pre-load disk rewrite path for malformed durable records and note that original files are backed up when repair occurs. Public docs: docs/reference/transcript-hygiene.md. (docs/reference/transcript-hygiene.md:15, 89db1e5440f5)
  • Protected PR handling applies: Provided GitHub context reports author association MEMBER and labels agents, maintainer, and size: S, so this cleanup review should not close the PR. (c944912db763)

Likely related people:

  • openperf: Authored merged Bedrock/session repair work that added empty assistant content repair and transcript hygiene docs in the same module area, and current PR builds on that repair path. (role: prior adjacent repair contributor; confidence: high; commits: 930d81aa41d2; files: src/agents/session-file-repair.ts, src/agents/session-file-repair.test.ts, docs/reference/transcript-hygiene.md)
  • amknight: Merged session-file repair work for blank user rewrites and trailing assistant repair behavior that this PR composes with in the same parse loop. (role: introduced related behavior; confidence: high; commits: 524528944f05; files: src/agents/session-file-repair.ts, src/agents/session-file-repair.test.ts)
  • obviyus: Recently merged the preservation change for delivered assistant replies in session repair and added adjacent coverage/docs. (role: recent maintainer of adjacent behavior; confidence: medium; commits: e8756d99aef0; files: src/agents/session-file-repair.ts, src/agents/session-file-repair.test.ts, docs/reference/transcript-hygiene.md)
  • steipete: Recent commit history on the same file includes session repair logging/maintenance and broader maintainer ownership of this area. (role: recent maintainer; confidence: medium; commits: 929df0f5567f; files: src/agents/session-file-repair.ts, src/agents/session-file-repair.test.ts)

Remaining risk / open question:

  • Tests and changed-lane gates still need normal CI or maintainer-run validation because this review was read-only.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 89db1e5440f5.

@openperf openperf force-pushed the fix/77228-repair-drop-null-role-entries branch from 1363562 to cd06dbe Compare May 4, 2026 14:23
@openperf openperf force-pushed the fix/77228-repair-drop-null-role-entries branch from 0df7035 to e2b58a5 Compare May 5, 2026 06:39
openperf added 3 commits May 5, 2026 14:41
…reserving them

A type:"message" entry whose role is missing, null, or blank cannot be
replayed to any provider — every provider router branches on
message.role — and preserving it through the on-disk repair pass just
relocates the corruption from the original file into the post-repair
file. Treat such entries the same way as unparseable JSONL: drop during
repair and count toward droppedLines so the cleaned transcript no longer
carries them.

Refs openclaw#77228 — addresses the auto-repair amplifier only.
The cascading auth-profile cooldown amplifier and the trailing prefill
placeholder root cause are separate failure modes and remain open.
@openperf openperf force-pushed the fix/77228-repair-drop-null-role-entries branch from e2b58a5 to 2560a79 Compare May 5, 2026 06:45
@openperf openperf merged commit 043cb32 into openclaw:main May 5, 2026
102 checks passed
@openperf
Copy link
Copy Markdown
Member Author

openperf commented May 5, 2026

Landed as squash commit 043cb32aab7dfd3747552b7bb707949cdc613de7 on main.

vincentkoc added a commit to VintageAyu/openclaw that referenced this pull request May 5, 2026
…ainer-hardening

* origin/main: (843 commits)
  docs(changelog): relocate openclaw#77046 and openclaw#77280 entries from 2026.5.3 to Unreleased (openclaw#77728)
  docs: reorder unreleased changelog
  fix: expose ollama thinking profile before activation (openclaw#77617) (thanks @yfge)
  fix: expose ollama thinking profile before activation
  test(gateway): preserve dispatch timers in waiter
  test(gateway): keep startup context timer live
  docs: document cache-friendly activity helper
  ci: install ffmpeg for Mantis media previews
  fix: avoid impossible device token rotation advice (openclaw#77688) (thanks @Conan-Scott)
  docs(changelog): note doctor device pairing advice fix
  fix(doctor): avoid impossible device token rotation advice
  ci: use Crabbox media previews for Mantis
  docs: filter maintainer-owned triage noise
  test: cover GitHub activity helper
  fix(session-file-repair): drop null-role message entries instead of preserving them (openclaw#77288)
  fix: prune orphan session artifacts
  perf: reduce GitHub activity cache misses
  fix: cache session list model resolution (openclaw#77650) (thanks @ragesaq)
  ci: embed Mantis desktop previews
  fix(replay-history): drop trailing stream-error placeholder before provider send (openclaw#77287)
  ...

# Conflicts:
#	CHANGELOG.md
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…reserving them (openclaw#77288)

type:"message" entries with a null, missing, or blank role cannot be
replayed to any provider — every router branches on message.role. The
auto-repair pass was passing them through unchanged, relocating the
corruption from the original file into the post-repair file (openclaw#77228
reported 935+ null-role entries surviving the pass).

Add isStructurallyInvalidMessageEntry ahead of the existing rewrite
predicates. Invalid message envelopes are counted as droppedLines and
skipped; non-message envelope types (summary, custom, …) are unaffected.
The .bak-* backup preserves the original bytes for postmortem before any
entries are dropped.

Tests:
- pnpm test src/agents/session-file-repair.test.ts
- pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/agents/session-file-repair.ts src/agents/session-file-repair.test.ts
- pnpm check:changed

Refs openclaw#77228
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant