Skip to content

Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228

@altierac

Description

@altierac

Bug Summary

A single 400 This model does not support assistant message prefill error from the LLM provider cascades into a full provider cooldown, making the agent completely unresponsive. The session file auto-repair mechanism then corrupts the transcript further, requiring a manual new session.

Environment

  • OpenClaw 4.29, Linux 6.17.0-1011-azure (x64)
  • Provider: github-copilot / claude-opus-4.6
  • Channel: WhatsApp

Steps to Reproduce

  1. Agent is mid-conversation with active tool calls
  2. A tool call fails, producing an [assistant turn failed before producing content] placeholder
  3. A blank/empty user message follows (possibly from WhatsApp inbound during the failed turn)
  4. This creates an invalid message sequence where the conversation ends with an assistant message (the failed placeholder) rather than a user message

What Happens

Phase 1: Prefill error (10:59:27 CEST)

[agent/embedded] embedded run agent end: isError=true error=LLM request failed: provider rejected the request schema or tool payload.
rawError=400 This model does not support assistant message prefill. The conversation must end with a user message.

Phase 2: Provider cooldown cascade

The format error puts the entire provider into cooldown:

[model-fallback/decision] decision=candidate_failed reason=format
[model-fallback/decision] decision=skip_candidate reason=format detail=Provider github-copilot is in cooldown (all profiles unavailable)
Embedded agent failed before reply: All models failed (1): github-copilot/claude-opus-4.6: Provider github-copilot is in cooldown

Every subsequent user message for the next ~42 minutes hits the same cooldown wall. The agent cannot respond at all.

Phase 3: Session repair makes it worse (11:41:15)

[session-init] session file repair: rewrote 1 assistant message(s), dropped 1 blank user message(s)

After repair, the error changes to:

rawError=400 messages: at least one message is required

The transcript is now fully corrupted — both .reset and .bak files contain 935+ entries with null roles (complete structural JSONL corruption).

Expected Behavior

  1. A prefill/format error on one request should NOT put the entire provider into long-term cooldown
  2. The cooldown should either be very short or only apply to that specific session, not block all sessions
  3. Session file repair should not produce a worse state than what it started with
  4. If a transcript is irrecoverable, the system should auto-create a fresh session rather than repeatedly failing

Actual Behavior

  • Single format error → 42+ minutes of complete agent unresponsiveness
  • Auto-repair corrupts the transcript further
  • User had to manually start a new session

Root Cause Analysis

The core issue appears to be that [assistant turn failed before producing content] placeholder messages create invalid message sequences. When combined with blank/dropped user messages, the conversation violates the provider's constraint that it must end with a user message. The cooldown mechanism then amplifies a single-request format error into a prolonged outage.

Suggested Fixes

  1. Short cooldown for format errors: Format errors are session-specific, not provider-wide issues. Cooldown should be seconds, not minutes, and scoped to the session.
  2. Safer transcript repair: Validate the repaired transcript before committing. If repair produces an invalid state, fall back to creating a fresh session.
  3. Handle [assistant turn failed] placeholders: These should be cleaned from the transcript before sending to the provider, or replaced with a valid assistant message.
  4. Auto-recovery: If a session is stuck in repeated format errors, offer to reset it automatically rather than failing silently for 40+ minutes.

Log References

  • Session ID: 229feaa0-2692-401c-a828-66939bf80acc
  • Failed run IDs: 66a98a07, 3ef84b65 (and several in between)
  • Corrupted files: .jsonl.reset.2026-05-04T09-42-09.905Z, .jsonl.bak-59249-*

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions