Bug Summary
A single 400 This model does not support assistant message prefill error from the LLM provider cascades into a full provider cooldown, making the agent completely unresponsive. The session file auto-repair mechanism then corrupts the transcript further, requiring a manual new session.
Environment
- OpenClaw 4.29, Linux 6.17.0-1011-azure (x64)
- Provider: github-copilot / claude-opus-4.6
- Channel: WhatsApp
Steps to Reproduce
- Agent is mid-conversation with active tool calls
- A tool call fails, producing an
[assistant turn failed before producing content] placeholder
- A blank/empty user message follows (possibly from WhatsApp inbound during the failed turn)
- This creates an invalid message sequence where the conversation ends with an assistant message (the failed placeholder) rather than a user message
What Happens
Phase 1: Prefill error (10:59:27 CEST)
[agent/embedded] embedded run agent end: isError=true error=LLM request failed: provider rejected the request schema or tool payload.
rawError=400 This model does not support assistant message prefill. The conversation must end with a user message.
Phase 2: Provider cooldown cascade
The format error puts the entire provider into cooldown:
[model-fallback/decision] decision=candidate_failed reason=format
[model-fallback/decision] decision=skip_candidate reason=format detail=Provider github-copilot is in cooldown (all profiles unavailable)
Embedded agent failed before reply: All models failed (1): github-copilot/claude-opus-4.6: Provider github-copilot is in cooldown
Every subsequent user message for the next ~42 minutes hits the same cooldown wall. The agent cannot respond at all.
Phase 3: Session repair makes it worse (11:41:15)
[session-init] session file repair: rewrote 1 assistant message(s), dropped 1 blank user message(s)
After repair, the error changes to:
rawError=400 messages: at least one message is required
The transcript is now fully corrupted — both .reset and .bak files contain 935+ entries with null roles (complete structural JSONL corruption).
Expected Behavior
- A prefill/format error on one request should NOT put the entire provider into long-term cooldown
- The cooldown should either be very short or only apply to that specific session, not block all sessions
- Session file repair should not produce a worse state than what it started with
- If a transcript is irrecoverable, the system should auto-create a fresh session rather than repeatedly failing
Actual Behavior
- Single format error → 42+ minutes of complete agent unresponsiveness
- Auto-repair corrupts the transcript further
- User had to manually start a new session
Root Cause Analysis
The core issue appears to be that [assistant turn failed before producing content] placeholder messages create invalid message sequences. When combined with blank/dropped user messages, the conversation violates the provider's constraint that it must end with a user message. The cooldown mechanism then amplifies a single-request format error into a prolonged outage.
Suggested Fixes
- Short cooldown for format errors: Format errors are session-specific, not provider-wide issues. Cooldown should be seconds, not minutes, and scoped to the session.
- Safer transcript repair: Validate the repaired transcript before committing. If repair produces an invalid state, fall back to creating a fresh session.
- Handle
[assistant turn failed] placeholders: These should be cleaned from the transcript before sending to the provider, or replaced with a valid assistant message.
- Auto-recovery: If a session is stuck in repeated format errors, offer to reset it automatically rather than failing silently for 40+ minutes.
Log References
- Session ID:
229feaa0-2692-401c-a828-66939bf80acc
- Failed run IDs:
66a98a07, 3ef84b65 (and several in between)
- Corrupted files:
.jsonl.reset.2026-05-04T09-42-09.905Z, .jsonl.bak-59249-*
Bug Summary
A single
400 This model does not support assistant message prefillerror from the LLM provider cascades into a full provider cooldown, making the agent completely unresponsive. The session file auto-repair mechanism then corrupts the transcript further, requiring a manual new session.Environment
Steps to Reproduce
[assistant turn failed before producing content]placeholderWhat Happens
Phase 1: Prefill error (10:59:27 CEST)
Phase 2: Provider cooldown cascade
The format error puts the entire provider into cooldown:
Every subsequent user message for the next ~42 minutes hits the same cooldown wall. The agent cannot respond at all.
Phase 3: Session repair makes it worse (11:41:15)
After repair, the error changes to:
The transcript is now fully corrupted — both .reset and .bak files contain 935+ entries with null roles (complete structural JSONL corruption).
Expected Behavior
Actual Behavior
Root Cause Analysis
The core issue appears to be that
[assistant turn failed before producing content]placeholder messages create invalid message sequences. When combined with blank/dropped user messages, the conversation violates the provider's constraint that it must end with a user message. The cooldown mechanism then amplifies a single-request format error into a prolonged outage.Suggested Fixes
[assistant turn failed]placeholders: These should be cleaned from the transcript before sending to the provider, or replaced with a valid assistant message.Log References
229feaa0-2692-401c-a828-66939bf80acc66a98a07,3ef84b65(and several in between).jsonl.reset.2026-05-04T09-42-09.905Z,.jsonl.bak-59249-*