Session corruption: prefill error cascades into provider cooldown + repair makes it worse

## Bug Summary

A single `400 This model does not support assistant message prefill` error from the LLM provider cascades into a full provider cooldown, making the agent completely unresponsive. The session file auto-repair mechanism then corrupts the transcript further, requiring a manual new session.

## Environment

- OpenClaw 4.29, Linux 6.17.0-1011-azure (x64)
- Provider: github-copilot / claude-opus-4.6
- Channel: WhatsApp

## Steps to Reproduce

1. Agent is mid-conversation with active tool calls
2. A tool call fails, producing an `[assistant turn failed before producing content]` placeholder
3. A blank/empty user message follows (possibly from WhatsApp inbound during the failed turn)
4. This creates an invalid message sequence where the conversation ends with an assistant message (the failed placeholder) rather than a user message

## What Happens

### Phase 1: Prefill error (10:59:27 CEST)
```
[agent/embedded] embedded run agent end: isError=true error=LLM request failed: provider rejected the request schema or tool payload.
rawError=400 This model does not support assistant message prefill. The conversation must end with a user message.
```

### Phase 2: Provider cooldown cascade
The format error puts the **entire provider** into cooldown:
```
[model-fallback/decision] decision=candidate_failed reason=format
[model-fallback/decision] decision=skip_candidate reason=format detail=Provider github-copilot is in cooldown (all profiles unavailable)
Embedded agent failed before reply: All models failed (1): github-copilot/claude-opus-4.6: Provider github-copilot is in cooldown
```

Every subsequent user message for the next ~42 minutes hits the same cooldown wall. The agent cannot respond at all.

### Phase 3: Session repair makes it worse (11:41:15)
```
[session-init] session file repair: rewrote 1 assistant message(s), dropped 1 blank user message(s)
```

After repair, the error changes to:
```
rawError=400 messages: at least one message is required
```

The transcript is now fully corrupted — both .reset and .bak files contain 935+ entries with null roles (complete structural JSONL corruption).

## Expected Behavior

1. A prefill/format error on one request should NOT put the entire provider into long-term cooldown
2. The cooldown should either be very short or only apply to that specific session, not block all sessions
3. Session file repair should not produce a worse state than what it started with
4. If a transcript is irrecoverable, the system should auto-create a fresh session rather than repeatedly failing

## Actual Behavior

- Single format error → 42+ minutes of complete agent unresponsiveness
- Auto-repair corrupts the transcript further
- User had to manually start a new session

## Root Cause Analysis

The core issue appears to be that `[assistant turn failed before producing content]` placeholder messages create invalid message sequences. When combined with blank/dropped user messages, the conversation violates the provider's constraint that it must end with a user message. The cooldown mechanism then amplifies a single-request format error into a prolonged outage.

## Suggested Fixes

1. **Short cooldown for format errors**: Format errors are session-specific, not provider-wide issues. Cooldown should be seconds, not minutes, and scoped to the session.
2. **Safer transcript repair**: Validate the repaired transcript before committing. If repair produces an invalid state, fall back to creating a fresh session.
3. **Handle `[assistant turn failed]` placeholders**: These should be cleaned from the transcript before sending to the provider, or replaced with a valid assistant message.
4. **Auto-recovery**: If a session is stuck in repeated format errors, offer to reset it automatically rather than failing silently for 40+ minutes.

## Log References

- Session ID: `229feaa0-2692-401c-a828-66939bf80acc`
- Failed run IDs: `66a98a07`, `3ef84b65` (and several in between)
- Corrupted files: `.jsonl.reset.2026-05-04T09-42-09.905Z`, `.jsonl.bak-59249-*`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228

Bug Summary

Environment

Steps to Reproduce

What Happens

Phase 1: Prefill error (10:59:27 CEST)

Phase 2: Provider cooldown cascade

Phase 3: Session repair makes it worse (11:41:15)

Expected Behavior

Actual Behavior

Root Cause Analysis

Suggested Fixes

Log References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228

Description

Bug Summary

Environment

Steps to Reproduce

What Happens

Phase 1: Prefill error (10:59:27 CEST)

Phase 2: Provider cooldown cascade

Phase 3: Session repair makes it worse (11:41:15)

Expected Behavior

Actual Behavior

Root Cause Analysis

Suggested Fixes

Log References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions