Session corruption: leading-assistant transcript causes infinite "messages: at least one message is required" loop

## Summary

A session can enter a permanently broken state where every subsequent turn fails with a 400 from the Anthropic API:

```
400 invalid_request_error: messages: at least one message is required
```

The runtime detects this as `failoverReason: "format"` / `providerRuntimeFailureKind: "schema"` and surfaces the generic `GENERIC_EXTERNAL_RUN_FAILURE_TEXT` ("⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.") to the user — but does **not** auto-reset the session, so every retry hits the same 400.

## Reproduction (observed)

In my deployment (`openclaw 2026.4.27`), a turn was aborted between "user message persisted" and "assistant reply written" — the host that the gateway runs on was wedged by an orphan SSH child holding the shell. The session transcript ended up with assistant-role entries containing `"text":"[assistant turn failed before producing content]"` and **no preceding user-role message**.

From that point onward, every send to that session re-submitted the broken transcript. The Anthropic API rejected each request because the `messages` array was effectively empty (no leading user turn). The session stayed broken for ~3 hours across multiple turns until I manually reset it via `/new` (`*.jsonl.reset.<timestamp>` was written at the same moment things started working again).

## Evidence from logs

Same `errorFingerprint: sha256:5d882a6629dc` on every failure, different `runId` each time, identical 400 body. Snippet from `openclaw logs`:

```
warn agent/embedded {"event":"embedded_run_agent_end","isError":true,
  "error":"LLM request rejected: messages: at least one message is required",
  "failoverReason":"format","providerRuntimeFailureKind":"schema",
  "providerErrorType":"invalid_request_error","httpCode":"400"}

warn model-fallback/decision {"decision":"candidate_failed","reason":"format",
  "fallbackStepFinalOutcome":"chain_exhausted","fallbackConfigured":false}

error diagnostic lane task error: lane=session:agent:main:telegram:direct:<id>
  error="FailoverError: LLM request rejected: messages: at least one message is required"
```

Transcript snippet (`agents/main/sessions/<sid>.jsonl.reset.<ts>`):

```
{"type":"session","version":3,"id":"<sid>","timestamp":"...Z"}
{"type":"model_change",...}
{"type":"thinking_level_change",...}
{"type":"custom","customType":"model-snapshot",...}
{"type":"message","message":{"role":"assistant",
  "content":[{"type":"text","text":"[assistant turn failed before producing content]"}],
  "stopReason":"error","errorMessage":"400 ... messages: at least one message is required"}}
{"type":"thinking_level_change",...}
{"type":"message","message":{"role":"assistant",...same shape...}}
```

No user-role entry exists in the transcript — only the orphan assistant turns.

## Expected behaviour (issue 1: auto-detect & reset)

The runtime already has auto-reset paths for two corruption modes:

- Gemini function-call ordering (`isSessionCorruption` branch in `agent-runner.runtime`)
- Role-ordering conflict (`isRoleOrderingError` branch)

Please add a third reset path for **leading-assistant / no-user-message transcripts**. Detection is reliable:

- HTTP 400
- `providerErrorType === "invalid_request_error"`
- `providerErrorMessagePreview` starts with `"messages: at least one message is required"`
- (and/or) on-disk transcript has zero user-role entries

When detected, the runtime should:

1. Snapshot the bad transcript to `*.jsonl.reset.<ts>` (same as existing reset paths)
2. Drop the session from the active session store
3. Reply to the user with the same friendly message used for the role-ordering reset: `"⚠️ Session history was corrupted. I've reset the conversation - please try again!"`

Without this, the user only sees the generic "Something went wrong" text and has no idea they need to `/new` — and on a Telegram channel they may not even know `/new` is an option. In my case the loop persisted across ~4 turns over hours.

## Expected behaviour (issue 2: post-reset confirmation ping)

Related UX gap from the same incident. After typing `/new`, the user gets only a system-rendered `"✅ New session started."` and no further signal. After watching repeated error messages, that acknowledgement reads ambiguously — did the agent actually boot? Is it waiting? Is it broken in a different way?

In my case I waited ~19 minutes after `/new` before sending another message because there was no proof the agent was alive on the new session. The moment I did send a message, the agent replied instantly — so the round-trip was always working post-reset, but I had no way to know that without gambling another message.

Suggested fix: after `/new` (or after any auto-reset path), have the runtime emit a brief agent-side ping like `"👋 Fresh session — ready when you are."` so the user sees a real round-trip and knows the new session works. Same fix benefits the auto-detect path in issue 1.

## Root cause guard (optional, broader fix)

The persistence layer probably should not write an orphan assistant entry in the first place. If the assistant turn errored before producing content, either:

- Don't persist the assistant entry at all (rollback the turn), or
- Persist a marker that triggers the corruption-recovery path on the next send.

Currently the on-disk shape (`stopReason:"error"`, no preceding user turn) is a state the schema permits but the API rejects forever. Treating it as terminal-corrupt on read would fix this class of issue without needing the API call to fail first.

## Environment

- openclaw 2026.4.27 (cbc2ba0)
- Provider: anthropic, model: claude-opus-4-7
- Channel: telegram (lane `session:agent:main:telegram:direct:<id>`)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Session corruption: leading-assistant transcript causes infinite "messages: at least one message is required" loop #75235

Summary

Reproduction (observed)

Evidence from logs

Expected behaviour (issue 1: auto-detect & reset)

Expected behaviour (issue 2: post-reset confirmation ping)

Root cause guard (optional, broader fix)

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Session corruption: leading-assistant transcript causes infinite "messages: at least one message is required" loop #75235

Description

Summary

Reproduction (observed)

Evidence from logs

Expected behaviour (issue 1: auto-detect & reset)

Expected behaviour (issue 2: post-reset confirmation ping)

Root cause guard (optional, broader fix)

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions