Skip to content

Session corruption: leading-assistant transcript causes infinite "messages: at least one message is required" loop #75235

@EVEHetzner

Description

@EVEHetzner

Summary

A session can enter a permanently broken state where every subsequent turn fails with a 400 from the Anthropic API:

400 invalid_request_error: messages: at least one message is required

The runtime detects this as failoverReason: "format" / providerRuntimeFailureKind: "schema" and surfaces the generic GENERIC_EXTERNAL_RUN_FAILURE_TEXT ("⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.") to the user — but does not auto-reset the session, so every retry hits the same 400.

Reproduction (observed)

In my deployment (openclaw 2026.4.27), a turn was aborted between "user message persisted" and "assistant reply written" — the host that the gateway runs on was wedged by an orphan SSH child holding the shell. The session transcript ended up with assistant-role entries containing "text":"[assistant turn failed before producing content]" and no preceding user-role message.

From that point onward, every send to that session re-submitted the broken transcript. The Anthropic API rejected each request because the messages array was effectively empty (no leading user turn). The session stayed broken for ~3 hours across multiple turns until I manually reset it via /new (*.jsonl.reset.<timestamp> was written at the same moment things started working again).

Evidence from logs

Same errorFingerprint: sha256:5d882a6629dc on every failure, different runId each time, identical 400 body. Snippet from openclaw logs:

warn agent/embedded {"event":"embedded_run_agent_end","isError":true,
  "error":"LLM request rejected: messages: at least one message is required",
  "failoverReason":"format","providerRuntimeFailureKind":"schema",
  "providerErrorType":"invalid_request_error","httpCode":"400"}

warn model-fallback/decision {"decision":"candidate_failed","reason":"format",
  "fallbackStepFinalOutcome":"chain_exhausted","fallbackConfigured":false}

error diagnostic lane task error: lane=session:agent:main:telegram:direct:<id>
  error="FailoverError: LLM request rejected: messages: at least one message is required"

Transcript snippet (agents/main/sessions/<sid>.jsonl.reset.<ts>):

{"type":"session","version":3,"id":"<sid>","timestamp":"...Z"}
{"type":"model_change",...}
{"type":"thinking_level_change",...}
{"type":"custom","customType":"model-snapshot",...}
{"type":"message","message":{"role":"assistant",
  "content":[{"type":"text","text":"[assistant turn failed before producing content]"}],
  "stopReason":"error","errorMessage":"400 ... messages: at least one message is required"}}
{"type":"thinking_level_change",...}
{"type":"message","message":{"role":"assistant",...same shape...}}

No user-role entry exists in the transcript — only the orphan assistant turns.

Expected behaviour (issue 1: auto-detect & reset)

The runtime already has auto-reset paths for two corruption modes:

  • Gemini function-call ordering (isSessionCorruption branch in agent-runner.runtime)
  • Role-ordering conflict (isRoleOrderingError branch)

Please add a third reset path for leading-assistant / no-user-message transcripts. Detection is reliable:

  • HTTP 400
  • providerErrorType === "invalid_request_error"
  • providerErrorMessagePreview starts with "messages: at least one message is required"
  • (and/or) on-disk transcript has zero user-role entries

When detected, the runtime should:

  1. Snapshot the bad transcript to *.jsonl.reset.<ts> (same as existing reset paths)
  2. Drop the session from the active session store
  3. Reply to the user with the same friendly message used for the role-ordering reset: "⚠️ Session history was corrupted. I've reset the conversation - please try again!"

Without this, the user only sees the generic "Something went wrong" text and has no idea they need to /new — and on a Telegram channel they may not even know /new is an option. In my case the loop persisted across ~4 turns over hours.

Expected behaviour (issue 2: post-reset confirmation ping)

Related UX gap from the same incident. After typing /new, the user gets only a system-rendered "✅ New session started." and no further signal. After watching repeated error messages, that acknowledgement reads ambiguously — did the agent actually boot? Is it waiting? Is it broken in a different way?

In my case I waited ~19 minutes after /new before sending another message because there was no proof the agent was alive on the new session. The moment I did send a message, the agent replied instantly — so the round-trip was always working post-reset, but I had no way to know that without gambling another message.

Suggested fix: after /new (or after any auto-reset path), have the runtime emit a brief agent-side ping like "👋 Fresh session — ready when you are." so the user sees a real round-trip and knows the new session works. Same fix benefits the auto-detect path in issue 1.

Root cause guard (optional, broader fix)

The persistence layer probably should not write an orphan assistant entry in the first place. If the assistant turn errored before producing content, either:

  • Don't persist the assistant entry at all (rollback the turn), or
  • Persist a marker that triggers the corruption-recovery path on the next send.

Currently the on-disk shape (stopReason:"error", no preceding user turn) is a state the schema permits but the API rejects forever. Treating it as terminal-corrupt on read would fix this class of issue without needing the API call to fail first.

Environment

  • openclaw 2026.4.27 (cbc2ba0)
  • Provider: anthropic, model: claude-opus-4-7
  • Channel: telegram (lane session:agent:main:telegram:direct:<id>)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions