Skip to content

Stale Claude session ID causes silent failure loop on container restart #1280

@ztech-gthb

Description

@ztech-gthb

Body:

Problem

Somtimes (after a container restart?), sending a message in an existing (project-scoped)
conversation silently fails. The response briefly flickers in the Web UI and then
disappears, leaving the user looking at the last persisted message from before the
restart. Subsequent messages in the same conversation all fail identically.

Root Cause

Archon stores the Anthropic session ID (assistant_session_id) in the database so
it can resume conversations. Claude API sessions have a limited lifetime — they expire after some period
of inactivity or are invalidated by service-side events (scaling, maintenance, etc.).
When Archon tries to resume a session that no longer exists on the server, ...When the container comes back up, the stored session ID
is stale: the Anthropic server no longer knows about it.

On the next message, the SDK returns:

{
  "is_error": true,
  "subtype": "error_during_execution",
  "num_turns": 0,
  "errors": ["No conversation found with session ID: <stale-id>"]
}

The existing error handler then persists the new (also empty) session ID returned
in the error result, so the next message hits the same failure with a different ID —
an infinite failure loop. The conversation appears permanently broken.

Important: Conversation history is NOT lost

Archon's conversation history lives in its own remote_agent_messages table, fully
independent of the Claude session. The Claude session is only a performance
optimisation (prompt-cache resume). When it's missing, the orchestrator rebuilds the
full context from the stored messages — exactly as it does when starting a brand-new
conversation. Clearing the stale session ID has no effect on the user's conversation
history.

Fix

In the error_during_execution error path inside the orchestrator, set
assistant_session_id = NULL instead of persisting the new (failed) session ID.
The next message then starts a completely fresh session with full context rebuilt from
the DB.

// Before: persists the failed session ID → loops forever
if (newSessionId) {
await tryPersistSessionId(session.id, newSessionId);
}

// After: clears the stale ID → next message gets a fresh session
if (msg.errorSubtype === 'error_during_execution') {
await tryPersistSessionId(session.id, null); // NULL = start fresh next time
} else if (newSessionId) {
await tryPersistSessionId(session.id, newSessionId);
}

Requires updateSession and tryPersistSessionId to accept string | null.

Affected files

  • packages/core/src/db/sessions.ts — updateSession signature
  • packages/core/src/orchestrator/orchestrator-agent.ts — two identical error handlers

Reproduction

  1. Start Archon, open a project-scoped conversation, exchange a few messages.
  2. docker compose down && docker compose up -d
  3. Open the same conversation and send any message.
  4. The UI briefly shows activity, then reverts to the pre-restart state.
  5. All subsequent messages in that conversation fail identically.

New conversations (not yet in the DB with a session ID) are unaffected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions