Stale Claude session ID causes silent failure loop on container restart



  Body:

  ## Problem

  Somtimes (after a container restart?), sending a message in an existing (project-scoped)
  conversation silently fails. The response briefly flickers in the Web UI and then
  disappears, leaving the user looking at the last persisted message from before the
  restart. Subsequent messages in the same conversation all fail identically.

  ## Root Cause

  Archon stores the Anthropic session ID (`assistant_session_id`) in the database so
  it can resume conversations.   Claude API sessions have a limited lifetime — they expire after some period
  of inactivity or are invalidated by service-side events (scaling, maintenance, etc.).
  When Archon tries to resume a session that no longer exists on the server, ...When the container comes back up, the stored session ID
  is stale: the Anthropic server no longer knows about it.

  On the next message, the SDK returns:

  ```json
  {
    "is_error": true,
    "subtype": "error_during_execution",
    "num_turns": 0,
    "errors": ["No conversation found with session ID: <stale-id>"]
  }
```

  The existing error handler then persists the new (also empty) session ID returned
  in the error result, so the next message hits the same failure with a different ID —
  an infinite failure loop. The conversation appears permanently broken.

  Important: Conversation history is NOT lost

  Archon's conversation history lives in its own remote_agent_messages table, fully
  independent of the Claude session. The Claude session is only a performance
  optimisation (prompt-cache resume). When it's missing, the orchestrator rebuilds the
  full context from the stored messages — exactly as it does when starting a brand-new
  conversation. Clearing the stale session ID has no effect on the user's conversation
  history.

##  Fix

  In the error_during_execution error path inside the orchestrator, set
  assistant_session_id = NULL instead of persisting the new (failed) session ID.
  The next message then starts a completely fresh session with full context rebuilt from
  the DB.

  // Before: persists the failed session ID → loops forever
  if (newSessionId) {
    await tryPersistSessionId(session.id, newSessionId);
  }

  // After: clears the stale ID → next message gets a fresh session
  if (msg.errorSubtype === 'error_during_execution') {
    await tryPersistSessionId(session.id, null);   // NULL = start fresh next time
  } else if (newSessionId) {
    await tryPersistSessionId(session.id, newSessionId);
  }

  Requires updateSession and tryPersistSessionId to accept string | null.

 ###  Affected files

  - packages/core/src/db/sessions.ts — updateSession signature
  - packages/core/src/orchestrator/orchestrator-agent.ts — two identical error handlers

 ## Reproduction

  1. Start Archon, open a project-scoped conversation, exchange a few messages.
  2. docker compose down && docker compose up -d
  3. Open the same conversation and send any message.
  4. The UI briefly shows activity, then reverts to the pre-restart state.
  5. All subsequent messages in that conversation fail identically.

  New conversations (not yet in the DB with a session ID) are unaffected.
  ```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale Claude session ID causes silent failure loop on container restart #1280

Problem

Root Cause

Fix

Affected files

Reproduction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Stale Claude session ID causes silent failure loop on container restart #1280

Description

Problem

Root Cause

Fix

Affected files

Reproduction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions