Skip to content

Session enters zombie state after embedded agent init failure (deactivated_workspace) — no auto-recovery #54964

@EronFan

Description

@EronFan

Problem Summary

When an embedded agent fails to initialize (e.g., `deactivated_workspace` error), the corresponding session enters a zombie state:

  • Subsequent messages are dispatched normally by the gateway
  • Gateway logs: `dispatch complete (queuedFinal=false, replies=0)`
  • The agent produces zero replies — users see complete silence
  • Session is never cleaned up; only manual deletion from `sessions.json` restores service

Steps to Reproduce

  1. A group-chat session triggers embedded agent initialization
  2. The initialization fails with a `deactivated_workspace` error (e.g., due to workspace config issues)
  3. The session is left in a broken state: marked "initialized" but not actually running
  4. All subsequent messages dispatched to this session queue silently — `replies=0` forever
  5. Only manually deleting the session key from `sessions.json` and triggering a new session creation restores functionality

Current Workaround

Manually delete the session key from:
`/root/.openclaw/agents//sessions/sessions.json`

This forces a fresh session to be created on the next inbound message.


Root Cause Analysis

  • The `deactivated_workspace` error does not trigger session cleanup
  • The session state machine does not enter a proper error handling path when agent init fails
  • The gateway dispatch layer considers the message "delivered" (since it reached the agent), but the agent never actually processes it
  • No `abortedLastRun` flag is set, and no automatic recovery mechanism fires

Suggested Fixes

The session lifecycle should handle embedded agent init failures gracefully:

  1. Auto-mark on failure: When embedded agent init fails, set `abortedLastRun=true` on the session so the next dispatch can detect it and create a fresh session
  2. Session health check on dispatch: Before dispatching to an existing session, check if the previous run was aborted/failed and auto-recover rather than silently reusing a dead session
  3. Graceful degradation: If the session cannot be initialized after N attempts, surface an explicit error to the user instead of silent `replies=0`
  4. Auto-cleanup of zombie sessions: A background cleanup task that detects sessions with repeated `abortedLastRun=true` and removes them proactively

Environment

  • OpenClaw version: latest (main branch)
  • Channel: Feishu group chat
  • Session type: `group` (embedded agent)

Labels: bug, session, recovery, embedded-agent

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions