Skip to content

2026.5.12: session recovery can split-brain when CLI runs under Codex HOME shadow state #82318

@chac4l

Description

@chac4l

Summary

We hit a production incident that started with the same Telegram/Codex stall symptoms tracked in #82274, but recovery exposed a separate state-path bug:

When OpenClaw CLI/session recovery is run from an agent/Codex environment whose HOME points at an isolated Codex home, session commands can read/write a shadow OpenClaw state tree instead of the gateway's real per-agent session store. The operator thinks sessions were reset, but the running gateway continues reading the canonical store and agents remain stuck in typing... / status=running.

This creates a nasty recovery split-brain during exactly the failure mode where operators are trying to unstick agents.

Environment

Observed evidence

During the incident the gateway logs showed repeated runtime stalls:

codex app-server turn idle timed out waiting for terminal event
codex app-server client retired after timed-out turn
timeout-compaction
active-memory timeout after 15000ms
before_prompt_build handler from active-memory failed: timed out after 15000ms

A recovery attempt then reset sessions from a CLI/tool context where HOME resolved under the Codex isolated home. That touched a shadow path like:

$OPENCLAW_HOME/agents/<coordinator>/agent/codex-home/home/.openclaw/agents/<target-agent>/sessions/sessions.json

But the running gateway was using the canonical per-agent store:

$OPENCLAW_HOME/agents/<target-agent>/sessions/sessions.json

Result: the shadow store looked reset, while the canonical store still had stale status=running sessions. Only after forcing the recovery with canonical env/path did the stuck agents clear.

Why this is dangerous

  • Recovery tools can report success against the wrong state tree.
  • A stalled agent lane can survive an apparent reset.
  • Operators can create multiple divergent session histories for the same agent.
  • The failure mode is easy to trigger from ACP/Codex launcher contexts because HOME is intentionally isolated.

Local mitigation applied

We mitigated by:

  1. Backing up the shadow session directories.
  2. Replacing the shadow codex-home/home/.openclaw/agents/<agent>/sessions directories with symlinks to the canonical $OPENCLAW_HOME/agents/<agent>/sessions directories.
  3. Running future offline recovery with explicit canonical env:
HOME=/root \
OPENCLAW_CONFIG_DIR=/root/.openclaw \
OPENCLAW_CONFIG_PATH=/root/.openclaw/openclaw.json \
openclaw ...

We also moved compaction/fallback away from the same Codex runtime and tightened active-memory timeouts as local blast-radius reduction, but that does not fix the underlying state-path split.

Expected behavior

OpenClaw should not silently read/write a new state tree just because the process HOME is inside an isolated Codex/ACP home.

At minimum, session/state commands should do one of:

  • derive the gateway home/config from OPENCLAW_CONFIG_DIR / OPENCLAW_CONFIG_PATH / the running gateway, not raw ~;
  • warn/refuse when ~/.openclaw resolves under agent/*/codex-home/home/.openclaw;
  • expose an explicit --home / --config-dir / --store for recovery tools;
  • make doctor detect shadow codex-home/home/.openclaw/agents/*/sessions directories and offer a safe repair or warning.

Suggested acceptance criteria

  • Running openclaw sessions from a Codex/ACP isolated HOME does not create or mutate a shadow session store by default.
  • openclaw doctor detects divergent canonical vs shadow session stores.
  • A recovery/reset command prints the exact session store path it will mutate before writing.
  • Docs clarify the canonical gateway state path for systemd/root installs and ACP/Codex isolated homes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions