2026.5.12: session recovery can split-brain when CLI runs under Codex HOME shadow state

## Summary

We hit a production incident that started with the same Telegram/Codex stall symptoms tracked in #82274, but recovery exposed a separate state-path bug:

When OpenClaw CLI/session recovery is run from an agent/Codex environment whose `HOME` points at an isolated Codex home, session commands can read/write a shadow OpenClaw state tree instead of the gateway's real per-agent session store. The operator thinks sessions were reset, but the running gateway continues reading the canonical store and agents remain stuck in `typing...` / `status=running`.

This creates a nasty recovery split-brain during exactly the failure mode where operators are trying to unstick agents.

## Environment

- OpenClaw: `2026.5.12`
- Gateway: local systemd root runtime, listening on `127.0.0.1:18789`
- Codex CLI/app-server: `0.130.0`
- Channel involved: Telegram
- Model path involved: `openai/gpt-5.5`
- Related upstream incident: #82274

## Observed evidence

During the incident the gateway logs showed repeated runtime stalls:

```text
codex app-server turn idle timed out waiting for terminal event
codex app-server client retired after timed-out turn
timeout-compaction
active-memory timeout after 15000ms
before_prompt_build handler from active-memory failed: timed out after 15000ms
```

A recovery attempt then reset sessions from a CLI/tool context where `HOME` resolved under the Codex isolated home. That touched a shadow path like:

```text
$OPENCLAW_HOME/agents/<coordinator>/agent/codex-home/home/.openclaw/agents/<target-agent>/sessions/sessions.json
```

But the running gateway was using the canonical per-agent store:

```text
$OPENCLAW_HOME/agents/<target-agent>/sessions/sessions.json
```

Result: the shadow store looked reset, while the canonical store still had stale `status=running` sessions. Only after forcing the recovery with canonical env/path did the stuck agents clear.

## Why this is dangerous

- Recovery tools can report success against the wrong state tree.
- A stalled agent lane can survive an apparent reset.
- Operators can create multiple divergent session histories for the same agent.
- The failure mode is easy to trigger from ACP/Codex launcher contexts because `HOME` is intentionally isolated.

## Local mitigation applied

We mitigated by:

1. Backing up the shadow session directories.
2. Replacing the shadow `codex-home/home/.openclaw/agents/<agent>/sessions` directories with symlinks to the canonical `$OPENCLAW_HOME/agents/<agent>/sessions` directories.
3. Running future offline recovery with explicit canonical env:

```bash
HOME=/root \
OPENCLAW_CONFIG_DIR=/root/.openclaw \
OPENCLAW_CONFIG_PATH=/root/.openclaw/openclaw.json \
openclaw ...
```

We also moved compaction/fallback away from the same Codex runtime and tightened active-memory timeouts as local blast-radius reduction, but that does not fix the underlying state-path split.

## Expected behavior

OpenClaw should not silently read/write a new state tree just because the process `HOME` is inside an isolated Codex/ACP home.

At minimum, session/state commands should do one of:

- derive the gateway home/config from `OPENCLAW_CONFIG_DIR` / `OPENCLAW_CONFIG_PATH` / the running gateway, not raw `~`;
- warn/refuse when `~/.openclaw` resolves under `agent/*/codex-home/home/.openclaw`;
- expose an explicit `--home` / `--config-dir` / `--store` for recovery tools;
- make `doctor` detect shadow `codex-home/home/.openclaw/agents/*/sessions` directories and offer a safe repair or warning.

## Suggested acceptance criteria

- Running `openclaw sessions` from a Codex/ACP isolated `HOME` does not create or mutate a shadow session store by default.
- `openclaw doctor` detects divergent canonical vs shadow session stores.
- A recovery/reset command prints the exact session store path it will mutate before writing.
- Docs clarify the canonical gateway state path for systemd/root installs and ACP/Codex isolated homes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2026.5.12: session recovery can split-brain when CLI runs under Codex HOME shadow state #82318

Summary

Environment

Observed evidence

Why this is dangerous

Local mitigation applied

Expected behavior

Suggested acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

2026.5.12: session recovery can split-brain when CLI runs under Codex HOME shadow state #82318

Description

Summary

Environment

Observed evidence

Why this is dangerous

Local mitigation applied

Expected behavior

Suggested acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions