Summary
We hit a production incident that started with the same Telegram/Codex stall symptoms tracked in #82274, but recovery exposed a separate state-path bug:
When OpenClaw CLI/session recovery is run from an agent/Codex environment whose HOME points at an isolated Codex home, session commands can read/write a shadow OpenClaw state tree instead of the gateway's real per-agent session store. The operator thinks sessions were reset, but the running gateway continues reading the canonical store and agents remain stuck in typing... / status=running.
This creates a nasty recovery split-brain during exactly the failure mode where operators are trying to unstick agents.
Environment
Observed evidence
During the incident the gateway logs showed repeated runtime stalls:
codex app-server turn idle timed out waiting for terminal event
codex app-server client retired after timed-out turn
timeout-compaction
active-memory timeout after 15000ms
before_prompt_build handler from active-memory failed: timed out after 15000ms
A recovery attempt then reset sessions from a CLI/tool context where HOME resolved under the Codex isolated home. That touched a shadow path like:
$OPENCLAW_HOME/agents/<coordinator>/agent/codex-home/home/.openclaw/agents/<target-agent>/sessions/sessions.json
But the running gateway was using the canonical per-agent store:
$OPENCLAW_HOME/agents/<target-agent>/sessions/sessions.json
Result: the shadow store looked reset, while the canonical store still had stale status=running sessions. Only after forcing the recovery with canonical env/path did the stuck agents clear.
Why this is dangerous
- Recovery tools can report success against the wrong state tree.
- A stalled agent lane can survive an apparent reset.
- Operators can create multiple divergent session histories for the same agent.
- The failure mode is easy to trigger from ACP/Codex launcher contexts because
HOME is intentionally isolated.
Local mitigation applied
We mitigated by:
- Backing up the shadow session directories.
- Replacing the shadow
codex-home/home/.openclaw/agents/<agent>/sessions directories with symlinks to the canonical $OPENCLAW_HOME/agents/<agent>/sessions directories.
- Running future offline recovery with explicit canonical env:
HOME=/root \
OPENCLAW_CONFIG_DIR=/root/.openclaw \
OPENCLAW_CONFIG_PATH=/root/.openclaw/openclaw.json \
openclaw ...
We also moved compaction/fallback away from the same Codex runtime and tightened active-memory timeouts as local blast-radius reduction, but that does not fix the underlying state-path split.
Expected behavior
OpenClaw should not silently read/write a new state tree just because the process HOME is inside an isolated Codex/ACP home.
At minimum, session/state commands should do one of:
- derive the gateway home/config from
OPENCLAW_CONFIG_DIR / OPENCLAW_CONFIG_PATH / the running gateway, not raw ~;
- warn/refuse when
~/.openclaw resolves under agent/*/codex-home/home/.openclaw;
- expose an explicit
--home / --config-dir / --store for recovery tools;
- make
doctor detect shadow codex-home/home/.openclaw/agents/*/sessions directories and offer a safe repair or warning.
Suggested acceptance criteria
- Running
openclaw sessions from a Codex/ACP isolated HOME does not create or mutate a shadow session store by default.
openclaw doctor detects divergent canonical vs shadow session stores.
- A recovery/reset command prints the exact session store path it will mutate before writing.
- Docs clarify the canonical gateway state path for systemd/root installs and ACP/Codex isolated homes.
Summary
We hit a production incident that started with the same Telegram/Codex stall symptoms tracked in #82274, but recovery exposed a separate state-path bug:
When OpenClaw CLI/session recovery is run from an agent/Codex environment whose
HOMEpoints at an isolated Codex home, session commands can read/write a shadow OpenClaw state tree instead of the gateway's real per-agent session store. The operator thinks sessions were reset, but the running gateway continues reading the canonical store and agents remain stuck intyping.../status=running.This creates a nasty recovery split-brain during exactly the failure mode where operators are trying to unstick agents.
Environment
2026.5.12127.0.0.1:187890.130.0openai/gpt-5.5Observed evidence
During the incident the gateway logs showed repeated runtime stalls:
A recovery attempt then reset sessions from a CLI/tool context where
HOMEresolved under the Codex isolated home. That touched a shadow path like:But the running gateway was using the canonical per-agent store:
Result: the shadow store looked reset, while the canonical store still had stale
status=runningsessions. Only after forcing the recovery with canonical env/path did the stuck agents clear.Why this is dangerous
HOMEis intentionally isolated.Local mitigation applied
We mitigated by:
codex-home/home/.openclaw/agents/<agent>/sessionsdirectories with symlinks to the canonical$OPENCLAW_HOME/agents/<agent>/sessionsdirectories.We also moved compaction/fallback away from the same Codex runtime and tightened active-memory timeouts as local blast-radius reduction, but that does not fix the underlying state-path split.
Expected behavior
OpenClaw should not silently read/write a new state tree just because the process
HOMEis inside an isolated Codex/ACP home.At minimum, session/state commands should do one of:
OPENCLAW_CONFIG_DIR/OPENCLAW_CONFIG_PATH/ the running gateway, not raw~;~/.openclawresolves underagent/*/codex-home/home/.openclaw;--home/--config-dir/--storefor recovery tools;doctordetect shadowcodex-home/home/.openclaw/agents/*/sessionsdirectories and offer a safe repair or warning.Suggested acceptance criteria
openclaw sessionsfrom a Codex/ACP isolatedHOMEdoes not create or mutate a shadow session store by default.openclaw doctordetects divergent canonical vs shadow session stores.