Skip to content

2026.5.12: session reset reuses stale sessionFile; codex-acp orphans can leave agents stuck #82364

@chac4l

Description

@chac4l

Summary

We hit a production incident on OpenClaw 2026.5.12 where multiple agent sessions appeared stuck in Telegram and the OpenClaw frontend displayed transcripts that did not match the real Telegram topic/DM.

Two separate but compounding problems showed up:

  1. Session reset rotates sessionId but can keep the old sessionFile path. The session store ends up with a fresh sessionId pointing at an unrelated/stale transcript file. The frontend then renders the wrong chat history for that session key.
  2. Failed Codex ACP/acpx launches can leave orphan codex-acp processes parented to PID 1. OpenClaw had no active ACP tasks, but OS process listing still showed orphan codex-acp processes. These correlated with sluggish/stuck agent behavior and had to be cleaned manually.

This looks related to #82274 / #82343 for Codex delivery stalls and #82318 for recovery state split-brain, but the sessionFile reuse below is a more direct reset-path bug.

Environment

  • OpenClaw: 2026.5.12 (f066dd2)
  • Codex CLI/app-server: 0.130.0
  • Node: v22.22.0
  • OS: Ubuntu Linux 5.15.0-173-generic x86_64
  • Gateway: local systemd runtime, openclaw-gateway active
  • Channels affected: Telegram direct and Telegram topic sessions
  • Agents affected: multiple configured agents sharing the same gateway
  • Install path inspected: npm/global OpenClaw package under /usr/lib/node_modules/openclaw

Private chat IDs/session keys are intentionally redacted below.

Bug A: reset creates a new sessionId but preserves stale sessionFile

During incident recovery, two agent session entries had this bad shape:

sessionKey=<redacted telegram topic session>
sessionId=<new UUID>
sessionFile=<old UUID>.jsonl

In one concrete case, the sessionFile pointed at a completely unrelated older transcript from a cron/review task. The current frontend session URL for the Telegram topic therefore showed an unrelated transcript instead of the messages in that Telegram topic.

After manually backing up and rewriting those entries so sessionFile matched the new sessionId, the frontend history stopped showing the wrong transcript.

Source-level evidence

In the installed 2026.5.12 bundle, performGatewaySessionReset generates nextSessionId, but then passes the current entry's existing sessionFile back into resolveSessionFilePath:

oldSessionId = currentEntry?.sessionId;
oldSessionFile = currentEntry?.sessionFile;
const now = Date.now();
const nextSessionId = randomUUID();
const nextEntry = {
  sessionId: nextSessionId,
  sessionFile: resolveSessionFilePath(nextSessionId, currentEntry?.sessionFile ? { sessionFile: currentEntry.sessionFile } : void 0, resolveSessionFilePathOptions({
    storePath,
    agentId: sessionAgentId
  })),
  updatedAt: now,
  systemSent: false,
  abortedLastRun: false,

resolveSessionFilePath trusts any provided entry.sessionFile candidate before deriving a path from sessionId:

function resolveSessionFilePath(sessionId, entry, opts) {
  const sessionsDir = resolveSessionsDir(opts);
  const candidate = entry?.sessionFile?.trim();
  if (candidate) try {
    return resolvePathWithinSessionsDir(sessionsDir, candidate, { agentId: opts?.agentId });
  } catch {}
  return resolveSessionTranscriptPathInDir(sessionId, sessionsDir);
}

So on reset, nextSessionId changes but sessionFile can remain the previous transcript path.

There is already a helper elsewhere that seems intended to solve exactly this class of problem:

function rewriteSessionFileForNewSessionId(params) {
  const trimmed = normalizeOptionalString(params.sessionFile);
  if (!trimmed) return;
  const base = path.basename(trimmed);
  if (!base.endsWith(".jsonl")) return;
  const withoutExt = base.slice(0, -6);
  if (withoutExt === params.previousSessionId) return path.join(path.dirname(trimmed), `${params.nextSessionId}.jsonl`);
  if (withoutExt.startsWith(`${params.previousSessionId}-topic-`)) return path.join(path.dirname(trimmed), `${params.nextSessionId}${base.slice(params.previousSessionId.length)}`);
}

But the reset service path above does not appear to use it.

Expected behavior

When a session reset rotates from oldSessionId to nextSessionId, the persisted sessionFile should also be rewritten to the matching transcript path, preserving topic suffixes such as -topic-<id> when present.

The store should not persist:

sessionId=<new UUID>
sessionFile=<old UUID>.jsonl

unless this is an intentional fork/checkpoint reference and is marked as such.

Bug B: codex-acp orphan processes after failed initialize

The same incident also involved Codex ACP/acpx failures before initialize. Task output showed ACP failed before initialize and direct acpx@0.6.1 fallback failed the same way.

After that, OS process listing showed codex-acp processes with PPID=1, while OpenClaw reported no active ACP tasks. They were invisible to normal task accounting and had to be killed manually.

Sanitized shape of the evidence:

openclaw tasks --runtime acp --json  -> active=[]
ps -eo pid,ppid,etime,cmd | grep codex-acp -> codex-acp rows with PPID=1

Impact observed locally:

  • Telegram sessions stayed in typing... / no final response.
  • Gateway/message actions became very slow.
  • Manual cleanup of orphan codex-acp processes plus session store repair restored normal behavior.

Expected behavior

If ACP/acpx fails before initialize, OpenClaw should guarantee child process cleanup or track and reap failed launch descendants.

At minimum:

  • failed ACP initialize should not leave codex-acp under PID 1;
  • openclaw tasks --runtime acp or doctor should surface orphan ACP descendants;
  • gateway restart/doctor should provide a safe reap/repair path.

Why this matters

This failure is very visible to users:

  • Telegram says the bot is typing forever.
  • The OpenClaw frontend shows a different transcript than the actual Telegram conversation.
  • Recovery is confusing because the session entry can look reset while still rendering stale history.
  • Orphan ACP processes are outside normal OpenClaw task visibility, so operators can miss the real reason the host feels stuck.

Related issues

Suggested fix direction

For the session reset path:

  • Use the existing rewriteSessionFileForNewSessionId behavior in performGatewaySessionReset, or equivalent logic.
  • Add a regression test that starts with sessionId=old, sessionFile=old-topic-123.jsonl, reset produces sessionId=new, and expected sessionFile=new-topic-123.jsonl.
  • Add a store integrity check: flag entries where basename UUID does not match entry.sessionId unless explicitly marked as checkpoint/fork/legacy.

For ACP orphan cleanup:

  • Make failed pre-initialize launches kill the whole process group.
  • Track ACP child PIDs early enough that pre-initialize failures are still accounted for.
  • Add doctor detection for codex-acp / acpx descendants with PPID=1 or no matching OpenClaw task.
  • Consider logging a clear warning when ACP process cleanup fails.

Local mitigation applied

We mitigated locally by:

  1. Backing up the affected sessions.json stores.
  2. Rewriting mismatched sessionFile entries to match their sessionId.
  3. Clearing stale status=running entries for sessions whose run had already died.
  4. Killing only orphan codex-acp processes with PPID=1.
  5. Revalidating that mismatched session entries were gone and no codex-acp orphan remained.

No secrets or private message content are needed to reproduce the source-level session reset bug; the snippets above should be enough to locate it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions