Skip to content

[Bug] Orphaned/oversized native Codex thread wedges a session permanently — chat.send returns started but no run executes, silently dropping messages #86963

@btilus

Description

@btilus

[Bug] Orphaned/oversized native Codex thread wedges a session permanently — chat.send returns "started" but no run executes, silently dropping every message

OpenClaw version: 2026.5.22 (a374c3a)
Component: Agents/Codex · Codex app-server native threads · context compaction
Severity: High — silent, permanent message loss on an affected session (no error surfaced)

Summary

A Codex (openai-codex / gpt-5.5) WebChat session became permanently wedged: it reports status=done, chat.send returns "started", lastInteractionAt advances — but no agent run ever executes, nothing is written to the transcript, and the user gets no reply. Every subsequent message is silently dropped. The session sits at ~212k tokens with compactionCount=0.

The existing Codex compaction-recovery mechanisms do not clear it:

  • The native Codex app-server restarts on dispatch (observed: a fresh app-server process spawned at the exact moment of a new send) yet the very next turn still stalls — so the "restart the native app-server and retry once when server-side compaction times out" path (Recover stuck Codex compaction #85500) loads the same oversized thread and stalls again.
  • "Rotate oversized native Codex threads before resume" (Guard Codex app-server context budgets #82981) does not appear to fire for this thread.
  • A full gateway restart does not help (the oversized native thread is persisted in codex-home and restored on resume).

Timeline / how it got into this state

The session's last successful turn ended while the agent was mid-work during a gateway reload (the assistant's final transcript line was literally "I got stuck at the gateway reload step … recovered"). After that point the native Codex thread appears orphaned/oversized, and the session never runs another turn.

Reproduction signature (from gateway log)

chat.send "started" (runId …)            # accepted in ~200-340ms
codex plugin thread config eligibility    # thread check runs
<then nothing — no run, no compaction completion, no transcript write, no error>
webchat disconnect (user reloads)

openclaw tasks audit does not flag this (there is no stuck task — it's a thread/compaction-layer stall). The only visible signature is in the session registry: status=done with lastInteractionAt advancing far past endedAt while compactionCount stays 0.

Expected

A session must never silently drop messages. Either:

  • preflight compaction on an oversized native thread must succeed or force-rotate the thread (start a fresh native thread seeded from a summary) so the turn can proceed, or
  • if the thread cannot be made runnable, surface a visible error to the user (and ideally to tasks audit / health) rather than accepting chat.send and silently running nothing.

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions