Skip to content

Any Single Engine Crash Leaves Orphaned Threads That Cause a Persistent Crash/Resume Loop #16271

@raederhans

Description

@raederhans

What version of the Codex App are you using (From “About Codex” dialog)?

26.325 (internal engine: 0.118.0-alpha.2)

What subscription do you have?

ChatGPT Pro

What platform is your computer?

Windows 11 Pro 10.0.26200

What issue are you seeing?

When the Codex engine process exits unexpectedly (for any reason — MCP failure, personality error, OOM, etc.), the Desktop App correctly detects the crash, auto-restarts the engine, and attempts to resume the interrupted conversation thread. However, the resume logic has a cascading failure mode that makes recovery impossible without manual intervention:

The interrupted thread's rollout file is never written (because the crash happened before the file was flushed).
On restart, the App sends thread/resume, which succeeds (thread goes to running).
The App then sends a second thread/resume with config overrides — but since the thread is already running, the overrides are silently dropped with WARN thread/resume overrides ignored for running thread.
The App attempts to thread/archive the thread, which fails: "no rollout found for thread id ".
Approximately 90 seconds later, the engine crashes again with exit code 3221225786 (STATUS_CONTROL_C_EXIT / 0xC000013A).
Return to step 1. The loop is infinite and requires manual state database deletion to break.

Relevant log sequence (timestamps from %LOCALAPPDATA%\Packages\OpenAI.Codex_*\LocalCache\Local\Codex\Logs)

19:28:09 thread/resume → success (latestTurnStatus=interrupted → running)
19:29:09 thread/resume again → WARN: "thread/resume overrides ignored for running thread
019d4035-...: config overrides were provided and ignored while running;
developerInstructions override was provided and ignored while running"
19:29:13 thread/archive → ERROR: "no rollout found for thread id 019d4035-..."
19:30:48 app_server_connection.closed code=3221225786
→ fatal_error_broadcasted
→ cause=start_process → reconnecting
(loop repeats)

What steps can reproduce the bug?

During our session, the initial crash that started the loop was caused by one of three separate triggers, each of which is fatal on its own (see related issues):

WARN codex_protocol::openai_models: Model personality requested but model_messages is missing for model=gpt-5.4 personality=friendly → crash
WARN rmcp::transport::worker: worker quit with fatal: Transport channel closed, when AuthRequired from plugin-injected Stripe MCP → crash
WARN codex_core::mcp_connection_manager: Failed to list resource templates for MCP server 'playwright': Mcp error: -32601: Method not found → crash
Any of these triggers leaves a thread in interrupted state with no rollout file, which then trips the resume loop.

What is the expected behavior?

The engine should handle the case where a thread has status=interrupted but no corresponding rollout file gracefully — either by marking it as unrecoverable and skipping it, or by creating an empty rollout as a placeholder. The second thread/resume call with overrides on an already-running thread should not cause instability.

Additional information

Unrecoverable crash loop until the user manually deletes state_5.sqlite*, session_index.jsonl, and sessions/.

Metadata

Metadata

Assignees

No one assigned

    Labels

    appIssues related to the Codex desktop appbugSomething isn't workingsessionIssues involving session (thread) management, resuming, forking, naming, archivingwindows-osIssues related to Codex on Windows systems

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions