fix(codex): prevent gateway crash when app-server subprocess terminates abruptly#67947
Conversation
Greptile SummaryFixes a gateway crash caused by unhandled Confidence Score: 5/5Safe to merge — fixes a real gateway crash with targeted, well-tested changes and no regressions to existing behavior. All three root causes are addressed correctly. closeWithError() is idempotent (guarded by if (this.closed) return), so the new transport teardown call is safe even when exit fires alongside an EPIPE error. The optional chaining on child.stdin.on?.() correctly skips registration for transports that don't expose on. No P0/P1 findings; remaining observations are cosmetic. No files require special attention. Reviews (1): Last reviewed commit: "fix(codex): handle stdin EPIPE in CodexA..." | Re-trigger Greptile |
|
Jejak |
a750b29 to
34c947e
Compare
…eway crash When the codex-acp subprocess terminates abruptly, writing to its dead stdin pipe emits an unhandled EPIPE error that crashes the entire gateway daemon. Root cause: writeMessage() wrote to child.stdin without checking the closed state, child.stdin had no error event handler, and closeWithError() did not clean up the readline interface or transport — leaving the client in a half-closed state that allowed further writes. Fix: attach an error handler on child.stdin, guard writeMessage() against writes after close, and align closeWithError() cleanup with close().
…#67947 - transport.ts: narrow stdin.on? from (event: string) to (event: "error") so the error-handler contract is explicit and misuse is caught at compile time - CHANGELOG.md: add Fixes entry for openclaw#67947 under ## Unreleased
34c947e to
b2ec11b
Compare
…es abruptly (openclaw#67947) Fixes openclaw#67886. Handles stdin EPIPE in CodexAppServerClient by attaching an error handler, guarding writeMessage against writes after close, and aligning closeWithError cleanup with close.
…es abruptly (openclaw#67947) Fixes openclaw#67886. Handles stdin EPIPE in CodexAppServerClient by attaching an error handler, guarding writeMessage against writes after close, and aligning closeWithError cleanup with close.
…es abruptly (openclaw#67947) Fixes openclaw#67886. Handles stdin EPIPE in CodexAppServerClient by attaching an error handler, guarding writeMessage against writes after close, and aligning closeWithError cleanup with close.
…es abruptly (openclaw#67947) Fixes openclaw#67886. Handles stdin EPIPE in CodexAppServerClient by attaching an error handler, guarding writeMessage against writes after close, and aligning closeWithError cleanup with close.
…es abruptly (openclaw#67947) Fixes openclaw#67886. Handles stdin EPIPE in CodexAppServerClient by attaching an error handler, guarding writeMessage against writes after close, and aligning closeWithError cleanup with close.
…es abruptly (openclaw#67947) Fixes openclaw#67886. Handles stdin EPIPE in CodexAppServerClient by attaching an error handler, guarding writeMessage against writes after close, and aligning closeWithError cleanup with close.
…es abruptly (openclaw#67947) Fixes openclaw#67886. Handles stdin EPIPE in CodexAppServerClient by attaching an error handler, guarding writeMessage against writes after close, and aligning closeWithError cleanup with close.
…es abruptly (openclaw#67947) Fixes openclaw#67886. Handles stdin EPIPE in CodexAppServerClient by attaching an error handler, guarding writeMessage against writes after close, and aligning closeWithError cleanup with close.
Summary
Problem: When the
codex-acpsubprocess terminates abruptly (e.g. misconfigured binary, interactive prompt on stdout, deprecated config error),CodexAppServerClient.writeMessage()atextensions/codex/src/app-server/client.ts:215writes a JSON-RPC payload to the dead subprocess'sstdin. Node.js emits an asynchronouswrite EPIPEerror on the stream, and because noerrorevent handler is registered onchild.stdin, the error propagates as an unhandled exception that crashes the entire OpenClaw gateway daemon — taking down all connected channels.Root Cause: Three compounding defects in
CodexAppServerClient:errorhandler onchild.stdin(constructor, line 57–77): the constructor registerserror/exithandlers on the transport (child) itself, but never onchild.stdin. When the pipe breaks, the stdin stream emits anerrorevent that has no listener, causing Node.js to throw it as an uncaught exception.writeMessage()does not checkthis.closed(line 214–216): even after the client is marked closed bycloseWithError(), async code paths likehandleServerRequest()can still resume and callwriteMessage(), triggering writes to a dead pipe.closeWithError()performs incomplete cleanup (line 298–304): unlikeclose()which closes the readline interface and shuts down the transport,closeWithError()only setsclosed = trueand rejects pending requests — leaving the readline active (processing more lines → more writes) and the transport open.Fix: Three targeted changes that address each root cause:
errorevent handler onchild.stdinin the constructor that routes errors throughcloseWithError(), preventing unhandled exceptions.writeMessage()with athis.closedcheck, preventing writes to dead pipes from async continuations.closeWithError()cleanup withclose(): close the readline interface and shut down the transport.What changed:
extensions/codex/src/app-server/transport.ts: added optionalonmethod to thestdintype to support error event registration across stdio, websocket, and test transports.extensions/codex/src/app-server/client.ts: added stdin error handler in constructor; addedclosedguard inwriteMessage(); addedlines.close()andcloseCodexAppServerTransport()tocloseWithError().extensions/codex/src/app-server/client.test.ts: added two tests — one verifying stdin EPIPE is caught and pending requests are rejected gracefully; one verifyingnotify()after child exit does not attempt writes.What did NOT change (scope boundary):
close()behavior is unchanged — it already performed full cleanup.request()call-site error handling is unchanged — its existing try/catch aroundwriteMessage()continues to work.transport-websocket.ts) is unchanged — its customWritablestdin already buffers writes safely.handleLine(),handleResponse(),handleNotification(), or any RPC protocol logic.closeCodexAppServerTransport().Reproduction
codex-acpbinary that fails during initialization (e.g. a binary that writes an interactive prompt or error text to stdout, then exits immediately).openclaw gatewayinitializepayload to the dead subprocess's stdin.Error: write EPIPEas an unhandled exception.initializerequest, and the gateway continues running.Risk / Mitigation
closeWithError()now callscloseCodexAppServerTransport()which sends SIGTERM/SIGKILL to the child process. If the child has already exited (the common case), the signal is silently ignored. If the child is still running after a stdin error, terminating it is the correct behavior.close()andcloseWithError()are guarded byif (this.closed) return, so only one can execute — no double-cleanup risk. ThecloseCodexAppServerTransport()force-kill timer isunref()'d, so it cannot keep the process alive. Two new tests validate the exact crash scenario and the post-exit write guard.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Fixes #67886