fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs)#35893
Conversation
A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an explicit /quit) or `start()` replacing a live child. #31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM crash reports arrive with no parent context and no way to tell a signal-driven kill from a memory-critical `process.exit(137)` (which closes the child's stdin → clean EOF, not SIGTERM). Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) plus the graceful-exit signal name and the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic entry — making these recurring reports diagnosable. Gated off under VITEST so unit tests stay hermetic.
🔎 Lint report:
|
There was a problem hiding this comment.
Pull request overview
Persists TUI parent-side gateway lifecycle breadcrumbs to the same crash log used by the Python gateway ($HERMES_HOME/logs/tui_gateway_crash.log) so crash reports include the parent’s termination context alongside the child’s SIGTERM/exit messages.
Changes:
- Add
recordParentLifecycle()helper to append timestamped parent lifecycle lines totui_gateway_crash.log(best-effort; disabled underVITEST). - Route gateway “death-explaining” lifecycle breadcrumb sites (spawn/exit/kill/replace/timeout/transport-exit) through a new
GatewayClient.lifecycle()helper that writes both in-memory tail and persisted crash log. - Record parent-side graceful-exit signals and memory-critical
process.exit(137)events into the crash log; add a hermetic unit test for the new logger.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| ui-tui/src/lib/parentLog.ts | New best-effort crash-log appender for parent lifecycle breadcrumbs. |
| ui-tui/src/gatewayClient.ts | Persist lifecycle breadcrumbs to crash log via a lifecycle() helper (in addition to existing in-memory tail). |
| ui-tui/src/entry.tsx | Emit parent lifecycle breadcrumbs for signals, memory-critical exit, and error scope. |
| ui-tui/src/tests/parentLog.test.ts | Adds focused unit coverage for crash-log persistence and VITEST gating. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When a still-owned gateway child dies while the TUI is alive (a crash, OOM process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the session and drops to an inert "gateway exited" state — the user loses a long session and has to restart + re-run everything. That single behavior is most of the "TUI doesn't survive heavy work" complaint, independent of what does the killing. The 'exit' event only reaches this handler on an *unexpected* death: a user /quit calls process.exit before it fires, and a replaced child is identity- skipped in GatewayClient. So on exit we now respawn the gateway and resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session. The in-flight reply is lost (it died with the process) but the session survives. Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS (60s) so a gateway that crash-loops on startup can't spawn-storm; past the budget we fall back to the inert state.
|
bugbot run |
|
Skipping Bugbot: Unable to authenticate your request. Please make sure Bugbot is properly installed and configured for this repository. |
Address PR review: - recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an error message) stays a single breadcrumb and can't masquerade as a separate entry or as the child's panic output sharing the crash log. - Reword the header: a backend SIGTERM is *usually* a parent action but can come straight from an external supervisor (s6, cgroup OOM, stray kill); the presence/absence of a [tui-parent] line before the child's panic is precisely what disambiguates the two.
Address PR review: - Null `sid` immediately in the gateway exit handler. While the gateway is down (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway. recoverSidRef carries the session forward; resumeById restores sid on ready. - Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks past it, and prunes attempts older than the window so recovery re-arms.
Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's in-memory log-line cap) so a pathological value — e.g. a giant error string — can't bloat the shared crash log or add noticeable blocking on the synchronous append during a failure path. Covered by a test.
… review) resumeById() synchronously sets status to 'resuming…' on entry, so the recovery branch now applies its 'recovering session…' label *after* calling resumeById — the distinct label sticks for the duration of the resume RPC (which later flips to 'ready') instead of being immediately clobbered. Test updated to assert the ordering.
…review) deadSid was read from getUiState().sid, which the first exit nulls — so if the respawned gateway crash-looped before gateway.ready (resumeById never restored sid), later exits saw null and abandoned the session after a single attempt, defeating the bounded retry budget. Lift the whole decision into a pure planGatewayRecovery() that falls back to the pending recoverSidRef target when the live sid is already cleared, and unit-test the crash-loop sequence (keeps retrying the same session up to the limit, then falls back to inert). Supersedes evalRecovery.
…(PR review) - Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs no `!` assertion (avoids a future unsafe refactor). - Reword the parentLog cap comment: it slices the value to 4096 chars and appends a short truncation marker (so the written line is slightly longer), rather than implying a strict 4096-byte limit.
…" (PR review) - parentLog header: a missing [tui-parent] line only *suggests* an external signal (the logger is best-effort: VITEST-disabled, failed append swallowed), not a definitive conclusion. - Recovery notice says "any in-flight reply was lost" since the gateway can also exit while idle.
… lifecycle breadcrumbs) (NousResearch#35893) * fix(tui): persist gateway lifecycle breadcrumbs to crash log A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an explicit /quit) or `start()` replacing a live child. NousResearch#31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM crash reports arrive with no parent context and no way to tell a signal-driven kill from a memory-critical `process.exit(137)` (which closes the child's stdin → clean EOF, not SIGTERM). Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) plus the graceful-exit signal name and the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic entry — making these recurring reports diagnosable. Gated off under VITEST so unit tests stay hermetic. * feat(tui): auto-recover the session when the gateway dies unexpectedly When a still-owned gateway child dies while the TUI is alive (a crash, OOM process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the session and drops to an inert "gateway exited" state — the user loses a long session and has to restart + re-run everything. That single behavior is most of the "TUI doesn't survive heavy work" complaint, independent of what does the killing. The 'exit' event only reaches this handler on an *unexpected* death: a user /quit calls process.exit before it fires, and a replaced child is identity- skipped in GatewayClient. So on exit we now respawn the gateway and resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session. The in-flight reply is lost (it died with the process) but the session survives. Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS (60s) so a gateway that crash-loops on startup can't spawn-storm; past the budget we fall back to the inert state. * fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog Address PR review: - recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an error message) stays a single breadcrumb and can't masquerade as a separate entry or as the child's panic output sharing the crash log. - Reword the header: a backend SIGTERM is *usually* a parent action but can come straight from an external supervisor (s6, cgroup OOM, stray kill); the presence/absence of a [tui-parent] line before the child's panic is precisely what disambiguates the two. * fix(tui): clear sid during recovery + extract/test the recovery budget Address PR review: - Null `sid` immediately in the gateway exit handler. While the gateway is down (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway. recoverSidRef carries the session forward; resumeById restores sid on ready. - Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks past it, and prunes attempts older than the window so recovery re-arms. * fix(tui): cap parent-log breadcrumb length (PR review) Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's in-memory log-line cap) so a pathological value — e.g. a giant error string — can't bloat the shared crash log or add noticeable blocking on the synchronous append during a failure path. Covered by a test. * fix(tui): keep "recovering session…" status visible during resume (PR review) resumeById() synchronously sets status to 'resuming…' on entry, so the recovery branch now applies its 'recovering session…' label *after* calling resumeById — the distinct label sticks for the duration of the resume RPC (which later flips to 'ready') instead of being immediately clobbered. Test updated to assert the ordering. * fix(tui): keep recovery budget alive across a startup crash-loop (PR review) deadSid was read from getUiState().sid, which the first exit nulls — so if the respawned gateway crash-looped before gateway.ready (resumeById never restored sid), later exits saw null and abandoned the session after a single attempt, defeating the bounded retry budget. Lift the whole decision into a pure planGatewayRecovery() that falls back to the pending recoverSidRef target when the live sid is already cleared, and unit-test the crash-loop sequence (keeps retrying the same session up to the limit, then falls back to inert). Supersedes evalRecovery. * chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review) - Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs no `!` assertion (avoids a future unsafe refactor). - Reword the parentLog cap comment: it slices the value to 4096 chars and appends a short truncation marker (so the written line is slightly longer), rather than implying a strict 4096-byte limit. * chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review) - parentLog header: a missing [tui-parent] line only *suggests* an external signal (the logger is best-effort: VITEST-disabled, failed append swallowed), not a definitive conclusion. - Recovery notice says "any in-flight reply was lost" since the gateway can also exit while idle.
Context
Investigating @bertl's recurring TUI death reports — gateway dies within minutes of heavy work; CLI is fine in the same env. The shared
tui_gateway_crash.logalways shows the Python side's=== SIGTERM received ===with the main thread idle onsys.stdin, i.e. the gateway was killed from outside while perfectly healthy. A backend SIGTERM is always a parent (Node) action:gw.kill()(graceful-exit on a signal to Node — e.g. SSH/terminal drop — or/quit) orstart()replacing a live child.process.exit(137)from the memory monitor is the other likely death (closes the child's stdin → clean EOF, not SIGTERM).The actual fix (commit 2)
The real pain isn't that the gateway got a signal — it's that when it dies the TUI nulls the session and drops to an inert
gateway exitedhusk, so the user loses a long session and must restart + re-run everything. That single behavior is most of the "doesn't survive heavy work" complaint, regardless of the killer.The
'exit'event only reaches the app handler on an unexpected death (a user/quitcallsprocess.exitbefore it fires; a replaced child is identity-skipped inGatewayClient). So on exit we now:recoverSidRefthe nextgateway.readyconsults before forging a new session.The in-flight reply is lost (it died with the process) but the session survives — the TUI self-heals instead of stranding the user. Bounded to 3 attempts / 60s so a gateway that crash-loops on startup can't spawn-storm (past the budget it falls back to the old inert state).
Diagnostics (commit 1)
#31051 added parent-side lifecycle breadcrumbs but left them in an in-memory
CircularBufferthat dies with the process, so SIGTERM reports arrive with no parent context (gateway.log= "file not found") and no way to tell a signal-driven kill from a memory-criticalprocess.exit(137). We persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) + the graceful-exit signal name + the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic:That pinpoints SIGHUP (SSH/terminal gone) vs real SIGTERM (s6/cgroup) vs
/quitvs OOM for the next report — so the trigger can be fixed at the source too. Gated off underVITESTso tests stay hermetic.Changes
lib/parentLog.ts(new) — best-effort sync append to the gateway crash log;VITEST-gated.gatewayClient.ts— route death-explaining breadcrumbs through alifecycle()helper (in-memory and persisted).entry.tsx— record graceful-exit signal, memory-criticalprocess.exit(137), uncaught errors.app/useMainApp.ts— bounded auto-recovery in the gateway exit handler (recoverSidRef+ retry budget).app/createGatewayEventHandler.ts—gateway.readyresumes a recovered session before forging a new one.app/interfaces.ts— optionalrecoverSidRefon the handler context.Test plan
createGatewayEventHandler.test.ts— new: gateway.ready after a crash resumes the recovered sid once, skips forge, clears the ref. (50 pass)parentLog.test.ts— new hermetic: writes timestamped breadcrumb when enabled; no-op + no file under VITEST.gatewayClient.test.ts— existing lifecycle-tail assertions still pass throughlifecycle().ui-tuisuite: 907 pass; only pre-existing failures remain (virtualHeights.test.ts,packages/hermes-inktypecheck — both reproduce onmain).