fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) by OutThisLife · Pull Request #35893 · NousResearch/hermes-agent

OutThisLife · 2026-05-31T14:22:31Z

Context

Investigating @bertl's recurring TUI death reports — gateway dies within minutes of heavy work; CLI is fine in the same env. The shared tui_gateway_crash.log always shows the Python side's === SIGTERM received === with the main thread idle on sys.stdin, i.e. the gateway was killed from outside while perfectly healthy. A backend SIGTERM is always a parent (Node) action: gw.kill() (graceful-exit on a signal to Node — e.g. SSH/terminal drop — or /quit) or start() replacing a live child. process.exit(137) from the memory monitor is the other likely death (closes the child's stdin → clean EOF, not SIGTERM).

The actual fix (commit 2)

The real pain isn't that the gateway got a signal — it's that when it dies the TUI nulls the session and drops to an inert gateway exited husk, so the user loses a long session and must restart + re-run everything. That single behavior is most of the "doesn't survive heavy work" complaint, regardless of the killer.

The 'exit' event only reaches the app handler on an unexpected death (a user /quit calls process.exit before it fires; a replaced child is identity-skipped in GatewayClient). So on exit we now:

respawn the gateway, and
resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session.

The in-flight reply is lost (it died with the process) but the session survives — the TUI self-heals instead of stranding the user. Bounded to 3 attempts / 60s so a gateway that crash-loops on startup can't spawn-storm (past the budget it falls back to the old inert state).

Diagnostics (commit 1)

#31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM reports arrive with no parent context (gateway.log = "file not found") and no way to tell a signal-driven kill from a memory-critical process.exit(137). We persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) + the graceful-exit signal name + the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic:

[tui-parent] 2026-05-28T14:09:25Z graceful-exit received signal=SIGHUP -> killing gateway
=== SIGTERM received . 2026-05-28 14:09:27 ===   (Python side)

That pinpoints SIGHUP (SSH/terminal gone) vs real SIGTERM (s6/cgroup) vs /quit vs OOM for the next report — so the trigger can be fixed at the source too. Gated off under VITEST so tests stay hermetic.

Changes

lib/parentLog.ts (new) — best-effort sync append to the gateway crash log; VITEST-gated.
gatewayClient.ts — route death-explaining breadcrumbs through a lifecycle() helper (in-memory and persisted).
entry.tsx — record graceful-exit signal, memory-critical process.exit(137), uncaught errors.
app/useMainApp.ts — bounded auto-recovery in the gateway exit handler (recoverSidRef + retry budget).
app/createGatewayEventHandler.ts — gateway.ready resumes a recovered session before forging a new one.
app/interfaces.ts — optional recoverSidRef on the handler context.

Test plan

createGatewayEventHandler.test.ts — new: gateway.ready after a crash resumes the recovered sid once, skips forge, clears the ref. (50 pass)
parentLog.test.ts — new hermetic: writes timestamped breadcrumb when enabled; no-op + no file under VITEST.
gatewayClient.test.ts — existing lifecycle-tail assertions still pass through lifecycle().
eslint clean on all changed files; no new typecheck errors.
Full ui-tui suite: 907 pass; only pre-existing failures remain (virtualHeights.test.ts, packages/hermes-ink typecheck — both reproduce on main).

A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an explicit /quit) or `start()` replacing a live child. #31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM crash reports arrive with no parent context and no way to tell a signal-driven kill from a memory-critical `process.exit(137)` (which closes the child's stdin → clean EOF, not SIGTERM). Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) plus the graceful-exit signal name and the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic entry — making these recurring reports diagnosable. Gated off under VITEST so unit tests stay hermetic.

github-actions · 2026-05-31T14:23:17Z

🔎 Lint report: `bb/tui-gateway-parent-lifecycle-log` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9539 on HEAD, 9539 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4950 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Copilot

Pull request overview

Persists TUI parent-side gateway lifecycle breadcrumbs to the same crash log used by the Python gateway ($HERMES_HOME/logs/tui_gateway_crash.log) so crash reports include the parent’s termination context alongside the child’s SIGTERM/exit messages.

Changes:

Add recordParentLifecycle() helper to append timestamped parent lifecycle lines to tui_gateway_crash.log (best-effort; disabled under VITEST).
Route gateway “death-explaining” lifecycle breadcrumb sites (spawn/exit/kill/replace/timeout/transport-exit) through a new GatewayClient.lifecycle() helper that writes both in-memory tail and persisted crash log.
Record parent-side graceful-exit signals and memory-critical process.exit(137) events into the crash log; add a hermetic unit test for the new logger.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
ui-tui/src/lib/parentLog.ts	New best-effort crash-log appender for parent lifecycle breadcrumbs.
ui-tui/src/gatewayClient.ts	Persist lifecycle breadcrumbs to crash log via a `lifecycle()` helper (in addition to existing in-memory tail).
ui-tui/src/entry.tsx	Emit parent lifecycle breadcrumbs for signals, memory-critical exit, and error scope.
ui-tui/src/tests/parentLog.test.ts	Adds focused unit coverage for crash-log persistence and VITEST gating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When a still-owned gateway child dies while the TUI is alive (a crash, OOM process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the session and drops to an inert "gateway exited" state — the user loses a long session and has to restart + re-run everything. That single behavior is most of the "TUI doesn't survive heavy work" complaint, independent of what does the killing. The 'exit' event only reaches this handler on an *unexpected* death: a user /quit calls process.exit before it fires, and a replaced child is identity- skipped in GatewayClient. So on exit we now respawn the gateway and resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session. The in-flight reply is lost (it died with the process) but the session survives. Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS (60s) so a gateway that crash-loops on startup can't spawn-storm; past the budget we fall back to the inert state.

OutThisLife · 2026-05-31T14:39:17Z

bugbot run

cursor · 2026-05-31T14:39:19Z

Skipping Bugbot: Unable to authenticate your request. Please make sure Bugbot is properly installed and configured for this repository.

Address PR review: - recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an error message) stays a single breadcrumb and can't masquerade as a separate entry or as the child's panic output sharing the crash log. - Reword the header: a backend SIGTERM is *usually* a parent action but can come straight from an external supervisor (s6, cgroup OOM, stray kill); the presence/absence of a [tui-parent] line before the child's panic is precisely what disambiguates the two.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Address PR review: - Null `sid` immediately in the gateway exit handler. While the gateway is down (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway. recoverSidRef carries the session forward; resumeById restores sid on ready. - Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks past it, and prunes attempts older than the window so recovery re-arms.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's in-memory log-line cap) so a pathological value — e.g. a giant error string — can't bloat the shared crash log or add noticeable blocking on the synchronous append during a failure path. Covered by a test.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

… review) resumeById() synchronously sets status to 'resuming…' on entry, so the recovery branch now applies its 'recovering session…' label *after* calling resumeById — the distinct label sticks for the duration of the resume RPC (which later flips to 'ready') instead of being immediately clobbered. Test updated to assert the ordering.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

…review) deadSid was read from getUiState().sid, which the first exit nulls — so if the respawned gateway crash-looped before gateway.ready (resumeById never restored sid), later exits saw null and abandoned the session after a single attempt, defeating the bounded retry budget. Lift the whole decision into a pure planGatewayRecovery() that falls back to the pending recoverSidRef target when the live sid is already cleared, and unit-test the crash-loop sequence (keeps retrying the same session up to the limit, then falls back to inert). Supersedes evalRecovery.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

…(PR review) - Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs no `!` assertion (avoids a future unsafe refactor). - Reword the parentLog cap comment: it slices the value to 4096 chars and appends a short truncation marker (so the written line is slightly longer), rather than implying a strict 4096-byte limit.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

…" (PR review) - parentLog header: a missing [tui-parent] line only *suggests* an external signal (the logger is best-effort: VITEST-disabled, failed append swallowed), not a definitive conclusion. - Recovery notice says "any in-flight reply was lost" since the gateway can also exit while idle.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

… lifecycle breadcrumbs) (NousResearch#35893) * fix(tui): persist gateway lifecycle breadcrumbs to crash log A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an explicit /quit) or `start()` replacing a live child. NousResearch#31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM crash reports arrive with no parent context and no way to tell a signal-driven kill from a memory-critical `process.exit(137)` (which closes the child's stdin → clean EOF, not SIGTERM). Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) plus the graceful-exit signal name and the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic entry — making these recurring reports diagnosable. Gated off under VITEST so unit tests stay hermetic. * feat(tui): auto-recover the session when the gateway dies unexpectedly When a still-owned gateway child dies while the TUI is alive (a crash, OOM process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the session and drops to an inert "gateway exited" state — the user loses a long session and has to restart + re-run everything. That single behavior is most of the "TUI doesn't survive heavy work" complaint, independent of what does the killing. The 'exit' event only reaches this handler on an *unexpected* death: a user /quit calls process.exit before it fires, and a replaced child is identity- skipped in GatewayClient. So on exit we now respawn the gateway and resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session. The in-flight reply is lost (it died with the process) but the session survives. Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS (60s) so a gateway that crash-loops on startup can't spawn-storm; past the budget we fall back to the inert state. * fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog Address PR review: - recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an error message) stays a single breadcrumb and can't masquerade as a separate entry or as the child's panic output sharing the crash log. - Reword the header: a backend SIGTERM is *usually* a parent action but can come straight from an external supervisor (s6, cgroup OOM, stray kill); the presence/absence of a [tui-parent] line before the child's panic is precisely what disambiguates the two. * fix(tui): clear sid during recovery + extract/test the recovery budget Address PR review: - Null `sid` immediately in the gateway exit handler. While the gateway is down (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway. recoverSidRef carries the session forward; resumeById restores sid on ready. - Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks past it, and prunes attempts older than the window so recovery re-arms. * fix(tui): cap parent-log breadcrumb length (PR review) Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's in-memory log-line cap) so a pathological value — e.g. a giant error string — can't bloat the shared crash log or add noticeable blocking on the synchronous append during a failure path. Covered by a test. * fix(tui): keep "recovering session…" status visible during resume (PR review) resumeById() synchronously sets status to 'resuming…' on entry, so the recovery branch now applies its 'recovering session…' label *after* calling resumeById — the distinct label sticks for the duration of the resume RPC (which later flips to 'ready') instead of being immediately clobbered. Test updated to assert the ordering. * fix(tui): keep recovery budget alive across a startup crash-loop (PR review) deadSid was read from getUiState().sid, which the first exit nulls — so if the respawned gateway crash-looped before gateway.ready (resumeById never restored sid), later exits saw null and abandoned the session after a single attempt, defeating the bounded retry budget. Lift the whole decision into a pure planGatewayRecovery() that falls back to the pending recoverSidRef target when the live sid is already cleared, and unit-test the crash-loop sequence (keeps retrying the same session up to the limit, then falls back to inert). Supersedes evalRecovery. * chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review) - Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs no `!` assertion (avoids a future unsafe refactor). - Reword the parentLog cap comment: it slices the value to 4096 chars and appends a short truncation marker (so the written line is slightly longer), rather than implying a strict 4096-byte limit. * chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review) - parentLog header: a missing [tui-parent] line only *suggests* an external signal (the logger is best-effort: VITEST-disabled, failed append swallowed), not a definitive conclusion. - Recovery notice says "any in-flight reply was lost" since the gateway can also exit while idle.

OutThisLife requested a review from Copilot May 31, 2026 14:29

Copilot started reviewing on behalf of OutThisLife May 31, 2026 14:29 View session