Skip to content

fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs)#35893

Merged
OutThisLife merged 9 commits into
mainfrom
bb/tui-gateway-parent-lifecycle-log
May 31, 2026
Merged

fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs)#35893
OutThisLife merged 9 commits into
mainfrom
bb/tui-gateway-parent-lifecycle-log

Conversation

@OutThisLife

@OutThisLife OutThisLife commented May 31, 2026

Copy link
Copy Markdown
Collaborator

Context

Investigating @bertl's recurring TUI death reports — gateway dies within minutes of heavy work; CLI is fine in the same env. The shared tui_gateway_crash.log always shows the Python side's === SIGTERM received === with the main thread idle on sys.stdin, i.e. the gateway was killed from outside while perfectly healthy. A backend SIGTERM is always a parent (Node) action: gw.kill() (graceful-exit on a signal to Node — e.g. SSH/terminal drop — or /quit) or start() replacing a live child. process.exit(137) from the memory monitor is the other likely death (closes the child's stdin → clean EOF, not SIGTERM).

The actual fix (commit 2)

The real pain isn't that the gateway got a signal — it's that when it dies the TUI nulls the session and drops to an inert gateway exited husk, so the user loses a long session and must restart + re-run everything. That single behavior is most of the "doesn't survive heavy work" complaint, regardless of the killer.

The 'exit' event only reaches the app handler on an unexpected death (a user /quit calls process.exit before it fires; a replaced child is identity-skipped in GatewayClient). So on exit we now:

  1. respawn the gateway, and
  2. resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session.

The in-flight reply is lost (it died with the process) but the session survives — the TUI self-heals instead of stranding the user. Bounded to 3 attempts / 60s so a gateway that crash-loops on startup can't spawn-storm (past the budget it falls back to the old inert state).

Diagnostics (commit 1)

#31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM reports arrive with no parent context (gateway.log = "file not found") and no way to tell a signal-driven kill from a memory-critical process.exit(137). We persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) + the graceful-exit signal name + the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic:

[tui-parent] 2026-05-28T14:09:25Z graceful-exit received signal=SIGHUP -> killing gateway
=== SIGTERM received . 2026-05-28 14:09:27 ===   (Python side)

That pinpoints SIGHUP (SSH/terminal gone) vs real SIGTERM (s6/cgroup) vs /quit vs OOM for the next report — so the trigger can be fixed at the source too. Gated off under VITEST so tests stay hermetic.

Changes

  • lib/parentLog.ts (new) — best-effort sync append to the gateway crash log; VITEST-gated.
  • gatewayClient.ts — route death-explaining breadcrumbs through a lifecycle() helper (in-memory and persisted).
  • entry.tsx — record graceful-exit signal, memory-critical process.exit(137), uncaught errors.
  • app/useMainApp.ts — bounded auto-recovery in the gateway exit handler (recoverSidRef + retry budget).
  • app/createGatewayEventHandler.tsgateway.ready resumes a recovered session before forging a new one.
  • app/interfaces.ts — optional recoverSidRef on the handler context.

Test plan

  • createGatewayEventHandler.test.ts — new: gateway.ready after a crash resumes the recovered sid once, skips forge, clears the ref. (50 pass)
  • parentLog.test.ts — new hermetic: writes timestamped breadcrumb when enabled; no-op + no file under VITEST.
  • gatewayClient.test.ts — existing lifecycle-tail assertions still pass through lifecycle().
  • eslint clean on all changed files; no new typecheck errors.
  • Full ui-tui suite: 907 pass; only pre-existing failures remain (virtualHeights.test.ts, packages/hermes-ink typecheck — both reproduce on main).

A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is
always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an
explicit /quit) or `start()` replacing a live child. #31051 added parent-side
lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies
with the process, so SIGTERM crash reports arrive with no parent context and no
way to tell a signal-driven kill from a memory-critical `process.exit(137)`
(which closes the child's stdin → clean EOF, not SIGTERM).

Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit /
replace-live-child / kill-reason / startup-timeout) plus the graceful-exit
signal name and the memory-critical exit into the same crash log the Python
side writes, so they interleave by timestamp next to the child's panic entry —
making these recurring reports diagnosable.

Gated off under VITEST so unit tests stay hermetic.
@github-actions

github-actions Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: bb/tui-gateway-parent-lifecycle-log vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9539 on HEAD, 9539 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4950 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Persists TUI parent-side gateway lifecycle breadcrumbs to the same crash log used by the Python gateway ($HERMES_HOME/logs/tui_gateway_crash.log) so crash reports include the parent’s termination context alongside the child’s SIGTERM/exit messages.

Changes:

  • Add recordParentLifecycle() helper to append timestamped parent lifecycle lines to tui_gateway_crash.log (best-effort; disabled under VITEST).
  • Route gateway “death-explaining” lifecycle breadcrumb sites (spawn/exit/kill/replace/timeout/transport-exit) through a new GatewayClient.lifecycle() helper that writes both in-memory tail and persisted crash log.
  • Record parent-side graceful-exit signals and memory-critical process.exit(137) events into the crash log; add a hermetic unit test for the new logger.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
ui-tui/src/lib/parentLog.ts New best-effort crash-log appender for parent lifecycle breadcrumbs.
ui-tui/src/gatewayClient.ts Persist lifecycle breadcrumbs to crash log via a lifecycle() helper (in addition to existing in-memory tail).
ui-tui/src/entry.tsx Emit parent lifecycle breadcrumbs for signals, memory-critical exit, and error scope.
ui-tui/src/tests/parentLog.test.ts Adds focused unit coverage for crash-log persistence and VITEST gating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ui-tui/src/lib/parentLog.ts Outdated
Comment thread ui-tui/src/lib/parentLog.ts
When a still-owned gateway child dies while the TUI is alive (a crash, OOM
process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the
session and drops to an inert "gateway exited" state — the user loses a long
session and has to restart + re-run everything. That single behavior is most of
the "TUI doesn't survive heavy work" complaint, independent of what does the
killing.

The 'exit' event only reaches this handler on an *unexpected* death: a user
/quit calls process.exit before it fires, and a replaced child is identity-
skipped in GatewayClient. So on exit we now respawn the gateway and resume the
session that was live (history is persisted in SQLite) via a one-shot
recoverSidRef the next gateway.ready consults before forging a new session. The
in-flight reply is lost (it died with the process) but the session survives.

Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS
(60s) so a gateway that crash-loops on startup can't spawn-storm; past the
budget we fall back to the inert state.
@OutThisLife OutThisLife changed the title fix(tui): persist gateway lifecycle breadcrumbs to crash log fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) May 31, 2026
@OutThisLife

Copy link
Copy Markdown
Collaborator Author

bugbot run

@cursor

cursor Bot commented May 31, 2026

Copy link
Copy Markdown

Skipping Bugbot: Unable to authenticate your request. Please make sure Bugbot is properly installed and configured for this repository.

@alt-glitch alt-glitch added type/bug Something isn't working comp/tui Terminal UI (ui-tui/ + tui_gateway/) P2 Medium — degraded but workaround exists labels May 31, 2026
Address PR review:
- recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an
  error message) stays a single breadcrumb and can't masquerade as a separate
  entry or as the child's panic output sharing the crash log.
- Reword the header: a backend SIGTERM is *usually* a parent action but can come
  straight from an external supervisor (s6, cgroup OOM, stray kill); the
  presence/absence of a [tui-parent] line before the child's panic is precisely
  what disambiguates the two.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comment thread ui-tui/src/app/useMainApp.ts Outdated
Comment thread ui-tui/src/app/useMainApp.ts Outdated
Address PR review:
- Null `sid` immediately in the gateway exit handler. While the gateway is down
  (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s
  session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway.
  recoverSidRef carries the session forward; resumeById restores sid on ready.
- Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and
  unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks
  past it, and prunes attempts older than the window so recovery re-arms.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comment thread ui-tui/src/lib/parentLog.ts Outdated
Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's
in-memory log-line cap) so a pathological value — e.g. a giant error string —
can't bloat the shared crash log or add noticeable blocking on the synchronous
append during a failure path. Covered by a test.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comment thread ui-tui/src/app/createGatewayEventHandler.ts Outdated
… review)

resumeById() synchronously sets status to 'resuming…' on entry, so the
recovery branch now applies its 'recovering session…' label *after* calling
resumeById — the distinct label sticks for the duration of the resume RPC
(which later flips to 'ready') instead of being immediately clobbered. Test
updated to assert the ordering.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comment thread ui-tui/src/app/useMainApp.ts
…review)

deadSid was read from getUiState().sid, which the first exit nulls — so if the
respawned gateway crash-looped before gateway.ready (resumeById never restored
sid), later exits saw null and abandoned the session after a single attempt,
defeating the bounded retry budget.

Lift the whole decision into a pure planGatewayRecovery() that falls back to the
pending recoverSidRef target when the live sid is already cleared, and unit-test
the crash-loop sequence (keeps retrying the same session up to the limit, then
falls back to inert). Supersedes evalRecovery.
@OutThisLife OutThisLife requested a review from Copilot May 31, 2026 15:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comment thread ui-tui/src/app/createGatewayEventHandler.ts
Comment thread ui-tui/src/lib/parentLog.ts Outdated
…(PR review)

- Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs
  no `!` assertion (avoids a future unsafe refactor).
- Reword the parentLog cap comment: it slices the value to 4096 chars and
  appends a short truncation marker (so the written line is slightly longer),
  rather than implying a strict 4096-byte limit.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comment thread ui-tui/src/lib/parentLog.ts Outdated
Comment thread ui-tui/src/app/useMainApp.ts Outdated
…" (PR review)

- parentLog header: a missing [tui-parent] line only *suggests* an external
  signal (the logger is best-effort: VITEST-disabled, failed append swallowed),
  not a definitive conclusion.
- Recovery notice says "any in-flight reply was lost" since the gateway can also
  exit while idle.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

@OutThisLife OutThisLife merged commit a726e8a into main May 31, 2026
21 checks passed
@OutThisLife OutThisLife deleted the bb/tui-gateway-parent-lifecycle-log branch May 31, 2026 15:36
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
… lifecycle breadcrumbs) (NousResearch#35893)

* fix(tui): persist gateway lifecycle breadcrumbs to crash log

A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is
always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an
explicit /quit) or `start()` replacing a live child. NousResearch#31051 added parent-side
lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies
with the process, so SIGTERM crash reports arrive with no parent context and no
way to tell a signal-driven kill from a memory-critical `process.exit(137)`
(which closes the child's stdin → clean EOF, not SIGTERM).

Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit /
replace-live-child / kill-reason / startup-timeout) plus the graceful-exit
signal name and the memory-critical exit into the same crash log the Python
side writes, so they interleave by timestamp next to the child's panic entry —
making these recurring reports diagnosable.

Gated off under VITEST so unit tests stay hermetic.

* feat(tui): auto-recover the session when the gateway dies unexpectedly

When a still-owned gateway child dies while the TUI is alive (a crash, OOM
process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the
session and drops to an inert "gateway exited" state — the user loses a long
session and has to restart + re-run everything. That single behavior is most of
the "TUI doesn't survive heavy work" complaint, independent of what does the
killing.

The 'exit' event only reaches this handler on an *unexpected* death: a user
/quit calls process.exit before it fires, and a replaced child is identity-
skipped in GatewayClient. So on exit we now respawn the gateway and resume the
session that was live (history is persisted in SQLite) via a one-shot
recoverSidRef the next gateway.ready consults before forging a new session. The
in-flight reply is lost (it died with the process) but the session survives.

Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS
(60s) so a gateway that crash-loops on startup can't spawn-storm; past the
budget we fall back to the inert state.

* fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog

Address PR review:
- recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an
  error message) stays a single breadcrumb and can't masquerade as a separate
  entry or as the child's panic output sharing the crash log.
- Reword the header: a backend SIGTERM is *usually* a parent action but can come
  straight from an external supervisor (s6, cgroup OOM, stray kill); the
  presence/absence of a [tui-parent] line before the child's panic is precisely
  what disambiguates the two.

* fix(tui): clear sid during recovery + extract/test the recovery budget

Address PR review:
- Null `sid` immediately in the gateway exit handler. While the gateway is down
  (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s
  session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway.
  recoverSidRef carries the session forward; resumeById restores sid on ready.
- Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and
  unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks
  past it, and prunes attempts older than the window so recovery re-arms.

* fix(tui): cap parent-log breadcrumb length (PR review)

Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's
in-memory log-line cap) so a pathological value — e.g. a giant error string —
can't bloat the shared crash log or add noticeable blocking on the synchronous
append during a failure path. Covered by a test.

* fix(tui): keep "recovering session…" status visible during resume (PR review)

resumeById() synchronously sets status to 'resuming…' on entry, so the
recovery branch now applies its 'recovering session…' label *after* calling
resumeById — the distinct label sticks for the duration of the resume RPC
(which later flips to 'ready') instead of being immediately clobbered. Test
updated to assert the ordering.

* fix(tui): keep recovery budget alive across a startup crash-loop (PR review)

deadSid was read from getUiState().sid, which the first exit nulls — so if the
respawned gateway crash-looped before gateway.ready (resumeById never restored
sid), later exits saw null and abandoned the session after a single attempt,
defeating the bounded retry budget.

Lift the whole decision into a pure planGatewayRecovery() that falls back to the
pending recoverSidRef target when the live sid is already cleared, and unit-test
the crash-loop sequence (keeps retrying the same session up to the limit, then
falls back to inert). Supersedes evalRecovery.

* chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review)

- Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs
  no `!` assertion (avoids a future unsafe refactor).
- Reword the parentLog cap comment: it slices the value to 4096 chars and
  appends a short truncation marker (so the written line is slightly longer),
  rather than implying a strict 4096-byte limit.

* chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review)

- parentLog header: a missing [tui-parent] line only *suggests* an external
  signal (the logger is best-effort: VITEST-disabled, failed append swallowed),
  not a definitive conclusion.
- Recovery notice says "any in-flight reply was lost" since the gateway can also
  exit while idle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tui Terminal UI (ui-tui/ + tui_gateway/) P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants