Skip to content

fix(desktop): recover chat after sleep/wake by revalidating a stale remote backend#41350

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-24f2854b
Jun 8, 2026
Merged

fix(desktop): recover chat after sleep/wake by revalidating a stale remote backend#41350
teknium1 merged 2 commits into
mainfrom
hermes/hermes-24f2854b

Conversation

@teknium1

@teknium1 teknium1 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Summary

Desktop chat recovers on its own after sleep/wake instead of locking on "Starting Hermes…" until a quit+reopen — reimplemented from #40135 on a smaller, more interpretable path (63 added lines vs 555).

Root cause (diagnosed by @AlchemistChaos): a remote/global-remote primary backend has no child process, so the 'exit'/'error' handlers that would clear the main process's cached connectionPromise never fire. Once the remote becomes unreachable across a sleep/wake, the renderer re-dials the same dead descriptor forever.

Changes

  • electron/main.cjs: new hermes:connection:revalidate IPC. Liveness-probes the cached remote backend's public /api/status (2.5s); on failure drops the cache via resetHermesConnection() (remote-only — no child to SIGTERM) so the next getConnection() rebuilds a reachable descriptor. Local backends are never touched (they self-heal via the child 'exit' handler).
  • electron/preload.cjs + src/global.d.ts: expose/type revalidateConnection().
  • src/app/gateway/hooks/use-gateway-boot.ts: the existing backoff-paced reconnect loop calls revalidateConnection() before re-dialing; dismisses the boot-progress overlay on the post-rebuild 'open' so an in-place rebuild can't leave it stuck at ~94%.
  • scripts/release.py: map co-author email.

Why simpler than #40135

The renderer's reconnect loop already provides retry pacing and rides out transient post-wake blips (exponential backoff, fires only on closed/error). So no failure-streak counter, no episode-window timestamps, no extracted helper module, no module-level state are needed — the loop is the retry mechanism. A transient probe miss just leaves the cache in place; the next backoff tick re-probes. Same recovery behavior, ~9x less code, far easier to reason about.

Validation

Result
npm run type-check (tsc -b) clean
eslint (changed files) clean
npm run test:desktop:platforms 102/102 pass

Reimplements #40135. Original diagnosis and fix by @AlchemistChaos; co-authorship preserved on the fix commit.

Infographic

desktop-sleep-wake-recovery

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-24f2854b vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10014 on HEAD, 10014 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5196 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have labels Jun 7, 2026
teknium1 and others added 2 commits June 7, 2026 10:04
…emote backend

After sleep/wake, a remote (global-remote) primary backend can become
unreachable, but it has no child process whose 'exit' clears the main
process's cached connectionPromise. The renderer then re-dials the same
dead remote forever and the composer stays stuck on "Starting Hermes…";
only a quit+reopen recovered.

Fix: the renderer's existing backoff-paced reconnect loop now asks the
main process to revalidate the cached connection before re-dialing. The
main process liveness-probes the cached REMOTE backend's public
/api/status and, if unreachable, drops the cache (resetHermesConnection
only nulls connectionPromise for a remote — no child to SIGTERM) so the
next getConnection() rebuilds a reachable descriptor. Local backends are
never touched here; they self-heal via the child 'exit' handler. The
renderer's loop already provides retry pacing and rides out transient
blips, so no streak/episode bookkeeping is needed in the main process.

The boot hook dismisses the boot-progress overlay on the post-rebuild
'open' so an in-place rebuild can't leave it stuck at ~94%.

Reimplements #40135 by @AlchemistChaos on a smaller, more interpretable
path (63 added lines vs 555): no extracted helper module, no
failure-streak / episode-window state, the renderer's backoff loop is
the retry mechanism. Original diagnosis and fix by @AlchemistChaos.

Co-authored-by: AlchemistChaos <alchemistchaos@protonmail.com>
@teknium1 teknium1 force-pushed the hermes/hermes-24f2854b branch from 66c6398 to a4d843c Compare June 7, 2026 17:04
@teknium1 teknium1 merged commit 1c7ae46 into main Jun 8, 2026
23 checks passed
@teknium1 teknium1 deleted the hermes/hermes-24f2854b branch June 8, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants