fix(desktop): recover chat after sleep/wake by revalidating a stale remote backend#41350
Merged
Conversation
Contributor
🔎 Lint report:
|
…emote backend After sleep/wake, a remote (global-remote) primary backend can become unreachable, but it has no child process whose 'exit' clears the main process's cached connectionPromise. The renderer then re-dials the same dead remote forever and the composer stays stuck on "Starting Hermes…"; only a quit+reopen recovered. Fix: the renderer's existing backoff-paced reconnect loop now asks the main process to revalidate the cached connection before re-dialing. The main process liveness-probes the cached REMOTE backend's public /api/status and, if unreachable, drops the cache (resetHermesConnection only nulls connectionPromise for a remote — no child to SIGTERM) so the next getConnection() rebuilds a reachable descriptor. Local backends are never touched here; they self-heal via the child 'exit' handler. The renderer's loop already provides retry pacing and rides out transient blips, so no streak/episode bookkeeping is needed in the main process. The boot hook dismisses the boot-progress overlay on the post-rebuild 'open' so an in-place rebuild can't leave it stuck at ~94%. Reimplements #40135 by @AlchemistChaos on a smaller, more interpretable path (63 added lines vs 555): no extracted helper module, no failure-streak / episode-window state, the renderer's backoff loop is the retry mechanism. Original diagnosis and fix by @AlchemistChaos. Co-authored-by: AlchemistChaos <alchemistchaos@protonmail.com>
66c6398 to
a4d843c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Desktop chat recovers on its own after sleep/wake instead of locking on "Starting Hermes…" until a quit+reopen — reimplemented from #40135 on a smaller, more interpretable path (63 added lines vs 555).
Root cause (diagnosed by @AlchemistChaos): a remote/
global-remoteprimary backend has no child process, so the'exit'/'error'handlers that would clear the main process's cachedconnectionPromisenever fire. Once the remote becomes unreachable across a sleep/wake, the renderer re-dials the same dead descriptor forever.Changes
electron/main.cjs: newhermes:connection:revalidateIPC. Liveness-probes the cached remote backend's public/api/status(2.5s); on failure drops the cache viaresetHermesConnection()(remote-only — no child to SIGTERM) so the nextgetConnection()rebuilds a reachable descriptor. Local backends are never touched (they self-heal via the child'exit'handler).electron/preload.cjs+src/global.d.ts: expose/typerevalidateConnection().src/app/gateway/hooks/use-gateway-boot.ts: the existing backoff-paced reconnect loop callsrevalidateConnection()before re-dialing; dismisses the boot-progress overlay on the post-rebuild'open'so an in-place rebuild can't leave it stuck at ~94%.scripts/release.py: map co-author email.Why simpler than #40135
The renderer's reconnect loop already provides retry pacing and rides out transient post-wake blips (exponential backoff, fires only on
closed/error). So no failure-streak counter, no episode-window timestamps, no extracted helper module, no module-level state are needed — the loop is the retry mechanism. A transient probe miss just leaves the cache in place; the next backoff tick re-probes. Same recovery behavior, ~9x less code, far easier to reason about.Validation
npm run type-check(tsc -b)eslint(changed files)npm run test:desktop:platformsReimplements #40135. Original diagnosis and fix by @AlchemistChaos; co-authorship preserved on the fix commit.
Infographic