fix(desktop): recover chat after sleep/wake by revalidating the cached backend#40135
Conversation
Detailed pre-fix investigation of a bug where, after the Mac sleeps and wakes, the desktop chat composer stays disabled on the "Starting Hermes…" placeholder and never recovers — only a full app quit + reopen fixes it. Root cause: in remote / global-remote mode, startHermes() (main.cjs:4322) returns a cached connectionPromise with no liveness check, and the cache is only invalidated by the local backend child's 'exit'/'error' handlers. A remote primary spawns no child process (main.cjs:4328-4348), so the cached descriptor is never invalidated for the life of the main process. After sleep the renderer's (sound) reconnect loop keeps re-dialing the same dead remote endpoint forever; a relaunch works only because it resets the module-level connectionPromise. The renderer reconnect loop, gateway state machine, and the exact "connecting"-pinned placeholder logic are all traced with file:line evidence. Root cause confirmed by three independent analyses (0.95–0.98). The fix lands in a separate commit so the diagnosis can be reviewed on its own. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d backend
After the Mac slept and woke, the chat composer stayed disabled on the
"Starting Hermes…" placeholder until the app was fully relaunched. Root
cause (see docs/bugs/desktop-sleep-wake-reconnect-stale-backend.md):
startHermes() returns a cached connectionPromise with no liveness check,
and that cache is only cleared by the local backend child's 'exit'/'error'
handlers. A remote / global-remote primary spawns no child process, so the
cached descriptor is never invalidated for the life of the main process —
the renderer's (sound) reconnect loop re-dials the same dead remote forever.
Only a relaunch, which resets the module-level connectionPromise, recovered.
Fix: revalidate-on-reconnect. The renderer's backoff-paced attemptReconnect
now calls getConnection(profile, { revalidate: true }); on a cache hit
startHermes() fast-probes the public /api/status (token-free, ~2.5s) and, if
the backend is unreachable, tears the stale connection down via
resetHermesConnection() and rebuilds — so the existing reconnect loop gets a
fresh, reachable descriptor with no app restart. Recovery lands within a
couple of backoff ticks (typically seconds; up to ~35s only if connect
attempts hit their 15s timeout).
Hardening (per fix red-team + code review):
- revalidate is opt-in and ONLY wired into use-gateway-boot's backoff-paced
reconnect — not use-gateway-request, which fires on any transient blip and
could needlessly SIGTERM a healthy local child.
- Only mode==='remote' connections are torn down; local backends self-heal
via the child 'exit' handler, so a probe miss there is treated as "WS not
reattached yet", not "backend dead".
- A teardown requires 2 consecutive probe failures *within one reconnect
episode* (time-windowed streak), so a single captive-portal / VPN-on-wake
blip doesn't trigger a respawn, and a stale miss from an earlier,
since-recovered episode can't pre-load the counter.
- Concurrency: after the probe we re-check connectionPromise===cached with no
intervening await (resetHermesConnection is synchronous). If a peer rebuilt
we return their fresh connection; if a backend 'exit' or a rejected cache
nulled it, we fall through and build fresh instead of returning null.
- Steady-state and cold boot stay revalidate-off → zero added latency.
- The renderer dismisses the boot-progress overlay on the post-boot 'open'
transition, so an in-place rebuild (which re-drives boot progress) can't
leave the overlay stuck at ~94%.
Liveness/decision/episode logic is extracted into pure, unit-tested helpers in
hardening.cjs (probeBackendAlive, shouldRebuildStaleConnection,
isFreshRevalidateEpisode) since main.cjs can't be loaded headlessly. New tests
added to hardening.test.cjs; full electron/*.test.cjs suite passes (84/84).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…emote backend After sleep/wake, a remote (global-remote) primary backend can become unreachable, but it has no child process whose 'exit' clears the main process's cached connectionPromise. The renderer then re-dials the same dead remote forever and the composer stays stuck on "Starting Hermes…"; only a quit+reopen recovered. Fix: the renderer's existing backoff-paced reconnect loop now asks the main process to revalidate the cached connection before re-dialing. The main process liveness-probes the cached REMOTE backend's public /api/status and, if unreachable, drops the cache (resetHermesConnection only nulls connectionPromise for a remote — no child to SIGTERM) so the next getConnection() rebuilds a reachable descriptor. Local backends are never touched here; they self-heal via the child 'exit' handler. The renderer's loop already provides retry pacing and rides out transient blips, so no streak/episode bookkeeping is needed in the main process. The boot hook dismisses the boot-progress overlay on the post-rebuild 'open' so an in-place rebuild can't leave it stuck at ~94%. Reimplements #40135 by @AlchemistChaos on a smaller, more interpretable path (63 added lines vs 555): no extracted helper module, no failure-streak / episode-window state, the renderer's backoff loop is the retry mechanism. Original diagnosis and fix by @AlchemistChaos. Co-authored-by: AlchemistChaos <alchemistchaos@protonmail.com>
…emote backend After sleep/wake, a remote (global-remote) primary backend can become unreachable, but it has no child process whose 'exit' clears the main process's cached connectionPromise. The renderer then re-dials the same dead remote forever and the composer stays stuck on "Starting Hermes…"; only a quit+reopen recovered. Fix: the renderer's existing backoff-paced reconnect loop now asks the main process to revalidate the cached connection before re-dialing. The main process liveness-probes the cached REMOTE backend's public /api/status and, if unreachable, drops the cache (resetHermesConnection only nulls connectionPromise for a remote — no child to SIGTERM) so the next getConnection() rebuilds a reachable descriptor. Local backends are never touched here; they self-heal via the child 'exit' handler. The renderer's loop already provides retry pacing and rides out transient blips, so no streak/episode bookkeeping is needed in the main process. The boot hook dismisses the boot-progress overlay on the post-rebuild 'open' so an in-place rebuild can't leave it stuck at ~94%. Reimplements #40135 by @AlchemistChaos on a smaller, more interpretable path (63 added lines vs 555): no extracted helper module, no failure-streak / episode-window state, the renderer's backoff loop is the retry mechanism. Original diagnosis and fix by @AlchemistChaos. Co-authored-by: AlchemistChaos <alchemistchaos@protonmail.com>
|
Merged via #41350 (commit cadb74a on main). Your diagnosis was spot-on — the remote backend has no child process, so the We reimplemented the fix on a smaller, more interpretable path (64 added lines vs 555): instead of a failure-streak counter + episode-window timestamps + an extracted helper module, the renderer's existing backoff reconnect loop just asks the main process to liveness-probe the cached remote and drop it if dead — the loop is already the retry mechanism, so it rides out transient post-wake blips on its own. Same recovery behavior. Your authorship is preserved as a co-author on the fix commit. Thanks for the careful write-up and red-team — it made the simplification easy to verify. |
* upstream/main: (430 commits) fix(yuanbao): bound ws.close() so an idle server can't stall shutdown ~5s (NousResearch#40607) docs: add Urdu translation of README (NousResearch#40578) fix(hindsight): send only new-turn delta on append retains instead of whole session (NousResearch#40605) feat(gateway): render terminal tool calls as native bash code blocks on markdown platforms (NousResearch#41215) feat(desktop): stop the chat viewport from following streaming output (NousResearch#41414) chore(release): map AlchemistChaos co-author email for NousResearch#40135 salvage fix(desktop): recover chat after sleep/wake by revalidating a stale remote backend fix(web): make _has_env config-aware so SEARXNG_URL auto-detect honors Hermes config fix(web): honor Hermes config-aware SEARXNG_URL lookup install.sh: hint at root-owned npm cache when desktop npm install fails (NousResearch#39688) fix(tools): percent-encode non-ascii URL components fix(skills): browse shows full catalog, not first 5000 (NousResearch#41413) feat(desktop+gateway): remote media relay — attach images/PDFs and display gateway images over the network feat(desktop): full tool-backend config (pickers + per-backend settings) in Settings (NousResearch#41232) hardening(api-server): scan cron prompts on REST create/update for parity with the agent tool fix: skip MCP preflight content-type probe on reconnect when already ready (NousResearch#40604) fix(kanban): sweep deferred scratch parent on non-scratch child completion + tests fix: defer scratch workspace cleanup when task has active children (NousResearch#33774) feat(onboarding): opt-in structured profile-build path on first contact (NousResearch#41114) feat(compression): temporal anchoring in compaction summaries (NousResearch#41102) test(discord): align clarify/model-picker tests with fail-closed component auth (NousResearch#41338) chore(release): map Dusk1e and LaPhilosophie for approval fail-closed salvage (NousResearch#33844, NousResearch#33866, NousResearch#30964) fix(discord): fail closed for component button auth when no allowlist set fix(feishu): fail closed for update prompt card actions fix(slack): re-check gateway auth on approval and slash-confirm buttons fix: guard int(os.getenv()) casts against malformed env vars (NousResearch#40598) fix: respect Honcho env var fallback in doctor and honcho status chore(release): add synapsesx to AUTHOR_MAP for NousResearch#40495 salvage fix(research): keep tool_call/tool_response pairs intact when compressing trajectories fix(simplex): accept display name in SIMPLEX_ALLOWED_USERS fix(desktop): make the running-turn timer per-session (NousResearch#41182) test(approval): regression for shell-escape denylist bypass (NousResearch#36846, NousResearch#36847) fix(security): strip shell escapes in denylist normalizer; fail-closed on missing approval module fix(stream+output-cap): guard empty streams and parse OpenRouter output-cap errors (NousResearch#40589) fix(desktop): bootstrap falls back to installed agent install.sh on GitHub 404 feat(dashboard): change UI font from the theme picker, independent of theme (NousResearch#41145) fix(cli): return bool (not None) when a destructive-slash confirmation is cancelled (NousResearch#40583) fix(desktop): preserve configured base_url on same-provider model switch (NousResearch#41121) fix(desktop): stop bare-URL autolinker swallowing trailing emphasis asterisks (NousResearch#41093) fix(cron): bound the desktop run-history query to one job (NousResearch#41088) fix(desktop): scope in-session /model switch per-session, stop process-env leak (NousResearch#41120) chore: map bmoore210 author email for PR NousResearch#40550 salvage fix(desktop): scope session list to active profile + longer timeout fix: harden gateway startup and turn persistence fix(computer_use): honor custom vision routing fix(aux): honor model.default_headers on auxiliary client too (NousResearch#40033) fix(agent): honor model.default_headers for custom OpenAI-compatible providers (NousResearch#40033) docs(i18n): port deep-audit corrections to zh-Hans mirror (NousResearch#41104) fix(compression): don't overwrite the -1 post-compression sentinel in preflight seed (NousResearch#36718) chore(release): map singhsanidhya741@gmail.com to sanidhyasin (NousResearch#41094) ...
|
Thanks for merging it in! Will be careful next time around how complex the proposed fixes are. |
…emote backend After sleep/wake, a remote (global-remote) primary backend can become unreachable, but it has no child process whose 'exit' clears the main process's cached connectionPromise. The renderer then re-dials the same dead remote forever and the composer stays stuck on "Starting Hermes…"; only a quit+reopen recovered. Fix: the renderer's existing backoff-paced reconnect loop now asks the main process to revalidate the cached connection before re-dialing. The main process liveness-probes the cached REMOTE backend's public /api/status and, if unreachable, drops the cache (resetHermesConnection only nulls connectionPromise for a remote — no child to SIGTERM) so the next getConnection() rebuilds a reachable descriptor. Local backends are never touched here; they self-heal via the child 'exit' handler. The renderer's loop already provides retry pacing and rides out transient blips, so no streak/episode bookkeeping is needed in the main process. The boot hook dismisses the boot-progress overlay on the post-rebuild 'open' so an in-place rebuild can't leave it stuck at ~94%. Reimplements NousResearch#40135 by @AlchemistChaos on a smaller, more interpretable path (63 added lines vs 555): no extracted helper module, no failure-streak / episode-window state, the renderer's backoff loop is the retry mechanism. Original diagnosis and fix by @AlchemistChaos. Co-authored-by: AlchemistChaos <alchemistchaos@protonmail.com>
What & why
After the Mac sleeps and wakes, the desktop chat composer stays disabled on the "Starting Hermes…" placeholder and never recovers — only a full quit + reopen fixes it.
The renderer's reconnect loop is actually sound (it retries forever with backoff). The lockout is in the main process:
startHermes()returns a cachedconnectionPromisewith no liveness check, and that cache is only invalidated by the local backend child's'exit'/'error'handlers. In remote /global-remotemode there is no child process (hermesProcessstaysnull), so the cached descriptor is never invalidated for the life of the main process — the renderer re-dials the same dead remote endpoint forever. A relaunch works only because it resets the module-levelconnectionPromise.(The placeholder is a precise signal: "Starting Hermes…" shows only when the gateway is stuck in
connecting, never reachingopen—composer/index.tsx+chat/index.tsxdisabled={!gatewayOpen}.)A detailed write-up is included in the first commit:
docs/bugs/desktop-sleep-wake-reconnect-stale-backend.md.The fix — revalidate-on-reconnect
The renderer's backoff-paced
attemptReconnectnow callsgetConnection(profile, { revalidate: true }). On a cache hit with that flag,startHermes()fast-probes the public/api/status(token-freefetchPublicJson, ~2.5s) and, if the backend is unreachable, tears the stale connection down viaresetHermesConnection()and rebuilds — so the renderer's existing loop gets a fresh, reachable descriptor without an app restart. Recovery lands within a couple of backoff ticks (typically seconds; up to ~35s only if connect attempts hit their 15s timeout).Guard rails (deliberate, to avoid regressions)
revalidateis wired solely intouse-gateway-boot's reconnect — notuse-gateway-request, which fires on any transient request blip and could needlessly SIGTERM a healthy local child.mode === 'remote'connections are rebuilt; local backends self-heal via the child'exit'handler, so a probe miss there is treated as "WS not reattached yet", not "backend dead".connectionPromise === cachedwith no intervening await (resetHermesConnectionis synchronous). A peer rebuild is returned as-is; a cache nulled by a concurrent'exit'/rejection falls through to a fresh build instead of returningnull.revalidate-off).'open'transition so an in-place rebuild can't leave the overlay stuck at ~94%.The liveness/decision/episode logic is extracted into pure, unit-tested helpers in
hardening.cjs(probeBackendAlive,shouldRebuildStaleConnection,isFreshRevalidateEpisode), sincemain.cjscan't be loaded headlessly.How to test
Repro (before this change):
global-remotemode).With this change: on wake, the composer recovers on its own within a couple of backoff ticks — no restart.
Edge cases to exercise: a transient WS drop where the remote is still alive (should reconnect with no rebuild/overlay, since the probe succeeds); a genuinely dead/moved remote (should rebuild + recover); a local-mode backend (a probe miss must never SIGTERM a healthy child).
Tests / platforms
node --test electron/*.test.cjs(thetest:desktop:platformssuite): 84/84 pass, including new pure-helper tests inhardening.test.cjs.Honest caveats: this was developed/verified on macOS. I was not able to run
tsc -b/eslintin my environment (desktopnode_modulesnot installed) — the TypeScript change is a backward-compatible optional-param widening ongetConnectionplus one matching call site, verified by auditing all call sites; please let CI confirm. I also could not run a live multi-OS Electron sleep/wake repro. Change is desktop/Electron (JS) only — no Python touched, so the pytest suite is unaffected.Related
Surfaces in the recent
global-remote/ multi-profile work (e.g. #39921, #39993), where a remote primary backend is the common configuration.🤖 Generated with Claude Code