Skip to content

fix(desktop): recover chat after sleep/wake by revalidating the cached backend#40135

Closed
AlchemistChaos wants to merge 2 commits into
NousResearch:mainfrom
AlchemistChaos:fix/desktop-sleep-wake-reconnect-stale-backend
Closed

fix(desktop): recover chat after sleep/wake by revalidating the cached backend#40135
AlchemistChaos wants to merge 2 commits into
NousResearch:mainfrom
AlchemistChaos:fix/desktop-sleep-wake-reconnect-stale-backend

Conversation

@AlchemistChaos

Copy link
Copy Markdown
Contributor

What & why

After the Mac sleeps and wakes, the desktop chat composer stays disabled on the "Starting Hermes…" placeholder and never recovers — only a full quit + reopen fixes it.

The renderer's reconnect loop is actually sound (it retries forever with backoff). The lockout is in the main process: startHermes() returns a cached connectionPromise with no liveness check, and that cache is only invalidated by the local backend child's 'exit'/'error' handlers. In remote / global-remote mode there is no child process (hermesProcess stays null), so the cached descriptor is never invalidated for the life of the main process — the renderer re-dials the same dead remote endpoint forever. A relaunch works only because it resets the module-level connectionPromise.

(The placeholder is a precise signal: "Starting Hermes…" shows only when the gateway is stuck in connecting, never reaching opencomposer/index.tsx + chat/index.tsx disabled={!gatewayOpen}.)

A detailed write-up is included in the first commit: docs/bugs/desktop-sleep-wake-reconnect-stale-backend.md.

The fix — revalidate-on-reconnect

The renderer's backoff-paced attemptReconnect now calls getConnection(profile, { revalidate: true }). On a cache hit with that flag, startHermes() fast-probes the public /api/status (token-free fetchPublicJson, ~2.5s) and, if the backend is unreachable, tears the stale connection down via resetHermesConnection() and rebuilds — so the renderer's existing loop gets a fresh, reachable descriptor without an app restart. Recovery lands within a couple of backoff ticks (typically seconds; up to ~35s only if connect attempts hit their 15s timeout).

Guard rails (deliberate, to avoid regressions)

  • Opt-in, backoff-paced only. revalidate is wired solely into use-gateway-boot's reconnect — not use-gateway-request, which fires on any transient request blip and could needlessly SIGTERM a healthy local child.
  • Remote-only teardown. Only mode === 'remote' connections are rebuilt; local backends self-heal via the child 'exit' handler, so a probe miss there is treated as "WS not reattached yet", not "backend dead".
  • 2 consecutive failures within one episode (time-windowed streak) before teardown, so a single captive-portal / VPN-re-establishing blip on wake doesn't trigger a respawn, and a stale miss from an earlier (since-recovered) episode can't pre-load the counter.
  • Concurrency-safe. After the probe we re-check connectionPromise === cached with no intervening await (resetHermesConnection is synchronous). A peer rebuild is returned as-is; a cache nulled by a concurrent 'exit'/rejection falls through to a fresh build instead of returning null.
  • Zero added latency on cold boot and steady state (both stay revalidate-off).
  • The renderer dismisses the boot-progress overlay on the post-boot 'open' transition so an in-place rebuild can't leave the overlay stuck at ~94%.

The liveness/decision/episode logic is extracted into pure, unit-tested helpers in hardening.cjs (probeBackendAlive, shouldRebuildStaleConnection, isFreshRevalidateEpisode), since main.cjs can't be loaded headlessly.

How to test

Repro (before this change):

  1. Run Hermes Desktop against a remote gateway (global-remote mode).
  2. Open a chat; confirm the composer is enabled.
  3. Sleep the Mac (or drop the network long enough to tear the remote WS down), then wake it later.
  4. Observe the composer stuck disabled on "Starting Hermes…" indefinitely; quit + reopen restores it.

With this change: on wake, the composer recovers on its own within a couple of backoff ticks — no restart.

Edge cases to exercise: a transient WS drop where the remote is still alive (should reconnect with no rebuild/overlay, since the probe succeeds); a genuinely dead/moved remote (should rebuild + recover); a local-mode backend (a probe miss must never SIGTERM a healthy child).

Tests / platforms

  • node --test electron/*.test.cjs (the test:desktop:platforms suite): 84/84 pass, including new pure-helper tests in hardening.test.cjs.
  • Code reviewed and red-teamed; findings (streak-reset lifecycle, concurrency, null/rejected-cache handling) addressed.

Honest caveats: this was developed/verified on macOS. I was not able to run tsc -b / eslint in my environment (desktop node_modules not installed) — the TypeScript change is a backward-compatible optional-param widening on getConnection plus one matching call site, verified by auditing all call sites; please let CI confirm. I also could not run a live multi-OS Electron sleep/wake repro. Change is desktop/Electron (JS) only — no Python touched, so the pytest suite is unaffected.

Related

Surfaces in the recent global-remote / multi-profile work (e.g. #39921, #39993), where a remote primary backend is the common configuration.

🤖 Generated with Claude Code

AlchemistChaos and others added 2 commits June 5, 2026 22:01
Detailed pre-fix investigation of a bug where, after the Mac sleeps and
wakes, the desktop chat composer stays disabled on the "Starting Hermes…"
placeholder and never recovers — only a full app quit + reopen fixes it.

Root cause: in remote / global-remote mode, startHermes() (main.cjs:4322)
returns a cached connectionPromise with no liveness check, and the cache is
only invalidated by the local backend child's 'exit'/'error' handlers. A
remote primary spawns no child process (main.cjs:4328-4348), so the cached
descriptor is never invalidated for the life of the main process. After
sleep the renderer's (sound) reconnect loop keeps re-dialing the same dead
remote endpoint forever; a relaunch works only because it resets the
module-level connectionPromise.

The renderer reconnect loop, gateway state machine, and the exact
"connecting"-pinned placeholder logic are all traced with file:line
evidence. Root cause confirmed by three independent analyses (0.95–0.98).

The fix lands in a separate commit so the diagnosis can be reviewed on its
own.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d backend

After the Mac slept and woke, the chat composer stayed disabled on the
"Starting Hermes…" placeholder until the app was fully relaunched. Root
cause (see docs/bugs/desktop-sleep-wake-reconnect-stale-backend.md):
startHermes() returns a cached connectionPromise with no liveness check,
and that cache is only cleared by the local backend child's 'exit'/'error'
handlers. A remote / global-remote primary spawns no child process, so the
cached descriptor is never invalidated for the life of the main process —
the renderer's (sound) reconnect loop re-dials the same dead remote forever.
Only a relaunch, which resets the module-level connectionPromise, recovered.

Fix: revalidate-on-reconnect. The renderer's backoff-paced attemptReconnect
now calls getConnection(profile, { revalidate: true }); on a cache hit
startHermes() fast-probes the public /api/status (token-free, ~2.5s) and, if
the backend is unreachable, tears the stale connection down via
resetHermesConnection() and rebuilds — so the existing reconnect loop gets a
fresh, reachable descriptor with no app restart. Recovery lands within a
couple of backoff ticks (typically seconds; up to ~35s only if connect
attempts hit their 15s timeout).

Hardening (per fix red-team + code review):
- revalidate is opt-in and ONLY wired into use-gateway-boot's backoff-paced
  reconnect — not use-gateway-request, which fires on any transient blip and
  could needlessly SIGTERM a healthy local child.
- Only mode==='remote' connections are torn down; local backends self-heal
  via the child 'exit' handler, so a probe miss there is treated as "WS not
  reattached yet", not "backend dead".
- A teardown requires 2 consecutive probe failures *within one reconnect
  episode* (time-windowed streak), so a single captive-portal / VPN-on-wake
  blip doesn't trigger a respawn, and a stale miss from an earlier,
  since-recovered episode can't pre-load the counter.
- Concurrency: after the probe we re-check connectionPromise===cached with no
  intervening await (resetHermesConnection is synchronous). If a peer rebuilt
  we return their fresh connection; if a backend 'exit' or a rejected cache
  nulled it, we fall through and build fresh instead of returning null.
- Steady-state and cold boot stay revalidate-off → zero added latency.
- The renderer dismisses the boot-progress overlay on the post-boot 'open'
  transition, so an in-place rebuild (which re-drives boot progress) can't
  leave the overlay stuck at ~94%.

Liveness/decision/episode logic is extracted into pure, unit-tested helpers in
hardening.cjs (probeBackendAlive, shouldRebuildStaleConnection,
isFreshRevalidateEpisode) since main.cjs can't be loaded headlessly. New tests
added to hardening.test.cjs; full electron/*.test.cjs suite passes (84/84).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/tui Terminal UI (ui-tui/ + tui_gateway/) labels Jun 5, 2026
teknium1 added a commit that referenced this pull request Jun 7, 2026
…emote backend

After sleep/wake, a remote (global-remote) primary backend can become
unreachable, but it has no child process whose 'exit' clears the main
process's cached connectionPromise. The renderer then re-dials the same
dead remote forever and the composer stays stuck on "Starting Hermes…";
only a quit+reopen recovered.

Fix: the renderer's existing backoff-paced reconnect loop now asks the
main process to revalidate the cached connection before re-dialing. The
main process liveness-probes the cached REMOTE backend's public
/api/status and, if unreachable, drops the cache (resetHermesConnection
only nulls connectionPromise for a remote — no child to SIGTERM) so the
next getConnection() rebuilds a reachable descriptor. Local backends are
never touched here; they self-heal via the child 'exit' handler. The
renderer's loop already provides retry pacing and rides out transient
blips, so no streak/episode bookkeeping is needed in the main process.

The boot hook dismisses the boot-progress overlay on the post-rebuild
'open' so an in-place rebuild can't leave it stuck at ~94%.

Reimplements #40135 by @AlchemistChaos on a smaller, more interpretable
path (63 added lines vs 555): no extracted helper module, no
failure-streak / episode-window state, the renderer's backoff loop is
the retry mechanism. Original diagnosis and fix by @AlchemistChaos.

Co-authored-by: AlchemistChaos <alchemistchaos@protonmail.com>
teknium1 added a commit that referenced this pull request Jun 8, 2026
…emote backend

After sleep/wake, a remote (global-remote) primary backend can become
unreachable, but it has no child process whose 'exit' clears the main
process's cached connectionPromise. The renderer then re-dials the same
dead remote forever and the composer stays stuck on "Starting Hermes…";
only a quit+reopen recovered.

Fix: the renderer's existing backoff-paced reconnect loop now asks the
main process to revalidate the cached connection before re-dialing. The
main process liveness-probes the cached REMOTE backend's public
/api/status and, if unreachable, drops the cache (resetHermesConnection
only nulls connectionPromise for a remote — no child to SIGTERM) so the
next getConnection() rebuilds a reachable descriptor. Local backends are
never touched here; they self-heal via the child 'exit' handler. The
renderer's loop already provides retry pacing and rides out transient
blips, so no streak/episode bookkeeping is needed in the main process.

The boot hook dismisses the boot-progress overlay on the post-rebuild
'open' so an in-place rebuild can't leave it stuck at ~94%.

Reimplements #40135 by @AlchemistChaos on a smaller, more interpretable
path (63 added lines vs 555): no extracted helper module, no
failure-streak / episode-window state, the renderer's backoff loop is
the retry mechanism. Original diagnosis and fix by @AlchemistChaos.

Co-authored-by: AlchemistChaos <alchemistchaos@protonmail.com>
@teknium1

teknium1 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Merged via #41350 (commit cadb74a on main). Your diagnosis was spot-on — the remote backend has no child process, so the 'exit' handler that would clear the cached connectionPromise never fires, and the renderer re-dials the dead endpoint forever.

We reimplemented the fix on a smaller, more interpretable path (64 added lines vs 555): instead of a failure-streak counter + episode-window timestamps + an extracted helper module, the renderer's existing backoff reconnect loop just asks the main process to liveness-probe the cached remote and drop it if dead — the loop is already the retry mechanism, so it rides out transient post-wake blips on its own. Same recovery behavior.

Your authorship is preserved as a co-author on the fix commit. Thanks for the careful write-up and red-team — it made the simplification easy to verify.

@teknium1 teknium1 closed this Jun 8, 2026
agogo233 added a commit to agogo233/hermes-agent that referenced this pull request Jun 8, 2026
* upstream/main: (430 commits)
  fix(yuanbao): bound ws.close() so an idle server can't stall shutdown ~5s (NousResearch#40607)
  docs: add Urdu translation of README (NousResearch#40578)
  fix(hindsight): send only new-turn delta on append retains instead of whole session (NousResearch#40605)
  feat(gateway): render terminal tool calls as native bash code blocks on markdown platforms (NousResearch#41215)
  feat(desktop): stop the chat viewport from following streaming output (NousResearch#41414)
  chore(release): map AlchemistChaos co-author email for NousResearch#40135 salvage
  fix(desktop): recover chat after sleep/wake by revalidating a stale remote backend
  fix(web): make _has_env config-aware so SEARXNG_URL auto-detect honors Hermes config
  fix(web): honor Hermes config-aware SEARXNG_URL lookup
  install.sh: hint at root-owned npm cache when desktop npm install fails (NousResearch#39688)
  fix(tools): percent-encode non-ascii URL components
  fix(skills): browse shows full catalog, not first 5000 (NousResearch#41413)
  feat(desktop+gateway): remote media relay — attach images/PDFs and display gateway images over the network
  feat(desktop): full tool-backend config (pickers + per-backend settings) in Settings (NousResearch#41232)
  hardening(api-server): scan cron prompts on REST create/update for parity with the agent tool
  fix: skip MCP preflight content-type probe on reconnect when already ready (NousResearch#40604)
  fix(kanban): sweep deferred scratch parent on non-scratch child completion + tests
  fix: defer scratch workspace cleanup when task has active children (NousResearch#33774)
  feat(onboarding): opt-in structured profile-build path on first contact (NousResearch#41114)
  feat(compression): temporal anchoring in compaction summaries (NousResearch#41102)
  test(discord): align clarify/model-picker tests with fail-closed component auth (NousResearch#41338)
  chore(release): map Dusk1e and LaPhilosophie for approval fail-closed salvage (NousResearch#33844, NousResearch#33866, NousResearch#30964)
  fix(discord): fail closed for component button auth when no allowlist set
  fix(feishu): fail closed for update prompt card actions
  fix(slack): re-check gateway auth on approval and slash-confirm buttons
  fix: guard int(os.getenv()) casts against malformed env vars (NousResearch#40598)
  fix: respect Honcho env var fallback in doctor and honcho status
  chore(release): add synapsesx to AUTHOR_MAP for NousResearch#40495 salvage
  fix(research): keep tool_call/tool_response pairs intact when compressing trajectories
  fix(simplex): accept display name in SIMPLEX_ALLOWED_USERS
  fix(desktop): make the running-turn timer per-session (NousResearch#41182)
  test(approval): regression for shell-escape denylist bypass (NousResearch#36846, NousResearch#36847)
  fix(security): strip shell escapes in denylist normalizer; fail-closed on missing approval module
  fix(stream+output-cap): guard empty streams and parse OpenRouter output-cap errors (NousResearch#40589)
  fix(desktop): bootstrap falls back to installed agent install.sh on GitHub 404
  feat(dashboard): change UI font from the theme picker, independent of theme (NousResearch#41145)
  fix(cli): return bool (not None) when a destructive-slash confirmation is cancelled (NousResearch#40583)
  fix(desktop): preserve configured base_url on same-provider model switch (NousResearch#41121)
  fix(desktop): stop bare-URL autolinker swallowing trailing emphasis asterisks (NousResearch#41093)
  fix(cron): bound the desktop run-history query to one job (NousResearch#41088)
  fix(desktop): scope in-session /model switch per-session, stop process-env leak (NousResearch#41120)
  chore: map bmoore210 author email for PR NousResearch#40550 salvage
  fix(desktop): scope session list to active profile + longer timeout
  fix: harden gateway startup and turn persistence
  fix(computer_use): honor custom vision routing
  fix(aux): honor model.default_headers on auxiliary client too (NousResearch#40033)
  fix(agent): honor model.default_headers for custom OpenAI-compatible providers (NousResearch#40033)
  docs(i18n): port deep-audit corrections to zh-Hans mirror (NousResearch#41104)
  fix(compression): don't overwrite the -1 post-compression sentinel in preflight seed (NousResearch#36718)
  chore(release): map singhsanidhya741@gmail.com to sanidhyasin (NousResearch#41094)
  ...
@AlchemistChaos

Copy link
Copy Markdown
Contributor Author

Thanks for merging it in! Will be careful next time around how complex the proposed fixes are.

changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…emote backend

After sleep/wake, a remote (global-remote) primary backend can become
unreachable, but it has no child process whose 'exit' clears the main
process's cached connectionPromise. The renderer then re-dials the same
dead remote forever and the composer stays stuck on "Starting Hermes…";
only a quit+reopen recovered.

Fix: the renderer's existing backoff-paced reconnect loop now asks the
main process to revalidate the cached connection before re-dialing. The
main process liveness-probes the cached REMOTE backend's public
/api/status and, if unreachable, drops the cache (resetHermesConnection
only nulls connectionPromise for a remote — no child to SIGTERM) so the
next getConnection() rebuilds a reachable descriptor. Local backends are
never touched here; they self-heal via the child 'exit' handler. The
renderer's loop already provides retry pacing and rides out transient
blips, so no streak/episode bookkeeping is needed in the main process.

The boot hook dismisses the boot-progress overlay on the post-rebuild
'open' so an in-place rebuild can't leave it stuck at ~94%.

Reimplements NousResearch#40135 by @AlchemistChaos on a smaller, more interpretable
path (63 added lines vs 555): no extracted helper module, no
failure-streak / episode-window state, the renderer's backoff loop is
the retry mechanism. Original diagnosis and fix by @AlchemistChaos.

Co-authored-by: AlchemistChaos <alchemistchaos@protonmail.com>
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tui Terminal UI (ui-tui/ + tui_gateway/) P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants