You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hermes gateway restart on macOS (launchd) prints ⚠ Gateway PID N still running after 60.0s — restart may fail on successful restarts and then runs launchctl kickstart -k (SIGKILL) against a gateway that was already shutting down cleanly. The force-kill races the gateway's cleanup tail, leaving sessions marked as auto-resumable when the original shutdown was valid.
Related to #17198 (same file, sibling site, fixed by PR #17292 — still open). Filing as an issue rather than a competing PR to defer to maintainer preference on whether to fold this into #17292 or accept a follow-up.
Observation
hermes_cli/gateway.pylaunchd_restart() around L3008:
exited=_wait_for_gateway_exit(timeout=drain_timeout, force_after=None)
ifnotexited:
print(f"⚠ Gateway drain timed out after {drain_timeout:.0f}s — forcing launchd restart")
subprocess.run(["launchctl", "kickstart", "-k", target], ...)
CLI deadline = drain_timeout (default 60s, from HERMES_RESTART_DRAIN_TIMEOUT).
Gateway-side drain budget = also 60s, plus ~1.2s for adapter-disconnect / SessionDB close / atexit / final exit.
The two deadlines collide. Any drain that runs close to its full budget loses the race even on a clean shutdown.
Evidence from a live log (~/.hermes/logs/gateway.log)
2026-05-14 16:19:59 Received SIGTERM — initiating shutdown
2026-05-14 16:20:49 Shutdown phase: drain done at +50.09s
(drain took 49.36s, timed_out=False, active_at_start=1, active_now=0)
Clean drain finished at +50.09s; CLI deadline at +60s passed it by 10s. Three back-to-back hermes gateway restart invocations on the same machine all printed the false-alarm warning even though timed_out=False for two of them.
When the CLI does cross 60s, it falls through to launchctl kickstart -k mid-cleanup → no .clean_shutdown marker → next boot:
INFO gateway.run: Marked 1 in-flight session(s) as resumable from previous run
INFO gateway.run: Scheduled auto-resume for 1 restart-interrupted session(s)
Sessions get auto-resume-tagged on what was actually a graceful shutdown.
Analysis
The CLI deadline must exceed the gateway deadline by enough to cover the cleanup tail (~5–15s observed). Today it equals the gateway deadline, guaranteeing this whenever drain uses its full budget.
PR #17292 fixed an analogous bug at service_stop() (L2946):
# Before #17292_wait_for_gateway_exit(timeout=10.0, force_after=5.0)
# After #17292_wait_for_gateway_exit(timeout=max(_drain, 20.0), force_after=min(_drain*0.5, 10.0))
The launchd_restart() call site wasn't updated in that PR.
Possible approaches
Extend the CLI deadline past the gateway deadline. Smallest fix:
Check gateway.log for drain took Ns with N close to but under 60 — those are the false alarms. If any go over 60, you'll also see the auto-resume tag on a session that didn't need it.
Context
hermes gateway restarton macOS (launchd) prints⚠ Gateway PID N still running after 60.0s — restart may failon successful restarts and then runslaunchctl kickstart -k(SIGKILL) against a gateway that was already shutting down cleanly. The force-kill races the gateway's cleanup tail, leaving sessions marked as auto-resumable when the original shutdown was valid.Related to #17198 (same file, sibling site, fixed by PR #17292 — still open). Filing as an issue rather than a competing PR to defer to maintainer preference on whether to fold this into #17292 or accept a follow-up.
Observation
hermes_cli/gateway.pylaunchd_restart()around L3008:CLI deadline =
drain_timeout(default 60s, fromHERMES_RESTART_DRAIN_TIMEOUT).Gateway-side drain budget = also 60s, plus ~1.2s for adapter-disconnect / SessionDB close / atexit / final exit.
The two deadlines collide. Any drain that runs close to its full budget loses the race even on a clean shutdown.
Evidence from a live log (
~/.hermes/logs/gateway.log)Clean drain finished at +50.09s; CLI deadline at +60s passed it by 10s. Three back-to-back
hermes gateway restartinvocations on the same machine all printed the false-alarm warning even thoughtimed_out=Falsefor two of them.When the CLI does cross 60s, it falls through to
launchctl kickstart -kmid-cleanup → no.clean_shutdownmarker → next boot:Sessions get auto-resume-tagged on what was actually a graceful shutdown.
Analysis
The CLI deadline must exceed the gateway deadline by enough to cover the cleanup tail (~5–15s observed). Today it equals the gateway deadline, guaranteeing this whenever drain uses its full budget.
PR #17292 fixed an analogous bug at
service_stop()(L2946):The
launchd_restart()call site wasn't updated in that PR.Possible approaches
Extend the CLI deadline past the gateway deadline. Smallest fix:
~2 LOC, same file,
launchd_restart()only.Apply fix: use configured drain timeout for gateway restart wait (#17198) #17292's pattern at this site. After fix: use configured drain timeout for gateway restart wait (#17198) #17292 merges, reuse the
max(_drain, 20.0)/min(_drain * 0.5, 10.0)shape for consistency. ~2 LOC.Emit the warning only when actually force-killing. The current message conflates "still draining (fine)" with "wedged (bad)". Move/reword. ~5 LOC.
1 and 3 are independent — both small.
Reproduction
On macOS with at least one mid-tool-call agent session active:
Check
gateway.logfordrain took Nswith N close to but under 60 — those are the false alarms. If any go over 60, you'll also see the auto-resume tag on a session that didn't need it.Questions