hermes gateway restart: CLI deadline equals gateway drain budget, false-fires 'still running after 60s' warning and force-kills mid-cleanup

### Context

`hermes gateway restart` on macOS (launchd) prints `⚠ Gateway PID N still running after 60.0s — restart may fail` on **successful** restarts and then runs `launchctl kickstart -k` (SIGKILL) against a gateway that was already shutting down cleanly. The force-kill races the gateway's cleanup tail, leaving sessions marked as auto-resumable when the original shutdown was valid.

Related to #17198 (same file, sibling site, fixed by PR #17292 — still open). Filing as an issue rather than a competing PR to defer to maintainer preference on whether to fold this into #17292 or accept a follow-up.

### Observation

`hermes_cli/gateway.py` `launchd_restart()` around L3008:

```python
exited = _wait_for_gateway_exit(timeout=drain_timeout, force_after=None)
if not exited:
    print(f"⚠ Gateway drain timed out after {drain_timeout:.0f}s — forcing launchd restart")
subprocess.run(["launchctl", "kickstart", "-k", target], ...)
```

CLI deadline = `drain_timeout` (default **60s**, from `HERMES_RESTART_DRAIN_TIMEOUT`).
Gateway-side drain budget = also **60s**, plus ~1.2s for adapter-disconnect / SessionDB close / atexit / final exit.

The two deadlines collide. Any drain that runs close to its full budget loses the race even on a clean shutdown.

### Evidence from a live log (`~/.hermes/logs/gateway.log`)

```
2026-05-14 16:19:59  Received SIGTERM — initiating shutdown
2026-05-14 16:20:49  Shutdown phase: drain done at +50.09s
                     (drain took 49.36s, timed_out=False, active_at_start=1, active_now=0)
```

Clean drain finished at +50.09s; CLI deadline at +60s passed it by 10s. Three back-to-back `hermes gateway restart` invocations on the same machine all printed the false-alarm warning even though `timed_out=False` for two of them.

When the CLI does cross 60s, it falls through to `launchctl kickstart -k` mid-cleanup → no `.clean_shutdown` marker → next boot:

```
INFO gateway.run: Marked 1 in-flight session(s) as resumable from previous run
INFO gateway.run: Scheduled auto-resume for 1 restart-interrupted session(s)
```

Sessions get auto-resume-tagged on what was actually a graceful shutdown.

### Analysis

The CLI deadline must exceed the gateway deadline by enough to cover the cleanup tail (~5–15s observed). Today it equals the gateway deadline, guaranteeing this whenever drain uses its full budget.

PR #17292 fixed an analogous bug at `service_stop()` (L2946):

```python
# Before #17292
_wait_for_gateway_exit(timeout=10.0, force_after=5.0)
# After #17292
_wait_for_gateway_exit(timeout=max(_drain, 20.0), force_after=min(_drain * 0.5, 10.0))
```

The `launchd_restart()` call site wasn't updated in that PR.

### Possible approaches

1. **Extend the CLI deadline past the gateway deadline.** Smallest fix:
   ```python
   exited = _wait_for_gateway_exit(timeout=drain_timeout + 15.0, force_after=drain_timeout + 10.0)
   ```
   ~2 LOC, same file, `launchd_restart()` only.

2. **Apply #17292's pattern at this site.** After #17292 merges, reuse the `max(_drain, 20.0)` / `min(_drain * 0.5, 10.0)` shape for consistency. ~2 LOC.

3. **Emit the warning only when actually force-killing.** The current message conflates "still draining (fine)" with "wedged (bad)". Move/reword. ~5 LOC.

1 and 3 are independent — both small.

### Reproduction

On macOS with at least one mid-tool-call agent session active:

```bash
hermes gateway restart
hermes gateway restart
hermes gateway restart
```

Check `gateway.log` for `drain took Ns` with N close to but under 60 — those are the false alarms. If any go over 60, you'll also see the auto-resume tag on a session that didn't need it.

### Questions

1. Is this site intentionally left for a follow-up to #17292, or did the PR just miss it?
2. Want me to PR the ~2 LOC fix once #17292 merges, or fold into a single combined PR?
3. The deeper "tool call won't release during drain" cause (real 60s timeout 2026-05-14 00:05:43) — separate issue, or covered by #14176 / #20694?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hermes gateway restart: CLI deadline equals gateway drain budget, false-fires 'still running after 60s' warning and force-kills mid-cleanup #25966

Context

Observation

Evidence from a live log (`~/.hermes/logs/gateway.log`)

Analysis

Possible approaches

Reproduction

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

hermes gateway restart: CLI deadline equals gateway drain budget, false-fires 'still running after 60s' warning and force-kills mid-cleanup #25966

Description

Context

Observation

Evidence from a live log (~/.hermes/logs/gateway.log)

Analysis

Possible approaches

Reproduction

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Evidence from a live log (`~/.hermes/logs/gateway.log`)