Skip to content

fix(gateway): drain in-flight work before restart#7290

Closed
bobashopcashier wants to merge 4 commits into
NousResearch:mainfrom
bobashopcashier:aquaright1/fix-gateway-restart-drain
Closed

fix(gateway): drain in-flight work before restart#7290
bobashopcashier wants to merge 4 commits into
NousResearch:mainfrom
bobashopcashier:aquaright1/fix-gateway-restart-drain

Conversation

@bobashopcashier

@bobashopcashier bobashopcashier commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes gateway self-restart so in-flight agent work is drained before shutdown instead of being torn down mid-response.

This adds a drain-aware /restart path inside the gateway, teaches shutdown/restart to wait for active agents up to agent.restart_drain_timeout, and updates service-manager restart behavior so hermes gateway restart requests graceful shutdown before forcing a replacement.

Related Issue

Fixes #7286

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactor (no functional changes, no api changes)
  • Performance improvement
  • Test improvement
  • Chore (build, tooling, dependencies)

Changes Made

  • add agent.restart_drain_timeout config/defaults and expose drain/restart state through gateway runtime status
  • add gateway /restart command and drain-aware runner shutdown that preserves the current response path when possible
  • prevent drain-time follow-up messages from interrupting active work; queue them only when display.busy_input_mode: queue is set
  • update launchd and systemd restart behavior so service restarts attempt graceful drain first
  • add targeted gateway and service tests for drain, timeout, /restart, and service restart behavior

How to Test

.venv/bin/python -m pytest tests/gateway/test_gateway_shutdown.py tests/gateway/test_restart_drain.py tests/gateway/test_session_race_guard.py tests/gateway/test_session_boundary_hooks.py tests/gateway/test_queue_consumption.py tests/hermes_cli/test_gateway_service.py -q
.venv/bin/python -m pytest tests/gateway/test_runner_startup_failures.py tests/gateway/test_platform_reconnect.py -q

Platform Tested

  • macOS
  • Linux
  • Windows

Checklist

  • I have read the contributing guide
  • My commit messages follow the Conventional Commits format
  • I have checked for duplicate PRs and verified this is not already covered
  • This PR only includes changes related to the issue described above
  • I've run pytest tests/ -q and all tests pass
  • I have added or updated tests for the changes in this PR
  • I have noted which platform(s) I tested on

@kshitijk4poor kshitijk4poor left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review — #7290: fix(gateway): drain in-flight work before restart

Verdict: Comment (observations and suggestions, nothing blocking)
Tests: 4123 passed, 11 failed (all 11 pre-existing on upstream/main). 0 new failures.
Live test: N/A (gateway runtime — tested via unit tests)

Warnings

  1. gateway/run.py: _drain_active_agents status thrashing_update_runtime_status("draining") is called 4 times in this method, including inside the 0.1s poll loop. Each call does a full JSON read-modify-write cycle to disk via write_runtime_status(). Over a 60s drain timeout, that's ~600 file read + 600 file write operations. Consider throttling to once per second or only when active_agents count changes.

  2. Duplicated config parser: _load_restart_drain_timeout() vs _get_restart_drain_timeout()gateway/run.py and hermes_cli/gateway.py both parse agent.restart_drain_timeout from env/YAML independently. The run.py version hardcodes 60.0 as default in two places, while the gateway.py version correctly references DEFAULT_CONFIG["agent"]["restart_drain_timeout"]. If the default changes in config.py, the run.py version silently drifts.

  3. Photo-merge logic copied a third time_queue_or_replace_pending_event() replicates the photo-burst media merging pattern from base.py (and already duplicated at run.py:2311). This is now in 3 places. Consider extracting a shared helper.

  4. Exit code 75 is an undocumented magic number — Used in _stop_impl and matched by RestartForceExitStatus=75 in the systemd unit. The convention is sound (EX_TEMPFAIL from sysexits.h), but the number appears in 3 locations with no named constant or comment explaining why 75 was chosen.

Suggestions

  1. Shared _make_runner() test fixturetest_gateway_shutdown.py and test_restart_drain.py each have their own _make_runner() that constructs a GatewayRunner via object.__new__() with ~18 manually-set attributes. test_session_boundary_hooks.py and test_session_race_guard.py also needed to add the 9 new attributes inline. A shared fixture in tests/gateway/conftest.py would prevent this N-file maintenance burden when new attributes are added.

  2. getattr(self, "_session_model_overrides", {}) band-aid — The _session_model_overrides change in _apply_session_model_overrides is a defensive workaround for the object.__new__() test pattern. The proper fix is adding _session_model_overrides: Dict[str, Dict[str, str]] = {} to the class-level defaults alongside the other 9 attributes added in this PR. This would be consistent with the stated purpose of those class-level defaults.

  3. Test coverage gaps for new methods — The following are untested: _launch_detached_restart_command() (complex subprocess orchestration), _load_busy_input_mode() and _load_restart_drain_timeout() (config loading with env/YAML/default fallback), and the exit_code = 75 path through stop(restart=True, service_restart=True). The exit code path is particularly important since it's the mechanism that makes systemd auto-restart work.

  4. request_restart() only tested via mock — Its actual implementation (idempotency guard via _restart_task_started, background task creation) is never exercised. A test calling it twice and asserting the second returns False would verify the guard.

Looks Good

  • Core drain-before-disconnect logic is well-structured: stop()_drain_active_agents() → optional _interrupt_running_agents()_finalize_shutdown_agents() → adapter disconnect. Clean separation of concerns.
  • Re-entrancy guard on stop() via _stop_task is a good pattern — second call awaits the same task instead of double-stopping.
  • /restart command correctly registered in COMMAND_REGISTRY (gateway-only) and added to the active-session bypass list in base.py.
  • Busy-session handler integration through BasePlatformAdapter is cleanly layered — the adapter delegates to the runner's handler without coupling to gateway internals.
  • Systemd changes are solid: ExecReload=/bin/kill -USR1 $MAINPID for graceful restart, RestartForceExitStatus=75 for the exit-code restart signal, reload-or-restart instead of hard restart.
  • Launchd restart now does SIGTERM → wait-for-drain → kickstart instead of the previous atomic kill+restart. The _wait_for_gateway_exit refactor (returning bool, optional force_after) is clean.
  • restart_drain_timeout config properly added to DEFAULT_CONFIG, cli-config.yaml.example, and exposed via env var HERMES_RESTART_DRAIN_TIMEOUT.
  • All new test files pass. The PR fixes some pre-existing test failures rather than introducing any.

Comment thread gateway/run.py
Comment thread gateway/run.py Outdated
Comment thread gateway/run.py
Comment thread gateway/run.py
Comment thread gateway/run.py Outdated
@bobashopcashier

Copy link
Copy Markdown
Contributor Author

Addressed the review feedback in 78618f35.

Implemented:

  • throttled drain-time runtime status updates so _drain_active_agents() only persists on meaningful count changes plus a 1s heartbeat/final update
  • moved restart timeout parsing/defaults onto a shared helper and shared default source
  • replaced the restart exit-code magic number with a named shared constant used by both the runner and generated systemd units
  • extracted shared pending-photo merge logic and reused it from the adapter path and gateway restart/drain queue paths
  • replaced the _session_model_overrides getattr(..., {}) workaround with a consistent class-level default and removed the remaining ad-hoc fallback checks
  • added targeted coverage for busy-input / drain-timeout loading, detached restart launching, request_restart() idempotency, and the service-restart exit-code path
  • introduced a narrow shared restart/shutdown test helper for the files touched by this PR, without expanding into a broader gateway test-fixture refactor

Verification:

  • .venv/bin/python -m pytest tests/gateway/test_gateway_shutdown.py tests/gateway/test_restart_drain.py tests/gateway/test_session_boundary_hooks.py tests/gateway/test_session_race_guard.py tests/hermes_cli/test_gateway_service.py -q
  • result: 88 passed

I kept the fixture cleanup intentionally scoped to the restart/shutdown-related tests in this PR rather than turning this branch into a wider gateway test-helper consolidation.

@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #7503. Your commits were cherry-picked onto current main with your authorship preserved in git log. Added agent.close() cleanup that was missing from the refactored stop() and fixed the /restart CommandDef category. Thanks for the solid contribution, Kenny!

@teknium1 teknium1 closed this Apr 11, 2026
teknium1 added a commit that referenced this pull request Apr 13, 2026
Three problems fixed:

1. bobashopcashier missing from v0.9.0 contributor list despite
   authoring the gateway drain PR (#7290, salvaged into #7503).
   Their email (kennyx102@gmail.com) was missing from AUTHOR_MAP.

2. release.py only scanned git commit authors, missing Co-authored-by
   trailers. Now parse_coauthors() extracts trailers from commit bodies.

3. No mechanism to detect contributors from salvaged PRs (where original
   author only appears in PR description, not git log).

Changes:
- scripts/release.py: add kennyx102@gmail.com to AUTHOR_MAP, enhance
  get_commits() to parse Co-authored-by trailers, filter AI assistants
  (Claude, Copilot, Cursor Agent) from co-author lists
- scripts/contributor_audit.py: new script that cross-references git
  authors, co-author trailers, and salvaged PR descriptions. Reports
  unknown emails and contributors missing from release notes.
- RELEASE_v0.9.0.md: add bobashopcashier to community contributors

Usage:
  python scripts/contributor_audit.py --since-tag v2026.4.8
  python scripts/contributor_audit.py --since-tag v2026.4.8 --release-file RELEASE_v0.9.0.md
teknium1 added a commit that referenced this pull request Apr 13, 2026
Three problems fixed:

1. bobashopcashier missing from v0.9.0 contributor list despite
   authoring the gateway drain PR (#7290, salvaged into #7503).
   Their email (kennyx102@gmail.com) was missing from AUTHOR_MAP.

2. release.py only scanned git commit authors, missing Co-authored-by
   trailers. Now parse_coauthors() extracts trailers from commit bodies.

3. No mechanism to detect contributors from salvaged PRs (where original
   author only appears in PR description, not git log).

Changes:
- scripts/release.py: add kennyx102@gmail.com to AUTHOR_MAP, enhance
  get_commits() to parse Co-authored-by trailers, filter AI assistants
  (Claude, Copilot, Cursor Agent) from co-author lists
- scripts/contributor_audit.py: new script that cross-references git
  authors, co-author trailers, and salvaged PR descriptions. Reports
  unknown emails and contributors missing from release notes.
- RELEASE_v0.9.0.md: add bobashopcashier to community contributors

Usage:
  python scripts/contributor_audit.py --since-tag v2026.4.8
  python scripts/contributor_audit.py --since-tag v2026.4.8 --release-file RELEASE_v0.9.0.md
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…arch#9264)

Three problems fixed:

1. bobashopcashier missing from v0.9.0 contributor list despite
   authoring the gateway drain PR (NousResearch#7290, salvaged into NousResearch#7503).
   Their email (kennyx102@gmail.com) was missing from AUTHOR_MAP.

2. release.py only scanned git commit authors, missing Co-authored-by
   trailers. Now parse_coauthors() extracts trailers from commit bodies.

3. No mechanism to detect contributors from salvaged PRs (where original
   author only appears in PR description, not git log).

Changes:
- scripts/release.py: add kennyx102@gmail.com to AUTHOR_MAP, enhance
  get_commits() to parse Co-authored-by trailers, filter AI assistants
  (Claude, Copilot, Cursor Agent) from co-author lists
- scripts/contributor_audit.py: new script that cross-references git
  authors, co-author trailers, and salvaged PR descriptions. Reports
  unknown emails and contributors missing from release notes.
- RELEASE_v0.9.0.md: add bobashopcashier to community contributors

Usage:
  python scripts/contributor_audit.py --since-tag v2026.4.8
  python scripts/contributor_audit.py --since-tag v2026.4.8 --release-file RELEASE_v0.9.0.md
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
…arch#9264)

Three problems fixed:

1. bobashopcashier missing from v0.9.0 contributor list despite
   authoring the gateway drain PR (NousResearch#7290, salvaged into NousResearch#7503).
   Their email (kennyx102@gmail.com) was missing from AUTHOR_MAP.

2. release.py only scanned git commit authors, missing Co-authored-by
   trailers. Now parse_coauthors() extracts trailers from commit bodies.

3. No mechanism to detect contributors from salvaged PRs (where original
   author only appears in PR description, not git log).

Changes:
- scripts/release.py: add kennyx102@gmail.com to AUTHOR_MAP, enhance
  get_commits() to parse Co-authored-by trailers, filter AI assistants
  (Claude, Copilot, Cursor Agent) from co-author lists
- scripts/contributor_audit.py: new script that cross-references git
  authors, co-author trailers, and salvaged PR descriptions. Reports
  unknown emails and contributors missing from release notes.
- RELEASE_v0.9.0.md: add bobashopcashier to community contributors

Usage:
  python scripts/contributor_audit.py --since-tag v2026.4.8
  python scripts/contributor_audit.py --since-tag v2026.4.8 --release-file RELEASE_v0.9.0.md
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…arch#9264)

Three problems fixed:

1. bobashopcashier missing from v0.9.0 contributor list despite
   authoring the gateway drain PR (NousResearch#7290, salvaged into NousResearch#7503).
   Their email (kennyx102@gmail.com) was missing from AUTHOR_MAP.

2. release.py only scanned git commit authors, missing Co-authored-by
   trailers. Now parse_coauthors() extracts trailers from commit bodies.

3. No mechanism to detect contributors from salvaged PRs (where original
   author only appears in PR description, not git log).

Changes:
- scripts/release.py: add kennyx102@gmail.com to AUTHOR_MAP, enhance
  get_commits() to parse Co-authored-by trailers, filter AI assistants
  (Claude, Copilot, Cursor Agent) from co-author lists
- scripts/contributor_audit.py: new script that cross-references git
  authors, co-author trailers, and salvaged PR descriptions. Reports
  unknown emails and contributors missing from release notes.
- RELEASE_v0.9.0.md: add bobashopcashier to community contributors

Usage:
  python scripts/contributor_audit.py --since-tag v2026.4.8
  python scripts/contributor_audit.py --since-tag v2026.4.8 --release-file RELEASE_v0.9.0.md
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…arch#9264)

Three problems fixed:

1. bobashopcashier missing from v0.9.0 contributor list despite
   authoring the gateway drain PR (NousResearch#7290, salvaged into NousResearch#7503).
   Their email (kennyx102@gmail.com) was missing from AUTHOR_MAP.

2. release.py only scanned git commit authors, missing Co-authored-by
   trailers. Now parse_coauthors() extracts trailers from commit bodies.

3. No mechanism to detect contributors from salvaged PRs (where original
   author only appears in PR description, not git log).

Changes:
- scripts/release.py: add kennyx102@gmail.com to AUTHOR_MAP, enhance
  get_commits() to parse Co-authored-by trailers, filter AI assistants
  (Claude, Copilot, Cursor Agent) from co-author lists
- scripts/contributor_audit.py: new script that cross-references git
  authors, co-author trailers, and salvaged PR descriptions. Reports
  unknown emails and contributors missing from release notes.
- RELEASE_v0.9.0.md: add bobashopcashier to community contributors

Usage:
  python scripts/contributor_audit.py --since-tag v2026.4.8
  python scripts/contributor_audit.py --since-tag v2026.4.8 --release-file RELEASE_v0.9.0.md
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…arch#9264)

Three problems fixed:

1. bobashopcashier missing from v0.9.0 contributor list despite
   authoring the gateway drain PR (NousResearch#7290, salvaged into NousResearch#7503).
   Their email (kennyx102@gmail.com) was missing from AUTHOR_MAP.

2. release.py only scanned git commit authors, missing Co-authored-by
   trailers. Now parse_coauthors() extracts trailers from commit bodies.

3. No mechanism to detect contributors from salvaged PRs (where original
   author only appears in PR description, not git log).

Changes:
- scripts/release.py: add kennyx102@gmail.com to AUTHOR_MAP, enhance
  get_commits() to parse Co-authored-by trailers, filter AI assistants
  (Claude, Copilot, Cursor Agent) from co-author lists
- scripts/contributor_audit.py: new script that cross-references git
  authors, co-author trailers, and salvaged PR descriptions. Reports
  unknown emails and contributors missing from release notes.
- RELEASE_v0.9.0.md: add bobashopcashier to community contributors

Usage:
  python scripts/contributor_audit.py --since-tag v2026.4.8
  python scripts/contributor_audit.py --since-tag v2026.4.8 --release-file RELEASE_v0.9.0.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(gateway): busy_input_mode=queue is not honored during Discord self-restart

3 participants