fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) by teknium1 · Pull Request #18761 · NousResearch/hermes-agent

teknium1 · 2026-05-02T09:07:23Z

Summary

Gateway shutdowns and restarts stop emitting false-positive error/success lines, and /restart no longer force-interrupts mid-API-call agents under realistic conversation loads.

Three issues from a real restart chain on 2026-05-02 (three cascading restarts in the user's gateway.log), all fixed here.

Changes

gateway/run.py — _send_restart_notification() now inspects result.success before logging. Previously logged Sent restart notification to <chat> at INFO unconditionally, even when adapter.send() returned SendResult(success=False) (e.g. Telegram 'Chat not found'). Failures now log WARNING with the underlying error.
gateway/platforms/whatsapp.py — disconnect() sets self._shutting_down = True before SIGTERMing the bridge; _check_managed_bridge_exit() returns None for returncode in (0, -2, -15) while shutting down. Previously every planned shutdown logged ERROR ... WhatsApp bridge process exited unexpectedly (code -15) plus Fatal whatsapp adapter error (whatsapp_bridge_exited) just before ✓ whatsapp disconnected. OOM-kill (137) and other abnormal exits still hit the fatal path.
hermes_cli/config.py — agent.restart_drain_timeout default 60 → 180. A real /restart on 2026-05-02 01:43:27 interrupted three agents with 82s/112s/154s in-flight API calls because the 60s budget expired. Explicit user values in config.yaml are preserved by deep-merge.

Validation

	Before	After
Restart notification fail	INFO 'Sent restart notification to …' (lie)	WARNING 'Restart notification to … was not delivered: Chat not found'
WhatsApp planned shutdown	ERROR 'bridge process exited unexpectedly' + 'Fatal whatsapp adapter error'	INFO 'Bridge exited during shutdown (code -15)'
WhatsApp real crash during shutdown	ERROR + fatal path	unchanged (still ERROR + fatal for returncodes outside {0,-2,-15})
/restart with 3 active agents	drain timed out at 60s, all interrupted	180s drain — typical conversations finish
Explicit user `restart_drain_timeout: 45`	45	45 (unchanged)

Targeted tests: 139/139 pass (tests/gateway/test_restart_notification.py, tests/gateway/test_restart_drain.py, tests/hermes_cli/test_gateway_service.py, the 4 relevant TestBridgeRuntimeFailure cases).

E2E: isolated HERMES_HOME verified all four config paths (default new install = 180, explicit user value preserved = 45, DEFAULT_CONFIG exports 180, DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT = 180.0).

Tests added

test_send_restart_notification_logs_warning_on_sendresult_failure — returns SendResult(success=False), asserts no INFO 'Sent restart notification' line and a WARNING with the error string.
test_send_restart_notification_logs_info_on_sendresult_success — returns SendResult(success=True), asserts INFO line is present.
test_shutdown_suppresses_fatal_on_planned_bridge_exit (parametrized over returncode in [0, -2, -15]) — _shutting_down=True + terminating returncodes → _check_managed_bridge_exit() returns None, no fatal handler fired.
test_shutdown_still_surfaces_nonzero_crash — _shutting_down=True + returncode 137 → still fatal.

The whatsapp suppression uses getattr(self, '_shutting_down', False) so existing _make_adapter() helpers that bypass __init__ (AGENTS.md pitfall #17) keep working unmodified.

…ettings Regression from the silent config→env bridge. The bridge at module import time is correct for max_turns (unconditional overwrite), but every other agent.*, display.*, timezone, and security bridge key was guarded by 'if X not in os.environ' — so a stale .env entry from an old 'hermes setup' run would shadow the user's current config.yaml indefinitely. Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60 in .env from an old setup, and the gateway silently capped at 60 iterations per turn. Gateway logs confirmed api_calls never exceeded 60. Three changes: 1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*, display.*, timezone, and security.* bridge keys. config.yaml is now authoritative for these settings — same semantics already in place for max_turns, terminal.*, and auxiliary.*. Also surface the bridge failure (previously 'except Exception: pass') to stderr so operators see bridge errors instead of silently falling back to .env. 2. gateway/run.py: INFO-log the resolved max_iterations at gateway start so operators can verify the config→env bridge did the right thing instead of chasing a phantom budget ceiling. 3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in the setup wizard. config.yaml is the single source of truth. Also clean up any stale .env entry left behind by pre-fix setups. Regression tests in tests/gateway/test_config_env_bridge_authority.py guard each config→env key against the 'stale .env shadows config' bug.

… success log) Three issues observed in production gateway.log during a rapid restart chain on 2026-05-02, all fixed here. 1. _send_restart_notification logged unconditional success adapter.send() catches provider errors (e.g. Telegram 'Chat not found') and returns SendResult(success=False); it never raises. The caller ignored the return value and always logged 'Sent restart notification to <chat>' at INFO, producing a misleading success line directly below the 'Failed to send Telegram message' traceback on every boot. Now inspects result.success and logs WARNING with the error otherwise. 2. WhatsApp bridge SIGTERM on shutdown classified as fatal error _check_managed_bridge_exit() saw the bridge's returncode -15 (our own SIGTERM from disconnect()) and fired the full fatal-error path, producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus 'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every planned shutdown, immediately before the normal '✓ whatsapp disconnected'. Adds a _shutting_down flag that disconnect() sets before the terminate, and _check_managed_bridge_exit() returns None for returncode in {0, -2, -15} while shutting down. OOM-kill (137) and other non-signal exits still hit the fatal path. 3. restart_drain_timeout default 60s → 180s On 2026-05-02 01:43:27 a user /restart fired while three agents were mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget expired and all three were force-interrupted. 180s covers realistic in-flight agent turns; users on very-long-reasoning models can still raise it further via agent.restart_drain_timeout in config.yaml. Existing explicit user values are preserved by deep-merge. Tests - tests/gateway/test_restart_notification.py: two new tests assert INFO is only logged on SendResult(success=True) and WARNING with the error string is logged on SendResult(success=False). - tests/gateway/test_whatsapp_connect.py: parametrized test for returncode in {0, -2, -15} proves shutdown-time exits are suppressed; separate test proves returncode 137 (SIGKILL/OOM) still surfaces as fatal even when _shutting_down is set. - _check_managed_bridge_exit() reads _shutting_down via getattr-with- default so existing _make_adapter() test helpers that bypass __init__ (pitfall #17 in AGENTS.md) keep working unmodified.

github-actions · 2026-05-02T09:07:46Z

🚨 CRITICAL Supply Chain Risk Detected

This PR contains a pattern that has been used in real supply chain attacks. A maintainer must review the flagged code carefully before merging.

🚨 CRITICAL: Install-hook file added or modified

These files can execute code during package installation or interpreter startup.

Files:

hermes_cli/setup.py

Scanner only fires on high-signal indicators: .pth files, base64+exec/eval combos, subprocess with encoded commands, or install-hook files. Low-signal warnings were removed intentionally — if you're seeing this comment, the finding is worth inspecting.

… success log) (NousResearch#18761) * fix(gateway): config.yaml wins over .env for agent/display/timezone settings Regression from the silent config→env bridge. The bridge at module import time is correct for max_turns (unconditional overwrite), but every other agent.*, display.*, timezone, and security bridge key was guarded by 'if X not in os.environ' — so a stale .env entry from an old 'hermes setup' run would shadow the user's current config.yaml indefinitely. Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60 in .env from an old setup, and the gateway silently capped at 60 iterations per turn. Gateway logs confirmed api_calls never exceeded 60. Three changes: 1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*, display.*, timezone, and security.* bridge keys. config.yaml is now authoritative for these settings — same semantics already in place for max_turns, terminal.*, and auxiliary.*. Also surface the bridge failure (previously 'except Exception: pass') to stderr so operators see bridge errors instead of silently falling back to .env. 2. gateway/run.py: INFO-log the resolved max_iterations at gateway start so operators can verify the config→env bridge did the right thing instead of chasing a phantom budget ceiling. 3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in the setup wizard. config.yaml is the single source of truth. Also clean up any stale .env entry left behind by pre-fix setups. Regression tests in tests/gateway/test_config_env_bridge_authority.py guard each config→env key against the 'stale .env shadows config' bug. * fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) Three issues observed in production gateway.log during a rapid restart chain on 2026-05-02, all fixed here. 1. _send_restart_notification logged unconditional success adapter.send() catches provider errors (e.g. Telegram 'Chat not found') and returns SendResult(success=False); it never raises. The caller ignored the return value and always logged 'Sent restart notification to <chat>' at INFO, producing a misleading success line directly below the 'Failed to send Telegram message' traceback on every boot. Now inspects result.success and logs WARNING with the error otherwise. 2. WhatsApp bridge SIGTERM on shutdown classified as fatal error _check_managed_bridge_exit() saw the bridge's returncode -15 (our own SIGTERM from disconnect()) and fired the full fatal-error path, producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus 'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every planned shutdown, immediately before the normal '✓ whatsapp disconnected'. Adds a _shutting_down flag that disconnect() sets before the terminate, and _check_managed_bridge_exit() returns None for returncode in {0, -2, -15} while shutting down. OOM-kill (137) and other non-signal exits still hit the fatal path. 3. restart_drain_timeout default 60s → 180s On 2026-05-02 01:43:27 a user /restart fired while three agents were mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget expired and all three were force-interrupted. 180s covers realistic in-flight agent turns; users on very-long-reasoning models can still raise it further via agent.restart_drain_timeout in config.yaml. Existing explicit user values are preserved by deep-merge. Tests - tests/gateway/test_restart_notification.py: two new tests assert INFO is only logged on SendResult(success=True) and WARNING with the error string is logged on SendResult(success=False). - tests/gateway/test_whatsapp_connect.py: parametrized test for returncode in {0, -2, -15} proves shutdown-time exits are suppressed; separate test proves returncode 137 (SIGKILL/OOM) still surfaces as fatal even when _shutting_down is set. - _check_managed_bridge_exit() reads _shutting_down via getattr-with- default so existing _make_adapter() test helpers that bypass __init__ (pitfall NousResearch#17 in AGENTS.md) keep working unmodified.

teknium1 added 2 commits May 2, 2026 01:48

teknium1 merged commit 1dce908 into main May 2, 2026
8 of 10 checks passed

teknium1 deleted the hermes/hermes-6bbfa865 branch May 2, 2026 09:08

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery platform/whatsapp WhatsApp Business adapter labels May 2, 2026

This was referenced May 3, 2026

fix(gateway): preserve home-channel thread targets across restart notifications (salvage #18440) #19271

Merged

fix(gateway): preserve home-channel thread targets across restart notifications #18440

Closed

BrewTestBot mentioned this pull request May 7, 2026

hermes-agent 2026.5.7 Homebrew/homebrew-core#281437

Merged

1 task

github-actions Bot mentioned this pull request May 8, 2026

chore: bump NousResearch/hermes-agent version from v2026.4.30 to v2026.5.7 Docker-Hub-sirmark/docker-hermes-agent#5

Merged

hoobnn mentioned this pull request May 29, 2026

docs(config): sync stale defaults in cli-config.yaml.example (restart_drain_timeout, streaming) #34627

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)#18761

fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)#18761
teknium1 merged 2 commits into
mainfrom
hermes/hermes-6bbfa865

teknium1 commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented May 2, 2026

Summary

Changes

Validation

Tests added

Uh oh!

github-actions Bot commented May 2, 2026

🚨 CRITICAL Supply Chain Risk Detected

🚨 CRITICAL: Install-hook file added or modified

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants