Skip to content

fix: harden gateway systemd restart behavior#13689

Closed
camaragon wants to merge 3 commits into
NousResearch:mainfrom
camaragon:fix/gateway-systemd-restart-hardening
Closed

fix: harden gateway systemd restart behavior#13689
camaragon wants to merge 3 commits into
NousResearch:mainfrom
camaragon:fix/gateway-systemd-restart-hardening

Conversation

@camaragon

@camaragon camaragon commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add systemd PID-file cleanup hooks for generated gateway units
  • switch generated gateway units from KillMode=mixed to KillMode=control-group
  • align tests and restart-related comments with the hardened behavior

Why

This improves gateway restart reliability for Hermes users running under systemd. In live debugging, stale gateway.pid files and brittle cgroup restart behavior caused restart races and stop-timeout loops. The generated units should be more defensive by default.

Testing

  • python -m pytest tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_update_gateway_restart.py tests/gateway/test_gateway_shutdown.py tests/gateway/test_restart_notification.py tests/gateway/test_restart_drain.py -q
  • result: 164 passed

Notes

  • full test suite currently has unrelated pre-existing failures outside this diff; targeted gateway/service regressions for this change are green.

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels Apr 21, 2026
@camaragon camaragon force-pushed the fix/gateway-systemd-restart-hardening branch 7 times, most recently from d481b0f to 8646456 Compare April 24, 2026 07:39
@camaragon

Copy link
Copy Markdown
Contributor Author

Branch refreshed onto latest main. Local verification passed: venv/bin/python -m pytest -q tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_update_gateway_restart.py. GitHub Actions on new head 8646456 still show action_required with zero jobs across workflows, so remaining blocker looks like fork-run approval/policy gate rather than code failure.

@camaragon camaragon force-pushed the fix/gateway-systemd-restart-hardening branch 2 times, most recently from 1b5bfc4 to aa544a1 Compare April 24, 2026 11:54
@camaragon

Copy link
Copy Markdown
Contributor Author

Branch refreshed onto latest main again. Local targeted verification passed on new head aa544a1: venv/bin/python -m pytest -q tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_update_gateway_restart.py (146 passed). PR is now behind_by=0; GitHub Actions/check-runs have started on the refreshed head, so current state is active CI rather than prior approval-gate/no-job behavior.

@camaragon camaragon force-pushed the fix/gateway-systemd-restart-hardening branch from aa544a1 to 58fa8c0 Compare April 24, 2026 14:05
@camaragon

Copy link
Copy Markdown
Contributor Author

Branch refreshed onto latest main again. Local targeted verification passed on new head 58fa8c0: .venv/bin/python -m pytest -q tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_update_gateway_restart.py (146 passed). PR compare is now behind_by=0; fresh Actions runs have started on the refreshed head.

@camaragon camaragon force-pushed the fix/gateway-systemd-restart-hardening branch from 58fa8c0 to 013a5f1 Compare April 24, 2026 16:13
@camaragon

Copy link
Copy Markdown
Contributor Author

Branch refreshed onto latest main again. Local targeted verification passed on new head 013a5f1: .venv/bin/python -m pytest -q tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_update_gateway_restart.py (146 passed). PR compare is now behind_by=0; fresh Actions runs are in progress on the refreshed head.

@camaragon camaragon force-pushed the fix/gateway-systemd-restart-hardening branch 2 times, most recently from 3a088f1 to 8d922ed Compare April 24, 2026 20:30
@camaragon

Copy link
Copy Markdown
Contributor Author

Refreshed branch onto latest main again: 8d922ed7

Local targeted verification on refreshed head:
pytest -q -n 0 tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_update_gateway_restart.py
146 passed

Current CI is still blocked by repo-level failures outside this PR diff (test_custom_provider_model_switch, test_pty_bridge, test_hindsight_provider, test_web_server, test_ctx_halving_fix) plus Nix Unable to authenticate to FlakeHub. No PR-local fix applied beyond refresh.

@camaragon camaragon force-pushed the fix/gateway-systemd-restart-hardening branch from 8d922ed to 8d2b502 Compare April 25, 2026 00:52
@camaragon

Copy link
Copy Markdown
Contributor Author

Refreshed branch onto current main and force-pushed latest head 8d2b502.

Local verification on refreshed branch:

  • uv run python -m pytest -q tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_update_gateway_restart.py tests/gateway/test_gateway_shutdown.py tests/gateway/test_restart_notification.py tests/gateway/test_restart_drain.py
  • result: 177 passed

GitHub Actions reran on new head; status still pending at time of refresh.

@camaragon camaragon force-pushed the fix/gateway-systemd-restart-hardening branch from 8d2b502 to 7728354 Compare April 25, 2026 03:02
@camaragon

Copy link
Copy Markdown
Contributor Author

Maintenance refresh: branch refreshed onto latest main and force-pushed at 7728354. Local targeted verification passed with uv run pytest tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_update_gateway_restart.py tests/gateway/test_gateway_shutdown.py tests/gateway/test_restart_notification.py tests/gateway/test_restart_drain.py (177 passed). Compare is now behind_by=0. Fresh Actions started on the new head; at refresh check the PR test lane was still in progress, so attribution was not yet possible, and the latest main Tests run was already red.

@camaragon camaragon force-pushed the fix/gateway-systemd-restart-hardening branch from 7728354 to 5a76cec Compare April 25, 2026 05:15
@camaragon

Copy link
Copy Markdown
Contributor Author

Maintenance refresh: rebased onto current main and force-pushed head 5a76cec.

Local targeted verification passed:
uv run python -m pytest -q tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_update_gateway_restart.py tests/gateway/test_gateway_shutdown.py tests/gateway/test_restart_notification.py tests/gateway/test_restart_drain.py
177 passed

Fresh Actions have started on the new head; compare is now behind_by=0.

@camaragon camaragon force-pushed the fix/gateway-systemd-restart-hardening branch 2 times, most recently from 1a49861 to 0131a18 Compare April 25, 2026 13:40
@camaragon camaragon force-pushed the fix/gateway-systemd-restart-hardening branch 26 times, most recently from 7059599 to da8ad20 Compare May 3, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants