Skip to content

fix(gateway): use --no-block for systemctl restart to avoid D-Bus hang#30297

Open
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/gateway-restart-dbus-hang
Open

fix(gateway): use --no-block for systemctl restart to avoid D-Bus hang#30297
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/gateway-restart-dbus-hang

Conversation

@konsisumer

Copy link
Copy Markdown
Contributor

What changed and why

Calling systemctl (--user) restart makes a synchronous D-Bus call that blocks indefinitely when the D-Bus session bus loses its connection during the restart handoff. This reproduces reliably on Ubuntu 24.04 ARM64 with a Linuxbrew-installed Hermes user unit (issue #29421).

Per the reporter's diagnostics on 2026-05-21: both hermes gateway restart and the raw systemctl --user restart hermes-gateway hang, while systemctl --user --no-block restart hermes-gateway returns immediately (exit 0), pointing squarely at the D-Bus job-wait.

Added --no-block to every systemctl restart invocation inside systemd_restart() (three call sites: post-graceful-drain, forced-restart fallback, and cold-start). The existing _wait_for_systemd_service_restart() poll loop already handles waiting for the new PID and printing progress messages, so operator-visible behaviour is unchanged.

How to test

  1. Install Hermes as a systemd user unit on Ubuntu 24.04 ARM64 (or any host that exhibits the D-Bus session bus drop during restart).
  2. Run hermes gateway restart — it should complete and print ✓ User service restarted (PID N) instead of hanging.
  3. Alternatively: strace -e trace=sendmsg,recvmsg hermes gateway restart — confirm that the systemctl child exits quickly rather than blocking on D-Bus receives.

What platforms tested on

  • macOS 14 (unit tests only; systemd tests skipped due to platform)
  • CI (Linux): existing test suite passes; 5 pre-existing macOS-only systemd preflight failures are unrelated and unchanged from main

Fixes #29421

@alt-glitch alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 22, 2026
@konsisumer

Copy link
Copy Markdown
Contributor Author

Rebased onto current origin/main to resolve a merge conflict caused by a stale base — upstream had reformatted the third systemctl restart call site in hermes_cli/gateway.py. Re-applied --no-block at all three restart call sites, preserving upstream's formatting. Diff is limited to hermes_cli/gateway.py and tests/hermes_cli/test_gateway_service.py; ruff lint and the Windows-footgun check pass.

The remaining red test check is in tests/tools/test_transcription*.py / tests/hermes_cli/test_web_server.py, which this PR does not modify — those are pre-existing upstream failures unrelated to this change.

@konsisumer konsisumer force-pushed the fix/gateway-restart-dbus-hang branch from 4873eaf to f201148 Compare June 2, 2026 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway restart hangs indefinitely — D-Bus race with systemd user unit

2 participants