Bug
hermes update auto-restarts the gateway service after pulling new code (main.py ~L3800), but only checks that systemctl restart returned 0 — it does not verify the service actually stayed running afterward.
If the new process crashes on startup (e.g. transient import error, stale module cache), systemd sees a clean exit (SIGTERM) so Restart=on-failure does not trigger a retry. The gateway stays dead silently until the user notices.
Reproduction
Observed on Linux (systemd system-level service). After hermes update on Apr 7, the gateway restarted but died in 26ms. systemd journal:
Apr 07 05:25:15 roger systemd[1]: Started Hermes Agent Gateway
Apr 07 05:25:15 roger systemd[1]: hermes-gateway.service: Deactivated successfully.
Duration: 26ms. Service stayed dead for 2 days until manually restarted.
Root cause
In hermes_cli/main.py around line 3800:
restart = subprocess.run(
scope_cmd + ["restart", svc_name],
capture_output=True, text=True, timeout=15,
)
if restart.returncode == 0:
restarted_services.append(svc_name)
systemctl restart returns 0 even if the service immediately crashes after starting — it only confirms the restart command was accepted, not that the process survived.
Suggested fix
Add a post-restart health check with a brief sleep + is-active verification, and retry once on failure:
if restart.returncode == 0:
import time
time.sleep(2)
verify = subprocess.run(
scope_cmd + ["is-active", svc_name],
capture_output=True, text=True, timeout=5,
)
if verify.stdout.strip() == "active":
restarted_services.append(svc_name)
else:
print(f" ⚠ {svc_name} restarted but died immediately — retrying...")
retry = subprocess.run(
scope_cmd + ["restart", svc_name],
capture_output=True, text=True, timeout=15,
)
time.sleep(2)
verify2 = subprocess.run(
scope_cmd + ["is-active", svc_name],
capture_output=True, text=True, timeout=5,
)
if verify2.stdout.strip() == "active":
restarted_services.append(svc_name)
else:
print(f" ⚠ {svc_name} failed to stay running after update")
Environment
- Hermes v0.8.0
- Linux (Arch), systemd system-level service
Restart=on-failure in unit file (correct default, but insufficient for this edge case)
Bug
hermes updateauto-restarts the gateway service after pulling new code (main.py ~L3800), but only checks thatsystemctl restartreturned 0 — it does not verify the service actually stayed running afterward.If the new process crashes on startup (e.g. transient import error, stale module cache), systemd sees a clean exit (SIGTERM) so
Restart=on-failuredoes not trigger a retry. The gateway stays dead silently until the user notices.Reproduction
Observed on Linux (systemd system-level service). After
hermes updateon Apr 7, the gateway restarted but died in 26ms. systemd journal:Duration: 26ms. Service stayed dead for 2 days until manually restarted.
Root cause
In
hermes_cli/main.pyaround line 3800:systemctl restartreturns 0 even if the service immediately crashes after starting — it only confirms the restart command was accepted, not that the process survived.Suggested fix
Add a post-restart health check with a brief sleep +
is-activeverification, and retry once on failure:Environment
Restart=on-failurein unit file (correct default, but insufficient for this edge case)