What happened
After a gateway restart (SIGTERM → drain → exit code 1), systemd failed to clean up the cgroup because an orphan adb process was still running inside it. This caused a 6-minute delay before Restart=always could bring the gateway back, leaving all platforms (Telegram, Discord, WhatsApp) and cron jobs completely dead.
Journal evidence
Jun 02 09:28:37 systemd[1983]: hermes-gateway.service: Main process exited, code=exited, status=1/FAILURE
Jun 02 09:28:37 systemd[1983]: hermes-gateway.service: Failed to kill control group /user.slice/.../hermes-gateway.service, ignoring: Invalid argument
Jun 02 09:28:37 systemd[1983]: hermes-gateway.service: Unit process 42104 (adb) remains running after unit stopped.
Jun 02 09:28:37 systemd[1983]: Stopped hermes-gateway.service
Jun 02 09:34:48 systemd[2039]: Started hermes-gateway.service ← 6 min 11 sec later
Root cause chain
- Gateway spawns processes during normal operation (terminal tool subprocesses, platform bridges, Android debug bridge, etc.)
KillMode=mixed only kills the main PID, leaving child processes to run
- On shutdown, the gateway cleans up most subprocesses but an
adb process remained
- systemd tried to kill the cgroup but got
Invalid argument — likely because the process was in an uninterruptible state or had already been re-parented
- systemd entered some retry/recovery loop that took 6 minutes to resolve
- During this entire window,
Restart=always could not restart the service
Impact
- 6+ minutes of complete Hermes outage across all platforms
- All cron jobs missed their windows during the outage
- User had to manually reboot the machine to recover
Environment
- Hermes Agent v0.15.1
- systemd user service:
KillMode=mixed, Restart=always, TimeoutStopSec=90
- Linux 6.8.0-124-generic
Suggested fix
- Change to
KillMode=control-group or KillMode=process so systemd kills the entire cgroup on stop, preventing orphan processes
- Or add a pre-stop cleanup that explicitly kills known orphan-prone processes (adb, node bridges, etc.)
- Or add
ExecStopPost=-/usr/bin/pkill -P $$ to ensure all children are cleaned up
- The
Skipping .clean_shutdown marker logic should also explicitly reap remaining subprocesses before exit
What happened
After a gateway restart (SIGTERM → drain → exit code 1), systemd failed to clean up the cgroup because an orphan
adbprocess was still running inside it. This caused a 6-minute delay beforeRestart=alwayscould bring the gateway back, leaving all platforms (Telegram, Discord, WhatsApp) and cron jobs completely dead.Journal evidence
Root cause chain
KillMode=mixedonly kills the main PID, leaving child processes to runadbprocess remainedInvalid argument— likely because the process was in an uninterruptible state or had already been re-parentedRestart=alwayscould not restart the serviceImpact
Environment
KillMode=mixed,Restart=always,TimeoutStopSec=90Suggested fix
KillMode=control-grouporKillMode=processso systemd kills the entire cgroup on stop, preventing orphan processesExecStopPost=-/usr/bin/pkill -P $$to ensure all children are cleaned upSkipping .clean_shutdown markerlogic should also explicitly reap remaining subprocesses before exit