Skip to content

gateway: orphan processes in cgroup block systemd restart for 6+ minutes #37454

@brian-doherty

Description

@brian-doherty

What happened

After a gateway restart (SIGTERM → drain → exit code 1), systemd failed to clean up the cgroup because an orphan adb process was still running inside it. This caused a 6-minute delay before Restart=always could bring the gateway back, leaving all platforms (Telegram, Discord, WhatsApp) and cron jobs completely dead.

Journal evidence

Jun 02 09:28:37 systemd[1983]: hermes-gateway.service: Main process exited, code=exited, status=1/FAILURE
Jun 02 09:28:37 systemd[1983]: hermes-gateway.service: Failed to kill control group /user.slice/.../hermes-gateway.service, ignoring: Invalid argument
Jun 02 09:28:37 systemd[1983]: hermes-gateway.service: Unit process 42104 (adb) remains running after unit stopped.
Jun 02 09:28:37 systemd[1983]: Stopped hermes-gateway.service
Jun 02 09:34:48 systemd[2039]: Started hermes-gateway.service  ← 6 min 11 sec later

Root cause chain

  1. Gateway spawns processes during normal operation (terminal tool subprocesses, platform bridges, Android debug bridge, etc.)
  2. KillMode=mixed only kills the main PID, leaving child processes to run
  3. On shutdown, the gateway cleans up most subprocesses but an adb process remained
  4. systemd tried to kill the cgroup but got Invalid argument — likely because the process was in an uninterruptible state or had already been re-parented
  5. systemd entered some retry/recovery loop that took 6 minutes to resolve
  6. During this entire window, Restart=always could not restart the service

Impact

  • 6+ minutes of complete Hermes outage across all platforms
  • All cron jobs missed their windows during the outage
  • User had to manually reboot the machine to recover

Environment

  • Hermes Agent v0.15.1
  • systemd user service: KillMode=mixed, Restart=always, TimeoutStopSec=90
  • Linux 6.8.0-124-generic

Suggested fix

  1. Change to KillMode=control-group or KillMode=process so systemd kills the entire cgroup on stop, preventing orphan processes
  2. Or add a pre-stop cleanup that explicitly kills known orphan-prone processes (adb, node bridges, etc.)
  3. Or add ExecStopPost=-/usr/bin/pkill -P $$ to ensure all children are cleaned up
  4. The Skipping .clean_shutdown marker logic should also explicitly reap remaining subprocesses before exit

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundarea/configConfig system, migrations, profilescomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions