gateway: orphan processes in cgroup block systemd restart for 6+ minutes

## What happened

After a gateway restart (SIGTERM → drain → exit code 1), systemd failed to clean up the cgroup because an orphan `adb` process was still running inside it. This caused a **6-minute delay** before `Restart=always` could bring the gateway back, leaving all platforms (Telegram, Discord, WhatsApp) and cron jobs completely dead.

## Journal evidence

```
Jun 02 09:28:37 systemd[1983]: hermes-gateway.service: Main process exited, code=exited, status=1/FAILURE
Jun 02 09:28:37 systemd[1983]: hermes-gateway.service: Failed to kill control group /user.slice/.../hermes-gateway.service, ignoring: Invalid argument
Jun 02 09:28:37 systemd[1983]: hermes-gateway.service: Unit process 42104 (adb) remains running after unit stopped.
Jun 02 09:28:37 systemd[1983]: Stopped hermes-gateway.service
Jun 02 09:34:48 systemd[2039]: Started hermes-gateway.service  ← 6 min 11 sec later
```

## Root cause chain

1. Gateway spawns processes during normal operation (terminal tool subprocesses, platform bridges, Android debug bridge, etc.)
2. `KillMode=mixed` only kills the main PID, leaving child processes to run
3. On shutdown, the gateway cleans up most subprocesses but an `adb` process remained
4. systemd tried to kill the cgroup but got `Invalid argument` — likely because the process was in an uninterruptible state or had already been re-parented
5. systemd entered some retry/recovery loop that took **6 minutes** to resolve
6. During this entire window, `Restart=always` could not restart the service

## Impact

- **6+ minutes of complete Hermes outage** across all platforms
- All cron jobs missed their windows during the outage
- User had to manually reboot the machine to recover

## Environment

- Hermes Agent v0.15.1
- systemd user service: `KillMode=mixed`, `Restart=always`, `TimeoutStopSec=90`
- Linux 6.8.0-124-generic

## Suggested fix

1. Change to `KillMode=control-group` or `KillMode=process` so systemd kills the entire cgroup on stop, preventing orphan processes
2. Or add a pre-stop cleanup that explicitly kills known orphan-prone processes (adb, node bridges, etc.)
3. Or add `ExecStopPost=-/usr/bin/pkill -P $$` to ensure all children are cleaned up
4. The `Skipping .clean_shutdown marker` logic should also explicitly reap remaining subprocesses before exit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gateway: orphan processes in cgroup block systemd restart for 6+ minutes #37454

What happened

Journal evidence

Root cause chain

Impact

Environment

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

gateway: orphan processes in cgroup block systemd restart for 6+ minutes #37454

Description

What happened

Journal evidence

Root cause chain

Impact

Environment

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions