Skip to content

Gateway self-restart from active WhatsApp chat can self-block graceful drain and always hit drain timeout #20694

@herbalizer404

Description

@herbalizer404

Bug Description

When a gateway self-restart is triggered from the same active WhatsApp DM session currently being served by Hermes, graceful drain appears to keep counting that requesting session as an active agent.

In practice, the restart path waits for the currently-running chat turn to finish, but that turn is itself the one requesting restart. The result is that graceful drain appears to wait on itself until the drain timeout expires.

Observed user-facing behavior:
•⁠ ⁠restart initiated from WhatsApp appears slow or stuck
•⁠ ⁠gateway enters draining with ⁠ 1 active agent(s) ⁠
•⁠ ⁠drain does not complete promptly
•⁠ ⁠drain eventually times out and interrupts remaining work
•⁠ ⁠systemd then relaunches the service
•⁠ ⁠resumed context may be degraded depending on timing and resume path

This looks different from pure service-manager restart-policy failures because the gateway does come back; the problem is that the restart request from the active chat seems to self-block graceful drain.

Exact Environment

Hermes:
•⁠ ⁠Hermes Agent version: ⁠ v0.12.0 (2026.4.30) ⁠
•⁠ ⁠Repo branch: ⁠ main ⁠
•⁠ ⁠Repo HEAD: ⁠ 629d8b843d8d8507925fd35344f57de776cb1490 ⁠
•⁠ ⁠Git describe: ⁠ v2026.4.30-641-g629d8b843 ⁠
•⁠ ⁠Latest commit subject at reproduction time: ⁠ fix(browser): tighten Lightpanda fallback edge cases ⁠
•⁠ ⁠Local state at investigation time: clean checkout, no persistent local code patches

Python:
•⁠ ⁠⁠ Python 3.11.15 ⁠

OS / kernel:
•⁠ ⁠OS: ⁠ Ubuntu 25.04 (Plucky Puffin) ⁠
•⁠ ⁠Kernel: ⁠ Linux 6.17.2-2-pve x86_64 GNU/Linux ⁠

systemd:
•⁠ ⁠⁠ systemd 257 (257.4-1ubuntu3.2) ⁠
•⁠ ⁠Running as a user service

Gateway service characteristics at reproduction time:
•⁠ ⁠systemd-managed gateway
•⁠ ⁠WhatsApp connected platform
•⁠ ⁠restart path initiated from the same active WhatsApp DM being served
•⁠ ⁠observed with ⁠ RestartForceExitStatus=75 ⁠

Effective service settings during investigation:
•⁠ ⁠⁠ Restart=on-failure ⁠
•⁠ ⁠⁠ RestartUSec=30s ⁠
•⁠ ⁠⁠ RestartSteps=0 ⁠
•⁠ ⁠⁠ RestartMaxDelayUSec=infinity ⁠
•⁠ ⁠⁠ TimeoutStopUSec=1min 30s ⁠
•⁠ ⁠⁠ RestartForceExitStatus=75 ⁠

Notes:
•⁠ ⁠During earlier investigation, the upstream-generated service behavior also showed the same family of drain/self-restart symptoms before local restart-policy mitigation was applied.
•⁠ ⁠The local override only reduced restart backoff noise; it did not explain the underlying “active session waits on itself” drain pattern.

Steps to Reproduce

1.⁠ ⁠Run Hermes gateway under systemd as a user service.
2.⁠ ⁠Connect WhatsApp as a platform.
3.⁠ ⁠Start a live WhatsApp DM conversation so the gateway is actively serving that session.
4.⁠ ⁠From that same WhatsApp DM, trigger a gateway restart path (for example via a command or action that results in gateway self-restart).
5.⁠ ⁠Observe logs in:

  • ⁠ journalctl --user -u hermes-gateway.service ⁠
  • ⁠ ~/.hermes/logs/gateway.log ⁠

Expected Behavior

A restart requested from an active chat session should not have to wait for that same session to complete before restart can proceed.

Any of these outcomes would be acceptable:
•⁠ ⁠the restart-requesting control session is excluded from drain accounting
•⁠ ⁠the restart control path is detached from the active user turn before drain starts
•⁠ ⁠the requesting session is immediately marked interrupted/resume-pending so drain can complete promptly

The key expectation is:
a self-restart initiated from chat should not reliably consume the full drain timeout waiting on itself.

Actual Behavior

The gateway repeatedly logs a drain timeout with one active agent, then interrupts remaining work and restarts.

Representative journal excerpt from a real reproduction:
Bug Description

When a gateway self-restart is triggered from the same active WhatsApp DM session currently being served by Hermes, graceful drain appears to keep counting that requesting session as an active agent.

In practice, the restart path waits for the currently-running chat turn to finish, but that turn is itself the one requesting restart. The result is that graceful drain appears to wait on itself until the drain timeout expires.

Observed user-facing behavior:
•⁠ ⁠restart initiated from WhatsApp appears slow or stuck
•⁠ ⁠gateway enters draining with ⁠ 1 active agent(s) ⁠
•⁠ ⁠drain does not complete promptly
•⁠ ⁠drain eventually times out and interrupts remaining work
•⁠ ⁠systemd then relaunches the service
•⁠ ⁠resumed context may be degraded depending on timing and resume path

This looks different from pure service-manager restart-policy failures because the gateway does come back; the problem is that the restart request from the active chat seems to self-block graceful drain.

Exact Environment

Hermes:
•⁠ ⁠Hermes Agent version: ⁠ v0.12.0 (2026.4.30) ⁠
•⁠ ⁠Repo branch: ⁠ main ⁠
•⁠ ⁠Repo HEAD: ⁠ 629d8b843d8d8507925fd35344f57de776cb1490 ⁠
•⁠ ⁠Git describe: ⁠ v2026.4.30-641-g629d8b843 ⁠
•⁠ ⁠Latest commit subject at reproduction time: ⁠ fix(browser): tighten Lightpanda fallback edge cases ⁠
•⁠ ⁠Local state at investigation time: clean checkout, no persistent local code patches

Python:
•⁠ ⁠⁠ Python 3.11.15 ⁠

OS / kernel:
•⁠ ⁠OS: ⁠ Ubuntu 25.04 (Plucky Puffin) ⁠
•⁠ ⁠Kernel: ⁠ Linux 6.17.2-2-pve x86_64 GNU/Linux ⁠

systemd:
•⁠ ⁠⁠ systemd 257 (257.4-1ubuntu3.2) ⁠
•⁠ ⁠Running as a user service

Gateway service characteristics at reproduction time:
•⁠ ⁠systemd-managed gateway
•⁠ ⁠WhatsApp connected platform
•⁠ ⁠restart path initiated from the same active WhatsApp DM being served
•⁠ ⁠observed with ⁠ RestartForceExitStatus=75 ⁠

Effective service settings during investigation:
•⁠ ⁠⁠ Restart=on-failure ⁠
•⁠ ⁠⁠ RestartUSec=30s ⁠
•⁠ ⁠⁠ RestartSteps=0 ⁠
•⁠ ⁠⁠ RestartMaxDelayUSec=infinity ⁠
•⁠ ⁠⁠ TimeoutStopUSec=1min 30s ⁠
•⁠ ⁠⁠ RestartForceExitStatus=75 ⁠

Notes:
•⁠ ⁠During earlier investigation, the upstream-generated service behavior also showed the same family of drain/self-restart symptoms before local restart-policy mitigation was applied.
•⁠ ⁠The local override only reduced restart backoff noise; it did not explain the underlying “active session waits on itself” drain pattern.

Steps to Reproduce

1.⁠ ⁠Run Hermes gateway under systemd as a user service.
2.⁠ ⁠Connect WhatsApp as a platform.
3.⁠ ⁠Start a live WhatsApp DM conversation so the gateway is actively serving that session.
4.⁠ ⁠From that same WhatsApp DM, trigger a gateway restart path (for example via a command or action that results in gateway self-restart).
5.⁠ ⁠Observe logs in:

  • ⁠ journalctl --user -u hermes-gateway.service ⁠
  • ⁠ ~/.hermes/logs/gateway.log ⁠

Expected Behavior

A restart requested from an active chat session should not have to wait for that same session to complete before restart can proceed.

Any of these outcomes would be acceptable:
•⁠ ⁠the restart-requesting control session is excluded from drain accounting
•⁠ ⁠the restart control path is detached from the active user turn before drain starts
•⁠ ⁠the requesting session is immediately marked interrupted/resume-pending so drain can complete promptly

The key expectation is:
a self-restart initiated from chat should not reliably consume the full drain timeout waiting on itself.

Actual Behavior

The gateway repeatedly logs a drain timeout with one active agent, then interrupts remaining work and restarts.

Representative journal excerpt from a real reproduction:

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliveryplatform/whatsappWhatsApp Business adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions