Skip to content

Exec-type quick commands are blocked when gateway is draining #28663

@HH1162

Description

@HH1162

Bug: Exec-type quick commands are blocked when gateway is draining

Problem

When the gateway enters a draining state (e.g., after SIGTERM, or after repeated LLM backend failures), user-defined quick commands with type: exec are blocked by the _draining guard before reaching the exec handler.

These commands run independent shell scripts via asyncio.create_subprocess_shell and do not depend on the agent loop or LLM backend — there's no reason to block them during draining. In practice, this means that when the system is most unhealthy and you need ops commands the most, those are the ones that get blocked.

(Example: custom quick commands like /cleanup and /restart-sglang defined in config.yaml with type: exec — intended for infrastructure recovery — were unusable during a real sglang crash incident.)

Root Cause

In gateway/run.py, the _draining check runs before quick command parsing:

if self._draining:
    return "⏳ Gateway is draining..."

# User-defined quick commands
if command:
    if qcmd.get("type") == "exec":
        # execute shell command

Because of this order, when _draining is true, the message falls through to the agent loop as a regular conversation turn — which then fails because the LLM is unreachable.

Reproduction

  1. Let sglang crash or become unreachable (e.g., GPU OOM)
  2. Gateway enters draining/degraded state
  3. Send any /your-exec-command defined in config.yaml with type: exec
  4. Message is treated as a conversation turn instead of executing the shell command

Expected Behavior

Exec-type quick commands should be parsed and executed regardless of gateway draining state, since they operate on external processes and don't depend on any gateway internals.

Fix

Move quick command parsing before the _draining check in gateway/run.py:

-        if self._draining:
-            return f"⏳ Gateway is {self._status_action_gerund()}..."
-
         # User-defined quick commands (bypass agent loop, no LLM call)
+        # MUST be checked BEFORE _draining so ops commands (cleanup, restart)
+        # still execute even when gateway is shutting down or draining.
         if command:
             ...
+
+        if self._draining:
+            return f"⏳ Gateway is {self._status_action_gerund()}..."

Impact Assessment

  • exec type: Gains privileged channel — executes regardless of draining state. Safe because exec commands run independent shell scripts that don't depend on gateway state.
  • alias type: Unchanged — still falls through to _draining check since alias rewrites event.text and continues normal dispatch.
  • Plugin commands / skill commands / regular messages: Unchanged — still blocked by _draining.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions