Bug: Exec-type quick commands are blocked when gateway is draining
Problem
When the gateway enters a draining state (e.g., after SIGTERM, or after repeated LLM backend failures), user-defined quick commands with type: exec are blocked by the _draining guard before reaching the exec handler.
These commands run independent shell scripts via asyncio.create_subprocess_shell and do not depend on the agent loop or LLM backend — there's no reason to block them during draining. In practice, this means that when the system is most unhealthy and you need ops commands the most, those are the ones that get blocked.
(Example: custom quick commands like /cleanup and /restart-sglang defined in config.yaml with type: exec — intended for infrastructure recovery — were unusable during a real sglang crash incident.)
Root Cause
In gateway/run.py, the _draining check runs before quick command parsing:
if self._draining:
return "⏳ Gateway is draining..."
# User-defined quick commands
if command:
if qcmd.get("type") == "exec":
# execute shell command
Because of this order, when _draining is true, the message falls through to the agent loop as a regular conversation turn — which then fails because the LLM is unreachable.
Reproduction
- Let sglang crash or become unreachable (e.g., GPU OOM)
- Gateway enters draining/degraded state
- Send any
/your-exec-command defined in config.yaml with type: exec
- Message is treated as a conversation turn instead of executing the shell command
Expected Behavior
Exec-type quick commands should be parsed and executed regardless of gateway draining state, since they operate on external processes and don't depend on any gateway internals.
Fix
Move quick command parsing before the _draining check in gateway/run.py:
- if self._draining:
- return f"⏳ Gateway is {self._status_action_gerund()}..."
-
# User-defined quick commands (bypass agent loop, no LLM call)
+ # MUST be checked BEFORE _draining so ops commands (cleanup, restart)
+ # still execute even when gateway is shutting down or draining.
if command:
...
+
+ if self._draining:
+ return f"⏳ Gateway is {self._status_action_gerund()}..."
Impact Assessment
- exec type: Gains privileged channel — executes regardless of draining state. Safe because exec commands run independent shell scripts that don't depend on gateway state.
- alias type: Unchanged — still falls through to
_draining check since alias rewrites event.text and continues normal dispatch.
- Plugin commands / skill commands / regular messages: Unchanged — still blocked by
_draining.
Bug: Exec-type quick commands are blocked when gateway is draining
Problem
When the gateway enters a draining state (e.g., after SIGTERM, or after repeated LLM backend failures), user-defined quick commands with
type: execare blocked by the_drainingguard before reaching the exec handler.These commands run independent shell scripts via
asyncio.create_subprocess_shelland do not depend on the agent loop or LLM backend — there's no reason to block them during draining. In practice, this means that when the system is most unhealthy and you need ops commands the most, those are the ones that get blocked.(Example: custom quick commands like
/cleanupand/restart-sglangdefined inconfig.yamlwithtype: exec— intended for infrastructure recovery — were unusable during a real sglang crash incident.)Root Cause
In
gateway/run.py, the_drainingcheck runs before quick command parsing:Because of this order, when
_drainingis true, the message falls through to the agent loop as a regular conversation turn — which then fails because the LLM is unreachable.Reproduction
/your-exec-commanddefined inconfig.yamlwithtype: execExpected Behavior
Exec-type quick commands should be parsed and executed regardless of gateway draining state, since they operate on external processes and don't depend on any gateway internals.
Fix
Move quick command parsing before the
_drainingcheck ingateway/run.py:Impact Assessment
_drainingcheck since alias rewritesevent.textand continues normal dispatch._draining.