fix(#30170): protect in-flight subagents from busy-mode interrupts#30183
Closed
xxxigm wants to merge 2 commits into
Closed
fix(#30170): protect in-flight subagents from busy-mode interrupts#30183xxxigm wants to merge 2 commits into
xxxigm wants to merge 2 commits into
Conversation
…ousResearch#30170) When a user sends a conversational follow-up while delegate_task is running, gateway/run.py calls running_agent.interrupt(event.text) on the PARENT agent. AIAgent.interrupt() then cascades synchronously through self._active_children and calls interrupt() on every child subagent, aborting in-flight delegate_task work. The user sees the fallback cascade with no root-cause in the gateway log, and minutes of subagent progress are destroyed — the exact failure mode reported in NousResearch#30170. Add GatewayRunner._agent_has_active_subagents(running_agent) — a static helper that returns True iff the parent is currently driving subagents via delegate_task. The helper is type-defensive: it ignores truthy MagicMock auto-attributes (so this doesn't accidentally fire in every test mock that hits the busy path), the _AGENT_PENDING_SENTINEL placeholder, and missing locks. Wire the helper into both interrupt branches: 1. _handle_active_session_busy_message — the adapter-level busy handler. When busy_input_mode == 'interrupt' AND the parent has active subagents, demote to 'queue' semantics: skip the parent.interrupt() call, merge the message into the pending queue, and surface a dedicated ack ("⏳ Subagent working — your message is queued for when it finishes (use /stop to cancel everything).") so the operator knows the message wasn't lost and discovers the explicit escape hatch. 2. The PRIORITY interrupt branch inside _handle_message — the non-command fast path. Same rationale, same demotion. Routes through _queue_or_replace_pending_event so the next-turn pickup stays unchanged. Explicit /stop and /new commands take a completely different path (_interrupt_and_clear_session in the slash-command dispatch at line ~6771) and are NOT affected by this guard — the operator still has a way to force-cancel everything when they actually mean it. Configured 'queue' and 'steer' modes are also untouched: 'queue' already does the right thing, and 'steer' goes through running_agent.steer() which does NOT cascade to children (so subagents survive a steer too). This is Phase 1 of the fix outlined in NousResearch#30170 — the minimum viable change that stops subagent loss. Phase 2 (delegation-aware steer forwarding to active children) and Phase 3 (async delegation, NousResearch#11508) are intentionally out of scope. Refs NousResearch#30170.
…rupt protection
17 new tests in tests/gateway/test_subagent_protection_30170.py pin
down both the detection helper and the demotion behaviour:
* TestAgentHasActiveSubagents — 11 cases covering the precision and
defensiveness of _agent_has_active_subagents:
- returns False for None, _AGENT_PENDING_SENTINEL, and stub
agents that lack the _active_children attribute;
- returns False for an empty list (the steady state of an idle
AIAgent);
- returns True for one or many children;
- works when _active_children_lock is None (test stubs);
- rejects truthy MagicMock auto-attributes — this is the
regression-guard for "every MagicMock-based gateway test
suddenly demotes to queue mode" (which is how this was
originally found);
- accepts list/tuple/set as the children container.
* TestBusyHandlerDemotesInterruptForSubagents — 6 cases driving
_handle_active_session_busy_message directly:
- parent.interrupt is NOT called when subagents are active,
message is still merged into the pending queue;
- ack copy mentions "Subagent working", "queued", and the
/stop escape hatch — and does NOT mention "Interrupting";
- with no subagents, behaviour is byte-identical to the
pre-NousResearch#30170 interrupt path (parent.interrupt called with the
user text, ack says "Interrupting");
- configured queue mode keeps its vanilla "Queued for the next
turn" ack (the NousResearch#30170 demotion-specific copy must NOT fire);
- configured steer mode still routes to running_agent.steer()
even when subagents are active (the guard is interrupt-only);
- _AGENT_PENDING_SENTINEL does not trigger demotion.
Refs NousResearch#30170.
Contributor
|
Reviewed in the #30170 triage — this PR is the strongest candidate. It covers both interrupt paths (warm |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Stops in-flight
delegate_tasksubagents from being killed when a user sends a conversational follow-up while delegation is running. The gateway now demotesbusy_input_mode='interrupt'toqueuesemantics whenever the parent agent is currently driving subagents, so the message is queued for the next turn instead of cascadinginterrupt()throughAIAgent._active_childrenand aborting every subagent.Before:
delegate_taskis in flight.running_agent.interrupt(text)on the parent.AIAgent.interrupt()cascades synchronously to every entry in_active_childrenand callschild.interrupt(message).status="interrupted", minutes of work are lost.After:
_agent_has_active_subagents(running_agent); the parent is delegating → demoteinterrupttoqueue.parent.interrupt()is NOT called. The message is merged into the pending queue and surfaced as⏳ Subagent working — your message is queued for when it finishes (use /stop to cancel everything)./stopand/newstill cascade the full interrupt for operators who actually want to cancel.Related Issue
Fixes #30170.
Type of Change
Changes Made
gateway/run.py— implementation (+81 lines, 1 file):GatewayRunner._agent_has_active_subagents(running_agent)static helper. Type-defensive: returns True iff_active_childrenis a non-empty reallist/tuple/set. RejectsNone,_AGENT_PENDING_SENTINEL, missing attributes, and truthyMagicMockauto-attributes (so test mocks don't accidentally fire the demotion)._handle_active_session_busy_message(~L2930) — the adapter-level busy handler the issue cites. Whenbusy_input_mode == 'interrupt'and the parent has active subagents, demote toqueue: skipparent.interrupt(), merge the event into the pending queue, and surface a dedicated ack with the/stopescape hatch._handle_message(~L7050) — the non-command fast path. Same guard, same demotion. Routes through_queue_or_replace_pending_event.tests/gateway/test_subagent_protection_30170.py— regression tests (+348 lines, new file):TestAgentHasActiveSubagents— 11 cases pinning the precision of the detection helper (None / sentinel / missing attribute / empty list / single child / many children / no-lock variant / truthy MagicMock regression guard / list-tuple-set acceptance).TestBusyHandlerDemotesInterruptForSubagents— 6 cases driving_handle_active_session_busy_messagedirectly (interrupt NOT called when subagents active; ack mentions "Subagent working", "queued", and/stop; baseline behaviour preserved when no subagents; configuredqueue/steermodes unchanged;_AGENT_PENDING_SENTINELdoes not trigger demotion).No other production code touched. Explicit
/stop,/new, configuredqueuemode, configuredsteermode, andAIAgent.interrupt()itself are all byte-identical to before.How to Test
.venvis set up:python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"display.busy_input_mode: interruptand a Telegram (or any) adapter.delegate_task(e.g. "spawn a subagent to summarize file X").⏳ Subagent working — your message is queued for when it finishes (use /stop to cancel everything).; your follow-up is processed after the delegation finishes./stopinstead still hard-stops the subagent.Checklist
Code
fix(gateway): ...andtest(gateway): ...)scripts/run_tests.sh tests/gateway/test_subagent_protection_30170.pyand all tests passDocumentation & Housekeeping
docs/, docstrings) — N/A (user-facing ack copy IS the documentation; helper docstring and inline comments cite [Bug]: Sending a message while delegate_task is running kills the subagent — interrupt propagates unconditionally to children #30170)cli-config.yaml.exampleif I added/changed config keys — N/A (no new config key; this is a behavioural refinement of the existingdisplay.busy_input_mode: interrupt)CONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/Adelegate_taskschema is unchanged; only the gateway's busy-message routing changed)Screenshots / Logs