Skip to content

fix(#30170): protect in-flight subagents from busy-mode interrupts#30183

Closed
xxxigm wants to merge 2 commits into
NousResearch:mainfrom
xxxigm:fix/30170-protect-subagents-from-interrupt
Closed

fix(#30170): protect in-flight subagents from busy-mode interrupts#30183
xxxigm wants to merge 2 commits into
NousResearch:mainfrom
xxxigm:fix/30170-protect-subagents-from-interrupt

Conversation

@xxxigm

@xxxigm xxxigm commented May 22, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Stops in-flight delegate_task subagents from being killed when a user sends a conversational follow-up while delegation is running. The gateway now demotes busy_input_mode='interrupt' to queue semantics whenever the parent agent is currently driving subagents, so the message is queued for the next turn instead of cascading interrupt() through AIAgent._active_children and aborting every subagent.

Before:

  • User sends a follow-up while delegate_task is in flight.
  • Gateway calls running_agent.interrupt(text) on the parent.
  • AIAgent.interrupt() cascades synchronously to every entry in _active_children and calls child.interrupt(message).
  • Subagent tool calls abort, subagent future resolves with status="interrupted", minutes of work are lost.
  • User only sees the fallback cascade with no actionable root cause in the gateway log.

After:

  • Same scenario.
  • Gateway checks _agent_has_active_subagents(running_agent); the parent is delegating → demote interrupt to queue.
  • parent.interrupt() is NOT called. The message is merged into the pending queue and surfaced as ⏳ Subagent working — your message is queued for when it finishes (use /stop to cancel everything).
  • Subagent keeps running. /stop and /new still cascade the full interrupt for operators who actually want to cancel.

Related Issue

Fixes #30170.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/run.py — implementation (+81 lines, 1 file):
    • New GatewayRunner._agent_has_active_subagents(running_agent) static helper. Type-defensive: returns True iff _active_children is a non-empty real list/tuple/set. Rejects None, _AGENT_PENDING_SENTINEL, missing attributes, and truthy MagicMock auto-attributes (so test mocks don't accidentally fire the demotion).
    • _handle_active_session_busy_message (~L2930) — the adapter-level busy handler the issue cites. When busy_input_mode == 'interrupt' and the parent has active subagents, demote to queue: skip parent.interrupt(), merge the event into the pending queue, and surface a dedicated ack with the /stop escape hatch.
    • PRIORITY interrupt branch inside _handle_message (~L7050) — the non-command fast path. Same guard, same demotion. Routes through _queue_or_replace_pending_event.
  • tests/gateway/test_subagent_protection_30170.py — regression tests (+348 lines, new file):
    • TestAgentHasActiveSubagents — 11 cases pinning the precision of the detection helper (None / sentinel / missing attribute / empty list / single child / many children / no-lock variant / truthy MagicMock regression guard / list-tuple-set acceptance).
    • TestBusyHandlerDemotesInterruptForSubagents — 6 cases driving _handle_active_session_busy_message directly (interrupt NOT called when subagents active; ack mentions "Subagent working", "queued", and /stop; baseline behaviour preserved when no subagents; configured queue/steer modes unchanged; _AGENT_PENDING_SENTINEL does not trigger demotion).

No other production code touched. Explicit /stop, /new, configured queue mode, configured steer mode, and AIAgent.interrupt() itself are all byte-identical to before.

How to Test

  1. Check out this branch and ensure .venv is set up: python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"
  2. Run the new regression tests on their own:
    scripts/run_tests.sh tests/gateway/test_subagent_protection_30170.py -v
    
    Expected: 17 passed.
  3. Run the full busy + subagent sweep to confirm no cross-file regressions:
    scripts/run_tests.sh tests/gateway/test_busy_session_ack.py tests/gateway/test_busy_session_auth_bypass.py tests/gateway/test_subagent_protection_30170.py tests/cli/test_busy_input_mode_command.py tests/agent/test_subagent_stop_hook.py tests/cli/test_cli_interrupt_subagent.py
    
    Expected: 52 passed.
  4. Manual end-to-end repro (matches issue body):
    • Configure gateway with default display.busy_input_mode: interrupt and a Telegram (or any) adapter.
    • Send a prompt that triggers delegate_task (e.g. "spawn a subagent to summarize file X").
    • While the subagent is actively making API calls / running tools, send any follow-up message.
    • Expected: subagent keeps working; you receive ⏳ Subagent working — your message is queued for when it finishes (use /stop to cancel everything).; your follow-up is processed after the delegation finishes.
    • Expected: sending /stop instead still hard-stops the subagent.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(gateway): ... and test(gateway): ...)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run scripts/run_tests.sh tests/gateway/test_subagent_protection_30170.py and all tests pass
  • I've added tests for my changes
  • I've tested on my platform: macOS 15.2 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — N/A (user-facing ack copy IS the documentation; helper docstring and inline comments cite [Bug]: Sending a message while delegate_task is running kills the subagent — interrupt propagates unconditionally to children #30170)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A (no new config key; this is a behavioural refinement of the existing display.busy_input_mode: interrupt)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — pure in-process logic, no OS-specific calls; tests are hermetic (MagicMock + asyncio, no real adapters)
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A (delegate_task schema is unchanged; only the gateway's busy-message routing changed)

Screenshots / Logs

$ scripts/run_tests.sh tests/gateway/test_subagent_protection_30170.py -v
collected 17 items

tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_returns_false_for_none PASSED
tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_returns_false_for_pending_sentinel PASSED
tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_returns_false_when_attribute_missing PASSED
tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_returns_false_for_empty_list PASSED
tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_returns_true_for_single_child PASSED
tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_returns_true_for_many_children PASSED
tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_works_without_lock PASSED
tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_rejects_truthy_non_collection_attribute PASSED
tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_accepts_list_tuple_set[tuple] PASSED
tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_accepts_list_tuple_set[set] PASSED
tests/gateway/test_subagent_protection_30170.py::TestAgentHasActiveSubagents::test_accepts_list_tuple_set[list] PASSED
tests/gateway/test_subagent_protection_30170.py::TestBusyHandlerDemotesInterruptForSubagents::test_does_not_call_interrupt_when_subagents_active PASSED
tests/gateway/test_subagent_protection_30170.py::TestBusyHandlerDemotesInterruptForSubagents::test_ack_explains_the_demotion PASSED
tests/gateway/test_subagent_protection_30170.py::TestBusyHandlerDemotesInterruptForSubagents::test_interrupt_still_fires_when_no_subagents PASSED
tests/gateway/test_subagent_protection_30170.py::TestBusyHandlerDemotesInterruptForSubagents::test_queue_mode_unchanged_with_subagents PASSED
tests/gateway/test_subagent_protection_30170.py::TestBusyHandlerDemotesInterruptForSubagents::test_steer_mode_still_routes_through_running_agent_steer PASSED
tests/gateway/test_subagent_protection_30170.py::TestBusyHandlerDemotesInterruptForSubagents::test_pending_sentinel_does_not_demote PASSED

============================== 17 passed in 0.21s ==============================

$ scripts/run_tests.sh tests/gateway/test_busy_session_ack.py tests/gateway/test_busy_session_auth_bypass.py tests/gateway/test_subagent_protection_30170.py tests/cli/test_busy_input_mode_command.py tests/agent/test_subagent_stop_hook.py tests/cli/test_cli_interrupt_subagent.py
....................................................                     [100%]
52 passed in 1.14s

xxxigm added 2 commits May 22, 2026 09:39
…ousResearch#30170)

When a user sends a conversational follow-up while delegate_task is
running, gateway/run.py calls running_agent.interrupt(event.text) on
the PARENT agent. AIAgent.interrupt() then cascades synchronously
through self._active_children and calls interrupt() on every child
subagent, aborting in-flight delegate_task work. The user sees the
fallback cascade with no root-cause in the gateway log, and minutes of
subagent progress are destroyed — the exact failure mode reported in
NousResearch#30170.

Add GatewayRunner._agent_has_active_subagents(running_agent) — a
static helper that returns True iff the parent is currently driving
subagents via delegate_task. The helper is type-defensive: it ignores
truthy MagicMock auto-attributes (so this doesn't accidentally fire
in every test mock that hits the busy path), the _AGENT_PENDING_SENTINEL
placeholder, and missing locks.

Wire the helper into both interrupt branches:

  1. _handle_active_session_busy_message — the adapter-level busy
     handler. When busy_input_mode == 'interrupt' AND the parent has
     active subagents, demote to 'queue' semantics: skip the
     parent.interrupt() call, merge the message into the pending
     queue, and surface a dedicated ack ("⏳ Subagent working — your
     message is queued for when it finishes (use /stop to cancel
     everything).") so the operator knows the message wasn't lost and
     discovers the explicit escape hatch.

  2. The PRIORITY interrupt branch inside _handle_message — the
     non-command fast path. Same rationale, same demotion. Routes
     through _queue_or_replace_pending_event so the next-turn pickup
     stays unchanged.

Explicit /stop and /new commands take a completely different path
(_interrupt_and_clear_session in the slash-command dispatch at line
~6771) and are NOT affected by this guard — the operator still has a
way to force-cancel everything when they actually mean it. Configured
'queue' and 'steer' modes are also untouched: 'queue' already does the
right thing, and 'steer' goes through running_agent.steer() which does
NOT cascade to children (so subagents survive a steer too).

This is Phase 1 of the fix outlined in NousResearch#30170 — the minimum viable
change that stops subagent loss. Phase 2 (delegation-aware steer
forwarding to active children) and Phase 3 (async delegation, NousResearch#11508)
are intentionally out of scope.

Refs NousResearch#30170.
…rupt protection

17 new tests in tests/gateway/test_subagent_protection_30170.py pin
down both the detection helper and the demotion behaviour:

  * TestAgentHasActiveSubagents — 11 cases covering the precision and
    defensiveness of _agent_has_active_subagents:
      - returns False for None, _AGENT_PENDING_SENTINEL, and stub
        agents that lack the _active_children attribute;
      - returns False for an empty list (the steady state of an idle
        AIAgent);
      - returns True for one or many children;
      - works when _active_children_lock is None (test stubs);
      - rejects truthy MagicMock auto-attributes — this is the
        regression-guard for "every MagicMock-based gateway test
        suddenly demotes to queue mode" (which is how this was
        originally found);
      - accepts list/tuple/set as the children container.

  * TestBusyHandlerDemotesInterruptForSubagents — 6 cases driving
    _handle_active_session_busy_message directly:
      - parent.interrupt is NOT called when subagents are active,
        message is still merged into the pending queue;
      - ack copy mentions "Subagent working", "queued", and the
        /stop escape hatch — and does NOT mention "Interrupting";
      - with no subagents, behaviour is byte-identical to the
        pre-NousResearch#30170 interrupt path (parent.interrupt called with the
        user text, ack says "Interrupting");
      - configured queue mode keeps its vanilla "Queued for the next
        turn" ack (the NousResearch#30170 demotion-specific copy must NOT fire);
      - configured steer mode still routes to running_agent.steer()
        even when subagents are active (the guard is interrupt-only);
      - _AGENT_PENDING_SENTINEL does not trigger demotion.

Refs NousResearch#30170.
@daimon-nous

daimon-nous Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Reviewed in the #30170 triage — this PR is the strongest candidate. It covers both interrupt paths (warm _handle_active_session_busy_message + cold _handle_message PRIORITY block), has robust MagicMock regression guards, tests steer-mode preservation, and includes a /stop hint in the ack message. 17 tests, all passing. CI failure is an unrelated test_tui_npm_install.py flake. See #30170 (comment) for the full triage.

@daimon-nous

daimon-nous Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Merged via PR #32076 (#32076). Your commits were cherry-picked onto current main with your authorship preserved in git log. Thanks for the thorough implementation — both interrupt paths guarded, MagicMock defense, comprehensive test suite. 🙏

@alt-glitch alt-glitch closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists tool/delegate Subagent delegation type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Sending a message while delegate_task is running kills the subagent — interrupt propagates unconditionally to children

2 participants