Skip to content

[Feature]: Decouple subagent lifecycle notifications from session-lane batching #66638

@pidgeon777

Description

@pidgeon777

Summary
Introduce per-run, per-event subagent notifications with immediate spawn/progress/completion delivery, instead of batching announcements behind the parent session lane.

Problem to solve
Today, subagent notifications can be delayed until the parent workflow or session lane yields, which creates visible latency on real-time channels such as WhatsApp. In practice, this causes two related problems:

  1. Forced batching
    Announcements that are logically independent events (spawn acknowledged, progress update, completion) are often delivered together or only after a later scheduler/session opportunity, instead of when the event actually happened.

  2. Lane starvation
    If the parent session is busy, blocked on a long turn, or otherwise holding the lane, child-run announcements wait behind unrelated work. This makes the system appear silent even when the subagent is actively running and producing meaningful milestones.

This is especially painful on user-facing channels where timing is part of the UX. A WhatsApp user expects an immediate acknowledgement that a detached/subagent task was spawned, intermediate progress while it runs, and prompt completion when it finishes. Delaying these behind the session lane undermines trust, makes long-running tasks feel stalled, and forces workarounds in authoring layers.

This issue appears connected to the broader notification and session-delivery class of problems tracked in:

The common pattern is that event production and channel delivery are too tightly coupled to a lane/session completion boundary, so user-visible notifications inherit backpressure from unrelated work.

Proposed solution
I would propose an architectural change that treats subagent/user-facing announces as first-class event delivery, not as side effects that must wait for a parent session turn to finish.

Concretely:

  1. Immediate spawn ACK at framework level
    When a subagent run is created successfully, emit and deliver a framework-level spawn acknowledgement immediately.

    • This should not depend on the parent session yielding.
    • It should be generated by the runtime/framework itself, not left to each authoring layer to simulate manually.
    • This gives the user an immediate confirmation that the task exists and is running.
  2. Independent lifecycle hooks per run
    Add explicit lifecycle hooks / event types for each run, for example:

    • run_spawned
    • run_progress
    • run_completed
    • run_failed
    • run_cancelled

    These should be emitted independently for every child run and should not be blocked behind the parent lane. The delivery path should preserve ordering per run, but not require waiting for unrelated session output.

  3. Channel-direct fallback for announces
    If normal announce routing would be delayed by the session lane, provide a channel-direct fallback path for framework-generated status notifications.

    • Example: if the target channel is WhatsApp, a progress/completion announce should be able to bypass the congested parent lane and still reach the configured destination safely.
    • This should be scoped to framework/system status events, not arbitrary model output, so the safety and routing model stays understandable.
  4. Persisted announce queue
    Add a durable announce queue for run lifecycle notifications.

    • If the channel listener/bridge is temporarily disconnected, events should be persisted and retried.
    • Delivery should be at-least-once with deduplication/idempotency keys at the announce level.
    • This would make subagent notifications resilient to transient channel outages instead of silently dropping or indefinitely delaying them.
  5. Separation between event creation and channel delivery
    More broadly, the architecture should separate:

    • event generation (runtime/run lifecycle)
    • event persistence/queueing
    • channel delivery

    That would avoid having the parent conversational lane act as the implicit transport for all child-run visibility.

Suggested behavioral contract:

  • Spawn succeeds -> user gets immediate ACK
  • Child emits progress -> user gets progress soon after emission, even if parent lane is busy
  • Child completes/fails -> user gets completion/failure promptly
  • Temporary channel outage -> event is queued durably and delivered when transport recovers
  • Per-run ordering is preserved, but unrelated parent output cannot starve child lifecycle notifications

Alternatives considered

  1. Keep current behavior and improve prompting/authoring guidance
    This does not solve the core problem, because the bottleneck is architectural. Better prompts cannot guarantee prompt delivery if the framework batches behind the lane.

  2. Let each integration/channel implement custom side-band notifications
    This creates duplicated logic, inconsistent behavior across channels, and weakens the contract for subagent lifecycle visibility.

  3. Polling-based status refresh from the parent session
    Polling is less efficient, adds latency, and still fails to provide true push semantics for spawn/progress/completion.

  4. Only add completion notifications
    This helps somewhat, but still leaves the most important UX gap: immediate spawn acknowledgement and timely in-flight progress.

Impact
Affected users/systems/channels:

  • Any workflow using subagents or detached child runs
  • Real-time user-facing channels such as WhatsApp and similar chat transports
  • Authoring layers that rely on background execution while keeping the parent session responsive

Severity:
High for real-time orchestration UX. It does not always break correctness, but it makes the system feel unreliable or unresponsive and pushes authors toward brittle workarounds.

Frequency:
Common whenever long-running or parallel subagent tasks are used.

Consequence:

  • Delayed or missing user feedback after spawning a task
  • Progress updates arriving too late to be useful
  • Completion notifications appearing batched with unrelated output
  • Extra manual status checks and polling
  • Lower operator trust in background execution
  • More complexity in plugins/authoring layers trying to compensate for framework-level timing issues

Evidence/examples
Observed failure mode:

  • A child run is spawned successfully, but the user does not receive an immediate acknowledgement.
  • Intermediate progress exists, but it is not surfaced promptly because the parent lane is still occupied.
  • Completion may arrive only when the parent session yields, making several lifecycle events appear as one delayed batch.

This strongly suggests a coupling between lifecycle announce delivery and session-lane availability rather than true event-driven push.

Related issues / context:

The proposal above is intended as a structural fix that unifies those symptoms under a clearer event-delivery model.

Additional information
A minimal viable implementation could start with:

  1. framework-generated immediate spawn ACK
  2. per-run lifecycle event bus
  3. durable announce queue
  4. channel-direct fallback for framework lifecycle events only

That would already remove most of the visible batching/lane-starvation pain while preserving backward compatibility for normal session output.

I am intentionally omitting any personal identifiers or channel-specific private data here. This request is about the delivery architecture, not a user-specific setup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions