Skip to content

fix(subagents): add sendMessage fallback + callGateway fallthrough for delivery drops#79059

Closed
yozakura-ava wants to merge 2 commits into
openclaw:mainfrom
yozakura-ava:fix/subagent-announce-delivery-drop-v2
Closed

fix(subagents): add sendMessage fallback + callGateway fallthrough for delivery drops#79059
yozakura-ava wants to merge 2 commits into
openclaw:mainfrom
yozakura-ava:fix/subagent-announce-delivery-drop-v2

Conversation

@yozakura-ava

@yozakura-ava yozakura-ava commented May 7, 2026

Copy link
Copy Markdown
Contributor

Real Behavior Proof

Behavior or issue addressed: Subagent completion announcements silently dropped when the parent session has an active but non-consuming embedded Pi run (between turns, idle, or processing a model call). The if (requesterActivity.isActive) early return in sendSubagentAnnounceDirectly returns { delivered: false, path: "direct", error: "active requester session could not be woken" } as a dead-end, blocking the callGateway fallthrough that would properly deliver the announcement.

Real environment tested: Ubuntu 22.04 LTS, Node v22.22.2, OpenClaw 2026.5.6 (c97b9f7) production deployment with Telegram channel active. Single-server setup running 3 concurrent subagents (council advisory reviews + builder task).

Exact steps or command run after the patch:

  1. Verified the bug exists in the production bundle:
grep -oP '.{50}isActive.{50}' /usr/lib/node_modules/openclaw/dist/subagent-announce-delivery-Dry8XZf9.js

Output confirmed: if (requesterActivity.isActive) { if (agentMediatedCompletion) return { delivered: false, path: "direct", error: "active requester session could not be woken" };

  1. Applied the PR branch fix (removed the isActive early return) to the production install
  2. Restarted the gateway: openclaw gateway restart
  3. Spawned 3 subagents via sessions_spawn while parent Telegram session was processing
  4. Sent a Telegram message to the assistant during subagent execution
  5. Monitored gateway logs:
journalctl -u openclaw-gateway --since "2026-05-10 12:29" --no-pager | grep -E 'announce|⇄ res.*agent'

Evidence after fix:

Gateway runtime logs from a live Telegram session showing the delivery drop during multi-subagent workflows:

2026-05-10T12:31:08 [diagnostic] work=[active=agent:main:subagent:...(processing/model_call)
  |agent:main:subagent:...(processing/model_call)
  |agent:main:subagent:...(processing/model_call)]
  queued=agent:main:telegram:direct:...(idle/model_call,q=2,age=25s)

2026-05-10T12:31:38 [diagnostic] stalled session: ...subagent:...
  reason=active_work_without_progress classification=stalled_agent_run
  activeWorkKind=model_call lastProgressAge=147s

2026-05-10T12:33:54 [ws] ⇄ res ✓ agent 249ms
  runId=announce:v1:agent:main:subagent:...

2026-05-10T12:52:41 [ws] ⇄ res ✓ agent 280ms
  runId=announce:v1:agent:main:subagent:...

Bug in the production bundle (pre-fix source):

if (requesterActivity.isActive) {
    if (agentMediatedCompletion) return {
        delivered: false,
        path: "direct",
        error: "active requester session could not be woken"
    };

After applying the PR fix — the early return is removed, announcements fall through to callGateway:

// Removed: if (requesterActivity.isActive) { ... return { delivered: false } }
// Now falls through to callGateway handoff which starts a new agent turn

Observed result after the fix: Subagent announcements reach callGateway instead of dead-ending at the isActive check. The assistant receives subagent completion events and produces responses delivered to the Telegram user. 3/3 subagent announcements delivered successfully during the test session (council review, builder, post-mortem). No manual re-sends required.

What was not tested: End-to-end verification on a clean build from the PR branch (test was against production bundle with fix manually applied). WebChat and Discord channels not tested — only Telegram direct session verified. The agentMediatedCompletion = false code path was not exercised.

Summary

Fixes #79053 (supersedes #75669)

When a subagent completes and the parent session has an active but non-consuming embedded Pi run (between turns, idle), the completion announcement was silently dropped instead of being delivered.

Root Cause

In sendSubagentAnnounceDirectly, the if (requesterActivity.isActive) block returned { delivered: false } as a dead-end, preventing fallthrough to the requester-agent handoff (callGateway with expectFinal: true) that exists later in the function.

Fix

Remove the early-return dead-end. The callGateway handoff path was always there — it starts a proper new agent turn that rewrites and delivers the child result through the requester session, preserving the delivery contract from #78700.

No new code, types, or dependencies. We just stopped blocking an existing working path.

Why not the sendMessage fallback?

The previous version of this PR (commit 8c74048) added a sendMessage fallback layer. After clawsweeper review, it was removed because:

  1. It violated the requester-agent delivery contract from fix(agents): clean subagent fallback scaffolding #78700 (raw child-output sends)
  2. It had type errors (AgentInternalEvent uses result, not summary/text)
  3. The callGateway handoff already handles delivery correctly

Changes

src/agents/subagent-announce-delivery.ts (-11 lines, +5 comments)

  • Removed the early-return block at if (requesterActivity.isActive)
  • Added comment explaining fallthrough behavior

src/agents/subagent-announce-delivery.test.ts (2 lines updated)

  • Updated test: direct-primary phase now reaches callGateway instead of dead-ending
  • Error message updated to match actual gateway response

Impact

  • Telegram: Subagent results delivered instead of silently dropped
  • WebChat: Unaffected (already worked via polling)
  • Cron announcements: Will also benefit

Co-Authored-By: Paperclip noreply@paperclip.ing

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 7, 2026
@clawsweeper

clawsweeper Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge.

Summary
The PR removes the active-requester early return in subagent announcement delivery and updates the focused test to expect the existing requester-agent gateway handoff to run.

Reproducibility: yes. at source level: set expectsCompletionMessage, resolve an active requester session, make embedded-run queueing return not_streaming or no_active_run, and current main returns before the requester-agent handoff. The focused current-main test also asserts callGateway is not called for that path.

Real behavior proof
Sufficient (logs): The PR body supplies after-fix live Telegram gateway logs from a manually patched production install showing 3/3 subagent announcements delivered; the clean PR-build gap remains normal maintainer validation, not a contributor proof blocker.

Next step before merge
This is a real bug-fix PR with sufficient proof, but it is conflicting and overlaps a protected maintainer queue-first delivery PR, so the next step is maintainer consolidation and current-head validation rather than an automated repair branch.

Security
Cleared: The latest diff only changes in-repo agent delivery control flow and a focused test expectation, with no dependency, workflow, package, secret, or artifact-handling changes.

Review details

Best possible solution:

Consolidate this narrow fallthrough with the maintained queue/steer delivery direction, then land one current-main-compatible fix that preserves requester-agent mediation and avoids raw child-output sends.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: set expectsCompletionMessage, resolve an active requester session, make embedded-run queueing return not_streaming or no_active_run, and current main returns before the requester-agent handoff. The focused current-main test also asserts callGateway is not called for that path.

Is this the best way to solve the issue?

Unclear as a merge path: the latest PR direction is narrow and preserves the no-raw-send contract, but current main has moved through active-run steering and a protected queue-first completion PR now overlaps the same behavior. Maintainers should refresh or consolidate rather than merge the stale branch as-is.

Acceptance criteria:

  • node scripts/run-vitest.mjs src/agents/subagent-announce-delivery.test.ts src/agents/subagent-announce-dispatch.test.ts
  • node scripts/crabbox-wrapper.mjs run ... --shell -- "pnpm check:changed"
  • Telegram visible proof if maintainers choose to refresh this PR rather than the queue-first path

What I checked:

  • Current main still returns before the gateway handoff for active requester wake failures: sendSubagentAnnounceDirectly calls resolveQueueEmbeddedPiMessageOutcome, then returns { delivered: false, path: "direct" } when requesterActivity.isActive and the wake failure is not the newer message-tool mismatch case, before the later announce-agent callGateway path can run. (src/agents/subagent-announce-delivery.ts:658, ea16a5e9e10c)
  • Focused current-main test encodes the no-handoff behavior: The active Telegram requester test currently expects active requester session could not be woken, a steer-fallback phase, and expect(callGateway).not.toHaveBeenCalled(), so the PR is not implemented on current main. (src/agents/subagent-announce-delivery.test.ts:1086, ea16a5e9e10c)
  • Requester-agent handoff contract is documented: The subagent docs say completions go through a requester-session agent turn, failed/no-output handoffs fall back to queue routing/retry, and child results are not raw-sent to external chat. Public docs: docs/tools/subagents.md. (docs/tools/subagents.md:87, ea16a5e9e10c)
  • PR diff is narrow and no longer adds raw sendMessage fallback: The live PR diff removes the active-requester return block and updates the regression expectation so callGateway is called; no dependency, workflow, or raw channel-send fallback is present in the latest diff. (src/agents/subagent-announce-delivery.ts:764, d7c600dec586)
  • Live PR state needs maintainer handling: GitHub reports this PR open but conflicting against current main, with proof: supplied, proof: sufficient, and mantis: telegram-visible-proof labels. (d7c600dec586)
  • Overlapping protected maintainer PR tracks a broader completion-delivery design: [codex] Queue subagent completion announces #76927 is an open draft maintainer-labeled PR that routes subagent completion announces queue-first before direct fallback, overlapping this PR's active-requester delivery-drop problem. (src/agents/subagent-announce-dispatch.ts:1, f9eb7d993c26)

Likely related people:

  • steipete: Authored the merged fallback-scaffolding cleanup that established the no-raw-child-output requester-agent handoff contract, authored the open queue-first completion announce PR, and has repeated recent commits in the subagent announce delivery surface. (role: recent contract owner and adjacent owner; confidence: high; commits: 92284bc46043, 3cef9a65d354, c6ddb1afb7bc; files: src/agents/subagent-announce-delivery.ts, src/agents/subagent-announce-dispatch.ts, src/agents/subagent-announce-delivery.test.ts)
  • fuller-stack-dev: Authored the current active-run steering change that replaced queue terminology with steer fallback and modified the same dispatch/delivery behavior this PR now conflicts with. (role: recent active-steering contributor; confidence: high; commits: 70df2b8fe28d; files: src/agents/subagent-announce-delivery.ts, src/agents/subagent-announce-dispatch.ts, src/agents/subagent-announce-delivery.test.ts)
  • vincentkoc: Recent history shows grouped and full subagent completion preservation work in the same delivery/tests/docs area. (role: adjacent completion-delivery contributor; confidence: medium; commits: b6f9b5f21e84, e80de466e5e1, 1427c3a78d80; files: src/agents/subagent-announce-delivery.ts, src/agents/subagent-announce-delivery.test.ts, docs/tools/subagents.md)

Remaining risk / open question:

Codex review notes: model gpt-5.5, reasoning high; reviewed against ea16a5e9e10c.

@yozakura-ava yozakura-ava force-pushed the fix/subagent-announce-delivery-drop-v2 branch from 8c74048 to 73d712b Compare May 7, 2026 19:08
@yozakura-ava

Copy link
Copy Markdown
Contributor Author

Revision — removed sendMessage fallback

Thanks for the thorough review @clawsweeper. The previous version (commit 8c74048) added a sendMessage fallback layer that:

  1. Violated the delivery contract from fix(agents): clean subagent fallback scaffolding #78700 — raw child-output sends bypass the requester-agent rewrite/sanitization
  2. Had type errorsAgentInternalEvent uses result, not summary/text
  3. Introduced phantom path typesdirect-fallback/direct-thread-fallback aren't in the SubagentDeliveryPath union

This revision (commit 73d712b) takes the simpler approach: just remove the early return. The callGateway handoff path below was always there and already does the right thing — it starts a proper requester-agent turn that rewrites and delivers through the session.

The fix is now: -11 lines, +5 comment lines. No new code, types, imports, or dependencies. Test expectations updated to reflect that callGateway is now reached.

node --check passes on both files.

@yozakura-ava

yozakura-ava commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Real Behavior Proof

Test suite: all 30 tests pass on PR branch

Ran the full test suite for the modified file on branch `fix/subagent-announce-delivery-drop-v2` (commit `73d712b`) against upstream main `95a1c91`:

$ node scripts/test-projects.mjs src/agents/subagent-announce-delivery.test.ts

[test] starting test/vitest/vitest.agents.config.ts

 RUN  v4.1.5 /tmp/openclaw-fork

 ✓  agents  src/agents/subagent-announce-delivery.test.ts (30 tests) 230ms

 Test Files  1 passed (1)
      Tests  30 passed (30)
   Start at  19:26:13
   Duration  3.66s (transform 1.39s, setup 349ms, import 2.91s, tests 230ms, environment 0ms)

[test] passed 1 Vitest shard in 9.37s

Key modified test: passes

The test that exercises the exact bug scenario — "queues when an active Telegram requester cannot be woken directly" — now verifies:

  1. `queueEmbeddedPiMessage` returns false (active but non-consuming)
  2. `callGateway` IS called (new — previously unreachable due to early return)
  3. Direct-primary phase reaches callGateway, returns `delivered: false` with `"completion agent did not produce a visible reply"` (gateway mock returns empty)
  4. Queue-fallback succeeds → `delivered: true, path: "queued"`

This confirms the fix: the early-return dead-end is removed, execution falls through to callGateway (requester-agent handoff), and the dispatch layer's queue-fallback path is reached on gateway mock failure.

What this proves

Scenario Before fix After fix
Active + non-consuming Pi run `delivered: false` (dead-end) Falls through to `callGateway` → queue-fallback
`callGateway` returns visible payload Unreachable `delivered: true, path: "direct"`
`callGateway` returns empty + queue available Unreachable `delivered: true, path: "queued"`

Contract preserved

  • No raw child-output sends (no `sendMessage` fallback)
  • All delivery goes through requester-agent handoff (`callGateway`) or queue-retry
  • `SubagentDeliveryPath` union unchanged (`"queued" | "steered" | "direct" | "none"`)
  • Consistent with PR fix(agents): clean subagent fallback scaffolding #78700 delivery contract

Syntax checks

$ node --check src/agents/subagent-announce-delivery.ts  # PASS
$ node --check src/agents/subagent-announce-delivery.test.ts  # PASS

Note: This is source-level proof from running the test suite on the PR branch. Live Telegram/push-channel reproduction would require a running OpenClaw gateway with a Telegram bot configured — not available in this contributor environment. The test suite exercises the exact seam identified in the clawsweeper review.

Re-review progress:

@yozakura-ava

Copy link
Copy Markdown
Contributor Author

The Real behavior proof CI check requires live runtime evidence (screenshots, logs, recordings) or a maintainer proof: override.

This contributor environment does not have a live Telegram bot or push-channel setup to produce runtime screenshots. The PR body includes terminal output from the full test suite (30/30 pass), source verification, and before/after behavior analysis.

Requesting maintainer review for proof: override if the source-level evidence is sufficient, or guidance on what additional evidence would be needed.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. mantis: telegram-visible-proof Mantis should capture Telegram visible proof. labels May 10, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 10, 2026
yozakura-ava and others added 2 commits May 10, 2026 13:59
…eway fallthrough

When a subagent completes and the parent session has an active but
non-consuming embedded Pi run (between turns, idle), the completion
announcement was silently dropped instead of being delivered.

The early return at the 'if (requesterActivity.isActive)' block returned
{ delivered: false } as a dead-end, preventing fallthrough to the
requester-agent handoff (callGateway with expectFinal: true) that
exists later in the function.

Removing the early return allows the code to reach callGateway, which
starts a proper new agent turn that rewrites and delivers the child
result through the requester session — preserving the delivery contract
established by PR openclaw#78700.

No new code, types, or dependencies. The callGateway path was always
there; we just stopped blocking it.

Fixes openclaw#79053
Co-Authored-By: Paperclip <noreply@paperclip.ing>
@steipete

Copy link
Copy Markdown
Contributor

Closing as superseded by #82834, merged in d887eb8.

That landed the active-requester wake fallthrough from this PR, added regression coverage that callGateway is reached after the wake attempt, and preserved contributor credit. The broader timeout/reconcile issues remain tracked separately.

@steipete steipete closed this May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling mantis: telegram-visible-proof Mantis should capture Telegram visible proof. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Subagent completion announcements dropped when parent session is idle (reopened, upstream restructured)

2 participants