Skip to content

fix: Subagent completion direct announce often fails with no visible reply#82804

Closed
galiniliev wants to merge 3 commits into
openclaw:mainfrom
galiniliev:bug-001-subagent-completion-direct-announce
Closed

fix: Subagent completion direct announce often fails with no visible reply#82804
galiniliev wants to merge 3 commits into
openclaw:mainfrom
galiniliev:bug-001-subagent-completion-direct-announce

Conversation

@galiniliev

@galiniliev galiniliev commented May 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: completed subagent announcements could fail with completion agent did not produce a visible reply after the requester wake path hit a stale session id (queue_message_failed reason=no_active_run).
  • Why it matters: the child run may have completed with usable output, but the requester can still see no visible completion update if both wake routing and the automatic direct handoff dead-end.
  • What changed: when the initial requester wake fails with no_active_run and the automatic completion-agent handoff returns no visible payload, OpenClaw retries the requester-agent handoff once with sourceReplyDeliveryMode: "message_tool_only" and deliver: false.
  • What did NOT change (scope boundary): no raw child completion output is sent to external chat; the requester agent still mediates the final visible update, grouped child-result guardrails remain mediated, and generated-media message-tool enforcement keeps its existing contract.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof (required for external PRs)

Behavior addressed: Completed subagent direct announces no longer dead-end at the reported no_active_run plus no-visible-payload path. After the stale requester wake fails and the automatic direct handoff has no visible output, the runtime now retries a mediated requester-agent handoff that requires message-tool delivery instead of raw-sending child output.

Real environment tested: Windows local Codex worktree based on origin/main c2e9091, Node v24.15.0. Dependencies were partially installed after pnpm install hit an esbuild postinstall spawn EPERM; the direct Vitest entry was available and used for the focused delivery seam.

Exact steps or command run after this patch: $env:OPENCLAW_VITEST_MAX_WORKERS='1'; node node_modules\vitest\vitest.mjs run src/agents/subagent-announce-delivery.test.ts --reporter=dot

Evidence after fix: Copied terminal capture from the post-review focused delivery run:

RUN  v4.1.6 C:/OpenClaw/worktrees/bug-001-subagent-completion-direct-announce

Test Files  2 passed (2)
Tests  86 passed (86)
Start at  18:50:08
Duration  27.03s (transform 12.67s, setup 967ms, import 15.41s, tests 10.11s, environment 0ms)

The updated assertions simulate queueEmbeddedPiMessageWithOutcome returning reason: "no_active_run", then an automatic direct completion handoff with empty payloads. The runtime performs a second requester-agent handoff with deliver: false, sourceReplyDeliveryMode: "message_tool_only", and a :message-tool idempotency key; the test only marks delivery successful when the second handoff reports committed message-tool evidence. The fallback sendMessage mock is not called.

Observed result after fix: single completed subagent thread/channel cases with stale requester runs now complete through a mediated message-tool-only retry when the first direct handoff is empty. If the retry still lacks message-tool evidence, delivery remains failed and queued for the existing retry/give-up machinery instead of raw-sending child output.

What was not tested: no live gateway/provider/channel rerun was performed. The after-fix proof is local delivery-seam execution, not a private live session replay.

Before evidence: raw runtime log excerpt from the affected gateway trace that this patch addresses:

Trace/proof:
- gateway-dev.log:27070
  "Subagent completion direct announce failed for run c73d9446-0a7f-422d-a904-4f0a5e92b556: completion agent did not produce a visible reply"
  traceId=be3befc660e5cba4364d3d60bdbcc9a9 spanId=c0e47fb6cf0e786b
- Neighboring same trace:
  gateway-dev.log:27069 "queue message failed: sessionId=4d1ec534-2295-41cf-b55a-9300cc14f1f1 reason=no_active_run"

Root Cause (if applicable)

  • Root cause: sendSubagentAnnounceDirectly detected the empty automatic completion-agent handoff but did not distinguish the reported stale requester wake (no_active_run) from ordinary no-visible output, so the path could fail without trying the stricter message-tool-only mediated handoff.
  • Missing detection / guardrail: coverage previously proved only the already-existing direct-then-steer branch. It did not cover no_active_run followed by an empty automatic direct handoff.
  • Contributing context (if known): prior fallback scaffolding that raw-sent child output was removed by 92284bc / fix(agents): clean subagent fallback scaffolding #78700; this patch keeps the repair within the requester-agent delivery contract documented for no-output handoffs.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/agents/subagent-announce-delivery.test.ts
  • Scenario the test should lock in: a stale requester wake returns no_active_run, the automatic requester-agent completion handoff returns empty payloads, and the runtime retries the same mediated handoff with sourceReplyDeliveryMode: "message_tool_only" without raw-sending child completion text.
  • Why this is the smallest reliable guardrail: it exercises the delivery decision seam directly without requiring a live provider to intentionally produce an empty final response.
  • Existing test that already covers this (if any): existing no-visible-output tests covered the failure and the direct-then-steer fallback; this PR adds the stale-run message-tool retry behavior.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Completed subagent announcements that previously dead-ended after a stale requester wake and empty automatic handoff can now be retried through a message-tool-only requester-agent handoff, producing a visible update when the requester agent sends through the message tool.

Diagram (if applicable)

Before:
[subagent completed] -> [wake requester: no_active_run] -> [automatic direct handoff: empty payload] -> [delivery failure]

After:
[subagent completed] -> [wake requester: no_active_run] -> [automatic direct handoff: empty payload] -> [message-tool-only requester handoff] -> [visible requester update]

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Windows local worktree for the regression test; original gateway log OS not grounded.
  • Runtime/container: Node v24.15.0 for local Vitest; OpenClaw current main c2e9091 before fix.
  • Model/provider: NOT_ENOUGH_INFO from the original log evidence.
  • Integration/channel (if any): regression covers Slack-style thread/channel delivery helpers; original log channel not grounded.
  • Relevant config (redacted): NOT_ENOUGH_INFO

Steps

  1. Run the focused delivery regression file.
  2. In the single-completion cases, mock the requester wake queue attempt as reason: "no_active_run".
  3. Mock the automatic requester-agent direct handoff as { result: { payloads: [] } }.
  4. Verify the runtime makes a second requester-agent handoff with deliver: false and sourceReplyDeliveryMode: "message_tool_only".
  5. Verify delivery succeeds only when that second handoff reports committed message-tool evidence, and verify the raw sendMessage fallback mock is not called.

Expected

  • Stale requester wake plus empty automatic direct handoff retries through a mediated message-tool-only requester-agent handoff.
  • Raw child output is not sent directly to external chat.
  • Grouped and media completion guardrails remain enforced.

Actual

  • Matches expected after this patch.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: focused delivery regression file passed locally with 86 tests across both configured Vitest projects.
  • Edge cases checked: stale-run no-visible single thread/channel completions retry through message-tool-only handoff; grouped child-result fallback remains mediated and does not raw-send; generated-media message-tool enforcement remains covered by existing tests in the same file.
  • What you did not verify: live provider/channel behavior requiring private sessions or credentials.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: the message-tool-only retry can still fail if the requester agent does not send through the message tool.
    • Mitigation: that failure remains explicit and feeds the existing retry/give-up machinery instead of bypassing the requester-agent contract with raw child output.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 17, 2026
@clawsweeper

clawsweeper Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
Review failed before ClawSweeper could summarize the requested change.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Real behavior proof
Not applicable: Real behavior proof was not assessed because the Codex review failed.

Next step before merge
Review did not complete, so no work-lane recommendation was made.

Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

What I checked:

  • failure reason: codex execution failed.
  • codex failure detail: Codex review failed for this PR with exit 1.
  • codex stdout: Per-item Codex failure; continuing with the rest of the shard.

Likely related people:

  • unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)

Remaining risk / open question:

  • No close action taken because the review did not complete.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 9e67f53b913a.

@clawsweeper clawsweeper Bot added mantis: telegram-visible-proof Mantis should capture Telegram visible proof. P1 High-priority user-facing bug, regression, or broken workflow. labels May 17, 2026
@galiniliev galiniliev changed the title fix: fallback subagent completion announces fix: Subagent completion direct announce often fails with no visible reply May 17, 2026
@galiniliev galiniliev force-pushed the bug-001-subagent-completion-direct-announce branch from 72c68a2 to 97ee119 Compare May 17, 2026 02:11
@clawsweeper clawsweeper Bot removed the mantis: telegram-visible-proof Mantis should capture Telegram visible proof. label May 17, 2026
@steipete

Copy link
Copy Markdown
Contributor

Closing as superseded by #82834.

Both PRs target #82803 and the same subagent announce delivery path. #82834 keeps the no-visible-reply fallback, adds the broader mediated/message-tool completion coverage, and updates docs/changelog, so that is the canonical review target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR P1 High-priority user-facing bug, regression, or broken workflow. size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Subagent completion direct announce fails with no visible reply

2 participants