Skip to content

fix(agent): suppress tool-use assistant text#71589

Open
deepkilo wants to merge 7 commits into
openclaw:mainfrom
deepkilo:fix/feishu-group-final-only
Open

fix(agent): suppress tool-use assistant text#71589
deepkilo wants to merge 7 commits into
openclaw:mainfrom
deepkilo:fix/feishu-group-final-only

Conversation

@deepkilo

Copy link
Copy Markdown
Contributor

Summary

  • suppress assistant messages that are tool-use continuations (stopReason: "toolUse" or tool-call content) from user-visible assistant output
  • keep explicit final_answer phase output deliverable
  • add regression coverage for unphased tool-use assistant text so it does not emit block/partial replies or final assistant text

Fixes #71588.

Notes

The Feishu group trigger policy (groupPolicy=open / requireMention=false) can make this easier to hit because more group messages start agent runs, but it is not the direct root cause. The direct issue is one agent run exposing both intermediate tool-use assistant text and the later final answer as user-visible replies.

Validation

  • corepack pnpm vitest run src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.suppresses-commentary-phase-output.test.ts
  • corepack pnpm tsgo:core
  • corepack pnpm exec oxlint --tsconfig tsconfig.oxlint.core.json src/agents/pi-embedded-subscribe.handlers.messages.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.suppresses-commentary-phase-output.test.ts
  • corepack pnpm exec oxfmt --check src/agents/pi-embedded-subscribe.handlers.messages.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.suppresses-commentary-phase-output.test.ts

Full corepack pnpm lint:core currently fails on an unrelated pre-existing no-array-sort finding in src/agents/cli-runner.spawn.test.ts:1029.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels Apr 25, 2026
@greptile-apps

greptile-apps Bot commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a bug where agent runs were emitting both intermediate tool-use assistant text and the final answer as user-visible replies. It expands shouldSuppressAssistantVisibleOutput to handle unphased messages by checking for stopReason: "toolUse" or tool-call content blocks, and updates the mid-stream second gate in handleMessageUpdate to use the same predicate. A regression test is added for the unphased tool-use path.

Confidence Score: 4/5

Safe to merge; the fix is well-scoped and the two P2 findings are non-blocking.

Only P2-level findings: a gap in test coverage for the text_end channel and a theoretical miss of snake_case content-type variants. The core suppression logic is correct and the existing commentary phase behaviour is preserved.

No files require special attention beyond the P2 notes above.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.suppresses-commentary-phase-output.test.ts
Line: 10-40

Comment:
**Missing `blockReplyBreak: "text_end"` coverage**

The new regression test only exercises the `blockReplyBreak: "message_end"` path. The `handleMessageUpdate` code has a separate `text_end`-channel flush at lines 617–634 (`ctx.blockChunker?.drain` and the deferred `flushBlockReplyBuffer`). A parallel fixture with `blockReplyBreak: "text_end"` would confirm those paths are also suppressed for unphased tool-use messages and guard against future regressions there.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/pi-embedded-subscribe.handlers.messages.ts
Line: 40

Comment:
**Snake-case content-type variants not covered**

`TOOL_CALL_CONTENT_TYPES` contains `"toolCall"`, `"toolUse"`, and `"functionCall"` (camelCase). Some providers (e.g. OpenAI-compatible adapters) may emit `"tool_call"`, `"tool_use"`, or `"function_call"` (snake_case). If any upstream adapter normalises content block types to snake_case before emitting events, the content-block branch of `hasToolUseContinuation` would miss them. The `stopReason === "toolUse"` path would still fire, so this only matters when `stopReason` is absent and the detection falls back to content scanning.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(agent): suppress tool-use assistant ..." | Re-trigger Greptile

Comment on lines +10 to +40
it("suppresses assistant messages that continue into tool use without phase metadata", () => {
const onBlockReply = vi.fn();
const onPartialReply = vi.fn();
const { emit, subscription } = createSubscribedSessionHarness({
runId: "run",
onBlockReply,
onPartialReply,
blockReplyBreak: "message_end",
});

const toolUseMessage = {
role: "assistant",
content: [
{ type: "text", text: "I'll check this now." },
{ type: "toolCall", id: "call-1", name: "read", input: { path: "README.md" } },
],
stopReason: "toolUse",
} as AssistantMessage;

emit({ type: "message_start", message: toolUseMessage });
emit({
type: "message_update",
message: toolUseMessage,
assistantMessageEvent: { type: "text_delta", delta: "I'll check this now." },
});
emit({ type: "message_end", message: toolUseMessage });

expect(onBlockReply).not.toHaveBeenCalled();
expect(onPartialReply).not.toHaveBeenCalled();
expect(subscription.assistantTexts).toEqual([]);
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing blockReplyBreak: "text_end" coverage

The new regression test only exercises the blockReplyBreak: "message_end" path. The handleMessageUpdate code has a separate text_end-channel flush at lines 617–634 (ctx.blockChunker?.drain and the deferred flushBlockReplyBuffer). A parallel fixture with blockReplyBreak: "text_end" would confirm those paths are also suppressed for unphased tool-use messages and guard against future regressions there.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.suppresses-commentary-phase-output.test.ts
Line: 10-40

Comment:
**Missing `blockReplyBreak: "text_end"` coverage**

The new regression test only exercises the `blockReplyBreak: "message_end"` path. The `handleMessageUpdate` code has a separate `text_end`-channel flush at lines 617–634 (`ctx.blockChunker?.drain` and the deferred `flushBlockReplyBuffer`). A parallel fixture with `blockReplyBreak: "text_end"` would confirm those paths are also suppressed for unphased tool-use messages and guard against future regressions there.

How can I resolve this? If you propose a fix, please make it concise.

@@ -37,8 +37,37 @@ import {
promoteThinkingTagsToBlocks,
} from "./pi-embedded-utils.js";

const TOOL_CALL_CONTENT_TYPES = new Set(["toolCall", "toolUse", "functionCall"]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Snake-case content-type variants not covered

TOOL_CALL_CONTENT_TYPES contains "toolCall", "toolUse", and "functionCall" (camelCase). Some providers (e.g. OpenAI-compatible adapters) may emit "tool_call", "tool_use", or "function_call" (snake_case). If any upstream adapter normalises content block types to snake_case before emitting events, the content-block branch of hasToolUseContinuation would miss them. The stopReason === "toolUse" path would still fire, so this only matters when stopReason is absent and the detection falls back to content scanning.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-subscribe.handlers.messages.ts
Line: 40

Comment:
**Snake-case content-type variants not covered**

`TOOL_CALL_CONTENT_TYPES` contains `"toolCall"`, `"toolUse"`, and `"functionCall"` (camelCase). Some providers (e.g. OpenAI-compatible adapters) may emit `"tool_call"`, `"tool_use"`, or `"function_call"` (snake_case). If any upstream adapter normalises content block types to snake_case before emitting events, the content-block branch of `hasToolUseContinuation` would miss them. The `stopReason === "toolUse"` path would still fire, so this only matters when `stopReason` is absent and the detection falls back to content scanning.

How can I resolve this? If you propose a fix, please make it concise.

@deepkilo deepkilo force-pushed the fix/feishu-group-final-only branch from 8aa46ee to 83f7a5d Compare April 25, 2026 13:48
@deepkilo

Copy link
Copy Markdown
Contributor Author

Rebased this branch on current main to pick up unrelated CI fixes that landed upstream after the previous run (notably the Array#toSorted() lint fix).\n\nNew head: 83f7a5dc1a.\n\nLocal validation after rebase:\n- corepack pnpm vitest run src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.suppresses-commentary-phase-output.test.ts\n- corepack pnpm tsgo:core\n- corepack pnpm lint:core\n- corepack pnpm exec oxfmt --check src/agents/pi-embedded-subscribe.handlers.messages.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.suppresses-commentary-phase-output.test.ts

@deepkilo

Copy link
Copy Markdown
Contributor Author

Pushed a follow-up commit: e8f55349fe (fix(agent): clear suppressed commentary buffers).

This addresses the remaining gap behind #71588. The existing suppression path covers commentary/tool-use assistant text once the phase is already resolved, but it does not fully cover the delayed-phase stream case:

  • text_delta can arrive before the stream exposes phase: "commentary" via text_end / textSignature.
  • In that window, the visible assistant text may already have been appended into assistantTexts, blockBuffer, or partial/block chunking state.
  • Returning early in handleMessageEnd is then too late: the user-visible buffers can already be contaminated, which can still produce an intermediate/tool-use-looking reply alongside the final answer.

The new commit keeps the narrow suppression behavior, but additionally clears suppressed delta/block/chunker buffers and rolls assistantTexts back to the pre-block baseline when a commentary/tool-use assistant block is suppressed. It also expands the tool-call block guard to include snake_case variants (tool_call, tool_use, function_call).

Local validation:

  • corepack pnpm vitest run src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.suppresses-commentary-phase-output.test.ts
  • corepack pnpm tsgo:core
  • corepack pnpm lint:core
  • corepack pnpm exec oxfmt --check src/agents/pi-embedded-subscribe.handlers.messages.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.suppresses-commentary-phase-output.test.ts

@deepkilo deepkilo force-pushed the fix/feishu-group-final-only branch 2 times, most recently from ba51b30 to 144cc6c Compare April 26, 2026 06:06
@openclaw-barnacle openclaw-barnacle Bot added the app: macos App: macos label Apr 26, 2026
@deepkilo deepkilo force-pushed the fix/feishu-group-final-only branch from 144cc6c to 3d3a202 Compare April 26, 2026 06:23
@openclaw-barnacle openclaw-barnacle Bot added size: M and removed app: macos App: macos size: S labels Apr 26, 2026
@clawsweeper

clawsweeper Bot commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-20 18:37 UTC / May 20, 2026, 2:37 PM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR expands embedded agent assistant-output suppression for tool-use/commentary turns, preserves explicit final_answer streaming, and adds related agent, qa-lab, status-test, and lockfile changes.

Reproducibility: yes. Source inspection gives a high-confidence path: emit an assistant text_delta before phase/tool metadata is known while partial replies or text_end block delivery are enabled, then classify the same assistant turn as commentary/tool-use later.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦪 silver shellfish
Summary: The PR targets a real bug, but it is not quality-ready because real behavior proof is missing, the branch conflicts with main, and the patch still has concrete message-delivery gaps.

Rank-up moves:

  • Fix the phase-pending delivery ordering and cover tool_calls/toolCalls structured content.
  • Rebase or otherwise resolve the main conflict and trim or justify unrelated qa-lab/status/lockfile maintenance.
  • Add redacted real channel proof showing the intermediate tool-use text is not emitted and the final answer still is.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body and comments provide tests and checks only; an external PR still needs redacted after-fix real Feishu, Telegram, or equivalent channel proof before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Mantis proof suggestion
A real channel transcript would materially help prove the visible message-delivery behavior that unit tests cannot show. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

telegram live QA: verify a tool-use assistant prelude is not sent to chat while the final answer still arrives.

Risk before merge

  • A delayed-phase text_delta can still reach partial reply or text_end block delivery before a later event marks the same assistant turn as commentary/tool-use.
  • The fallback tool-use predicate misses supported tool_calls and toolCalls array shapes when stopReason is absent.
  • No redacted Feishu, Telegram, or equivalent real channel proof shows that intermediate tool-use text is suppressed while the final answer still arrives.
  • The live PR is conflicting with main and includes adjacent qa-lab, status-test, and lockfile changes beyond the central agent suppression fix.

Maintainer options:

  1. Quarantine Before Delivery (recommended)
    Hold phase-pending assistant text out of partial/block callbacks until commentary/tool-use versus final-answer visibility is known, then prove the real channel transcript only receives the final answer.
  2. Supersede With Canonical Leak Work
    If maintainers prefer a broader fix for the open tool/message leak cluster, preserve this PR’s Feishu reproduction and tests in that canonical path before closing the branch.

Next step before merge
Contributor proof and maintainer review are needed because the external PR lacks real channel proof, conflicts with main, and still has message-delivery gaps.

Security
Cleared: No concrete security or supply-chain regression was found; the lockfile change only adds a workspace importer and the remaining leak concern is covered by functional message-delivery findings.

Review findings

  • [P2] Quarantine unclassified chunks before emitting — src/agents/pi-embedded-subscribe.handlers.messages.ts:560-562
  • [P2] Cover array-shaped tool-call content — src/agents/pi-embedded-subscribe.handlers.messages.ts:87
Review details

Best possible solution:

Land one rebased embedded-subscribe suppression path that quarantines unclassified assistant text until visibility is known, recognizes the repo’s structured tool-call shapes, preserves explicit final_answer delivery, and includes focused regression coverage plus redacted real channel proof.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection gives a high-confidence path: emit an assistant text_delta before phase/tool metadata is known while partial replies or text_end block delivery are enabled, then classify the same assistant turn as commentary/tool-use later.

Is this the best way to solve the issue?

No. The handler is the right surface, but clearing buffers only after classification is too late for already emitted chunks; the safer fix is to quarantine phase-pending text before visible callbacks and cover the full structured tool-call contract.

Label justifications:

  • P2: This is a normal-priority user-visible message-delivery bug fix with limited blast radius but real channel impact.
  • merge-risk: 🚨 message-delivery: The PR changes when assistant text is emitted or suppressed, so a wrong predicate can leak or drop visible channel replies despite green unit tests.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🦪 silver shellfish, and The PR targets a real bug, but it is not quality-ready because real behavior proof is missing, the branch conflicts with main, and the patch still has concrete message-delivery gaps.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body and comments provide tests and checks only; an external PR still needs redacted after-fix real Feishu, Telegram, or equivalent channel proof before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
  • mantis: telegram-visible-proof: Mantis should capture Telegram visible proof. The generic agent delivery change affects visible Telegram chat behavior and can be demonstrated with a short real transcript even though it does not edit Telegram-specific files.

Full review comments:

  • [P2] Quarantine unclassified chunks before emitting — src/agents/pi-embedded-subscribe.handlers.messages.ts:560-562
    The patch only clears buffers after shouldSuppressAssistantVisibleOutput(partialAssistant) becomes true. A delayed-phase text_delta without phase/tool metadata can still emit through partial replies or text_end block chunking before a later text_end/message_end marks the same assistant message as commentary/tool-use, so users can still see the intermediate reply this PR is meant to suppress.
    Confidence: 0.86
  • [P2] Cover array-shaped tool-call content — src/agents/pi-embedded-subscribe.handlers.messages.ts:87
    The fallback content scan only checks block.type, but current diagnostics also treat tool_calls and toolCalls arrays as structured tool invocation evidence. If an adapter emits those shapes without stopReason, this PR still treats the prelude as visible assistant text.
    Confidence: 0.78

Overall correctness: patch is incorrect
Overall confidence: 0.86

What I checked:

Likely related people:

  • steipete: Recent GitHub history shows repeated work on the embedded subscribe handler, tool-text diagnostics, streaming/display projection, and adjacent reply delivery behavior. (role: recent area contributor; confidence: high; commits: 694ca50e9775, f9181835e870, 5f2273e81efc; files: src/agents/pi-embedded-subscribe.handlers.messages.ts, src/agents/pi-embedded-subscribe.tool-text-diagnostics.ts)
  • mbelinky: GitHub history ties the commentary-phase suppression test path to commit 4175cae, which is the current suppression behavior this PR extends. (role: introduced related suppression baseline; confidence: medium; commits: 4175caee6e95; files: src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.suppresses-commentary-phase-output.test.ts)
  • pashpashpash: Recent history shows adjacent work on streamed media and channel delivery behavior in the same handler family, which overlaps the block-reply delivery surface affected here. (role: adjacent streaming/reply contributor; confidence: medium; commits: 5404bbbb7196, 1a5548203e09; files: src/agents/pi-embedded-subscribe.handlers.messages.ts, docs/concepts/streaming.md)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 447a3643c69b.

@openclaw-barnacle openclaw-barnacle Bot added extensions: qa-lab channel: telegram Channel integration: telegram commands Command implementations labels Apr 26, 2026
@deepkilo deepkilo force-pushed the fix/feishu-group-final-only branch from 844c674 to edddde9 Compare April 26, 2026 08:22
@openclaw-barnacle openclaw-barnacle Bot removed the channel: telegram Channel integration: telegram label Apr 26, 2026
@deepkilo deepkilo force-pushed the fix/feishu-group-final-only branch from d02542f to 07b0b62 Compare April 26, 2026 08:32
@deepkilo deepkilo force-pushed the fix/feishu-group-final-only branch from 07b0b62 to a6f2ffc Compare April 26, 2026 09:30
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. labels May 20, 2026
@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 20, 2026
@clawsweeper clawsweeper Bot added the mantis: telegram-visible-proof Mantis should capture Telegram visible proof. label May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling commands Command implementations extensions: qa-lab mantis: telegram-visible-proof Mantis should capture Telegram visible proof. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. P2 Normal backlog priority with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: M status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feishu group can double reply when tool-use turns emit visible assistant text

1 participant