Skip to content

fix(tts): propagate audioAsVoice from tool results to delivery payload#46535

Closed
azade-c wants to merge 1 commit intoopenclaw:mainfrom
azade-c:fix/tts-tool-audio-as-voice
Closed

fix(tts): propagate audioAsVoice from tool results to delivery payload#46535
azade-c wants to merge 1 commit intoopenclaw:mainfrom
azade-c:fix/tts-tool-audio-as-voice

Conversation

@azade-c
Copy link
Copy Markdown
Contributor

@azade-c azade-c commented Mar 14, 2026

Problem

When the TTS tool returns [[audio_as_voice]] alongside a MEDIA: path in its tool result, the voice-bubble flag is lost during tool result extraction. This causes TTS audio to be sent as a file attachment instead of a voice bubble on Telegram (and other voice-capable channels like WhatsApp/Feishu).

Root cause

extractToolResultMediaPaths() only extracts file paths from tool results, discarding the audioAsVoice field that splitMediaFromOutput() correctly parses from [[audio_as_voice]] tags.

The extracted media is then passed to onToolResult({ mediaUrls }) without the voice flag, so Telegram's sendVoice path is never triggered — it falls through to sendAudio (file attachment).

Two code paths are affected:

  1. Non-verbose path (emitToolResultOutput in handlers.tools.ts): uses extractToolResultMediaPaths → loses audioAsVoice
  2. Verbose path (emitToolResultMessage in pi-embedded-subscribe.ts): calls parseReplyDirectives which returns audioAsVoice, but the field was not propagated to the onToolResult payload

Fix

  • Add extractToolResultMedia() that returns { paths: string[], audioAsVoice?: boolean }
  • Keep extractToolResultMediaPaths() as a backward-compatible wrapper
  • Propagate audioAsVoice to the onToolResult payload in both code paths
  • Add tests covering [[audio_as_voice]] detection in tool results

Testing

  • All 29 existing + 4 new tests pass in pi-embedded-subscribe.tools.media.test.ts
  • Manually verified: TTS tool result with [[audio_as_voice]]\nMEDIA:path.mp3 was arriving as file attachment on Telegram; after fix, audioAsVoice: true is correctly set on the delivery payload

When the TTS tool returns [[audio_as_voice]] alongside a MEDIA: path,
the voice-bubble flag was lost during tool result extraction. The
extractToolResultMediaPaths function only returned file paths, discarding
the audioAsVoice directive parsed by splitMediaFromOutput.

This caused TTS audio to be sent as a file attachment instead of a voice
bubble on Telegram (and other voice-capable channels).

Changes:
- Add extractToolResultMedia() returning { paths, audioAsVoice? }
- Keep extractToolResultMediaPaths() as backward-compatible wrapper
- Propagate audioAsVoice in emitToolResultOutput (non-verbose path)
- Propagate audioAsVoice in emitToolResultMessage (verbose path)
- Add tests for extractToolResultMedia audioAsVoice detection
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels Mar 14, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 14, 2026

Greptile Summary

This PR fixes a bug where the audioAsVoice flag (emitted by the TTS tool via [[audio_as_voice]] tags) was lost during tool result processing, causing TTS audio to be delivered as a file attachment rather than a voice bubble on Telegram and other voice-capable channels.

Changes:

  • Introduces extractToolResultMedia() in pi-embedded-subscribe.tools.ts that returns both paths and audioAsVoice, with extractToolResultMediaPaths() retained as a backward-compatible wrapper — clean, minimal API surface change.
  • Non-verbose path (handlers.tools.ts): switches from extractToolResultMediaPaths to extractToolResultMedia and propagates audioAsVoice to the onToolResult payload.
  • Verbose path (pi-embedded-subscribe.ts): adds audioAsVoice to the parseReplyDirectives destructuring and propagates it to onToolResult.
  • Adds 4 targeted unit tests covering tag detection, tag absence, multi-block detection, and null input.

Minor observations:

  • The old JSDoc that described the extraction strategy for extractToolResultMediaPaths is now positioned above the ExtractedToolResultMedia interface after the refactor, making it read as interface documentation when it was originally function documentation. The new extractToolResultMedia function has its own correct JSDoc, so the block above the interface is redundant.
  • The audioAsVoice flag is silently dropped in the image details.path fallback branch when no MEDIA: paths are found. This is not a real issue for TTS (which always emits a MEDIA: path alongside the tag), but it is a latent inconsistency for future callers.

Confidence Score: 4/5

  • Safe to merge — the fix is well-scoped, both affected code paths are addressed, and existing tests pass alongside 4 new targeted tests.
  • The core logic change is correct and minimal. The non-verbose and verbose paths are both fixed consistently. The only concerns are a misplaced JSDoc comment (cosmetic) and a missing audioAsVoice propagation in an unlikely image-fallback edge case, neither of which affects the primary TTS use case described in the PR.
  • src/agents/pi-embedded-subscribe.tools.ts — minor: orphaned JSDoc above ExtractedToolResultMedia and audioAsVoice not carried through the details.path fallback branch.

Comments Outside Diff (1)

  1. src/agents/pi-embedded-subscribe.tools.ts, line 185-199 (link)

    Orphaned JSDoc comment above the interface

    The docblock that was originally written for extractToolResultMediaPaths (describing the extraction strategy and the "returns an empty array" note) now sits above ExtractedToolResultMedia rather than above either function. The new extractToolResultMedia function already has its own correct JSDoc, so the block above the interface is now redundant and misleading — it reads as documentation for the interface when it was really describing function behavior.

    Consider removing the old comment from above the interface (or relocating it), leaving only the JSDoc that's already above extractToolResultMedia.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/agents/pi-embedded-subscribe.tools.ts
    Line: 185-199
    
    Comment:
    **Orphaned JSDoc comment above the interface**
    
    The docblock that was originally written for `extractToolResultMediaPaths` (describing the extraction strategy and the "returns an empty array" note) now sits above `ExtractedToolResultMedia` rather than above either function. The new `extractToolResultMedia` function already has its own correct JSDoc, so the block above the interface is now redundant and misleading — it reads as documentation for the interface when it was really describing function behavior.
    
    Consider removing the old comment from above the interface (or relocating it), leaving only the JSDoc that's already above `extractToolResultMedia`.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-subscribe.tools.ts
Line: 185-199

Comment:
**Orphaned JSDoc comment above the interface**

The docblock that was originally written for `extractToolResultMediaPaths` (describing the extraction strategy and the "returns an empty array" note) now sits above `ExtractedToolResultMedia` rather than above either function. The new `extractToolResultMedia` function already has its own correct JSDoc, so the block above the interface is now redundant and misleading — it reads as documentation for the interface when it was really describing function behavior.

Consider removing the old comment from above the interface (or relocating it), leaving only the JSDoc that's already above `extractToolResultMedia`.

```suggestion
export interface ExtractedToolResultMedia {
  paths: string[];
  audioAsVoice?: boolean;
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/pi-embedded-subscribe.tools.ts
Line: 250-264

Comment:
**`audioAsVoice` silently dropped in the image-fallback path**

When `paths.length === 0` but `audioAsVoice` was detected (e.g. a content block contains `[[audio_as_voice]]` but no valid `MEDIA:` token), the code falls through to the `details.path` image-fallback branch. Both `return { paths: [p] }` and the final `return { paths: [] }` discard the `audioAsVoice` flag.

The TTS tool always emits a `MEDIA:` path alongside the tag, so this won't bite in today's usage. But as an invariant it's fragile — future tools that emit `[[audio_as_voice]]` without a `MEDIA:` line (or where the `MEDIA:` token fails validation) would silently lose the flag. A defensive approach would be to carry `audioAsVoice` through:

```typescript
  if (paths.length > 0) {
    return { paths, ...(audioAsVoice ? { audioAsVoice: true } : {}) };
  }

  // Fall back to details.path when image content exists but no MEDIA: text.
  if (hasImageContent) {
    const details = record.details as Record<string, unknown> | undefined;
    const p = typeof details?.path === "string" ? details.path.trim() : "";
    if (p) {
      return { paths: [p], ...(audioAsVoice ? { audioAsVoice: true } : {}) };
    }
  }

  return { paths: [], ...(audioAsVoice ? { audioAsVoice: true } : {}) };
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 5133827

Comment on lines 250 to 264
if (paths.length > 0) {
return paths;
return { paths, ...(audioAsVoice ? { audioAsVoice: true } : {}) };
}

// Fall back to details.path when image content exists but no MEDIA: text.
if (hasImageContent) {
const details = record.details as Record<string, unknown> | undefined;
const p = typeof details?.path === "string" ? details.path.trim() : "";
if (p) {
return [p];
return { paths: [p] };
}
}

return [];
return { paths: [] };
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

audioAsVoice silently dropped in the image-fallback path

When paths.length === 0 but audioAsVoice was detected (e.g. a content block contains [[audio_as_voice]] but no valid MEDIA: token), the code falls through to the details.path image-fallback branch. Both return { paths: [p] } and the final return { paths: [] } discard the audioAsVoice flag.

The TTS tool always emits a MEDIA: path alongside the tag, so this won't bite in today's usage. But as an invariant it's fragile — future tools that emit [[audio_as_voice]] without a MEDIA: line (or where the MEDIA: token fails validation) would silently lose the flag. A defensive approach would be to carry audioAsVoice through:

  if (paths.length > 0) {
    return { paths, ...(audioAsVoice ? { audioAsVoice: true } : {}) };
  }

  // Fall back to details.path when image content exists but no MEDIA: text.
  if (hasImageContent) {
    const details = record.details as Record<string, unknown> | undefined;
    const p = typeof details?.path === "string" ? details.path.trim() : "";
    if (p) {
      return { paths: [p], ...(audioAsVoice ? { audioAsVoice: true } : {}) };
    }
  }

  return { paths: [], ...(audioAsVoice ? { audioAsVoice: true } : {}) };
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-subscribe.tools.ts
Line: 250-264

Comment:
**`audioAsVoice` silently dropped in the image-fallback path**

When `paths.length === 0` but `audioAsVoice` was detected (e.g. a content block contains `[[audio_as_voice]]` but no valid `MEDIA:` token), the code falls through to the `details.path` image-fallback branch. Both `return { paths: [p] }` and the final `return { paths: [] }` discard the `audioAsVoice` flag.

The TTS tool always emits a `MEDIA:` path alongside the tag, so this won't bite in today's usage. But as an invariant it's fragile — future tools that emit `[[audio_as_voice]]` without a `MEDIA:` line (or where the `MEDIA:` token fails validation) would silently lose the flag. A defensive approach would be to carry `audioAsVoice` through:

```typescript
  if (paths.length > 0) {
    return { paths, ...(audioAsVoice ? { audioAsVoice: true } : {}) };
  }

  // Fall back to details.path when image content exists but no MEDIA: text.
  if (hasImageContent) {
    const details = record.details as Record<string, unknown> | undefined;
    const p = typeof details?.path === "string" ? details.path.trim() : "";
    if (p) {
      return { paths: [p], ...(audioAsVoice ? { audioAsVoice: true } : {}) };
    }
  }

  return { paths: [], ...(audioAsVoice ? { audioAsVoice: true } : {}) };
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5133827a52

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

void ctx.params.onToolResult({ mediaUrls: mediaPaths });
void ctx.params.onToolResult({
mediaUrls: mediaPaths,
...(extracted.audioAsVoice ? { audioAsVoice: true } : {}),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate audioAsVoice on surviving media after URL filtering

audioAsVoice is propagated even though media URLs are filtered separately, so the flag can survive while its intended audio file is removed. In a mixed tool result (for example [[audio_as_voice]] with both a filtered local path and a remaining remote image URL), the remaining non-audio media is still marked as voice and downstream delivery can take the wrong path; Discord unconditionally routes audioAsVoice payloads through its voice API (extensions/discord/src/monitor/reply-delivery.ts:352), which can fail for non-audio media. Please clear/recompute audioAsVoice after filtering so only retained audio media keep the voice flag.

Useful? React with 👍 / 👎.

@steipete
Copy link
Copy Markdown
Contributor

Maintainer deep-review update from current main:

This is mostly superseded for the built-in tts tool, but not exactly identical to the original generic legacy-path fix.

Current code now returns structured media from the TTS tool: src/agents/tools/tts-tool.ts puts the audio path under details.media.mediaUrl, with trustedLocalMedia: true and audioAsVoice: true when the synthesized result is voice-compatible. src/agents/pi-embedded-subscribe.tools.ts reads that structured media in extractToolResultMediaArtifact(), including audioAsVoice. The focused agent tests pass on current main:

pnpm test src/agents/pi-embedded-subscribe.handlers.tools.media.test.ts src/agents/pi-embedded-subscribe.handlers.tools.test.ts

Result: 2 files / 46 tests passed.

Caveat: the legacy text-only path still calls splitMediaFromOutput(entry.text) but currently keeps only parsed.mediaUrls; it does not propagate parsed.audioAsVoice. So if this PR is meant to support arbitrary legacy tool-result text like [[audio_as_voice]]\nMEDIA:/tmp/file.opus, that part is still not fully covered by main.

Recommended next step: either close this as superseded by structured TTS media if the original production issue was only the built-in tts tool, or narrow/rebase it to add the missing legacy parsed.audioAsVoice propagation plus one focused regression test.

@steipete
Copy link
Copy Markdown
Contributor

Thanks @azade-c. I deep-reviewed this against current main and landed the remaining real gap directly:

60f9358348

What landed:

  • extractToolResultMediaArtifact() now preserves [[audio_as_voice]] from legacy trusted tool-result text that also emits MEDIA:.
  • The flag is also preserved when the voice hint and media path arrive in separate text blocks.
  • The delivery handler now has regression coverage proving trusted tts legacy MEDIA: output queues pendingToolAudioAsVoice=true.
  • Rich output protocol docs and changelog were updated.

Why I closed instead of merging this PR: current main already had the structured details.media.audioAsVoice path, and the only remaining gap was the legacy text parser path. The landed patch keeps the fix scoped to the existing parser contract instead of adding a parallel ad hoc path.

Validation:

  • pnpm test src/agents/pi-embedded-subscribe.tools.media.test.ts src/agents/pi-embedded-subscribe.handlers.tools.media.test.ts -> 2 files / 63 tests passed
  • pnpm test src/agents/pi-embedded-subscribe.tools.media.test.ts src/agents/pi-embedded-subscribe.handlers.tools.media.test.ts src/agents/pi-embedded-subscribe.handlers.messages.test.ts -> 3 files / 86 tests passed
  • pnpm tsgo:core and pnpm tsgo:core:test completed in pnpm check:changed before the known local oxlint lock wedge
  • pnpm lint:core
  • pnpm check:docs
  • git diff --check
  • live .profile OpenAI TTS smoke: pnpm openclaw infer tts convert --local --json --model openai/gpt-4o-mini-tts --voice alloy ... succeeded and wrote a 58K MP3

Note: pnpm check:changed still self-deadlocked at lint:core because its oxlint child waited on the heavy-check lock held by the wrapper. I decomposed and ran the same relevant lanes directly.

@steipete
Copy link
Copy Markdown
Contributor

Superseded by current main in 60f9358348.

@steipete steipete closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants