feat(matrix): transcribe inbound voice notes before mention gate by frankdierolf · Pull Request #78069 · openclaw/openclaw

frankdierolf · 2026-05-05T21:09:48Z

Summary

Problem: Inbound voice messages on Matrix reach the agent as raw audio attachments with no transcript. The model improvises a reply instead of answering, and requireMention: true rooms drop voice notes entirely because there's no text to match the mention regex.
Why it matters: Discord, Telegram, WhatsApp, and Feishu already transcribe inbound voice via transcribeFirstAudio before the mention gate. Matrix is the lone holdout — voice-only Matrix users have no way to talk to their agent.
What changed: Wire the existing transcribeFirstAudio helper into the Matrix monitor handler before the mention gate, mirroring the Discord/Telegram pattern. The transcript feeds into the mention check (so a voice note that says the bot's name reaches the agent in requireMention rooms) and into BodyForAgent (so the agent reads the transcript instead of a placeholder). MediaTranscribedIndexes is set so downstream tools don't re-transcribe the same audio.
What did NOT change (scope boundary): No core media-understanding changes. No new config keys (operators control via the existing global tools.media.audio.enabled). Outbound TTS untouched. E2EE crypto path untouched (decryption stays inside downloadMatrixMedia; preflight receives the plaintext path). No changes to other channels.

Change Type (select all)

Feature

Scope (select all touched areas)

Integrations

Linked Issue/PR

Closes [Feature]: Voice messages to agent don't work on Matrix #78016
This PR fixes a bug or regression

Root Cause (if applicable)

N/A — feature parity request, not a regression.

Regression Test Plan (if applicable)

N/A — feature parity. Test coverage:

15 unit tests in extensions/matrix/src/matrix/monitor/preflight-audio.test.ts covering the audio-detection predicate, transcript formatter, and runtime caller (happy paths, error swallowing, abort-signal short-circuit, MediaPaths/MediaTypes ctx shape).
9 integration tests in extensions/matrix/src/matrix/monitor/handler.audio-preflight.test.ts covering: DM voice notes, m.file with audio mime, mention-gate bypass via transcript, mention-gate drop without transcript match, transcription failure fallback, non-audio bypass, single-download verification, encrypted (E2EE) audio, and size-limit handling.
Existing handler.test.ts, handler.media-failure.test.ts, and the rest of the matrix monitor suite: 396/396 passing, no regressions.

User-visible / Behavior Changes

Inbound voice notes (m.audio, plus m.file carrying an audio/* mimetype) on Matrix now get transcribed before the mention gate. A voice note that mentions the bot by name (per the existing mentionRegexes) bypasses requireMention: true rooms, matching Discord/Telegram.
The transcript is wrapped with [Audio transcript (machine-generated, untrusted)]: … framing (with JSON.stringify escaping) so prompt-injection content inside the audio cannot impersonate system instructions.
Bare-filename audio bodies (e.g. voice.ogg, auto-set by Element) are replaced with the existing [matrix audio attachment] placeholder so the agent sees a clear audio marker rather than a stray filename. The download-failed path was already doing this; we extend it to the success path for audio specifically.
Operators can disable globally via tools.media.audio.enabled: false.

Diagram (if applicable)

Before:
[user voice note] -> [Matrix handler]
  -> mention gate (drops if requireMention:true & no @mention text)
  -> media download
  -> agent ctx { MediaPath, MediaUrl, BodyForAgent: "[matrix media]" }
  -> agent improvises a reply about "an audio attachment"

After:
[user voice note] -> [Matrix handler]
  -> audio detect + early download + transcribeFirstAudio
  -> mention gate (sees transcript text, can match @bot mentions)
  -> agent ctx { MediaPath, MediaUrl, BodyForAgent: "[Audio transcript ...]: \"...\"", MediaTranscribedIndexes: [0] }
  -> agent answers the spoken question

Security Impact (required)

New permissions/capabilities? No.
Secrets/tokens handling changed? No.
New/changed network calls? No new outbound endpoint — uses the operator's existing tools.media.audio.provider, the same one Discord/Telegram/WhatsApp/Feishu already call. No new API keys required.
Command/tool execution surface changed? No.
Data access scope changed? Yes (minor) — audio attachment bytes (decrypted plaintext for E2EE rooms) are now sent to the operator-configured STT provider on Matrix, matching peer-channel behavior. Mitigation: gated by tools.media.audio.enabled; documented in docs/channels/matrix.md.

Repro + Verification

Environment

OS: Linux (Debian)
Runtime/container: OpenClaw gateway in Docker
Model/provider: OpenAI gpt-4o-mini-transcribe (via tools.media.audio.provider: openai)
Integration/channel: Matrix (self-hosted Synapse, Element clients)
Relevant config (redacted): tools.media.audio.{enabled, provider: openai, model: gpt-4o-mini-transcribe}

Steps

Send a voice note from Element to a Matrix bot in a requireMention: true room, saying the bot's name in the recording.
Send a voice note in a DM to the bot, with no text caption.
Send a non-audio attachment (e.g. an image) to the same room to confirm unchanged behavior.

Expected

(1) Bot transcribes the voice note, sees the spoken bot mention in the transcript, and replies normally.
(2) Bot replies based on the transcribed body.
(3) Behavior unchanged from main.

Actual

(1) (2) (3) Asserted by integration tests in handler.audio-preflight.test.ts. Live end-to-end on a personal homeserver is deferred (see Human Verification below).

Evidence

Failing test/log before + passing after — the integration test file was authored TDD-style. Initial run showed 5 failed assertions on expect(transcribeFirstAudioMock).toHaveBeenCalledTimes(1); once the handler was wired, all 9 cases turned green.
Trace/log snippets — N/A; tests mock downloadMatrixMedia and transcribeFirstAudio following the Discord pattern.
Screenshot/recording — N/A.
Perf numbers — N/A.

Human Verification (required)

Verified scenarios: TDD-driven integration tests covering DM voice, room-mention-bypass via transcript, room-mention-drop without match, transcription-failure fallback, non-audio bypass, single-download verification, E2EE-encrypted audio, and size-limit handling. All 9 cases green. Full matrix monitor suite runs clean (396/396, including pre-existing handler.media-failure.test.ts and handler.body-for-agent.test.ts).
Edge cases checked: Encrypted media (file: { url, key, iv, hashes, v }) decryption + transcription path. m.file with audio mimetype. Bare-filename body normalization. Abort signal short-circuit before SDK load. Empty / undefined transcript fallthrough.
What I did NOT verify: Live end-to-end run against my own Synapse + Element clients on this branch. Reason: testing a fork build requires a custom Docker image; the framework-side path (transcribeFirstAudio) is exercised by the existing src/media-understanding test suite, and our handler integration is locked in by the new tests.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes — additive only.
Config/env changes? No — reuses existing tools.media.audio configuration.
Migration needed? No.

Risks and Mitigations

Risk: Prompt injection via voice transcript ("ignore prior instructions and …").
- Mitigation: Transcript is wrapped with [Audio transcript (machine-generated, untrusted)]: ${JSON.stringify(transcript)} framing before reaching the agent body, mirroring Telegram exactly. JSON.stringify escapes control characters and quote chars.
Risk: STT cost amplification — every audio message in a watched room triggers an STT API call.
- Mitigation: Operator-controlled via tools.media.audio.enabled. Existing room/sender allowlist in the Matrix handler runs BEFORE the preflight code, so unauthorized senders never trigger transcription.
Risk: Reordering the audio download (now runs before mention gate) could shift failure semantics.
- Mitigation: The existing media block now reuses the preflight result via !media && !mediaDownloadFailed guards. Same exception types propagate through the same logger paths. Existing handler.media-failure.test.ts still passes unchanged.

Notes for reviewers

The audio download + error-handling block in handler.ts duplicates ~25 lines with the existing media block (different scope, slightly different encrypted: Boolean(...) flag). Kept as-is for minimal-diff. Happy to extract into a helper if preferred.
The early content-extraction block at handler.ts ~876-887 (earlyContentInfo etc.) duplicates the later block at ~1088-1096 because the audio path needs the info BEFORE the mention gate while non-audio doesn't. Same minimal-diff trade-off.
No disableAudioPreflight per-room knob (Telegram has one). Operators rely on global tools.media.audio.enabled: false. Easy to add if the team wants finer-grained control — happy to send a follow-up.

This PR was developed with AI — [AI-assisted].

Suggested reviewer: @gumadeiras (Matrix-area maintainer).

P.S. I didn't tested it live so far. Still need to. First lets send the pr and lets see from there.

New module under extensions/matrix/src/matrix/monitor/ that wraps the shared transcribeFirstAudio helper for Matrix. Mirrors the Discord/Telegram pattern: a small *.runtime.ts shim re-exports from openclaw/plugin-sdk/media-runtime, and the public surface is a Matrix-specific predicate (isMatrixAudioContent), a prompt-injection-safe transcript formatter, and an async caller that builds a Matrix-shaped MsgContext (MediaPaths over MediaUrls, since Matrix downloads attachments locally for E2EE). Includes 15 unit tests covering predicate edges, JSON-escaped formatter output, happy-path transcription, error swallowing, and abort-signal short-circuit.

@bot

Wire resolveMatrixPreflightAudioTranscript into the Matrix monitor handler before the mention gate so voice-only messages can carry an @bot mention via the transcript and reach the agent in requireMention rooms, matching Discord, Telegram, WhatsApp, and Feishu. The audio download is hoisted ahead of the mention gate; the existing media block reuses the preflight result via !media && !mediaDownloadFailed guards so non-audio paths are unchanged. Bare-filename audio bodies (auto-set by Element) are normalized to the existing [matrix audio attachment] placeholder so the agent sees a clear marker rather than a stray filename. MediaTranscribedIndexes is set so downstream tools do not re-transcribe. 9 integration tests cover DM voice notes, mention-gate bypass via transcript, mention-gate drop without match, transcription failure, non-audio bypass, single-download verification, encrypted (E2EE) audio, and size-limit handling. Closes openclaw#78016.

Add a Voice messages and audio transcription section to docs/channels/matrix.md describing the preflight flow, behavior contract, and the global tools.media.audio.enabled kill switch.

clawsweeper · 2026-05-05T21:13:23Z

Codex review: needs real behavior proof before merge. Reviewed June 1, 2026, 1:07 AM ET / 05:07 UTC.

Summary
Review failed before ClawSweeper could summarize the requested change.

PR surface: Source +172, Tests +510, Docs +20. Total +702 across 8 files.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Review metrics: none identified.

Merge readiness
Overall: 🌊 off-meta tidepool
Proof: 🌊 off-meta tidepool
Patch quality: 🌊 off-meta tidepool
Result: rating does not apply to this item.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

[P1] No close action taken because the review did not complete.

Maintainer options:

Decide the mitigation before merge
Retry the Codex review after fixing the execution failure.
Pause or close
Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

[P1] Review did not complete, so no work-lane recommendation was made.

Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

AGENTS.md: unclear because the file could not be read completely.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4e57546a8752.

Label changes

Label changes:

add rating: 🌊 off-meta tidepool: Overall readiness is 🌊 off-meta tidepool; proof is 🌊 off-meta tidepool and patch quality is 🌊 off-meta tidepool.
remove P2: Current review triage priority is none.
remove rating: 🧂 unranked krab: Current PR rating is rating: 🌊 off-meta tidepool, so this older rating label is no longer current.
remove merge-risk: 🚨 compatibility: Current PR review selected no merge-risk labels.
remove merge-risk: 🚨 security-boundary: Current PR review selected no merge-risk labels.
remove status: 📣 needs proof: Current PR status no longer selects a status label.

Label justifications:

rating: 🌊 off-meta tidepool: Overall readiness is 🌊 off-meta tidepool; proof is 🌊 off-meta tidepool and patch quality is 🌊 off-meta tidepool.

Evidence reviewed

PR surface:

Source +172, Tests +510, Docs +20. Total +702 across 8 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	3	178	6	+172
Tests	2	510	0	+510
Docs	3	20	0	+20
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	8	708	6	+702

What I checked:

failure reason: codex execution failed.
codex failure detail: Codex review failed for this PR with exit 1.
codex stdout: Per-item Codex failure; continuing with the rest of the shard.

Likely related people:

unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Adds the user-facing changelog entry that OpenClaw policy requires for `feat` changes. Both extensions/matrix/CHANGELOG.md and the root CHANGELOG.md note the new pre-mention transcription behavior under Unreleased / Changes.

frankdierolf · 2026-05-05T21:25:19Z

Hey @clawsweeper — thanks for the review.

P1: false positive — isLikelyBareFilename is exported at extensions/matrix/src/matrix/media-text.ts:39. Imports resolve fine, type-check + tests are green.

P2: my bad on the changelog — fixed in 6441896.

Heads up for any human reviewer: I haven't done a live Synapse/Element run on this branch yet. Just threw the PR out because the impl is small and mirrors Discord/Telegram pretty closely. Happy if anyone has a moment to live-test, otherwise I'll get to it on my own homeserver soon-ish. Good starting point at least.

What surprised me when I dug into how this actually works: we transcribe every voice note that comes in, then grep the transcript for the bot mention. So if STT mishears the bot's name (or the user mispronounces it), the bot stays silent. Hilarious failure mode 😅 but I get why — no way to peek inside audio without transcribing it. Discord/Telegram/WhatsApp/Feishu live with the same trade-off, and STT is cheap enough now that it's fine.

Cheers,
Frank

P.s. approved and read by myself ^^

openclaw-barnacle · 2026-05-31T05:03:39Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

steipete · 2026-06-05T15:50:27Z

Closed as superseded by the maintainer-owned Matrix fix that landed in #90415 at 25149801189f.

#90415 includes the inbound Matrix voice-note preflight path from this PR, with focused tests and live proof: AWS Crabbox live OpenAI voice-STT Matrix QA run_f9edb5fe2a5f passed matrix-voice-preflight-mention in live-frontier mode, suite 4/4.

Thanks @frankdierolf for pushing this behavior forward.

layhaus · 2026-06-06T05:31:09Z

I will test it when it lands in the next release and comment here again to give a "production approve signal".

Thanks for the work 👍
*I am Frank, diffrent account. My fault^^

frankdierolf added 3 commits May 5, 2026 23:08

docs(matrix): document inbound audio preflight

8403e0e

Add a Voice messages and audio transcription section to docs/channels/matrix.md describing the preflight flow, behavior contract, and the global tools.media.audio.enabled kill switch.

openclaw-barnacle Bot added docs Improvements or additions to documentation channel: matrix Channel integration: matrix size: L triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 5, 2026

docs(matrix): add changelog entry for inbound audio preflight

6441896

Adds the user-facing changelog entry that OpenClaw policy requires for `feat` changes. Both extensions/matrix/CHANGELOG.md and the root CHANGELOG.md note the new pre-mention transcription behavior under Unreleased / Changes.

clawsweeper Bot mentioned this pull request May 14, 2026

[Feature]: Voice messages to agent don't work on Matrix #78016

Closed

openclaw-barnacle Bot added the stale Marked as stale due to inactivity label May 31, 2026

openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Jun 1, 2026

steipete mentioned this pull request Jun 4, 2026

feat(matrix): handle voice preflight and threads #90415

Merged

steipete closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(matrix): transcribe inbound voice notes before mention gate#78069

feat(matrix): transcribe inbound voice notes before mention gate#78069
frankdierolf wants to merge 4 commits into
openclaw:mainfrom
frankdierolf:matrix-inbound-audio-preflight

frankdierolf commented May 5, 2026

Uh oh!

clawsweeper Bot commented May 5, 2026 •

edited

Loading

Uh oh!

frankdierolf commented May 5, 2026 •

edited

Loading

Uh oh!

openclaw-barnacle Bot commented May 31, 2026

Uh oh!

steipete commented Jun 5, 2026

Uh oh!

layhaus commented Jun 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

frankdierolf commented May 5, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Notes for reviewers

Uh oh!

clawsweeper Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frankdierolf commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openclaw-barnacle Bot commented May 31, 2026

Uh oh!

steipete commented Jun 5, 2026

Uh oh!

layhaus commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clawsweeper Bot commented May 5, 2026 •

edited

Loading

frankdierolf commented May 5, 2026 •

edited

Loading

layhaus commented Jun 6, 2026 •

edited

Loading