feat(matrix): transcribe inbound voice notes before mention gate#78069
feat(matrix): transcribe inbound voice notes before mention gate#78069frankdierolf wants to merge 4 commits into
Conversation
New module under extensions/matrix/src/matrix/monitor/ that wraps the shared transcribeFirstAudio helper for Matrix. Mirrors the Discord/Telegram pattern: a small *.runtime.ts shim re-exports from openclaw/plugin-sdk/media-runtime, and the public surface is a Matrix-specific predicate (isMatrixAudioContent), a prompt-injection-safe transcript formatter, and an async caller that builds a Matrix-shaped MsgContext (MediaPaths over MediaUrls, since Matrix downloads attachments locally for E2EE). Includes 15 unit tests covering predicate edges, JSON-escaped formatter output, happy-path transcription, error swallowing, and abort-signal short-circuit.
Wire resolveMatrixPreflightAudioTranscript into the Matrix monitor handler before the mention gate so voice-only messages can carry an @bot mention via the transcript and reach the agent in requireMention rooms, matching Discord, Telegram, WhatsApp, and Feishu. The audio download is hoisted ahead of the mention gate; the existing media block reuses the preflight result via !media && !mediaDownloadFailed guards so non-audio paths are unchanged. Bare-filename audio bodies (auto-set by Element) are normalized to the existing [matrix audio attachment] placeholder so the agent sees a clear marker rather than a stray filename. MediaTranscribedIndexes is set so downstream tools do not re-transcribe. 9 integration tests cover DM voice notes, mention-gate bypass via transcript, mention-gate drop without match, transcription failure, non-audio bypass, single-download verification, encrypted (E2EE) audio, and size-limit handling. Closes openclaw#78016.
Add a Voice messages and audio transcription section to docs/channels/matrix.md describing the preflight flow, behavior contract, and the global tools.media.audio.enabled kill switch.
|
Codex review: needs real behavior proof before merge. Reviewed June 1, 2026, 1:07 AM ET / 05:07 UTC. Summary PR surface: Source +172, Tests +510, Docs +20. Total +702 across 8 files. Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path. Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge
Review detailsBest possible solution: Retry the Codex review after fixing the execution failure. Do we have a high-confidence way to reproduce the issue? Unclear. The review failed before ClawSweeper could establish a reproduction path. Is this the best way to solve the issue? Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction. AGENTS.md: unclear because the file could not be read completely. Codex review notes: model gpt-5.5, reasoning high; reviewed against 4e57546a8752. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +172, Tests +510, Docs +20. Total +702 across 8 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
Adds the user-facing changelog entry that OpenClaw policy requires for `feat` changes. Both extensions/matrix/CHANGELOG.md and the root CHANGELOG.md note the new pre-mention transcription behavior under Unreleased / Changes.
|
Hey @clawsweeper — thanks for the review. P1: false positive — P2: my bad on the changelog — fixed in 6441896. Heads up for any human reviewer: I haven't done a live Synapse/Element run on this branch yet. Just threw the PR out because the impl is small and mirrors Discord/Telegram pretty closely. Happy if anyone has a moment to live-test, otherwise I'll get to it on my own homeserver soon-ish. Good starting point at least. What surprised me when I dug into how this actually works: we transcribe every voice note that comes in, then grep the transcript for the bot mention. So if STT mishears the bot's name (or the user mispronounces it), the bot stays silent. Hilarious failure mode 😅 but I get why — no way to peek inside audio without transcribing it. Discord/Telegram/WhatsApp/Feishu live with the same trade-off, and STT is cheap enough now that it's fine. Cheers, P.s. approved and read by myself ^^ |
|
This pull request has been automatically marked as stale due to inactivity. |
|
Closed as superseded by the maintainer-owned Matrix fix that landed in #90415 at #90415 includes the inbound Matrix voice-note preflight path from this PR, with focused tests and live proof: AWS Crabbox live OpenAI voice-STT Matrix QA Thanks @frankdierolf for pushing this behavior forward. |
|
I will test it when it lands in the next release and comment here again to give a "production approve signal". Thanks for the work 👍 |
Summary
requireMention: truerooms drop voice notes entirely because there's no text to match the mention regex.transcribeFirstAudiobefore the mention gate. Matrix is the lone holdout — voice-only Matrix users have no way to talk to their agent.transcribeFirstAudiohelper into the Matrix monitor handler before the mention gate, mirroring the Discord/Telegram pattern. The transcript feeds into the mention check (so a voice note that says the bot's name reaches the agent inrequireMentionrooms) and intoBodyForAgent(so the agent reads the transcript instead of a placeholder).MediaTranscribedIndexesis set so downstream tools don't re-transcribe the same audio.tools.media.audio.enabled). Outbound TTS untouched. E2EE crypto path untouched (decryption stays insidedownloadMatrixMedia; preflight receives the plaintext path). No changes to other channels.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause (if applicable)
N/A — feature parity request, not a regression.
Regression Test Plan (if applicable)
N/A — feature parity. Test coverage:
extensions/matrix/src/matrix/monitor/preflight-audio.test.tscovering the audio-detection predicate, transcript formatter, and runtime caller (happy paths, error swallowing, abort-signal short-circuit, MediaPaths/MediaTypes ctx shape).extensions/matrix/src/matrix/monitor/handler.audio-preflight.test.tscovering: DM voice notes,m.filewith audio mime, mention-gate bypass via transcript, mention-gate drop without transcript match, transcription failure fallback, non-audio bypass, single-download verification, encrypted (E2EE) audio, and size-limit handling.handler.test.ts,handler.media-failure.test.ts, and the rest of the matrix monitor suite: 396/396 passing, no regressions.User-visible / Behavior Changes
m.audio, plusm.filecarrying anaudio/*mimetype) on Matrix now get transcribed before the mention gate. A voice note that mentions the bot by name (per the existingmentionRegexes) bypassesrequireMention: truerooms, matching Discord/Telegram.[Audio transcript (machine-generated, untrusted)]: …framing (withJSON.stringifyescaping) so prompt-injection content inside the audio cannot impersonate system instructions.voice.ogg, auto-set by Element) are replaced with the existing[matrix audio attachment]placeholder so the agent sees a clear audio marker rather than a stray filename. The download-failed path was already doing this; we extend it to the success path for audio specifically.tools.media.audio.enabled: false.Diagram (if applicable)
Security Impact (required)
tools.media.audio.provider, the same one Discord/Telegram/WhatsApp/Feishu already call. No new API keys required.tools.media.audio.enabled; documented indocs/channels/matrix.md.Repro + Verification
Environment
gpt-4o-mini-transcribe(viatools.media.audio.provider: openai)tools.media.audio.{enabled, provider: openai, model: gpt-4o-mini-transcribe}Steps
requireMention: trueroom, saying the bot's name in the recording.Expected
Actual
handler.audio-preflight.test.ts. Live end-to-end on a personal homeserver is deferred (see Human Verification below).Evidence
expect(transcribeFirstAudioMock).toHaveBeenCalledTimes(1); once the handler was wired, all 9 cases turned green.downloadMatrixMediaandtranscribeFirstAudiofollowing the Discord pattern.Human Verification (required)
handler.media-failure.test.tsandhandler.body-for-agent.test.ts).file: { url, key, iv, hashes, v }) decryption + transcription path.m.filewith audio mimetype. Bare-filename body normalization. Abort signal short-circuit before SDK load. Empty / undefined transcript fallthrough.transcribeFirstAudio) is exercised by the existingsrc/media-understandingtest suite, and our handler integration is locked in by the new tests.Review Conversations
Compatibility / Migration
tools.media.audioconfiguration.Risks and Mitigations
[Audio transcript (machine-generated, untrusted)]: ${JSON.stringify(transcript)}framing before reaching the agent body, mirroring Telegram exactly.JSON.stringifyescapes control characters and quote chars.tools.media.audio.enabled. Existing room/sender allowlist in the Matrix handler runs BEFORE the preflight code, so unauthorized senders never trigger transcription.!media && !mediaDownloadFailedguards. Same exception types propagate through the same logger paths. Existinghandler.media-failure.test.tsstill passes unchanged.Notes for reviewers
handler.tsduplicates ~25 lines with the existing media block (different scope, slightly differentencrypted: Boolean(...)flag). Kept as-is for minimal-diff. Happy to extract into a helper if preferred.earlyContentInfoetc.) duplicates the later block at ~1088-1096 because the audio path needs the info BEFORE the mention gate while non-audio doesn't. Same minimal-diff trade-off.disableAudioPreflightper-room knob (Telegram has one). Operators rely on globaltools.media.audio.enabled: false. Easy to add if the team wants finer-grained control — happy to send a follow-up.This PR was developed with AI —
[AI-assisted].Suggested reviewer: @gumadeiras (Matrix-area maintainer).
P.S. I didn't tested it live so far. Still need to. First lets send the pr and lets see from there.