Skip to content

feat(matrix): transcribe inbound voice notes before mention gate#78069

Closed
frankdierolf wants to merge 4 commits into
openclaw:mainfrom
frankdierolf:matrix-inbound-audio-preflight
Closed

feat(matrix): transcribe inbound voice notes before mention gate#78069
frankdierolf wants to merge 4 commits into
openclaw:mainfrom
frankdierolf:matrix-inbound-audio-preflight

Conversation

@frankdierolf

Copy link
Copy Markdown
Contributor

Summary

  • Problem: Inbound voice messages on Matrix reach the agent as raw audio attachments with no transcript. The model improvises a reply instead of answering, and requireMention: true rooms drop voice notes entirely because there's no text to match the mention regex.
  • Why it matters: Discord, Telegram, WhatsApp, and Feishu already transcribe inbound voice via transcribeFirstAudio before the mention gate. Matrix is the lone holdout — voice-only Matrix users have no way to talk to their agent.
  • What changed: Wire the existing transcribeFirstAudio helper into the Matrix monitor handler before the mention gate, mirroring the Discord/Telegram pattern. The transcript feeds into the mention check (so a voice note that says the bot's name reaches the agent in requireMention rooms) and into BodyForAgent (so the agent reads the transcript instead of a placeholder). MediaTranscribedIndexes is set so downstream tools don't re-transcribe the same audio.
  • What did NOT change (scope boundary): No core media-understanding changes. No new config keys (operators control via the existing global tools.media.audio.enabled). Outbound TTS untouched. E2EE crypto path untouched (decryption stays inside downloadMatrixMedia; preflight receives the plaintext path). No changes to other channels.

Change Type (select all)

  • Feature

Scope (select all touched areas)

  • Integrations

Linked Issue/PR

Root Cause (if applicable)

N/A — feature parity request, not a regression.

Regression Test Plan (if applicable)

N/A — feature parity. Test coverage:

  • 15 unit tests in extensions/matrix/src/matrix/monitor/preflight-audio.test.ts covering the audio-detection predicate, transcript formatter, and runtime caller (happy paths, error swallowing, abort-signal short-circuit, MediaPaths/MediaTypes ctx shape).
  • 9 integration tests in extensions/matrix/src/matrix/monitor/handler.audio-preflight.test.ts covering: DM voice notes, m.file with audio mime, mention-gate bypass via transcript, mention-gate drop without transcript match, transcription failure fallback, non-audio bypass, single-download verification, encrypted (E2EE) audio, and size-limit handling.
  • Existing handler.test.ts, handler.media-failure.test.ts, and the rest of the matrix monitor suite: 396/396 passing, no regressions.

User-visible / Behavior Changes

  • Inbound voice notes (m.audio, plus m.file carrying an audio/* mimetype) on Matrix now get transcribed before the mention gate. A voice note that mentions the bot by name (per the existing mentionRegexes) bypasses requireMention: true rooms, matching Discord/Telegram.
  • The transcript is wrapped with [Audio transcript (machine-generated, untrusted)]: … framing (with JSON.stringify escaping) so prompt-injection content inside the audio cannot impersonate system instructions.
  • Bare-filename audio bodies (e.g. voice.ogg, auto-set by Element) are replaced with the existing [matrix audio attachment] placeholder so the agent sees a clear audio marker rather than a stray filename. The download-failed path was already doing this; we extend it to the success path for audio specifically.
  • Operators can disable globally via tools.media.audio.enabled: false.

Diagram (if applicable)

Before:
[user voice note] -> [Matrix handler]
  -> mention gate (drops if requireMention:true & no @mention text)
  -> media download
  -> agent ctx { MediaPath, MediaUrl, BodyForAgent: "[matrix media]" }
  -> agent improvises a reply about "an audio attachment"

After:
[user voice note] -> [Matrix handler]
  -> audio detect + early download + transcribeFirstAudio
  -> mention gate (sees transcript text, can match @bot mentions)
  -> agent ctx { MediaPath, MediaUrl, BodyForAgent: "[Audio transcript ...]: \"...\"", MediaTranscribedIndexes: [0] }
  -> agent answers the spoken question

Security Impact (required)

  • New permissions/capabilities? No.
  • Secrets/tokens handling changed? No.
  • New/changed network calls? No new outbound endpoint — uses the operator's existing tools.media.audio.provider, the same one Discord/Telegram/WhatsApp/Feishu already call. No new API keys required.
  • Command/tool execution surface changed? No.
  • Data access scope changed? Yes (minor) — audio attachment bytes (decrypted plaintext for E2EE rooms) are now sent to the operator-configured STT provider on Matrix, matching peer-channel behavior. Mitigation: gated by tools.media.audio.enabled; documented in docs/channels/matrix.md.

Repro + Verification

Environment

  • OS: Linux (Debian)
  • Runtime/container: OpenClaw gateway in Docker
  • Model/provider: OpenAI gpt-4o-mini-transcribe (via tools.media.audio.provider: openai)
  • Integration/channel: Matrix (self-hosted Synapse, Element clients)
  • Relevant config (redacted): tools.media.audio.{enabled, provider: openai, model: gpt-4o-mini-transcribe}

Steps

  1. Send a voice note from Element to a Matrix bot in a requireMention: true room, saying the bot's name in the recording.
  2. Send a voice note in a DM to the bot, with no text caption.
  3. Send a non-audio attachment (e.g. an image) to the same room to confirm unchanged behavior.

Expected

  • (1) Bot transcribes the voice note, sees the spoken bot mention in the transcript, and replies normally.
  • (2) Bot replies based on the transcribed body.
  • (3) Behavior unchanged from main.

Actual

  • (1) (2) (3) Asserted by integration tests in handler.audio-preflight.test.ts. Live end-to-end on a personal homeserver is deferred (see Human Verification below).

Evidence

  • Failing test/log before + passing after — the integration test file was authored TDD-style. Initial run showed 5 failed assertions on expect(transcribeFirstAudioMock).toHaveBeenCalledTimes(1); once the handler was wired, all 9 cases turned green.
  • Trace/log snippets — N/A; tests mock downloadMatrixMedia and transcribeFirstAudio following the Discord pattern.
  • Screenshot/recording — N/A.
  • Perf numbers — N/A.

Human Verification (required)

  • Verified scenarios: TDD-driven integration tests covering DM voice, room-mention-bypass via transcript, room-mention-drop without match, transcription-failure fallback, non-audio bypass, single-download verification, E2EE-encrypted audio, and size-limit handling. All 9 cases green. Full matrix monitor suite runs clean (396/396, including pre-existing handler.media-failure.test.ts and handler.body-for-agent.test.ts).
  • Edge cases checked: Encrypted media (file: { url, key, iv, hashes, v }) decryption + transcription path. m.file with audio mimetype. Bare-filename body normalization. Abort signal short-circuit before SDK load. Empty / undefined transcript fallthrough.
  • What I did NOT verify: Live end-to-end run against my own Synapse + Element clients on this branch. Reason: testing a fork build requires a custom Docker image; the framework-side path (transcribeFirstAudio) is exercised by the existing src/media-understanding test suite, and our handler integration is locked in by the new tests.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes — additive only.
  • Config/env changes? No — reuses existing tools.media.audio configuration.
  • Migration needed? No.

Risks and Mitigations

  • Risk: Prompt injection via voice transcript ("ignore prior instructions and …").
    • Mitigation: Transcript is wrapped with [Audio transcript (machine-generated, untrusted)]: ${JSON.stringify(transcript)} framing before reaching the agent body, mirroring Telegram exactly. JSON.stringify escapes control characters and quote chars.
  • Risk: STT cost amplification — every audio message in a watched room triggers an STT API call.
    • Mitigation: Operator-controlled via tools.media.audio.enabled. Existing room/sender allowlist in the Matrix handler runs BEFORE the preflight code, so unauthorized senders never trigger transcription.
  • Risk: Reordering the audio download (now runs before mention gate) could shift failure semantics.
    • Mitigation: The existing media block now reuses the preflight result via !media && !mediaDownloadFailed guards. Same exception types propagate through the same logger paths. Existing handler.media-failure.test.ts still passes unchanged.

Notes for reviewers

  • The audio download + error-handling block in handler.ts duplicates ~25 lines with the existing media block (different scope, slightly different encrypted: Boolean(...) flag). Kept as-is for minimal-diff. Happy to extract into a helper if preferred.
  • The early content-extraction block at handler.ts ~876-887 (earlyContentInfo etc.) duplicates the later block at ~1088-1096 because the audio path needs the info BEFORE the mention gate while non-audio doesn't. Same minimal-diff trade-off.
  • No disableAudioPreflight per-room knob (Telegram has one). Operators rely on global tools.media.audio.enabled: false. Easy to add if the team wants finer-grained control — happy to send a follow-up.

This PR was developed with AI — [AI-assisted].

Suggested reviewer: @gumadeiras (Matrix-area maintainer).


P.S. I didn't tested it live so far. Still need to. First lets send the pr and lets see from there.

New module under extensions/matrix/src/matrix/monitor/ that wraps
the shared transcribeFirstAudio helper for Matrix. Mirrors the
Discord/Telegram pattern: a small *.runtime.ts shim re-exports
from openclaw/plugin-sdk/media-runtime, and the public surface
is a Matrix-specific predicate (isMatrixAudioContent), a
prompt-injection-safe transcript formatter, and an async caller
that builds a Matrix-shaped MsgContext (MediaPaths over
MediaUrls, since Matrix downloads attachments locally for E2EE).

Includes 15 unit tests covering predicate edges, JSON-escaped
formatter output, happy-path transcription, error swallowing,
and abort-signal short-circuit.
Wire resolveMatrixPreflightAudioTranscript into the Matrix
monitor handler before the mention gate so voice-only messages
can carry an @bot mention via the transcript and reach the
agent in requireMention rooms, matching Discord, Telegram,
WhatsApp, and Feishu.

The audio download is hoisted ahead of the mention gate; the
existing media block reuses the preflight result via
!media && !mediaDownloadFailed guards so non-audio paths are
unchanged. Bare-filename audio bodies (auto-set by Element)
are normalized to the existing [matrix audio attachment]
placeholder so the agent sees a clear marker rather than a
stray filename. MediaTranscribedIndexes is set so downstream
tools do not re-transcribe.

9 integration tests cover DM voice notes, mention-gate bypass
via transcript, mention-gate drop without match, transcription
failure, non-audio bypass, single-download verification,
encrypted (E2EE) audio, and size-limit handling.

Closes openclaw#78016.
Add a Voice messages and audio transcription section to
docs/channels/matrix.md describing the preflight flow,
behavior contract, and the global tools.media.audio.enabled
kill switch.
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation channel: matrix Channel integration: matrix size: L triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 5, 2026
@clawsweeper

clawsweeper Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed June 1, 2026, 1:07 AM ET / 05:07 UTC.

Summary
Review failed before ClawSweeper could summarize the requested change.

PR surface: Source +172, Tests +510, Docs +20. Total +702 across 8 files.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Review metrics: none identified.

Merge readiness
Overall: 🌊 off-meta tidepool
Proof: 🌊 off-meta tidepool
Patch quality: 🌊 off-meta tidepool
Result: rating does not apply to this item.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • [P1] No close action taken because the review did not complete.

Maintainer options:

  1. Decide the mitigation before merge
    Retry the Codex review after fixing the execution failure.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • [P1] Review did not complete, so no work-lane recommendation was made.
Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

AGENTS.md: unclear because the file could not be read completely.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4e57546a8752.

Label changes

Label changes:

  • add rating: 🌊 off-meta tidepool: Overall readiness is 🌊 off-meta tidepool; proof is 🌊 off-meta tidepool and patch quality is 🌊 off-meta tidepool.
  • remove P2: Current review triage priority is none.
  • remove rating: 🧂 unranked krab: Current PR rating is rating: 🌊 off-meta tidepool, so this older rating label is no longer current.
  • remove merge-risk: 🚨 compatibility: Current PR review selected no merge-risk labels.
  • remove merge-risk: 🚨 security-boundary: Current PR review selected no merge-risk labels.
  • remove status: 📣 needs proof: Current PR status no longer selects a status label.

Label justifications:

  • rating: 🌊 off-meta tidepool: Overall readiness is 🌊 off-meta tidepool; proof is 🌊 off-meta tidepool and patch quality is 🌊 off-meta tidepool.
Evidence reviewed

PR surface:

Source +172, Tests +510, Docs +20. Total +702 across 8 files.

View PR surface stats
Area Files Added Removed Net
Source 3 178 6 +172
Tests 2 510 0 +510
Docs 3 20 0 +20
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 8 708 6 +702

What I checked:

  • failure reason: codex execution failed.
  • codex failure detail: Codex review failed for this PR with exit 1.
  • codex stdout: Per-item Codex failure; continuing with the rest of the shard.

Likely related people:

  • unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Adds the user-facing changelog entry that OpenClaw policy
requires for `feat` changes. Both extensions/matrix/CHANGELOG.md
and the root CHANGELOG.md note the new pre-mention transcription
behavior under Unreleased / Changes.
@frankdierolf

frankdierolf commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Hey @clawsweeper — thanks for the review.

P1: false positive — isLikelyBareFilename is exported at extensions/matrix/src/matrix/media-text.ts:39. Imports resolve fine, type-check + tests are green.

P2: my bad on the changelog — fixed in 6441896.

Heads up for any human reviewer: I haven't done a live Synapse/Element run on this branch yet. Just threw the PR out because the impl is small and mirrors Discord/Telegram pretty closely. Happy if anyone has a moment to live-test, otherwise I'll get to it on my own homeserver soon-ish. Good starting point at least.

What surprised me when I dug into how this actually works: we transcribe every voice note that comes in, then grep the transcript for the bot mention. So if STT mishears the bot's name (or the user mispronounces it), the bot stays silent. Hilarious failure mode 😅 but I get why — no way to peek inside audio without transcribing it. Discord/Telegram/WhatsApp/Feishu live with the same trade-off, and STT is cheap enough now that it's fine.

Cheers,
Frank

P.s. approved and read by myself ^^

@openclaw-barnacle

Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label May 31, 2026
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. labels May 31, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Jun 1, 2026
@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 1, 2026
@steipete

steipete commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Closed as superseded by the maintainer-owned Matrix fix that landed in #90415 at 25149801189f.

#90415 includes the inbound Matrix voice-note preflight path from this PR, with focused tests and live proof: AWS Crabbox live OpenAI voice-STT Matrix QA run_f9edb5fe2a5f passed matrix-voice-preflight-mention in live-frontier mode, suite 4/4.

Thanks @frankdierolf for pushing this behavior forward.

@steipete steipete closed this Jun 5, 2026
@layhaus

layhaus commented Jun 6, 2026

Copy link
Copy Markdown

I will test it when it lands in the next release and comment here again to give a "production approve signal".

Thanks for the work 👍
*I am Frank, diffrent account. My fault^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: matrix Channel integration: matrix docs Improvements or additions to documentation merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. P2 Normal backlog priority with limited blast radius. rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. size: L triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Voice messages to agent don't work on Matrix

3 participants