Problem or Use Case
Current behavior
When a user sends a voice message, Hermes automatically transcribes it to text and injects the transcript into the conversation context. The agent never sees or receives the actual audio file.
Example of what the agent currently receives:
[The user sent a voice message~ Here's what they said: "Ja, we moeten even..."]
Problem
This makes it impossible for the agent to:
- Run speaker diarization (e.g. with parakeet-rs) to identify who spoke when
- Perform audio quality checks or noise analysis
- Extract speaker snippets to ask the user "who is Speaker 0?"
- Use custom transcription pipelines (different models, languages, formatting)
- Archive the original audio alongside transcripts
Desired behavior
Provide a way (per-message or per-conversation) to receive the voice message as an audio file instead of (or in addition to) the automatic transcript.
Ideally, the agent should receive something like:
[The user sent a voice message: /path/to/downloaded/audio.ogg (duration: 12:34)]
Then the agent can decide what to do — transcribe it itself, run diarization, store it, etc.
Use case / motivation
I maintain a personal knowledge wiki where meeting recordings are diarized with parakeet-rs, speaker identities are resolved by extracting audio snippets, and transcripts are stored via a structured ingestion pipeline. The automatic STT bypasses this entire workflow and provides no speaker separation.
Proposed Solution
Possible solutions
- Per-message opt-out: A user prefix or command (e.g. /voice or !nostt) that tells Hermes "send me the file, not the transcript"
- Agent-side preference: A setting the agent can toggle: "for this conversation, request raw audio for voice messages"
- Always provide both: Send the audio file path and the transcript, letting the agent choose which to use
- Platform-level config: A setting in Hermes config to disable automatic STT globally or per-platform
- An mcp/skill to transcribe when the agent sees fit
Alternatives Considered
No response
Feature Type
Configuration option
Scope
Small (single file, < 50 lines)
Contribution
Debug Report (optional)
Problem or Use Case
Current behavior
When a user sends a voice message, Hermes automatically transcribes it to text and injects the transcript into the conversation context. The agent never sees or receives the actual audio file.
Example of what the agent currently receives:
Problem
This makes it impossible for the agent to:
Desired behavior
Provide a way (per-message or per-conversation) to receive the voice message as an audio file instead of (or in addition to) the automatic transcript.
Ideally, the agent should receive something like:
Then the agent can decide what to do — transcribe it itself, run diarization, store it, etc.
Use case / motivation
I maintain a personal knowledge wiki where meeting recordings are diarized with parakeet-rs, speaker identities are resolved by extracting audio snippets, and transcripts are stored via a structured ingestion pipeline. The automatic STT bypasses this entire workflow and provides no speaker separation.
Proposed Solution
Possible solutions
Alternatives Considered
No response
Feature Type
Configuration option
Scope
Small (single file, < 50 lines)
Contribution
Debug Report (optional)