You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Part of #415 Phase 3, item 9. Speech-to-text is a middleware concern (like TTS) — it should work identically regardless of which CLI runtime is configured. Currently it's embedded in the media-understanding subsystem which is being decomposed.
Rationale
STT is runtime-agnostic: Voice messages need transcription before ANY runtime can process them. Even Gemini (which supports audio natively) benefits from middleware STT for consistent behavior.
Image/video understanding is runtime-dependent: Some runtimes handle images natively (Gemini, Claude), others need middleware fallback. This is NOT the same concern as STT.
TTS precedent: TTS already exists as a standalone src/tts/ module. STT should mirror this structure.
Scope
Create src/stt/ module:
Extract audio transcription logic from src/media-understanding/
Provider support: Deepgram, OpenAI Whisper, Google (whatever media-understanding currently supports for audio)
Context
Part of #415 Phase 3, item 9. Speech-to-text is a middleware concern (like TTS) — it should work identically regardless of which CLI runtime is configured. Currently it's embedded in the media-understanding subsystem which is being decomposed.
Rationale
src/tts/module. STT should mirror this structure.Scope
Create
src/stt/module:src/media-understanding/resolveApiKeyForProviderfromsrc/auth/(after refactor(auth): relocate auth-profiles from src/agents/ to src/auth/ #419 relocation)Wire into ChannelBridge media routing (feat(middleware): propagate multimodal media through ChannelBridge and auto-reply #387):
Preserve existing behavior:
Files to extract from
src/media-understanding/runner.ts— audio handling pathssrc/media-understanding/runner.entries.ts— audio entry processingsrc/media-understanding/providers/— audio provider implementationsTests
Depends on
Related