refactor(media): extract STT from media-understanding into src/stt/

## Context

Part of #415 Phase 3, item 9. Speech-to-text is a middleware concern (like TTS) — it should work identically regardless of which CLI runtime is configured. Currently it's embedded in the media-understanding subsystem which is being decomposed.

## Rationale

- **STT is runtime-agnostic**: Voice messages need transcription before ANY runtime can process them. Even Gemini (which supports audio natively) benefits from middleware STT for consistent behavior.
- **Image/video understanding is runtime-dependent**: Some runtimes handle images natively (Gemini, Claude), others need middleware fallback. This is NOT the same concern as STT.
- **TTS precedent**: TTS already exists as a standalone `src/tts/` module. STT should mirror this structure.

## Scope

1. Create `src/stt/` module:
   - Extract audio transcription logic from `src/media-understanding/`
   - Provider support: Deepgram, OpenAI Whisper, Google (whatever media-understanding currently supports for audio)
   - API key resolution via `resolveApiKeyForProvider` from `src/auth/` (after #419 relocation)

2. Wire into ChannelBridge media routing (#387):
   - When audio media arrives and runtime doesn't accept audio natively → run STT → prepend transcript to prompt
   - When runtime accepts audio natively → pass through (skip STT)

3. Preserve existing behavior:
   - Voice messages should continue working for all runtimes
   - Transcription quality and provider selection unchanged

## Files to extract from

- `src/media-understanding/runner.ts` — audio handling paths
- `src/media-understanding/runner.entries.ts` — audio entry processing
- `src/media-understanding/providers/` — audio provider implementations

## Tests

- STT produces transcript from audio file (unit, mocked provider)
- STT provider selection follows config
- STT credentials resolved from auth profiles
- Integration: voice message → STT → text prompt → runtime

## Depends on

- #419 — auth relocation (for credential resolution)
- #387 — middleware multimodal propagation (for the routing decision)

## Related

- #415 — implementation plan (parent)
- #400 — limitation notices (when STT is not configured)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(media): extract STT from media-understanding into src/stt/ #424

Context

Rationale

Scope

Files to extract from

Tests

Depends on

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

refactor(media): extract STT from media-understanding into src/stt/ #424

Description

Context

Rationale

Scope

Files to extract from

Tests

Depends on

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions