Overview
Comprehensive plan for unified credential management and multimodal media support across all CLI runtimes. Successor to #376 (which captured the research, spikes, and architectural evolution).
Background
#376 started as "gut auth profiles + media understanding" and evolved through several rounds of analysis into a restructuring plan:
- Auth profiles are mis-wired, not useless — should be middleware-wide, not agent-specific
- Media understanding should be decomposed: STT → middleware, image/video → runtime-dependent
- TTS has its own credential silo that bypasses auth profiles (upstream design gap)
- CLI runtimes can accept multimodal input (Gemini best, Claude images-only, Codex/OpenCode blocked upstream)
Key decisions documented in #376 comments:
Phase 1: Auth Foundation ✅
Unified credential management with per-agent key rotation.
| # |
Task |
Issue |
Status |
| 1 |
Relocate src/agents/auth-profiles/ → src/auth/ |
#419 |
done ✅ |
| 2 |
Add auth config field (auth?: false | string | string[]) |
#421 |
done ✅ |
| 3 |
Wire auth profile → CLI env injection |
#422 |
done ✅ |
| 4 |
Retry with rotated key on rate-limit |
#423 |
done ✅ |
| 5 |
Adapt onboarding wizard |
#417 |
done ✅ |
| 6 |
Adapt OpenClaw import |
#427 |
done ✅ |
| 7 |
Relocate auth store to global path + strip legacy migration |
#438 |
done ✅ |
| 8 |
Import: consolidate per-agent auth into global store |
#439 |
done ✅ |
Phase 2: Multimodal Contract + Routing ✅
AgentRuntime multimodal contract and ChannelBridge media routing.
| # |
Task |
Issue |
Status |
| 9 |
AgentRuntime multimodal contract (MediaAttachment, mediaCapabilities) |
#385 |
done ✅ |
| 10 |
Fix buildChannelMessage mediaUrls wiring |
#384 |
done ✅ |
| 11 |
ChannelBridge media routing (capability check → passthrough or fallback) |
#387 |
done ✅ |
Phase 3: Gemini Multimodal ✅
Gemini gets full native multimodal — images, audio, video, PDF.
| # |
Task |
Issue |
Status |
| 12 |
Gemini runtime multimodal (@path syntax, temp files) |
#397 |
done ✅ |
Phase 4: Claude Multimodal ✅
Claude gets native image support via stdin stream-json refactor.
| # |
Task |
Issue |
Status |
| 13 |
Claude runtime multimodal (stdin stream-json, images) |
#396 |
done ✅ |
Phase 5: STT + User Feedback ✅
Voice messages work end-to-end for all runtimes. Clear feedback when media can't be processed.
| # |
Task |
Issue |
Blocked by |
Status |
| 14 |
Extract STT from src/media-understanding/ → src/stt/ |
#424 |
— |
done ✅ |
| 15 |
Communicate multimodal limitations to users |
#400 |
#424 |
done ✅ |
Phase 6: TTS Credential Unification ✅
TTS joins unified auth. All credentials through one system.
| # |
Task |
Issue |
Blocked by |
Status |
| 16 |
Add ElevenLabs to auth provider system |
#403 |
— |
done ✅ |
| 17 |
TTS uses resolveApiKeyForProvider from src/auth/ |
#402 |
#403 |
done ✅ |
Phase 7: Voice Channel Validation ✅
Voice-only channels require STT/TTS auth credentials.
| # |
Task |
Issue |
Blocked by |
Status |
| 18 |
Require STT/TTS auth for voice-only channels |
#471 |
#424, #402, #403 |
done ✅ |
Phase 8: Cleanup ✅
Remove dead code after all phases land.
| # |
Task |
Issue |
Blocked by |
Status |
| 19 |
Remove dead media understanding code (multi-provider vision runner) |
#425 |
#424 ✅ |
done ✅ |
| 20 |
Remove dead auth profile consumers (session overrides, directive handlers) |
#426 |
#402 |
done ✅ |
Parallelization
Phase 1 ✅ ── Phase 2 ✅
│
├── Phase 3 ✅ (Gemini #397) ──────────┐
├── Phase 4 ✅ (Claude #396) ───────────┤
├── Phase 5 ✅ (STT #424 → #400) ────┼── Phase 7 ✅ (voice #471)
└── Phase 6 ✅ (TTS auth #403 → #402) ──┤
└── Phase 8 ✅ (cleanup #425, #426)
Phases 3, 4, 5, 6 are all independent and can run in parallel. Phase 7 needs Phases 5+6. Phase 8 needs all of them.
Out of scope (for now)
All related issues
| Issue |
Title |
Phase |
Status |
| #375 |
runtimeEnv config field |
prereq |
done ✅ |
| #376 |
Auth/media research and architectural evolution |
predecessor |
superseded |
| #384 |
buildChannelMessage never populates mediaUrls |
2 |
done ✅ |
| #385 |
AgentRuntime multimodal contract |
2 |
done ✅ |
| #386 |
Per-runtime multimodal (tracking) |
3-4 |
tracking |
| #387 |
Middleware multimodal propagation |
2 |
done ✅ |
| #396 |
Claude runtime multimodal |
4 |
done ✅ |
| #397 |
Gemini runtime multimodal |
3 |
done ✅ |
| #398 |
Codex runtime multimodal (blocked upstream) |
— |
out of scope |
| #399 |
OpenCode runtime multimodal (blocked upstream) |
— |
out of scope |
| #400 |
Communicate multimodal limitations |
5 |
done ✅ |
| #402 |
TTS auth profile integration |
6 |
done ✅ |
| #403 |
ElevenLabs auth provider |
6 |
done ✅ |
| #417 |
Onboarding wizard adaptation |
1 |
done ✅ |
| #419 |
Auth profiles relocation |
1 |
done ✅ |
| #421 |
Per-agent auth config field |
1 |
done ✅ |
| #422 |
Auth profile → CLI env injection |
1 |
done ✅ |
| #423 |
Retry with rotated key on rate-limit |
1 |
done ✅ |
| #424 |
STT extraction to src/stt/ |
5 |
done ✅ |
| #425 |
Remove dead media understanding code |
8 |
done ✅ |
| #426 |
Remove dead auth profile consumers |
8 |
done ✅ |
| #427 |
OpenClaw import adaptation |
1 |
done ✅ |
| #438 |
Auth store global relocation |
1 |
done ✅ |
| #439 |
Import: consolidate per-agent auth |
1 |
done ✅ |
| #471 |
Voice channel STT/TTS validation |
7 |
done ✅ |
| #478 |
Wire auxiliary provider auth flags |
6 |
done ✅ |
| #497 |
Plugin SDK: custom STT providers |
— |
done ✅ |
| #498 |
Plugin SDK: custom TTS providers |
— |
done ✅ |
| #374 |
CLIRuntimeBase stderr swallowing |
— |
independent |
Overview
Comprehensive plan for unified credential management and multimodal media support across all CLI runtimes. Successor to #376 (which captured the research, spikes, and architectural evolution).
Background
#376 started as "gut auth profiles + media understanding" and evolved through several rounds of analysis into a restructuring plan:
Key decisions documented in #376 comments:
Phase 1: Auth Foundation ✅
Unified credential management with per-agent key rotation.
src/agents/auth-profiles/→src/auth/authconfig field (auth?: false | string | string[])Phase 2: Multimodal Contract + Routing ✅
AgentRuntime multimodal contract and ChannelBridge media routing.
MediaAttachment,mediaCapabilities)buildChannelMessagemediaUrls wiringPhase 3: Gemini Multimodal ✅
Gemini gets full native multimodal — images, audio, video, PDF.
@pathsyntax, temp files)Phase 4: Claude Multimodal ✅
Claude gets native image support via stdin stream-json refactor.
Phase 5: STT + User Feedback ✅
Voice messages work end-to-end for all runtimes. Clear feedback when media can't be processed.
src/media-understanding/→src/stt/Phase 6: TTS Credential Unification ✅
TTS joins unified auth. All credentials through one system.
resolveApiKeyForProviderfromsrc/auth/Phase 7: Voice Channel Validation ✅
Voice-only channels require STT/TTS auth credentials.
Phase 8: Cleanup ✅
Remove dead code after all phases land.
Parallelization
Phases 3, 4, 5, 6 are all independent and can run in parallel. Phase 7 needs Phases 5+6. Phase 8 needs all of them.
Out of scope (for now)
AgentMediaEvent— no runtime emits media today, add when neededremoteclaw auth addCLI command — natural follow-up, not scoped hereAll related issues
runtimeEnvconfig fieldbuildChannelMessagenever populatesmediaUrlssrc/stt/