Skip to content

feat: unified auth + multimodal AgentRuntime — implementation plan #415

@alexey-pelykh

Description

@alexey-pelykh

Overview

Comprehensive plan for unified credential management and multimodal media support across all CLI runtimes. Successor to #376 (which captured the research, spikes, and architectural evolution).

Background

#376 started as "gut auth profiles + media understanding" and evolved through several rounds of analysis into a restructuring plan:

  • Auth profiles are mis-wired, not useless — should be middleware-wide, not agent-specific
  • Media understanding should be decomposed: STT → middleware, image/video → runtime-dependent
  • TTS has its own credential silo that bypasses auth profiles (upstream design gap)
  • CLI runtimes can accept multimodal input (Gemini best, Claude images-only, Codex/OpenCode blocked upstream)

Key decisions documented in #376 comments:


Phase 1: Auth Foundation ✅

Unified credential management with per-agent key rotation.

# Task Issue Status
1 Relocate src/agents/auth-profiles/src/auth/ #419 done ✅
2 Add auth config field (auth?: false | string | string[]) #421 done ✅
3 Wire auth profile → CLI env injection #422 done ✅
4 Retry with rotated key on rate-limit #423 done ✅
5 Adapt onboarding wizard #417 done ✅
6 Adapt OpenClaw import #427 done ✅
7 Relocate auth store to global path + strip legacy migration #438 done ✅
8 Import: consolidate per-agent auth into global store #439 done ✅

Phase 2: Multimodal Contract + Routing ✅

AgentRuntime multimodal contract and ChannelBridge media routing.

# Task Issue Status
9 AgentRuntime multimodal contract (MediaAttachment, mediaCapabilities) #385 done ✅
10 Fix buildChannelMessage mediaUrls wiring #384 done ✅
11 ChannelBridge media routing (capability check → passthrough or fallback) #387 done ✅

Phase 3: Gemini Multimodal ✅

Gemini gets full native multimodal — images, audio, video, PDF.

# Task Issue Status
12 Gemini runtime multimodal (@path syntax, temp files) #397 done ✅

Phase 4: Claude Multimodal ✅

Claude gets native image support via stdin stream-json refactor.

# Task Issue Status
13 Claude runtime multimodal (stdin stream-json, images) #396 done ✅

Phase 5: STT + User Feedback ✅

Voice messages work end-to-end for all runtimes. Clear feedback when media can't be processed.

# Task Issue Blocked by Status
14 Extract STT from src/media-understanding/src/stt/ #424 done ✅
15 Communicate multimodal limitations to users #400 #424 done ✅

Phase 6: TTS Credential Unification ✅

TTS joins unified auth. All credentials through one system.

# Task Issue Blocked by Status
16 Add ElevenLabs to auth provider system #403 done ✅
17 TTS uses resolveApiKeyForProvider from src/auth/ #402 #403 done ✅

Phase 7: Voice Channel Validation ✅

Voice-only channels require STT/TTS auth credentials.

# Task Issue Blocked by Status
18 Require STT/TTS auth for voice-only channels #471 #424, #402, #403 done ✅

Phase 8: Cleanup ✅

Remove dead code after all phases land.

# Task Issue Blocked by Status
19 Remove dead media understanding code (multi-provider vision runner) #425 #424 done ✅
20 Remove dead auth profile consumers (session overrides, directive handlers) #426 #402 done ✅

Parallelization

Phase 1 ✅ ── Phase 2 ✅
                │
                ├── Phase 3 ✅ (Gemini #397) ──────────┐
                ├── Phase 4 ✅ (Claude #396) ───────────┤
                ├── Phase 5 ✅ (STT #424 → #400) ────┼── Phase 7 ✅ (voice #471)
                └── Phase 6 ✅ (TTS auth #403 → #402) ──┤
                                                      └── Phase 8 ✅ (cleanup #425, #426)

Phases 3, 4, 5, 6 are all independent and can run in parallel. Phase 7 needs Phases 5+6. Phase 8 needs all of them.

Out of scope (for now)

All related issues

Issue Title Phase Status
#375 runtimeEnv config field prereq done ✅
#376 Auth/media research and architectural evolution predecessor superseded
#384 buildChannelMessage never populates mediaUrls 2 done ✅
#385 AgentRuntime multimodal contract 2 done ✅
#386 Per-runtime multimodal (tracking) 3-4 tracking
#387 Middleware multimodal propagation 2 done ✅
#396 Claude runtime multimodal 4 done ✅
#397 Gemini runtime multimodal 3 done ✅
#398 Codex runtime multimodal (blocked upstream) out of scope
#399 OpenCode runtime multimodal (blocked upstream) out of scope
#400 Communicate multimodal limitations 5 done ✅
#402 TTS auth profile integration 6 done ✅
#403 ElevenLabs auth provider 6 done ✅
#417 Onboarding wizard adaptation 1 done ✅
#419 Auth profiles relocation 1 done ✅
#421 Per-agent auth config field 1 done ✅
#422 Auth profile → CLI env injection 1 done ✅
#423 Retry with rotated key on rate-limit 1 done ✅
#424 STT extraction to src/stt/ 5 done ✅
#425 Remove dead media understanding code 8 done ✅
#426 Remove dead auth profile consumers 8 done ✅
#427 OpenClaw import adaptation 1 done ✅
#438 Auth store global relocation 1 done ✅
#439 Import: consolidate per-agent auth 1 done ✅
#471 Voice channel STT/TTS validation 7 done ✅
#478 Wire auxiliary provider auth flags 6 done ✅
#497 Plugin SDK: custom STT providers done ✅
#498 Plugin SDK: custom TTS providers done ✅
#374 CLIRuntimeBase stderr swallowing independent

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions