feat: unified auth + multimodal AgentRuntime — implementation plan

## Overview

Comprehensive plan for unified credential management and multimodal media support across all CLI runtimes. Successor to #376 (which captured the research, spikes, and architectural evolution).

## Background

#376 started as "gut auth profiles + media understanding" and evolved through several rounds of analysis into a restructuring plan:

- Auth profiles are **mis-wired, not useless** — should be middleware-wide, not agent-specific
- Media understanding should be **decomposed**: STT → middleware, image/video → runtime-dependent
- TTS has its own credential silo that bypasses auth profiles (upstream design gap)
- CLI runtimes can accept multimodal input (Gemini best, Claude images-only, Codex/OpenCode blocked upstream)

Key decisions documented in #376 comments:
- [Spike results: CLI media capabilities](https://github.com/remoteclaw/remoteclaw/issues/376#issuecomment-4017564597)
- [Revised architecture: repurpose not gut](https://github.com/remoteclaw/remoteclaw/issues/376#issuecomment-4018450250)
- [TTS credential fragmentation: upstream gap](https://github.com/remoteclaw/remoteclaw/issues/376#issuecomment-4018451810)
- [Per-agent auth config design](https://github.com/remoteclaw/remoteclaw/issues/376#issuecomment-4018528312)
- [Onboarding wizard impact: stays single-key simple](https://github.com/remoteclaw/remoteclaw/issues/415#issuecomment-4018552007)

---

## Phase 1: Auth Foundation ✅

*Unified credential management with per-agent key rotation.*

| # | Task | Issue | Status |
|---|------|-------|--------|
| 1 | Relocate `src/agents/auth-profiles/` → `src/auth/` | #419 | done ✅ |
| 2 | Add `auth` config field (`auth?: false \| string \| string[]`) | #421 | done ✅ |
| 3 | Wire auth profile → CLI env injection | #422 | done ✅ |
| 4 | Retry with rotated key on rate-limit | #423 | done ✅ |
| 5 | Adapt onboarding wizard | #417 | done ✅ |
| 6 | Adapt OpenClaw import | #427 | done ✅ |
| 7 | Relocate auth store to global path + strip legacy migration | #438 | done ✅ |
| 8 | Import: consolidate per-agent auth into global store | #439 | done ✅ |

---

## Phase 2: Multimodal Contract + Routing ✅

*AgentRuntime multimodal contract and ChannelBridge media routing.*

| # | Task | Issue | Status |
|---|------|-------|--------|
| 9 | AgentRuntime multimodal contract (`MediaAttachment`, `mediaCapabilities`) | #385 | done ✅ |
| 10 | Fix `buildChannelMessage` mediaUrls wiring | #384 | done ✅ |
| 11 | ChannelBridge media routing (capability check → passthrough or fallback) | #387 | done ✅ |

---

## Phase 3: Gemini Multimodal ✅

*Gemini gets full native multimodal — images, audio, video, PDF.*

| # | Task | Issue | Status |
|---|------|-------|--------|
| 12 | Gemini runtime multimodal (`@path` syntax, temp files) | #397 | done ✅ |

---

## Phase 4: Claude Multimodal ✅

*Claude gets native image support via stdin stream-json refactor.*

| # | Task | Issue | Status |
|---|------|-------|--------|
| 13 | Claude runtime multimodal (stdin stream-json, images) | #396 | done ✅ |

---

## Phase 5: STT + User Feedback ✅

*Voice messages work end-to-end for all runtimes. Clear feedback when media can't be processed.*

| # | Task | Issue | Blocked by | Status |
|---|------|-------|------------|--------|
| 14 | Extract STT from `src/media-understanding/` → `src/stt/` | #424 | — | done ✅ |
| 15 | Communicate multimodal limitations to users | #400 | #424 | done ✅ |

---

## Phase 6: TTS Credential Unification ✅

*TTS joins unified auth. All credentials through one system.*

| # | Task | Issue | Blocked by | Status |
|---|------|-------|------------|--------|
| 16 | Add ElevenLabs to auth provider system | #403 | — | done ✅ |
| 17 | TTS uses `resolveApiKeyForProvider` from `src/auth/` | #402 | #403 | done ✅ |

---

## Phase 7: Voice Channel Validation ✅

*Voice-only channels require STT/TTS auth credentials.*

| # | Task | Issue | Blocked by | Status |
|---|------|-------|------------|--------|
| 18 | Require STT/TTS auth for voice-only channels | #471 | #424, #402, #403 | done ✅ |

---

## Phase 8: Cleanup ✅

*Remove dead code after all phases land.*

| # | Task | Issue | Blocked by | Status |
|---|------|-------|------------|--------|
| 19 | Remove dead media understanding code (multi-provider vision runner) | #425 | #424 ✅ | done ✅ |
| 20 | Remove dead auth profile consumers (session overrides, directive handlers) | #426 | #402 | done ✅ |

---

## Parallelization

```
Phase 1 ✅ ── Phase 2 ✅
                │
                ├── Phase 3 ✅ (Gemini #397) ──────────┐
                ├── Phase 4 ✅ (Claude #396) ───────────┤
                ├── Phase 5 ✅ (STT #424 → #400) ────┼── Phase 7 ✅ (voice #471)
                └── Phase 6 ✅ (TTS auth #403 → #402) ──┤
                                                      └── Phase 8 ✅ (cleanup #425, #426)
```

Phases 3, 4, 5, 6 are all independent and can run in parallel. Phase 7 needs Phases 5+6. Phase 8 needs all of them.

## Out of scope (for now)

- Codex multimodal (#398) — blocked upstream ([codex#5773](https://github.com/openai/codex/issues/5773))
- OpenCode multimodal (#399) — blocked upstream (hardcoded text/plain MIME)
- Outbound `AgentMediaEvent` — no runtime emits media today, add when needed
- Per-request key rotation inside a CLI session — session-level rotation is sufficient
- Multi-key onboarding UX — wizard stays single-key, rotation configured post-wizard
- `remoteclaw auth add` CLI command — natural follow-up, not scoped here

## All related issues

| Issue | Title | Phase | Status |
|-------|-------|-------|--------|
| #375 | `runtimeEnv` config field | prereq | done ✅ |
| #376 | Auth/media research and architectural evolution | predecessor | superseded |
| #384 | `buildChannelMessage` never populates `mediaUrls` | 2 | done ✅ |
| #385 | AgentRuntime multimodal contract | 2 | done ✅ |
| #386 | Per-runtime multimodal (tracking) | 3-4 | tracking |
| #387 | Middleware multimodal propagation | 2 | done ✅ |
| #396 | Claude runtime multimodal | 4 | done ✅ |
| #397 | Gemini runtime multimodal | 3 | done ✅ |
| #398 | Codex runtime multimodal (blocked upstream) | — | out of scope |
| #399 | OpenCode runtime multimodal (blocked upstream) | — | out of scope |
| #400 | Communicate multimodal limitations | 5 | done ✅ |
| #402 | TTS auth profile integration | 6 | done ✅ |
| #403 | ElevenLabs auth provider | 6 | done ✅ |
| #417 | Onboarding wizard adaptation | 1 | done ✅ |
| #419 | Auth profiles relocation | 1 | done ✅ |
| #421 | Per-agent auth config field | 1 | done ✅ |
| #422 | Auth profile → CLI env injection | 1 | done ✅ |
| #423 | Retry with rotated key on rate-limit | 1 | done ✅ |
| #424 | STT extraction to `src/stt/` | 5 | done ✅ |
| #425 | Remove dead media understanding code | 8 | done ✅ |
| #426 | Remove dead auth profile consumers | 8 | done ✅ |
| #427 | OpenClaw import adaptation | 1 | done ✅ |
| #438 | Auth store global relocation | 1 | done ✅ |
| #439 | Import: consolidate per-agent auth | 1 | done ✅ |
| #471 | Voice channel STT/TTS validation | 7 | done ✅ |
| #478 | Wire auxiliary provider auth flags | 6 | done ✅ |
| #497 | Plugin SDK: custom STT providers | — | done ✅ |
| #498 | Plugin SDK: custom TTS providers | — | done ✅ |
| #374 | CLIRuntimeBase stderr swallowing | — | independent |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unified auth + multimodal AgentRuntime — implementation plan #415

Overview

Background

Phase 1: Auth Foundation ✅

Phase 2: Multimodal Contract + Routing ✅

Phase 3: Gemini Multimodal ✅

Phase 4: Claude Multimodal ✅

Phase 5: STT + User Feedback ✅

Phase 6: TTS Credential Unification ✅

Phase 7: Voice Channel Validation ✅

Phase 8: Cleanup ✅

Parallelization

Out of scope (for now)

All related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Task	Issue	Status
1	Relocate `src/agents/auth-profiles/` → `src/auth/`	#419	done ✅
2	Add `auth` config field (`auth?: false \| string \| string[]`)	#421	done ✅
3	Wire auth profile → CLI env injection	#422	done ✅
4	Retry with rotated key on rate-limit	#423	done ✅
5	Adapt onboarding wizard	#417	done ✅
6	Adapt OpenClaw import	#427	done ✅
7	Relocate auth store to global path + strip legacy migration	#438	done ✅
8	Import: consolidate per-agent auth into global store	#439	done ✅

#	Task	Issue	Status
9	AgentRuntime multimodal contract (`MediaAttachment`, `mediaCapabilities`)	#385	done ✅
10	Fix `buildChannelMessage` mediaUrls wiring	#384	done ✅
11	ChannelBridge media routing (capability check → passthrough or fallback)	#387	done ✅

#	Task	Issue	Blocked by	Status
14	Extract STT from `src/media-understanding/` → `src/stt/`	#424	—	done ✅
15	Communicate multimodal limitations to users	#400	#424	done ✅

#	Task	Issue	Blocked by	Status
16	Add ElevenLabs to auth provider system	#403	—	done ✅
17	TTS uses `resolveApiKeyForProvider` from `src/auth/`	#402	#403	done ✅

#	Task	Issue	Blocked by	Status
19	Remove dead media understanding code (multi-provider vision runner)	#425	#424 ✅	done ✅
20	Remove dead auth profile consumers (session overrides, directive handlers)	#426	#402	done ✅

Issue	Title	Phase	Status
#375	`runtimeEnv` config field	prereq	done ✅
#376	Auth/media research and architectural evolution	predecessor	superseded
#384	`buildChannelMessage` never populates `mediaUrls`	2	done ✅
#385	AgentRuntime multimodal contract	2	done ✅
#386	Per-runtime multimodal (tracking)	3-4	tracking
#387	Middleware multimodal propagation	2	done ✅
#396	Claude runtime multimodal	4	done ✅
#397	Gemini runtime multimodal	3	done ✅
#398	Codex runtime multimodal (blocked upstream)	—	out of scope
#399	OpenCode runtime multimodal (blocked upstream)	—	out of scope
#400	Communicate multimodal limitations	5	done ✅
#402	TTS auth profile integration	6	done ✅
#403	ElevenLabs auth provider	6	done ✅
#417	Onboarding wizard adaptation	1	done ✅
#419	Auth profiles relocation	1	done ✅
#421	Per-agent auth config field	1	done ✅
#422	Auth profile → CLI env injection	1	done ✅
#423	Retry with rotated key on rate-limit	1	done ✅
#424	STT extraction to `src/stt/`	5	done ✅
#425	Remove dead media understanding code	8	done ✅
#426	Remove dead auth profile consumers	8	done ✅
#427	OpenClaw import adaptation	1	done ✅
#438	Auth store global relocation	1	done ✅
#439	Import: consolidate per-agent auth	1	done ✅
#471	Voice channel STT/TTS validation	7	done ✅
#478	Wire auxiliary provider auth flags	6	done ✅
#497	Plugin SDK: custom STT providers	—	done ✅
#498	Plugin SDK: custom TTS providers	—	done ✅
#374	CLIRuntimeBase stderr swallowing	—	independent

feat: unified auth + multimodal AgentRuntime — implementation plan #415

Description

Overview

Background

Phase 1: Auth Foundation ✅

Phase 2: Multimodal Contract + Routing ✅

Phase 3: Gemini Multimodal ✅

Phase 4: Claude Multimodal ✅

Phase 5: STT + User Feedback ✅

Phase 6: TTS Credential Unification ✅

Phase 7: Voice Channel Validation ✅

Phase 8: Cleanup ✅

Parallelization

Out of scope (for now)

All related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions