Phase 1 (#139) lands a sync chat completion. Replies block until the full LLM response arrives — typically 3–10s. SSE streaming would let tokens land incrementally so the user sees a typing animation backed by real content.
Scope
- Add
LlmClient.generateStream(prompt, options) returning AsyncIterable<string> chunks. Anthropic and OpenAI both support SSE on their completion endpoints.
- Refactor
AssistantService.reply() to optionally return an async iterable instead of a string (or add a sibling replyStream()).
- Refactor
POST /api/assistant/messages to upgrade to SSE when the client requests it (Accept: text/event-stream). Backward-compatible: existing JSON path still works.
- Web client: detect SSE, render tokens as they arrive, swap the typing-dots bubble for the streaming text bubble.
- Persist the assistant message AFTER the stream closes (use the accumulated content).
Out of scope
- Provider-side cancellation when the user closes the page mid-stream (nice but adds complexity).
- Tool-use streaming (phase 2c).
Why
Single biggest UX improvement available for the assistant. Users have low tolerance for waiting on chat replies after the ChatGPT/Claude UX baseline.
Phase 1 (#139) lands a sync chat completion. Replies block until the full LLM response arrives — typically 3–10s. SSE streaming would let tokens land incrementally so the user sees a typing animation backed by real content.
Scope
LlmClient.generateStream(prompt, options)returningAsyncIterable<string>chunks. Anthropic and OpenAI both support SSE on their completion endpoints.AssistantService.reply()to optionally return an async iterable instead of a string (or add a siblingreplyStream()).POST /api/assistant/messagesto upgrade to SSE when the client requests it (Accept: text/event-stream). Backward-compatible: existing JSON path still works.Out of scope
Why
Single biggest UX improvement available for the assistant. Users have low tolerance for waiting on chat replies after the ChatGPT/Claude UX baseline.