feat(android): streaming TTS via ElevenLabs WebSocket for voice screen#29521
feat(android): streaming TTS via ElevenLabs WebSocket for voice screen#29521obviyus merged 9 commits intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9c4cacd258
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
apps/android/app/src/main/java/ai/openclaw/android/voice/MicCaptureManager.kt
Outdated
Show resolved
Hide resolved
apps/android/app/src/main/java/ai/openclaw/android/voice/TalkModeManager.kt
Show resolved
Hide resolved
apps/android/app/src/main/java/ai/openclaw/android/voice/TalkModeManager.kt
Show resolved
Hide resolved
Greptile SummaryAdds real-time streaming TTS to the Android voice screen via ElevenLabs WebSocket API. The implementation introduces Key changes:
Known limitations (documented by author for follow-up):
The implementation handles WebSocket timing correctly and includes proper lifecycle management (barge-in, tab switching, backgrounding). Testing on OnePlus Android 15 confirms the feature works for sequential voice interactions. Confidence Score: 4/5
Last reviewed commit: 9c4cacd |
|
Hey @obviyus, voice screen TTS is working well for the core flow. Three known limitations are documented (mic rapid-toggle race, STT first-word cutoff, sendText thread-safety). Happy to fix those before merge if you'd prefer, or we can track them separately. |
|
@gregmousseau thanks for the PR! I'll manually test it and merge it if everything seems good. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 70bf53770f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
apps/android/app/src/main/java/ai/openclaw/android/voice/TalkModeManager.kt
Show resolved
Hide resolved
apps/android/app/src/main/java/ai/openclaw/android/voice/TalkModeManager.kt
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64749866e2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
apps/android/app/src/main/java/ai/openclaw/android/voice/TalkModeManager.kt
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0b310125a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
apps/android/app/src/main/java/ai/openclaw/android/voice/TalkModeManager.kt
Show resolved
Hide resolved
|
@obviyus -Merged with main (adopted playbackToken cancellation pattern) Happy to split into smaller PRs, but would create a dependency chain. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b50a89cdfb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
apps/android/app/src/main/java/ai/openclaw/android/voice/TalkModeManager.kt
Outdated
Show resolved
Hide resolved
apps/android/app/src/main/java/ai/openclaw/android/voice/ElevenLabsStreamingTts.kt
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 43f4ed1156
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
apps/android/app/src/main/java/ai/openclaw/android/chat/ChatController.kt
Show resolved
Hide resolved
apps/android/app/src/main/java/ai/openclaw/android/voice/TalkModeManager.kt
Outdated
Show resolved
Hide resolved
6447032 to
96a4117
Compare
…PCM playback Streams text to the ElevenLabs WebSocket API and plays audio in real-time via AudioTrack (PCM 24kHz). Key design points: - sendText(fullText) takes the full accumulated text and only transmits the new suffix, detecting divergence for restart - Chunks are queued if the WebSocket isn't yet connected; flushed in onOpen - finish() sends EOS to ElevenLabs; deferred if called before onOpen fires - sendText returns true (not false) when finished=true to avoid treating a normal end-of-stream as a diverge restart - finishStreamingTts coroutine uses identity check before nulling streamingTts to prevent a mid-drain restart from orphaning a live TTS session - eleven_v3 does NOT support WebSocket streaming; use eleven_flash_v2_5
… TTS TalkModeManager is instantiated lazily in NodeRuntime and drives ElevenLabs streaming TTS for all assistant responses when the voice screen is active. MicCaptureManager continues to own STT and chat.send; TalkModeManager is TTS-only (ttsOnAllResponses = true, setEnabled never called). - talkMode.ttsOnAllResponses = true when mic is enabled or voice screen active - Barge-in: tapping the mic button calls stopTts() before re-enabling mic - Lifecycle: PostOnboardingTabs LaunchedEffect + VoiceTabScreen onDispose both call setVoiceScreenActive(false) so TTS stops cleanly on tab switch or app backgrounding - applyMainSessionKey wires the session key into TalkModeManager so it subscribes to the correct chat session for TTS
…oice ChatController: - final/aborted/error run events now trigger a history refresh regardless of whether the runId is in pendingRuns; only delta events require the run to be tracked (prevents voice-initiated responses from being silently dropped) MicCaptureManager: - Don't auto-send on onResults silence detection — accumulate transcript segments and send when mic is toggled off, giving the recognizer time to finish processing buffered audio - Capture any partial live transcript if no final segments arrived (2s drain window before stop) - Join multi-segment transcripts with sentence-ending punctuation to avoid run-on text sent to the gateway
…cooldown Bug fixes: - @synchronized on ElevenLabsStreamingTts.sendText/finish to prevent sentFullText/sentTextLength races across OkHttp and caller threads - Pre-set pendingRunId via onRunIdKnown callback before chat.send to eliminate race where gateway events arrive before runId is stored - Track drain coroutine as Job; cancel prior on rapid mic toggle to prevent duplicate TTS and stale transcript sends - Mic button disabled during 2s drain cooldown (micCooldown StateFlow) Codex review fixes: - Gate agent streaming TTS on sessionKey to prevent cross-session audio leaks (P1) - Clear ElevenLabs credentials when talk.provider is not elevenlabs; gate streaming TTS on activeProviderIsElevenLabs (P2) System TTS fallback fixes: - Null streamingTts immediately in finishStreamingTts so next response gets a fresh TTS instance - Add hasReceivedAudio flag to ElevenLabsStreamingTts to detect when WebSocket connects but returns no audio (invalid key, network error) - Fall back to playTtsForText when streaming TTS produced no audio - Track ttsJob to cleanly cancel prior playTtsForText on new response - Re-throw CancellationException instead of cascading into fallback attempts that also get cancelled
- Codex P1: streamAndPlayMp3 was computed but never called after PCM failure. Now properly invoked as fallback. - Codex P2: MicCaptureManager.speakAssistantReply now skipped when TalkModeManager.ttsOnAllResponses is active, preventing both pipelines from speaking the same assistant reply.
- Codex P1: setSpeakerEnabled now syncs talkMode.setPlaybackEnabled so muting the speaker works when ttsOnAllResponses is active. - Codex P2: abandonAudioFocus() called in stopSpeaking to prevent audio focus leak after TTS completes or is interrupted.
Agent events arrive on multiple threads concurrently. A stale event with shorter accumulated text was falsely triggering 'text diverged', causing the streaming TTS to restart with a new WebSocket — resulting in multiple simultaneous ElevenLabs connections (2-3 voices) and eventual system TTS fallback when hasReceivedAudio was false. Fix: if sentFullText.startsWith(fullText), the event is stale (we already have this text), not diverged. Accept and ignore it.
…ailure - Codex P2: drain coroutine now only clears drainingTts if it's the same instance (=== check), preventing a newer drain from being unreachable by stopTts. - Codex P2: set stopped=true on WebSocket onFailure so subsequent sendText calls are rejected and stale state doesn't persist.
96a4117 to
1763daa
Compare
|
Landed via temp rebase onto main. Changes added in this landing pass:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1763daa9dc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }) | ||
| } | ||
| webSocket.send(config.toString()) | ||
| wsReady = true |
There was a problem hiding this comment.
Flush queued text before setting WebSocket ready state
onOpen flips wsReady to true before draining pendingText, so a concurrent sendText() call can send a newer chunk directly while older queued chunks are still waiting in the flush loop. When assistant deltas arrive quickly during socket startup, this reorders text sent to ElevenLabs and can produce garbled/truncated speech for the turn.
Useful? React with 👍 / 👎.
| delay(2000L) | ||
| stop() |
There was a problem hiding this comment.
Bypass drain delay when turning mic off from voice screen
Mic disable now always waits 2 seconds before calling stop(), and setVoiceScreenActive(false) uses this same path when the user leaves the voice tab. That means the recognizer can keep listening after the screen is closed/backgrounded and still flush/send transcript text, which is an unexpected capture window for users who just exited voice mode.
Useful? React with 👍 / 👎.
Adds real-time ElevenLabs streaming TTS to the Android voice screen, so the assistant speaks responses as they stream in rather than waiting for the full reply.
What changed
ElevenLabsStreamingTts (new class)
Streams text chunks to the ElevenLabs WebSocket API and plays audio via AudioTrack (PCM 24kHz). Text is sent incrementally as agent delta events arrive; EOS is sent when the response finalizes. Handles WebSocket connect timing: chunks are queued before onOpen fires, and finish() defers EOS if called before the socket is ready.
TalkModeManager wired into NodeRuntime
TalkModeManager runs in TTS-only mode (ttsOnAllResponses = true) alongside MicCaptureManager, which continues to own STT and chat.send. Barge-in (mic tap stops active TTS), voice screen lifecycle (TTS stops on tab switch or backgrounding), and session key wiring are all handled.
ChatController fix
final/aborted/error run events now refresh chat history regardless of whether the runId is tracked in pendingRuns. Previously, voice-initiated runs were silently dropped because they weren't registered — responses would play via TTS but never appear in the chat UI.
MicCaptureManager improvements
Don't auto-send on silence — accumulate segments and send when mic toggles off, with a 2s drain window to catch buffered audio. Multi-segment transcripts are joined with sentence-ending punctuation.
Testing
Tested on OnePlus CPH2581 (Android 15). Voice → STT → response → TTS confirmed working for sequential messages.
Known limitations (follow-up)
stop()fires into a live session and the previous response replaysonResultsmeans words spoken immediately after the window are lostsentFullText/sentTextLengthare accessed from multiple threads without synchronization; rare false "text diverged" restart possible under concurrent delta events