You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve OpenHuman's text-to-speech and speech-to-text experience so the mascot can act as a polished voice-first companion, with higher-quality speaking, listening, feedback, and turn-taking behavior.
This issue is about making voice interaction feel deliberate and productized rather than merely functional.
Problem
OpenHuman can support voice-related workflows, but the mascot does not yet feel like a refined conversational presence.
Current gaps likely include some combination of:
TTS that does not yet feel expressive, branded, or tightly tied to the mascot persona
STT flows that need better responsiveness, error handling, and user feedback while listening
Weak transition states between listening, thinking, and speaking
Limited visual/audio affordances that make it obvious what the mascot is doing at any given moment
Missing product-level guidance for interruption, retry, silence handling, and partial transcript behavior
Without this polish, voice interaction risks feeling bolted on instead of becoming a differentiated part of the product experience.
Solution (optional)
Design and implement a more complete mascot-centered voice experience across both TTS and STT.
Areas of improvement should include:
Mascot-aligned TTS voice behavior and presentation
Better STT capture flow and transcript reliability
Clear UI states for idle, listening, processing, speaking, interrupted, and failed states
Turn-taking rules for when the mascot should start/stop listening or speaking
Interruption behavior so the user can stop speech or speak over the mascot intentionally
Better fallback and recovery paths when speech services fail, time out, or return low-confidence results
This work should cover both the technical voice pipeline and the user-facing interaction model.
Acceptance criteria
The mascot has a clearly defined TTS experience that feels consistent with its persona and product role
STT capture flow is improved for responsiveness, reliability, and recoverability
Voice interaction states are clearly represented in the UI, including listening, processing, speaking, interrupted, and error states
The mascot can transition cleanly between listening and speaking without confusing or stale UI state
Interrupting or cancelling mascot speech is supported and behaves predictably
Silence, timeout, retry, and low-confidence recognition cases are handled intentionally
Partial transcript and final transcript behavior is defined and implemented where appropriate
Relevant desktop-specific edge cases are handled across supported platforms where feasible
Debug logging is sufficient to trace voice session lifecycle, state transitions, and failures without leaking sensitive content
Documentation is updated to describe the mascot voice interaction model and any configuration or platform constraints
Unit and integration coverage is added for the changed behavior
Diff coverage ≥ 80% — the implementing PR meets the changed-lines coverage gate (Vitest + cargo-llvm-cov, enforced by .github/workflows/coverage.yml).
Related
Follow-up issue for UI-triggered and hotkey-triggered voice conversation entry points
Existing mascot motion / expression work
Existing voice pipeline and desktop audio capture work
Summary
Improve OpenHuman's text-to-speech and speech-to-text experience so the mascot can act as a polished voice-first companion, with higher-quality speaking, listening, feedback, and turn-taking behavior.
This issue is about making voice interaction feel deliberate and productized rather than merely functional.
Problem
OpenHuman can support voice-related workflows, but the mascot does not yet feel like a refined conversational presence.
Current gaps likely include some combination of:
Without this polish, voice interaction risks feeling bolted on instead of becoming a differentiated part of the product experience.
Solution (optional)
Design and implement a more complete mascot-centered voice experience across both TTS and STT.
Areas of improvement should include:
This work should cover both the technical voice pipeline and the user-facing interaction model.
Acceptance criteria
.github/workflows/coverage.yml).Related