Polish mascot-driven TTS and STT voice experience

## Summary

Improve OpenHuman's text-to-speech and speech-to-text experience so the mascot can act as a polished voice-first companion, with higher-quality speaking, listening, feedback, and turn-taking behavior.

This issue is about making voice interaction feel deliberate and productized rather than merely functional.

## Problem

OpenHuman can support voice-related workflows, but the mascot does not yet feel like a refined conversational presence.

Current gaps likely include some combination of:

- TTS that does not yet feel expressive, branded, or tightly tied to the mascot persona
- STT flows that need better responsiveness, error handling, and user feedback while listening
- Weak transition states between listening, thinking, and speaking
- Limited visual/audio affordances that make it obvious what the mascot is doing at any given moment
- Missing product-level guidance for interruption, retry, silence handling, and partial transcript behavior

Without this polish, voice interaction risks feeling bolted on instead of becoming a differentiated part of the product experience.

## Solution (optional)

Design and implement a more complete mascot-centered voice experience across both TTS and STT.

Areas of improvement should include:

- Mascot-aligned TTS voice behavior and presentation
- Better STT capture flow and transcript reliability
- Clear UI states for idle, listening, processing, speaking, interrupted, and failed states
- Turn-taking rules for when the mascot should start/stop listening or speaking
- Interruption behavior so the user can stop speech or speak over the mascot intentionally
- Better fallback and recovery paths when speech services fail, time out, or return low-confidence results

This work should cover both the technical voice pipeline and the user-facing interaction model.

## Acceptance criteria

- [ ] The mascot has a clearly defined TTS experience that feels consistent with its persona and product role
- [ ] STT capture flow is improved for responsiveness, reliability, and recoverability
- [ ] Voice interaction states are clearly represented in the UI, including listening, processing, speaking, interrupted, and error states
- [ ] The mascot can transition cleanly between listening and speaking without confusing or stale UI state
- [ ] Interrupting or cancelling mascot speech is supported and behaves predictably
- [ ] Silence, timeout, retry, and low-confidence recognition cases are handled intentionally
- [ ] Partial transcript and final transcript behavior is defined and implemented where appropriate
- [ ] Relevant desktop-specific edge cases are handled across supported platforms where feasible
- [ ] Debug logging is sufficient to trace voice session lifecycle, state transitions, and failures without leaking sensitive content
- [ ] Documentation is updated to describe the mascot voice interaction model and any configuration or platform constraints
- [ ] Unit and integration coverage is added for the changed behavior
- [ ] Diff coverage ≥ 80% — the implementing PR meets the changed-lines coverage gate (Vitest + cargo-llvm-cov, enforced by [`.github/workflows/coverage.yml`](.github/workflows/coverage.yml)).

## Related

- Follow-up issue for UI-triggered and hotkey-triggered voice conversation entry points
- Existing mascot motion / expression work
- Existing voice pipeline and desktop audio capture work


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polish mascot-driven TTS and STT voice experience #1206

Summary

Problem

Solution (optional)

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Polish mascot-driven TTS and STT voice experience #1206

Description

Summary

Problem

Solution (optional)

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions