Skip to content

Polish mascot-driven TTS and STT voice experience #1206

@senamakel

Description

@senamakel

Summary

Improve OpenHuman's text-to-speech and speech-to-text experience so the mascot can act as a polished voice-first companion, with higher-quality speaking, listening, feedback, and turn-taking behavior.

This issue is about making voice interaction feel deliberate and productized rather than merely functional.

Problem

OpenHuman can support voice-related workflows, but the mascot does not yet feel like a refined conversational presence.

Current gaps likely include some combination of:

  • TTS that does not yet feel expressive, branded, or tightly tied to the mascot persona
  • STT flows that need better responsiveness, error handling, and user feedback while listening
  • Weak transition states between listening, thinking, and speaking
  • Limited visual/audio affordances that make it obvious what the mascot is doing at any given moment
  • Missing product-level guidance for interruption, retry, silence handling, and partial transcript behavior

Without this polish, voice interaction risks feeling bolted on instead of becoming a differentiated part of the product experience.

Solution (optional)

Design and implement a more complete mascot-centered voice experience across both TTS and STT.

Areas of improvement should include:

  • Mascot-aligned TTS voice behavior and presentation
  • Better STT capture flow and transcript reliability
  • Clear UI states for idle, listening, processing, speaking, interrupted, and failed states
  • Turn-taking rules for when the mascot should start/stop listening or speaking
  • Interruption behavior so the user can stop speech or speak over the mascot intentionally
  • Better fallback and recovery paths when speech services fail, time out, or return low-confidence results

This work should cover both the technical voice pipeline and the user-facing interaction model.

Acceptance criteria

  • The mascot has a clearly defined TTS experience that feels consistent with its persona and product role
  • STT capture flow is improved for responsiveness, reliability, and recoverability
  • Voice interaction states are clearly represented in the UI, including listening, processing, speaking, interrupted, and error states
  • The mascot can transition cleanly between listening and speaking without confusing or stale UI state
  • Interrupting or cancelling mascot speech is supported and behaves predictably
  • Silence, timeout, retry, and low-confidence recognition cases are handled intentionally
  • Partial transcript and final transcript behavior is defined and implemented where appropriate
  • Relevant desktop-specific edge cases are handled across supported platforms where feasible
  • Debug logging is sufficient to trace voice session lifecycle, state transitions, and failures without leaking sensitive content
  • Documentation is updated to describe the mascot voice interaction model and any configuration or platform constraints
  • Unit and integration coverage is added for the changed behavior
  • Diff coverage ≥ 80% — the implementing PR meets the changed-lines coverage gate (Vitest + cargo-llvm-cov, enforced by .github/workflows/coverage.yml).

Related

  • Follow-up issue for UI-triggered and hotkey-triggered voice conversation entry points
  • Existing mascot motion / expression work
  • Existing voice pipeline and desktop audio capture work

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementreact-uiReact app work in app/src: pages, components, providers, store, and UX.taskWork item that is not primarily a bug or a feature.voiceVoice features and audio workflows in src/openhuman/voice/ and app/src/features/voice/.

    Type

    No fields configured for Task.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions