About

Voice shouldn't be a setting. It should be an identity.

Inspiration

Most voice agents sound identical — flat, generic, forgettable. Whether it's a customer service bot, a podcast narrator, or a voice assistant, synthetic voice has no identity. It's technically impressive but humanly unmemorable.

We're entering a world where voice is becoming the primary interface for hands-free computing, AI narration is replacing human voiceover, and synthetic speech is everywhere. Yet nobody has solved voice identity — the idea that how you sound is as expressive and personal as how you look.

The insight that drove Aria: identity in voice isn't just tone. It's the words you choose, the rhythm of your sentences, your energy, your age, your emotion, your accent. A calm narrator doesn't just speak quietly — they choose measured, precise words. A radio host doesn't just speak loudly — they speak in punchy fragments. These are different identities, not just different volumes.

That question — can we transform voice identity at every layer simultaneously — became Aria.


What it does

Aria is a real-time voice identity playground. You speak once. Aria transforms your voice into a completely different persona — not just the tone, but the actual words adapt to match each persona's identity and character.

The experience

  1. Record your voice or upload an audio file
  2. Pick a persona: Calm Narrator, Radio Host, Elder Storyteller, or Playful Kid
  3. Aria transcribes your speech, rewrites it to match the persona's linguistic identity, and synthesizes it in that persona's voice
  4. A live 120-spike SVG audio visualizer reacts to both your raw input and the transformed output in real time
  5. Switch personas instantly — color, character, phrasing, and voice all change

Radio Host supports 4 live accent variants: American, British, Australian, Indian

Same meaning. Completely different person.


How we built it

Three AI systems work in sequence on every request:

1. Google Gemini 2.5 Flash — Speech Understanding

Transcribes raw audio to text in real time using inline base64 audio. We chose Gemini for its speed and accuracy on short conversational clips.

2. Featherless (Meta Llama 3.1 8B Instruct) — Persona Rewriting

This is the core insight that separates Aria from a TTS wrapper. After transcription, we send the text to Llama 3.1 8B via Featherless's OpenAI-compatible API with a carefully engineered persona director prompt.

The LLM rewrites the transcript to match the persona's linguistic identity:

  • Calm Narrator: formal, measured, precise phrasing
  • Radio Host: punchy, clipped, high-energy fragments
  • Elder Storyteller: warm, unhurried, reflective sentences
  • Playful Kid: simple, exclamatory, bouncy rhythm

The words themselves change — not just the voice.

3. ElevenLabs eleven_multilingual_v2 — Voice Synthesis

Synthesizes the rewritten transcript using custom Voice Design voices created for each persona with age, accent, and tone prompts.

Each persona has individually tuned voice settings:

  • stability — controls consistency vs. expressiveness
  • style_exaggeration — controls how “acted” the delivery feels
  • similarity_boost — controls adherence to the reference voice
  • speaking_rate — controls pacing

Radio Host has 4 distinct Voice Design voices for American, British, Australian, and Indian accents — same character, different cultural identity.


Research foundation

Our persona voice design and synthesis approach is inspired by:

  • Amphion / Vevo (open-source controllable voice conversion framework) — specifically its approach to separating timbre, style, and emotion as independently controllable dimensions. We implemented a lightweight API-first version of this concept using ElevenLabs voice settings as the control surface.

  • ElevenLabs Voice Design v3 prompting methodology — describing voices by age, accent, timbre, and emotional register rather than by acoustic parameters directly.


Full stack

  • Backend: Next.js App Router + TypeScript, single /api/transform endpoint
  • Frontend: Vite + React + Tailwind CSS
  • Audio: WebAudio API AnalyserNode driving a custom 120-spike SVG visualizer
  • Personas: 7 custom ElevenLabs Voice Design voices (4 personas + 3 Radio Host accent variants)

Challenges we ran into

WebAudio InvalidStateError

createMediaElementSource can only be called once per HTMLAudioElement. Calling it again on persona switch throws an unrecoverable error.

We fixed this by creating the analyser source node once on component mount and routing all subsequent audio through the same node. Mic stream and playback audio both flow through the same analyser so the visualizer reacts to everything.

Stale React closure bugs

The persona selected at record time had to match the persona at API call time across async boundaries. Standard useState closures captured the wrong value.

We fixed this with a selectedPersonaRef that always mirrors current state and is read at call time, not capture time.

Pipeline latency

The Gemini → Featherless → ElevenLabs round trip takes 3–5 seconds.

We built explicit state transitions (idle → recording → transforming → playing) with visual feedback at every step so the UI never feels broken or frozen.

Graceful degradation

The Featherless rewrite step is wrapped in a silent try/catch. If the LLM call fails for any reason, the route falls back to the original transcript without breaking the demo. The user never sees an error from an upstream API failure.


Accomplishments we're proud of

  • The pipeline is fully real — no pre-recorded clips, no mocked responses, no smoke and mirrors
  • The moment a judge hears their own words played back as four completely different identities is genuinely surprising every single time
  • Identity transformation works at two layers simultaneously:
    • Linguistic (Featherless rewrites the words)
    • Acoustic (ElevenLabs voices the new identity)
  • The visualizer reacts to both mic input during recording and transformed audio during playback through the same WebAudio analyser node — one continuous visual experience
  • Built and shipped end-to-end in 4 hours on two MacBook Airs with zero GPU infrastructure

What we learned

  • API-first design with strong visuals beats trying to run heavy models locally on consumer laptops
  • The most impressive demos are not the ones with the most models — they are the ones where the judge immediately understands the before and the after
  • ElevenLabs Voice Design + per-persona settings creates dramatically different perceived identities without any model training or fine-tuning
  • Featherless makes open LLM inference a drop-in, OpenAI-compatible building block — we integrated Llama 3.1 8B in under 30 minutes
  • Separating concerns cleanly (Gemini for understanding, Featherless for identity, ElevenLabs for synthesis) made the system easy to debug and extend under time pressure

What's next

  • Continuous emotion, age, and intensity sliders mapped to ElevenLabs stability and style controls — letting users tune identity in real time
  • Backboard integration for streaming, interruptible persona switching with sub-second latency
  • Amphion Vevo on-device voice conversion as an alternative synthesis path for privacy-first use cases
  • Comparison mode: one recording generates all 4 persona outputs simultaneously via Promise.all, displayed side by side
  • Accent variants for all 4 personas, not just Radio Host

Built With

Share this project:

Updates