About
Voice shouldn't be a setting. It should be an identity.
Inspiration
Most voice agents sound identical — flat, generic, forgettable. Whether it's a customer service bot, a podcast narrator, or a voice assistant, synthetic voice has no identity. It's technically impressive but humanly unmemorable.
We're entering a world where voice is becoming the primary interface for hands-free computing, AI narration is replacing human voiceover, and synthetic speech is everywhere. Yet nobody has solved voice identity — the idea that how you sound is as expressive and personal as how you look.
The insight that drove Aria: identity in voice isn't just tone. It's the words you choose, the rhythm of your sentences, your energy, your age, your emotion, your accent. A calm narrator doesn't just speak quietly — they choose measured, precise words. A radio host doesn't just speak loudly — they speak in punchy fragments. These are different identities, not just different volumes.
That question — can we transform voice identity at every layer simultaneously — became Aria.
What it does
Aria is a real-time voice identity playground. You speak once. Aria transforms your voice into a completely different persona — not just the tone, but the actual words adapt to match each persona's identity and character.
The experience
- Record your voice or upload an audio file
- Pick a persona: Calm Narrator, Radio Host, Elder Storyteller, or Playful Kid
- Aria transcribes your speech, rewrites it to match the persona's linguistic identity, and synthesizes it in that persona's voice
- A live 120-spike SVG audio visualizer reacts to both your raw input and the transformed output in real time
- Switch personas instantly — color, character, phrasing, and voice all change
Radio Host supports 4 live accent variants: American, British, Australian, Indian
Same meaning. Completely different person.
How we built it
Three AI systems work in sequence on every request:
1. Google Gemini 2.5 Flash — Speech Understanding
Transcribes raw audio to text in real time using inline base64 audio. We chose Gemini for its speed and accuracy on short conversational clips.
2. Featherless (Meta Llama 3.1 8B Instruct) — Persona Rewriting
This is the core insight that separates Aria from a TTS wrapper. After transcription, we send the text to Llama 3.1 8B via Featherless's OpenAI-compatible API with a carefully engineered persona director prompt.
The LLM rewrites the transcript to match the persona's linguistic identity:
- Calm Narrator: formal, measured, precise phrasing
- Radio Host: punchy, clipped, high-energy fragments
- Elder Storyteller: warm, unhurried, reflective sentences
- Playful Kid: simple, exclamatory, bouncy rhythm
The words themselves change — not just the voice.
3. ElevenLabs eleven_multilingual_v2 — Voice Synthesis
Synthesizes the rewritten transcript using custom Voice Design voices created for each persona with age, accent, and tone prompts.
Each persona has individually tuned voice settings:
stability— controls consistency vs. expressivenessstyle_exaggeration— controls how “acted” the delivery feelssimilarity_boost— controls adherence to the reference voicespeaking_rate— controls pacing
Radio Host has 4 distinct Voice Design voices for American, British, Australian, and Indian accents — same character, different cultural identity.
Research foundation
Our persona voice design and synthesis approach is inspired by:
Amphion / Vevo (open-source controllable voice conversion framework) — specifically its approach to separating timbre, style, and emotion as independently controllable dimensions. We implemented a lightweight API-first version of this concept using ElevenLabs voice settings as the control surface.
ElevenLabs Voice Design v3 prompting methodology — describing voices by age, accent, timbre, and emotional register rather than by acoustic parameters directly.
Full stack
- Backend: Next.js App Router + TypeScript, single
/api/transformendpoint - Frontend: Vite + React + Tailwind CSS
- Audio: WebAudio API
AnalyserNodedriving a custom 120-spike SVG visualizer - Personas: 7 custom ElevenLabs Voice Design voices (4 personas + 3 Radio Host accent variants)
Challenges we ran into
WebAudio InvalidStateError
createMediaElementSource can only be called once per HTMLAudioElement. Calling it again on persona switch throws an unrecoverable error.
We fixed this by creating the analyser source node once on component mount and routing all subsequent audio through the same node. Mic stream and playback audio both flow through the same analyser so the visualizer reacts to everything.
Stale React closure bugs
The persona selected at record time had to match the persona at API call time across async boundaries. Standard useState closures captured the wrong value.
We fixed this with a selectedPersonaRef that always mirrors current state and is read at call time, not capture time.
Pipeline latency
The Gemini → Featherless → ElevenLabs round trip takes 3–5 seconds.
We built explicit state transitions (idle → recording → transforming → playing) with visual feedback at every step so the UI never feels broken or frozen.
Graceful degradation
The Featherless rewrite step is wrapped in a silent try/catch. If the LLM call fails for any reason, the route falls back to the original transcript without breaking the demo. The user never sees an error from an upstream API failure.
Accomplishments we're proud of
- The pipeline is fully real — no pre-recorded clips, no mocked responses, no smoke and mirrors
- The moment a judge hears their own words played back as four completely different identities is genuinely surprising every single time
- Identity transformation works at two layers simultaneously:
- Linguistic (Featherless rewrites the words)
- Acoustic (ElevenLabs voices the new identity)
- Linguistic (Featherless rewrites the words)
- The visualizer reacts to both mic input during recording and transformed audio during playback through the same WebAudio analyser node — one continuous visual experience
- Built and shipped end-to-end in 4 hours on two MacBook Airs with zero GPU infrastructure
What we learned
- API-first design with strong visuals beats trying to run heavy models locally on consumer laptops
- The most impressive demos are not the ones with the most models — they are the ones where the judge immediately understands the before and the after
- ElevenLabs Voice Design + per-persona settings creates dramatically different perceived identities without any model training or fine-tuning
- Featherless makes open LLM inference a drop-in, OpenAI-compatible building block — we integrated Llama 3.1 8B in under 30 minutes
- Separating concerns cleanly (Gemini for understanding, Featherless for identity, ElevenLabs for synthesis) made the system easy to debug and extend under time pressure
What's next
- Continuous emotion, age, and intensity sliders mapped to ElevenLabs stability and style controls — letting users tune identity in real time
- Backboard integration for streaming, interruptible persona switching with sub-second latency
- Amphion Vevo on-device voice conversion as an alternative synthesis path for privacy-first use cases
- Comparison mode: one recording generates all 4 persona outputs simultaneously via
Promise.all, displayed side by side - Accent variants for all 4 personas, not just Radio Host
Built With
- elevenlabs
- featherless
- gemini
- next.js
- react
- tailwind
- three.js
- typescript
- webaudio

Log in or sign up for Devpost to join the conversation.