About

Voice shouldn't be a setting. It should be an identity.

Inspiration

Most voice agents sound identical — flat, generic, forgettable. Whether it's a customer service bot, a podcast narrator, or a voice assistant, synthetic voice has no identity. It's technically impressive but humanly unmemorable.

We're entering a world where voice is becoming the primary interface for hands-free computing, AI narration is replacing human voiceover, and synthetic speech is everywhere. Yet nobody has solved voice identity — the idea that how you sound is as expressive and personal as how you look.

The insight that drove Aria: identity in voice isn't just tone. It's the words you choose, the rhythm of your sentences, your energy, your age, your emotion, your accent. A calm narrator doesn't just speak quietly — they choose measured, precise words. A radio host doesn't just speak loudly — they speak in punchy fragments. These are different identities, not just different volumes.

That question — can we transform voice identity at every layer simultaneously — became Aria.

What it does

Aria is a real-time voice identity playground. You speak once. Aria transforms your voice into a completely different persona — not just the tone, but the actual words adapt to match each persona's identity and character.

The experience

Record your voice or upload an audio file
Pick a persona: Calm Narrator, Radio Host, Elder Storyteller, or Playful Kid
Aria transcribes your speech, rewrites it to match the persona's linguistic identity, and synthesizes it in that persona's voice
A live 120-spike SVG audio visualizer reacts to both your raw input and the transformed output in real time
Switch personas instantly — color, character, phrasing, and voice all change

Radio Host supports 4 live accent variants: American, British, Australian, Indian

Same meaning. Completely different person.

How we built it

Three AI systems work in sequence on every request:

1. Google Gemini 2.5 Flash — Speech Understanding

Transcribes raw audio to text in real time using inline base64 audio. We chose Gemini for its speed and accuracy on short conversational clips.

2. Featherless (Meta Llama 3.1 8B Instruct) — Persona Rewriting

This is the core insight that separates Aria from a TTS wrapper. After transcription, we send the text to Llama 3.1 8B via Featherless's OpenAI-compatible API with a carefully engineered persona director prompt.

The LLM rewrites the transcript to match the persona's linguistic identity:

Calm Narrator: formal, measured, precise phrasing
Radio Host: punchy, clipped, high-energy fragments
Elder Storyteller: warm, unhurried, reflective sentences
Playful Kid: simple, exclamatory, bouncy rhythm

The words themselves change — not just the voice.

3. ElevenLabs `eleven_multilingual_v2` — Voice Synthesis

Synthesizes the rewritten transcript using custom Voice Design voices created for each persona with age, accent, and tone prompts.

Each persona has individually tuned voice settings:

stability — controls consistency vs. expressiveness
style_exaggeration — controls how “acted” the delivery feels
similarity_boost — controls adherence to the reference voice
speaking_rate — controls pacing

Radio Host has 4 distinct Voice Design voices for American, British, Australian, and Indian accents — same character, different cultural identity.

Research foundation

Our persona voice design and synthesis approach is inspired by:

Amphion / Vevo (open-source controllable voice conversion framework) — specifically its approach to separating timbre, style, and emotion as independently controllable dimensions. We implemented a lightweight API-first version of this concept using ElevenLabs voice settings as the control surface.
ElevenLabs Voice Design v3 prompting methodology — describing voices by age, accent, timbre, and emotional register rather than by acoustic parameters directly.

Full stack

Backend: Next.js App Router + TypeScript, single /api/transform endpoint
Frontend: Vite + React + Tailwind CSS
Audio: WebAudio API AnalyserNode driving a custom 120-spike SVG visualizer
Personas: 7 custom ElevenLabs Voice Design voices (4 personas + 3 Radio Host accent variants)

Challenges we ran into

WebAudio InvalidStateError

createMediaElementSource can only be called once per HTMLAudioElement. Calling it again on persona switch throws an unrecoverable error.

We fixed this by creating the analyser source node once on component mount and routing all subsequent audio through the same node. Mic stream and playback audio both flow through the same analyser so the visualizer reacts to everything.

Stale React closure bugs

The persona selected at record time had to match the persona at API call time across async boundaries. Standard useState closures captured the wrong value.

We fixed this with a selectedPersonaRef that always mirrors current state and is read at call time, not capture time.

Pipeline latency

The Gemini → Featherless → ElevenLabs round trip takes 3–5 seconds.

We built explicit state transitions (idle → recording → transforming → playing) with visual feedback at every step so the UI never feels broken or frozen.

Graceful degradation

The Featherless rewrite step is wrapped in a silent try/catch. If the LLM call fails for any reason, the route falls back to the original transcript without breaking the demo. The user never sees an error from an upstream API failure.

Accomplishments we're proud of

The pipeline is fully real — no pre-recorded clips, no mocked responses, no smoke and mirrors
The moment a judge hears their own words played back as four completely different identities is genuinely surprising every single time
Identity transformation works at two layers simultaneously:
- Linguistic (Featherless rewrites the words)
- Acoustic (ElevenLabs voices the new identity)
The visualizer reacts to both mic input during recording and transformed audio during playback through the same WebAudio analyser node — one continuous visual experience
Built and shipped end-to-end in 4 hours on two MacBook Airs with zero GPU infrastructure

What we learned

API-first design with strong visuals beats trying to run heavy models locally on consumer laptops
The most impressive demos are not the ones with the most models — they are the ones where the judge immediately understands the before and the after
ElevenLabs Voice Design + per-persona settings creates dramatically different perceived identities without any model training or fine-tuning
Featherless makes open LLM inference a drop-in, OpenAI-compatible building block — we integrated Llama 3.1 8B in under 30 minutes
Separating concerns cleanly (Gemini for understanding, Featherless for identity, ElevenLabs for synthesis) made the system easy to debug and extend under time pressure

What's next

Continuous emotion, age, and intensity sliders mapped to ElevenLabs stability and style controls — letting users tune identity in real time
Backboard integration for streaming, interruptible persona switching with sub-second latency
Amphion Vevo on-device voice conversion as an alternative synthesis path for privacy-first use cases
Comparison mode: one recording generates all 4 persona outputs simultaneously via Promise.all, displayed side by side
Accent variants for all 4 personas, not just Radio Host

Built With

elevenlabs
featherless
gemini
next.js
react
tailwind
three.js
typescript
webaudio

Updates

Rohan Saha started this project — Feb 28, 2026 04:19 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.