Audio-driven avatar lipsync

Talk to the avatar.

Live mic → MFCCs → small MLP → ARKit blendshapes → 3D mouth. Runs in your browser. Train it on your own voice in the Train tab, or use the formant heuristic out of the box.

Live

click to grant mic access

mapper checking for model…

Viseme weights

jaw open

consonant

Pipeline

mic → Web Audio FFT → 13-D MFCCs (src/mfcc.js) → 9-frame window → MLP 117 → 128 → 64 → 52 (src/model.js, TF.js) → blendshape→viseme reduce → 3D mouth (src/avatar.js).

Training data comes from MediaPipe FaceLandmarker. Your webcam labels the audio for you. The browser handles training too (TF.js). The same model also exports to LiteRT via training/train_jax.py for shipping to Android XR or Quest.

Capture

click start to grant permissions

frames captured: 0

How to record

Do each of the 3 takes below once — they cover different phonetic territory, so reading distinct material gives the model much more variety than repeating the same script. Each take is ~75 seconds. Vary your volume and pace; move your head a bit.

Take 1 — sentences + silence + vowels

"The birch canoe slid on the smooth planks. Glue the sheet to the dark blue background. It's easy to tell the depth of a well. These days a chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch."

[pause 4 seconds — silent]

"Aaaaaa eeeeee iiiiii oooooo uuuuuu. Boot, beet, bit, bait, but, bot, bought, bout. Now backwards: ooo, oh, awe, ah, eh, ih, ee."

[pause 4 seconds — silent]

──────────────

Take 2 — consonants + tongue-twisters

"Sssssss, ffffff, shhhhh, zzzzzz, vvvvvv, mmmmm, nnnnnn. Pa pa pa, ta ta ta, ka ka ka, ba ba ba, da da da, ga ga ga. Pip, top, kick, big, dot, get."

[pause 4 seconds — silent]

"Peter Piper picked a peck of pickled peppers. She sells sea shells by the sea shore. How now brown cow. Red lorry, yellow lorry, red lorry, yellow lorry. Unique New York, unique New York. The Leith police dismisseth us."

[pause 4 seconds — silent]

──────────────

Take 3 — pangrams + casual speech

"The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs. Sphinx of black quartz, judge my vow. Bright vixens jump dozy fowl quack. Five quacking zephyrs jolt my wax bed."

[pause 4 seconds — silent]

Then, in a casual voice, as if chatting to a friend, read:

"I went to the shop this morning, and they were completely out of milk, which is really annoying because I was planning to make pancakes for breakfast and now I have to walk all the way to the other shop on the corner. Anyway, how are you doing today? Anything interesting happen?"

[pause 4 seconds — silent]

Drop all three downloads on the Train tab when done.

Training data

Drop JSON files here, or

frames loaded: 0 (need ~1000+ to train; 30k+ for good quality)

no model saved

drag in JSON files to begin

What this does

Trains a small MLP (117 → 128 → 64 → 52) in your browser with TensorFlow.js. Inputs are 9-frame windows of MFCCs; targets are the 52 ARKit blendshape coefficients MediaPipe captured for that audio.

Saves the trained weights to your browser's localStorage. The Demo tab auto-loads the saved model on every visit; the mapper indicator flips to learned.

For shipping to Android XR or Quest, the same architecture is in training/train_jax.py — JAX/Flax trainer that exports to LiteRT (.tflite).