Doce

AI piano practice coach for turning one performance video into score alignment, posture checks, and actionable feedback.

Inspiration

I wanted to build something closer to a real piano teacher than a generic score checker.

Most practice tools only tell you whether a note was right or wrong. I wanted a system that could also answer:

Did I play the right notes, at the right time?
Was my tempo stable?
Did my posture look healthy while I was playing?
What should I fix next, in plain language?

That idea became Doceo: a performance tutor that combines audio analysis, video analysis, and an AI-generated coaching layer.

What It Does

Doceo takes in:

a reference MIDI file
a performance video

Then it:

extracts audio from the video
transcribes the performance into note events
aligns the played notes against the reference score
analyzes posture from sampled video frames
generates feedback and drills
renders the result in a web interface

Output:

score annotations
piano-roll fallback
video playback with pose overlay
reference audio playback
AI tutor feedback

How I Built It

Frontend

upload reference MIDI
upload performance video
run analysis
view results

The results page shows:

the annotated score
a piano-roll fallback
A/B playback with the user video and synthesized reference audio
posture overlay on top of the video
focus areas and drills
AI tutor feedback

Backend

The backend lives in api/ and is built with FastAPI.

Pipeline:

/midi parses the reference MIDI, exports MusicXML, and synthesizes a reference audio track
/video stores the performance video and extracts a mono WAV file with ffmpeg
/analyze transcribes the extracted audio with Basic Pitch
/align matches the played notes to the reference score with DTW
/pose samples video frames with OpenCV + MediaPipe Pose/Hands
/tutor feeds the analysis into Gemini and ElevenLabs for spoken coaching

Core Libraries

FastAPI for the API layer
Next.js and React for the frontend
ffmpeg for video-to-audio extraction
basic-pitch for audio-to-note transcription
pretty_midi and music21 for MIDI/score handling
fastdtw for note alignment
OpenCV for frame sampling
MediaPipe for pose and hand landmark detection
Gemini for tutor script generation
ElevenLabs for voice output

Tech Stack

Frontend

Next.js 16
React 19
TypeScript
OpenSheetMusicDisplay

Backend

FastAPI
Uvicorn
Python
pretty_midi
music21
basic-pitch
fastdtw
OpenCV
MediaPipe
ffmpeg

AI and Feedback

Gemini for written tutor feedback
ElevenLabs for audio narration

Challenge: Converting Video to MIDI With High Accuracy

This was the hardest part of the project.

The raw audio coming from a performance video is messy:

room noise leaks into the signal
pedal resonance blurs note boundaries
timing is not perfectly quantized
transcribers can miss short notes or create duplicates
performance tempo can drift compared with the reference

To make the transcription usable, I had to add multiple cleanup layers:

extract a clean mono track from the video
run Basic Pitch to get candidate note events
smooth and merge near-duplicate transcribed notes
estimate timing offset against synthesized reference audio
reference-guide the note cleanup using the score itself
write both raw and cleaned MIDI outputs for debugging

Even after transcription, note alignment still needed a second pass. I used DTW to compare the played note stream against the reference, then labeled each event as correct, wrong pitch, missed, extra, early, late, or on-time.

That combination of transcription plus alignment is what makes the output feel coherent instead of noisy.

What I Learned

Audio transcription needs cleanup and alignment to be useful.
Reference-aware heuristics matter when performances drift from the score.
Audio and video analysis solve different parts of the problem.
Specific feedback beats a raw score.
Structured inputs make AI feedback much better.

Next Steps to Scale This Project

improve transcription accuracy on noisier recordings
support longer performances and bigger excerpts
separate left-hand and right-hand analysis
improve posture scoring
support multi-song sessions
track progress across practice attempts
deploy the pipeline for reliable scale
personalize feedback from recurring mistakes

Local Development

Backend

cd api
pip install -r requirements.txt
uvicorn main:app --reload --port 8000

or

cd /Users/dzkchen/doceo/api
source .venv/bin/activate
python -m uvicorn main:app --host 127.0.0.1 --port 8000

Frontend

cd web
npm install
npm run dev

Open the app and upload a reference MIDI plus a performance video.

Repo Structure

api/ - FastAPI backend and analysis pipeline
web/ - Next.js frontend
api/storage/ - session outputs, transcriptions, alignments, and posture results

License

No license has been specified yet.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
api		api
docs/superpowers		docs/superpowers
tests		tests
web		web
.gitignore		.gitignore
IMG_2347 2.mov		IMG_2347 2.mov
IMG_2347 3.MOV		IMG_2347 3.MOV
Paul de Senneville - Mariage d'Amour.mid		Paul de Senneville - Mariage d'Amour.mid
Paul de Senneville, Olivier Toussaint - MARIAGE D'AMOUR.mid		Paul de Senneville, Olivier Toussaint - MARIAGE D'AMOUR.mid
README.md		README.md
mariage_15s.mid		mariage_15s.mid
mariage_30s.mid		mariage_30s.mid
mariage_amour_15s.mid		mariage_amour_15s.mid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Doce

Inspiration

What It Does

How I Built It

Frontend

Backend

Core Libraries

Tech Stack

Frontend

Backend

AI and Feedback

Challenge: Converting Video to MIDI With High Accuracy

What I Learned

Next Steps to Scale This Project

Local Development

Backend

Frontend

Repo Structure

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Doce

Inspiration

What It Does

How I Built It

Frontend

Backend

Core Libraries

Tech Stack

Frontend

Backend

AI and Feedback

Challenge: Converting Video to MIDI With High Accuracy

What I Learned

Next Steps to Scale This Project

Local Development

Backend

Frontend

Repo Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages