Inspiration
Every music student knows the struggle: you practice for hours, but without a teacher in the room, you can't tell if you're actually improving. Lessons are expensive and time-limited. I sought out to solve this problem using AI. What if AI could sit beside you during every practice session, comparing your playing to a reference performance and coaching you in real time? That's how Virtuoso AI was born.
What it does
Virtuoso AI is a web-based music practice coach. You upload two clips : 1) a teacher's reference performance and 2) your own practice attempt and the AI analyzes the differences across four dimensions: rhythm & timing, pitch & intonation, phrasing & dynamics, and physical technique. It returns timestamped feedback pinpointing exactly where you deviate, highlights your strengths, and generates personalized practice drills complete with repetition targets and success criteria so you know when you've nailed it.
How I built it
I built Virtuoso AI as a React + TypeScript single-page application using Vite for fast development. The UI is styled with Tailwind CSS. On the AI side, I integrated Google's Gemini 2.0 Pro model through the @google/genai SDK, leveraging its multimodal capabilities to process both video and audio inputs simultaneously. Media files are converted to base64 and sent alongside a carefully engineered prompt that instructs Gemini to act as a world-class conservatory professor. The frontend includes a custom markdown renderer to display structured, color-coded feedback with icons for each section.
Challenges I ran into
API quota limits on the Gemini Pro free tier were a constant bottleneck during testing - I had to carefully manage rate limits and experiment with different model versions. Crafting the right prompt to get consistently structured, timestamped, and actionable feedback also took several iterations. Performance is another ongoing challenge — longer clips take noticeably more time to analyze since the entire file is converted to base64 and sent in a single API call, and optimizing how media is processed and chunked before reaching the model is something I'd like to improve.
Limitations / What would make it better
Audio Processing
- A dedicated audio analysis layer (like Essentia or librosa on a backend) for quantitative pitch/timing data, combined with Gemini for qualitative technique feedback
- Audio alignment algorithms (like dynamic time warping) to sync the two clips before comparison
- Waveform visualization so students can see where they diverge. The current approach is great for a zero-backend prototype, but a production version would benefit from pairing the AI with actual digital signal processing.
Media Processing
- Entire files are converted to base64 in the browser, which is memory-intensive and can crash tabs with large files
- The 50MB limit is a soft cap — there's no server-side processing or compression
- No audio extraction from video, so the model processes raw video even when only audio matters
Accomplishments that I'm proud of
I'm proud of how polished the end-to-end experience feels - from the dual media input (upload or record directly in-browser) to the color-coded, section-by-section analysis output. The timestamped feedback makes it feel like a real teacher is watching your performance frame by frame.
What I learned
I learned how powerful multimodal AI has become - Gemini can genuinely analyze musical performances across audio and video simultaneously. I also learned that prompt engineering is as much art as science; small wording changes dramatically affect output quality. On the technical side, I deepened my understanding of Vite's environment variable handling, the MediaRecorder API for in-browser recording, and building lightweight markdown renderers without heavy dependencies.
What's next for Virtuoso AI
Looking further ahead, I'd love to build an AI avatar! A virtual teacher figure that watches and listens to your practice in real time, offering feedback as you play, almost like having a private instructor in the room with you. Longer term, I envision a teacher library with pre-loaded reference performances and streaming AI responses so feedback appears in real time as the model generates it.
Built With
- gemini
- vite

Log in or sign up for Devpost to join the conversation.