Inspiration
- We love singing and especially love the vibe of Asian-style karaoke bars, but they are too hard to come by and too expensive for students. Therefore, KaraokeJump was born, as a means of singing whatever you want, whenever you want, and wherever you want.
Core Technologies
- Speech Recognition with Whisper: Converts audio into text using advanced speech-to-text technology.
- Audio Separation through Spleeter: Isolates vocals and accompaniment from MP3 files, creating a true karaoke experience by splitting tracks just like in a real karaoke bar.
- Audio Compression: Automatically compresses all files uploaded to decrease load times between front-end and back-end.
- Microservice Architecture: Transcription servers are run separately to effectively deal with dependency conflicts
- React: React used for development of the frontend graphical user interface
What it does
The user-uploaded audio file is passed through an audio separation engine that returns vocals and accompaniment. The accompaniment is run if the user chooses to sing, whereas the vocals is used for measuring accuracy of the user's performance.
How we built it
We built KaraokeJump as a Single Page Application using React and MaterialUI on the front-end, and FastAPI on the back-end for compatibility with our audio processing tools. We pass the user uploaded song to the back-end, where audio compression and vocals-accompaniment separation happens. The vocals is passed on to a separate transcription server, where it would return the lyrics in text form to the front-end. As the user sings, the server measures accuracy against the vocals, and displays the score after the user is done.
Challenges we ran into
- The main challenge along the way was figuring out a way to fine-tune our Speech-To-Text models, since most existing commercial Speech-To-Text APIs (i.e. Google Cloud Transcribe or OpenAI Speech-To-Text) are optimized for regular day-to-day conversations, not musical lyrics. Our approach was to first separate the audio into vocals and accompaniment, and apply noise reduction and filtering to get it as close to normal speech as possible.
Accomplishments that we're proud of
- Achieved an average lyric detection rate of 75% across different music genres (yes, including Rap :D) by leveraging advanced amplitude modulation, feature extraction and noise reduction.
- We also optimized existing Spleeter models to gain a ~15% improvement in audio separation times (10-minute version of All Too Well takes 4 seconds to separate during our testing, variable depending on CPU/GPU).
What we learned
- Audio processing, optimization and different techniques for speeding up webpages
What's next for KaraokeJump
In the future, we plan to further optimize our existing Whisper models to better capture the lyrics of a song's vocals (think ~95% accuracy). In addition, lyric support for more non-English languages would also be added. In terms of user experience, our plans would be for users to create their own profiles, record performances, and compete with other users, either through direct PvP or online local/global contests.
Log in or sign up for Devpost to join the conversation.