Inspiration
As CS students, we know firsthand how stressful our lives can be. Whether it's schoolwork, projects, or even just applying to internships, we know that every little thing we do could change the trajectory of our lives. And this can come across in our voices. But what if there was a way that we could detect how a person was feeling through their voice, and give them a song that helps their current mood?
What it does
Moodulate takes an audio recording of the user speaking and analyzes features such as tone, pitch, and modulation in the speech to determine the user’s mood and recommends a playlist of 5 songs on Spotify. Our goal is to help users better regulate their emotions and listen to music in response to their state of mind at any given moment. For example, if the user’s mood is detected to be “happy,” the app would recommend the user a list of 5 happy songs on Spotify to keep them happy. However, if the user’s mood is detected to be “sad,” the app would still recommend the user a list of 5 happy songs to cheer them up. Ultimately, it serves as a mental health wellness app that aims to brighten the user’s mood.
How we built it
We used a pre-trained Way2Vec2 Speech Recognition model on HuggingFace trained on a combination of multiple datasets (including RAVDESS, SAVEE, TESS, and URDU) to recognize emotions in speech while allowing for maximum robustness. We used Flask for our backend and React to design the features for the UI. We deployed our AI pipeline using Modal and then deployed our frontend on Aedify.
Challenges we ran into
The number one challenge we ran into was that the first model that we used for detecting emotion from voice was decent, but inconsistent. It was a pre-trained model trained off the RAVDESS dataset-a dataset for human emotion from audio samples. However, these samples were really short (around 2-3 seconds) and they were recorded from only a few professional voice actors in studio noise conditions, so the model was not able to generalize to real-world examples. This limited the usefulness of the model, since it had limited data of people speaking normally. This caused our emotion predictions to be humorously different to what we expected (for example outputting ‘happy’ when we pretended to cry, and ‘angry’ when we were calm). We investigated different solutions to this problem, including splitting up our recorded audio data into 3-second overlapping intervals, but nothing worked. We also found that shorter audio clips were much easier for the model to analyze than longer ones. At some point, we decided that the only solution was to change the pretrained model we were using. After some extensive research and experimentation, we landed on a Way2Vec2 Speech Recognition model that was trained on multiple different datasets, not just RAVDESS, and had received a high classification accuracy. These included the SAVEE, TESS, and URDU datasets. This dataset worked much better!
Accomplishments that we're proud of
Developing a user-friendly interface that is interactive and dynamic based on the user’s mood Successfully implementing an audio→emotion model Using modal to host our backend pipeline and our emotion model Utilizing Aedify.ai to host our frontend
What we learned
We explored different methods to improve model robustness and optimize parameters to ensure the model had high classification accuracy, especially with noisy, real-world laptop recordings involving background ambient noise. We learned the importance of using a reliable, multi-corpus Transformer-based model that had already been trained on a wide range of conditions especially for tasks such as speech recognition, and using serverless infrastructure such as Modal to greatly accelerate inference and reduce latency. Through this, we were able to learn how to take a research dataset and transform it into a responsive web application. Working on this project has been super fun and we hope to improve upon it much more in the future!
What's next for Moodulate
One possible avenue for expansion for Moodulate would be to expand our dataset of songs. Right now we have a dataset of 64 songs, and while that’s a decently robust size for our needs, it would always be better to have more songs to show to our users. One possible way we could do this would be to find (or maybe even build!) a model that evaluates the mood/tone of songs. Combining this with something like the spotify api, we could find any number of songs that would help the user’s state of mind. Additionally we could have some other model that actually generates music based on the user’s mood.
Log in or sign up for Devpost to join the conversation.