Final Writeup
https://docs.google.com/document/d/1hWGAkPgRPcQybHF-Z3VS-jpbBnh9sOkuKfIvYAEhg1U/edit?usp=sharing
Music Mood Classification
Final Project for CSCI 2470
Team Members
- Breese Sherman (bsherma3)
- Nathan Plano: (nplano)
- Elizabeth Chen (ecc1)
- Qingyang Xie (qxie6)
Introduction
We are trying to implement a Deep Learning system which can automatically detect the emotion of a given piece of music. Specifically, we will classify the given data into four mood classes {happy, angry, sad, relax}, making use of both the audio signal and the lyrics. We decided on this topic as all of us are initially interested in music processing with DL and the way that music causes emotional arousal to the listener attracted us. We also consider a classification task could be a feasible one for us to achieve success. In reference to this paper Multi-Modal Song Mood Detection with Deep Learning, we plan to first implement the model which will achieve the expected outcome. Next we would like to extend the model by adding more spectral features and rhythm features such as tempogram and the Fourier tempogram. In addition, trying to overcome some drawbacks mentioned in the paper, we are also considering collecting more data with the Spotify API and modifying the structure of the model to achieve better accuracy.
Related Work
The research paper that inspired us––entitled "Multi-Modal Song Mood Detection with Deep Learning"––focuses on examining the success of using audio and lyrics separately versus concurrently for detecting emotions in different songs with Deep Learning. The authors posit that despite many factors contributing to perception and intensity of emotions evoked by music––including cultural background, life experiences, listening environment, and more––in general, emotions evoked by a given song tend to have a universally interpretable mood associated with it. Given this fact, the goal of their research was to combine Natural Language Processing methods applied to lyrics and Digital Signal Processing methods applied to audio segments to form a unified analysis system that is more successful than either method individually. Using the Circumplex model of human emotions to define four primary mood categories, each quadrant of a two-dimensional valence-arousal (x-y) space represents either Happy, Relaxed, Sad, or Angry. For lyrics, the authors tested a Fully-Connected Neural Network using Bag of Words and TF-IDF embeddings, a Recurrent Neural Network with LSTM cells using Word2Vec and GloVe embeddings, and one pre-trained Network that uses BERT embeddings. For audio analysis, the authors chose to use a Convolutional Neural Network with three pairs of convolutional and pooling layers and two fully-connected classifier layers, tuning hyperparameters to find the optimal combination of values. They chose the best individual performing systems for the lyrics and audio, the NN with BERT and the CNN with optimal hyperparameters to form a unified system to work in tandem for emotion classification. The authors ultimately found that multi-modal detection systems are far superior to uni-modal systems, indicating that both lyrics and audio contain useful information for training Deep Learning models.
Links to Related Work
- https://www.mdpi.com/1424-8220/22/3/1065 (Original Paper)
- https://github.com/konpyro/DeepMoodDetection
- https://github.com/teju97/Multi-Modal-Mood-Detection-of-Music
Data
We'll be working with the MoodyLyrics dataset, which can be downloaded from http://softeng.polito.it/erion/MoodyLyrics.zip. This is the same dataset used in the paper mentioned in the introduction, and it contains over 2000 songs of varying popularity, each with a title, artist, and mood. To comply with copyright restrictions, the songs themselves (the sound files) are not included in the dataset, nor are the lyrics to the respective songs. Therefore, we'll have to do some web scraping to obtain the required data.
We'll first write a scraper that curls, in sequence, the lyrics (if there are any) of each song in the dataset. We'll then do additional processing on the lyrics that condenses them all to a smaller, unified vocabulary, and removes any extraneous metadata.
The more difficult part will be obtaining pertinent audio files for all ~2000 songs in the dataset. Our current plan is to use the famous youtube-dl command-line tool in a similar manner to our lyric scraper — downloading each song, or more likely, a segment of each song, off of YouTube if it's available. This may take some time, but we will try to parallelize the process among the four of us as much as possible.
Methodology
As stated in the introduction, we'd like to expand upon the methodology used in the original paper by exploring additional features of the dataset.
Our architecture is a simple classifier on the surface, but we'll be experimenting with focusing on different features of the audio to definitively say what features best determine the mood of any particular song. More specifically, we'll extract available audio features from the Spotify Web API using the Tekore Python library, and use those to better inform our decision. However, in the end, we'd like to be able to make well-informed decisions without relying on the Spotify API's more in-depth features — ideally, we'd like to use only tempo, since the rest of them (danceability, valence…) would do too much heavy lifting for us.
We'd also like to bring 2D convolution into the picture by analyzing the visual spectrogram of each song. This'll require getting the spectrogram, or some version of it, in the first place — the librosa Python library provides helpful functions that pull spectral features out of sound files. Our convolutional architecture will do so, most likely running a Fourier transform on (a segment of) each song, transforming it into an array or tensor on which we can run our classic Conv2D functions.
Ultimately, our methodology will be reliant on training using features of different types — auditory, visual, and lyrical — that we can pull out of the dataset.
Metrics
Our metrics for success are very straightforward. We want to be able to accurately classify the mood of a song. Thus we can score ourselves based on how many moods we predict correctly for songs. We would like our accuracy to be above 90% but this may be a stretch goal as other papers have lower accuracies when dealing with the MoodyLyrics database.
Ethics
There is not much to consider when thinking about the ethics of determining the mood of a song. A bad algorithm would at worst recommend a sad song in a happy playlist. Really not much to consider but we will strive to make our project as ethical as can be.
Division of Labor
- Breese: Data / preprocessing
- Qingyang: Model visual spectrograph part
- Elizabeth: Model audio lyrics
- Nathan: Training code
Log in or sign up for Devpost to join the conversation.