Final Writeup

https://docs.google.com/document/d/1hWGAkPgRPcQybHF-Z3VS-jpbBnh9sOkuKfIvYAEhg1U/edit?usp=sharing

Music Mood Classification

Final Project for CSCI 2470

Team Members

  • Breese Sherman (bsherma3)
  • Nathan Plano: (nplano)
  • Elizabeth Chen (ecc1)
  • Qingyang Xie (qxie6)

Introduction

We are trying to implement a Deep Learning system which can automatically detect the emotion of a given piece of music. Specifically, we will classify the given data into four mood classes {happy, angry, sad, relax}, making use of both the audio signal and the lyrics. We decided on this topic as all of us are initially interested in music processing with DL and the way that music causes emotional arousal to the listener attracted us. We also consider a classification task could be a feasible one for us to achieve success. In reference to this paper Multi-Modal Song Mood Detection with Deep Learning, we plan to first implement the model which will achieve the expected outcome. Next we would like to extend the model by adding more spectral features and rhythm features such as tempogram and the Fourier tempogram. In addition, trying to overcome some drawbacks mentioned in the paper, we are also considering collecting more data with the Spotify API and modifying the structure of the model to achieve better accuracy.

Related Work

The research paper that inspired us––entitled "Multi-Modal Song Mood Detection with Deep Learning"––focuses on examining the success of using audio and lyrics separately versus concurrently for detecting emotions in different songs with Deep Learning. The authors posit that despite many factors contributing to perception and intensity of emotions evoked by music––including cultural background, life experiences, listening environment, and more––in general, emotions evoked by a given song tend to have a universally interpretable mood associated with it. Given this fact, the goal of their research was to combine Natural Language Processing methods applied to lyrics and Digital Signal Processing methods applied to audio segments to form a unified analysis system that is more successful than either method individually. Using the Circumplex model of human emotions to define four primary mood categories, each quadrant of a two-dimensional valence-arousal (x-y) space represents either Happy, Relaxed, Sad, or Angry. For lyrics, the authors tested a Fully-Connected Neural Network using Bag of Words and TF-IDF embeddings, a Recurrent Neural Network with LSTM cells using Word2Vec and GloVe embeddings, and one pre-trained Network that uses BERT embeddings. For audio analysis, the authors chose to use a Convolutional Neural Network with three pairs of convolutional and pooling layers and two fully-connected classifier layers, tuning hyperparameters to find the optimal combination of values. They chose the best individual performing systems for the lyrics and audio, the NN with BERT and the CNN with optimal hyperparameters to form a unified system to work in tandem for emotion classification. The authors ultimately found that multi-modal detection systems are far superior to uni-modal systems, indicating that both lyrics and audio contain useful information for training Deep Learning models.

Links to Related Work

Data

We'll be working with the MoodyLyrics dataset, which can be downloaded from http://softeng.polito.it/erion/MoodyLyrics.zip. This is the same dataset used in the paper mentioned in the introduction, and it contains over 2000 songs of varying popularity, each with a title, artist, and mood. To comply with copyright restrictions, the songs themselves (the sound files) are not included in the dataset, nor are the lyrics to the respective songs. Therefore, we'll have to do some web scraping to obtain the required data.

We'll first write a scraper that curls, in sequence, the lyrics (if there are any) of each song in the dataset. We'll then do additional processing on the lyrics that condenses them all to a smaller, unified vocabulary, and removes any extraneous metadata.

The more difficult part will be obtaining pertinent audio files for all ~2000 songs in the dataset. Our current plan is to use the famous youtube-dl command-line tool in a similar manner to our lyric scraper — downloading each song, or more likely, a segment of each song, off of YouTube if it's available. This may take some time, but we will try to parallelize the process among the four of us as much as possible.

Methodology

As stated in the introduction, we'd like to expand upon the methodology used in the original paper by exploring additional features of the dataset.

Our architecture is a simple classifier on the surface, but we'll be experimenting with focusing on different features of the audio to definitively say what features best determine the mood of any particular song. More specifically, we'll extract available audio features from the Spotify Web API using the Tekore Python library, and use those to better inform our decision. However, in the end, we'd like to be able to make well-informed decisions without relying on the Spotify API's more in-depth features — ideally, we'd like to use only tempo, since the rest of them (danceability, valence…) would do too much heavy lifting for us.

We'd also like to bring 2D convolution into the picture by analyzing the visual spectrogram of each song. This'll require getting the spectrogram, or some version of it, in the first place — the librosa Python library provides helpful functions that pull spectral features out of sound files. Our convolutional architecture will do so, most likely running a Fourier transform on (a segment of) each song, transforming it into an array or tensor on which we can run our classic Conv2D functions.

Ultimately, our methodology will be reliant on training using features of different types — auditory, visual, and lyrical — that we can pull out of the dataset.

Metrics

Our metrics for success are very straightforward. We want to be able to accurately classify the mood of a song. Thus we can score ourselves based on how many moods we predict correctly for songs. We would like our accuracy to be above 90% but this may be a stretch goal as other papers have lower accuracies when dealing with the MoodyLyrics database.

Ethics

There is not much to consider when thinking about the ethics of determining the mood of a song. A bad algorithm would at worst recommend a sad song in a happy playlist. Really not much to consider but we will strive to make our project as ethical as can be.

Division of Labor

  • Breese: Data / preprocessing
  • Qingyang: Model visual spectrograph part
  • Elizabeth: Model audio lyrics
  • Nathan: Training code

Built With

Share this project:

Updates

posted an update

Update - Final Project Reflection

May 2, 2022

Introduction

We are trying to implement a Deep Learning system which can automatically detect the emotion of a given piece of music. Specifically, we will classify the given data into four mood classes {happy, angry, sad, relax}, making use of both the audio signal and the lyrics. We decided on this topic as all of us are initially interested in music processing with DL and the way that music causes emotional arousal to the listener attracted us. We also consider a classification task could be a feasible one for us to achieve success. In reference to this paper Multi-Modal Song Mood Detection with Deep Learning, we plan to first implement the model which will achieve the expected outcome. Next we would like to extend the model by adding more spectral features and rhythm features such as tempogram and the Fourier tempogram. In addition, trying to overcome some drawbacks mentioned in the paper, we are also considering collecting more data with the Spotify API and modifying the structure of the model to achieve better accuracy.

Challenges

Most of the work put into this project so far has been obtaining the different datasets required for training. This required writing two custom web scrapers — one for downloading song lyrics and one for getting sound files — the legality and effectiveness of which were hotly debated amongst our team.

The lyrical scraper was written first, and it turned out to be the more challenging of the two. In the paper that provided the MoodyLyrics dataset, where our songs and labels come from, the authors did not provide the lyrics for each song directly, but rather suggested a way to scrape the lyrics for each song from LyricWiki. We prepared to implement this ourselves, but then found, to our dismay, that LyricWiki was shut down in 2020, throwing that possibility out the window. We then turned to scraping from Google, but they disallow requests that do not come from their web page. As a last resort, we turned to Genius, building URLs for all 2595 songs in the dataset and hoping that they would be associated with a registered set of lyrics on Genius. This mostly worked, although there are still a small set of lyrics missing.

The challenges that came with extracting 2595 separate audio snippets mostly revolved around the time needed to download them from YouTube using the youtube-dl command line tool. YouTube unfortunately throttles download speeds significantly, so each song took about 1-2 minutes to download. Luckily, this process was parallelizable, so we overcame this challenge by spawning 10 separate youtube-dl crawlers, speeding up the procedure by a factor of 10. Still, the effort took several days in total, but we now have every single song of the 2595 in the dataset.

Insights

We do not have any concrete results yet as we just finished downloading all of the audio files over the weekend. We are looking forward to working with the data and gaining interesting insights in the coming days.

Plan

Going forward during the next nine days our team plans to dedicate time to preprocessing our audio data. This will include creating spectrographs for all the audio files we have. Up until this point much of our work has been in scraping the web in order to get audio samples and lyric samples for songs with labels of their mood. Now that we have acquired this data it is time to preprocess this data such that we may extract the most information from the data in order to classify mood correctly. This means creating spectrographs which will show Fourier analysis of song snippets aka 2D representations of our data. CNN’s will be used on this data to extract information on mood during training and testing. The second set of data comes from lyrics in which a vec2vec approach will be used to determine mood of songs. Our third set of data will be a tempograph that is created from our .wav files. This will give tempo based information on our dataset. Again a CNN will be very useful for using this data for mood classification. We feel that this plan going forward will allow us to be successful.

Log in or sign up for Devpost to join the conversation.