Omer Cohen • Jonathan Nir Shalit
Our Project for the Technion's EE 046211 course "Deep Learning"
As our final project in Deep Learning course we have been asked to choose a problem and to solve it using neural network and deep learning techniques. We chose to implement DL algorithm that classifies genre of music track.
The algorithm's input is a 30 seconds long music track, and the output is one of the following genres: Blues, Rock, Classic, Reggae, Disco, Country, Hip-Hop, Metal, Jazz and Pop. Throughout our work we experimented several approches to solve this problem both via the data and the architecture.
We used the widely used GTZAN dataset. the dataset includes 10 classes of music genres. Each class contains 100 tracks of 30 seconds. Therefore, we faced a low-amount-of-data problem.
To enlarge our dataset, we used data augmentations. To execute those augmentation easily, we used Librosa package. We have used the following augmentations:
usage:
import audio_augmentation
audio_augmentation.main_reduced()Our first trial to improve model's performance was to work with the raw data and to use 1D convnet. We tried 2 architetures that yileded same performances:
first:
second:
We tested our model on 10-classes dataset and we got poor 10% accuracy performance (random prediction). We tried to use chopped sub-tracks of different lengths and still the performance didn’t improve. At this point we concluded:
- Working with the raw music signal (without any pre-processing) is more difficult and requires more sophisticated architectures.
- Working with 2D data allows us to use known computer-vision architectures and techniques.
Working with 2D input means transforming the data into time-frequency space of mel-spectrogram. We used Librosa tools to transform the data. Here is an ilustration for the transform: usage:
import feature_extraction
feature_extraction.main()We used resnet18 architecture with dropout. To tune our hyper-parameters we used Optuna. That model achieved 62.4% accuracy on the test-set.
We tried to boost our performnaces by using ensemble of classifiers. In this method we chop each track to sub-tracks and predict label for each sub-track independently. we tried both 'soft' and 'hard' ensembles. 'soft' means summing up the output vectors and then taking the arg-max as the final prediction, 'hard' means create histogram from all the mini-predictions and take the label that got the majority of the mini-predictions as our final prediction.
that method using soft ensemble yielded the following performances:
8 classes:
10 classes:










