For our final project in the Technion DL course (046211), we chose to classify music genres over the GTZAN dataset.
The approach to the problem is using a pre-trained Wav2Vec2 transformer model.
As appose to most existing models, a transformer will use the raw time series data which is the reason we predicted an improvement over existing methods
We used the model facebook/wav2vec2-large-100k-voxpopuli from huggingface,
Facebooks Wav2Vec2 model pre-trained on 100k unlabeled subset of speech data.
We used the femiliar GTZAN dataset.
The dataset consists of 1000 audio tracks each 30 seconds long.
It contains 10 genres, each represented by 100 tracks:
The genres are: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock
The tracks are all 22050Hz Mono 16-bit audio files in .wav format.
| File | Purpsoe |
|---|---|
img |
Contains images for README.md file |
train_30s_model.py |
train the model on 30s tracks |
train_15s_model.py |
train the model on 15s tracks |
train_10s_model.py |
train the model on 10s tracks |
eval_model.py |
evaluate the model |
rolling_stones.wav |
example audio file |
The model was trained on 30s tracks.
performance:
87% accuracy on validation set
77% accuracy on test set
The model was trained on 15s long tracks. Each 30s track was divided into 2 sub-tracks 15s long
performance:
78.85% accuracy on validation set
75.5% accuracy on test set
The model was trained on 10s tracks.
Each 30s track was divided into 3 sub-tracks 10s long
performance:
78% accuracy on validation set
The project is intended to run in huggingface docker image
For instructions on how to install docker:
https://docs.docker.com/engine/install/
Replace train_30s_model.py with your chosen model
docker run --name gtzan --rm -it --ipc=host --gpus=all -v $PWD:/home huggingface/transformers-pytorch-gpu python3 /home/train_30s_model.pyThis command spins up a docker container from the official huggingface image, mounts the repo directory and run the training script
Open the Model in hugging face.
Note that hugging face server supports tracks up to 2-3 minutes
docker run --name gtzan --rm -it --ipc=host --gpus=all -v $PWD:/home huggingface/transformers-pytorch-gpudocker run --name gtzan --rm -it -v $PWD:/home huggingface/transformers-pytorch-gpuIn the container either use a python script file or via the interactive interpreter:
from transformers import pipeline
import torchaudio
import sys
MODEL_NAME = 'adamkatav/wav2vec2_100k_gtzan_30s_model'
SONG_IN_REPO_DIR_PATH = '/home/rolling_stones.wav'
pipe = pipeline(model=MODEL_NAME)
audio_array,sample_freq = torchaudio.load(SONG_IN_REPO_DIR_PATH)
resample = torchaudio.transforms.Resample(orig_freq=sample_freq)
audio_array = audio_array.mean(axis=0).squeeze().numpy()
output = pipe(audio_array)
print(output)





