Songs Popularity Prediction using Spectrograms and Textual Metadata

This project aims to predict song popularity using a multi-modal approach, combining audio spectrograms processed with Convolutional Neural Networks (CNN) and textual metadata processed using Transformer-based models.

Overview

This repository contains the implementation for predicting song popularity using a combination of audio features (spectrograms) and textual metadata. A Convolutional Neural Network (CNN) is used to extract features from audio spectrograms, while a Transformer network processes metadata (artist name, song name, and release year). The two model outputs are then fused for a combined prediction.

How To Run

Files Explanation:

Popularity_Analysis.ipynb – Contains the full training process and evaluation. You can modify hyperparameters and load previously trained models.
Trained_models/ – Stores pre-trained models, including naive and optimized combined models.
datasets/ – Contains a small dataset of 150 samples for testing. The full dataset (~3,000 samples) couldn't be uploaded due to storage limitations.
- Files are in pickle format (.pkl), which must be loaded in the notebook's initial cells.
dataset_generations/ – Includes scripts for:
- Generating down-sampled spectrograms (~3k samples).
- Preprocessing the Spotify dataset (~550k samples).
- If needed, you can use the Kaggle API JSON file to fetch datasets.

Steps to Run:

Upload the two .pkl dataset files into the notebook.
- For convenience, it's recommended to use Google Drive mounting in Google Colab.
Run all notebook cells to train and evaluate the model.

Custom Dataset Generation:

If you wish to generate your own dataset (Spotify metadata processing or different spectrogram down-sampling techniques), use the scripts in dataset_generation/.

Enjoy! 🚀

Datasets

1. Billboard Hot 100 1960-2020 Spectrograms

Source: Kaggle - Billboard Hot 100 Spectrograms
Details: The dataset contains over 3,000 CWT (Continuous Wavelet Transform) spectrograms for the CNN model.
Spectrogram Properties: 200 pixels per second, each spectrogram is approximately 30-40K in width.

2. Spotify Dataset (1921-2020, 600k+ Tracks)

Source: Kaggle - Spotify Dataset
Details: The dataset includes metadata for songs, including artist name, song title, release date, and popularity, which are used as input for the Transformer model.

The datasets are merged to create a final dataset of ~3,000 samples, while excluding overlapping samples from the Spotify dataset.

Challenges and Solutions

Solution for Inconsistent Spectrogram Sizes

Down-sampling: The spectrograms are down-sampled using FFT-interpolations to a constant size of 1024x256 pixels. This standardization ensures a consistent size across all samples while retaining important patterns and features.

Solution for Small Spectrogram Dataset

Data Augmentation: To enhance the dataset, data augmentation techniques are applied, including:
- Adding random noise
- Applying random cover patches to simulate missing portions of the spectrogram.
Training Transformer on Larger Textual Data: The Transformer network is trained on a larger dataset consisting of textual metadata (without spectrograms), which enables it to learn more robust features and improve model performance.

Example: Patches covering a down-sampled Spectrogram

Model Architecture

CNN for Spectrograms

Layers:
- Convolution + Pooling Layers
- Batch Normalization (to stabilize training)
- Dropout (to avoid overfitting)
- Fully Connected (FC) Layers

Transformers for Textual Metadata

Input: Tokenized metadata (artist name, song title, and release year).
Transformer Encoder: Produces contextual embeddings.
Output: Pooled output represents the textual features.

Feature Fusion

Input: CNN output (64 features) and Transformer output (64 features).
Fusion: Both outputs are concatenated into a 128-width layer, followed by hidden layers with 64 width for final output.
Output: A single regression output representing song popularity.

Training Approach

Naive Approach

This approach explores three different training configurations:

Transformers only (without CNN)
CNN only (without Transformers)
Combined CNN and Transformer

Problems:

Small datasets limit the effectiveness of Transformers.
Epochs for Transformers differ from those required for CNNs.

Better Approach

A better approach involves pre-training both models separately before combining them.

Pre-train Transformers on a large dataset (~550k samples).
Pre-train CNN using augmentation techniques on spectrograms.
Freeze pre-trained layers and train the model based on the FC layers only.

Advantages:

Optimized training time for both models.
More expressive and effective feature representation.

Training Details

Datasets:
- Separate large dataset for pre-training the Transformer network.
- Merged dataset for CNN and the combined model (with no common samples).
Loss Function: Mean Squared Error
Optimizer: Adam (with different learning rates for each model)
Hyperparameters: Optimized using Optuna, including batch size, epochs, and learning rate.

Preliminary Results

Naive Approach

Better Approach

Conclusions

A multi-modal approach combining audio spectrograms and text metadata improves prediction accuracy over individual models.
Spectrogram down-sampling helps capture better spatial and time features then cropping random parts of the spectrogram.
Augmentation on small datasets improves model accuracy.
Separate pre-training for large and small datasets is beneficial, especially when dealing with partial and additional information.
Epoch optimization may require separate training phases for the different models to achieve better performance.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.idea		.idea
Graphs		Graphs
Trained_models		Trained_models
dataset_generation		dataset_generation
datasets		datasets
images		images
Popularity_Analysis.ipynb		Popularity_Analysis.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Songs Popularity Prediction using Spectrograms and Textual Metadata

Table of Contents

Overview

How To Run

Files Explanation:

Steps to Run:

Custom Dataset Generation:

Datasets

1. Billboard Hot 100 1960-2020 Spectrograms

2. Spotify Dataset (1921-2020, 600k+ Tracks)

Challenges and Solutions

Solution for Inconsistent Spectrogram Sizes

Solution for Small Spectrogram Dataset

Example: Patches covering a down-sampled Spectrogram

Model Architecture

CNN for Spectrograms

Transformers for Textual Metadata

Feature Fusion

Training Approach

Naive Approach

Better Approach

Training Details

Preliminary Results

Naive Approach

Better Approach

Conclusions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Songs Popularity Prediction using Spectrograms and Textual Metadata

Table of Contents

Overview

How To Run

Files Explanation:

Steps to Run:

Custom Dataset Generation:

Datasets

1. Billboard Hot 100 1960-2020 Spectrograms

2. Spotify Dataset (1921-2020, 600k+ Tracks)

Challenges and Solutions

Solution for Inconsistent Spectrogram Sizes

Solution for Small Spectrogram Dataset

Example: Patches covering a down-sampled Spectrogram

Model Architecture

CNN for Spectrograms

Transformers for Textual Metadata

Feature Fusion

Training Approach

Naive Approach

Better Approach

Training Details

Preliminary Results

Naive Approach

Better Approach

Conclusions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages