This project aims to predict song popularity using a multi-modal approach, combining audio spectrograms processed with Convolutional Neural Networks (CNN) and textual metadata processed using Transformer-based models.
- Overview
- How To Run
- Datasets
- Challenges and Solutions
- Model Architecture
- Training Approach
- Training Details
- Preliminary Results
- Conclusions
This repository contains the implementation for predicting song popularity using a combination of audio features (spectrograms) and textual metadata. A Convolutional Neural Network (CNN) is used to extract features from audio spectrograms, while a Transformer network processes metadata (artist name, song name, and release year). The two model outputs are then fused for a combined prediction.
Popularity_Analysis.ipynb– Contains the full training process and evaluation. You can modify hyperparameters and load previously trained models.Trained_models/– Stores pre-trained models, including naive and optimized combined models.datasets/– Contains a small dataset of 150 samples for testing. The full dataset (~3,000 samples) couldn't be uploaded due to storage limitations.- Files are in pickle format (
.pkl), which must be loaded in the notebook's initial cells.
- Files are in pickle format (
dataset_generations/– Includes scripts for:- Generating down-sampled spectrograms (~3k samples).
- Preprocessing the Spotify dataset (~550k samples).
- If needed, you can use the Kaggle API JSON file to fetch datasets.
- Upload the two
.pkldataset files into the notebook.- For convenience, it's recommended to use Google Drive mounting in Google Colab.
- Run all notebook cells to train and evaluate the model.
- If you wish to generate your own dataset (Spotify metadata processing or different spectrogram down-sampling techniques), use the scripts in
dataset_generation/.
Enjoy! 🚀
- Source: Kaggle - Billboard Hot 100 Spectrograms
- Details: The dataset contains over 3,000 CWT (Continuous Wavelet Transform) spectrograms for the CNN model.
- Spectrogram Properties: 200 pixels per second, each spectrogram is approximately 30-40K in width.
- Source: Kaggle - Spotify Dataset
- Details: The dataset includes metadata for songs, including artist name, song title, release date, and popularity, which are used as input for the Transformer model.
The datasets are merged to create a final dataset of ~3,000 samples, while excluding overlapping samples from the Spotify dataset.
- Down-sampling: The spectrograms are down-sampled using FFT-interpolations to a constant size of 1024x256 pixels. This standardization ensures a consistent size across all samples while retaining important patterns and features.
-
Data Augmentation: To enhance the dataset, data augmentation techniques are applied, including:
- Adding random noise
- Applying random cover patches to simulate missing portions of the spectrogram.
-
Training Transformer on Larger Textual Data: The Transformer network is trained on a larger dataset consisting of textual metadata (without spectrograms), which enables it to learn more robust features and improve model performance.
- Layers:
- Convolution + Pooling Layers
- Batch Normalization (to stabilize training)
- Dropout (to avoid overfitting)
- Fully Connected (FC) Layers
- Input: Tokenized metadata (artist name, song title, and release year).
- Transformer Encoder: Produces contextual embeddings.
- Output: Pooled output represents the textual features.
- Input: CNN output (64 features) and Transformer output (64 features).
- Fusion: Both outputs are concatenated into a 128-width layer, followed by hidden layers with 64 width for final output.
- Output: A single regression output representing song popularity.
This approach explores three different training configurations:
- Transformers only (without CNN)
- CNN only (without Transformers)
- Combined CNN and Transformer
Problems:
- Small datasets limit the effectiveness of Transformers.
- Epochs for Transformers differ from those required for CNNs.
A better approach involves pre-training both models separately before combining them.
- Pre-train Transformers on a large dataset (~550k samples).
- Pre-train CNN using augmentation techniques on spectrograms.
- Freeze pre-trained layers and train the model based on the FC layers only.
Advantages:
- Optimized training time for both models.
- More expressive and effective feature representation.
-
Datasets:
- Separate large dataset for pre-training the Transformer network.
- Merged dataset for CNN and the combined model (with no common samples).
-
Loss Function: Mean Squared Error
-
Optimizer: Adam (with different learning rates for each model)
-
Hyperparameters: Optimized using Optuna, including batch size, epochs, and learning rate.
- A multi-modal approach combining audio spectrograms and text metadata improves prediction accuracy over individual models.
- Spectrogram down-sampling helps capture better spatial and time features then cropping random parts of the spectrogram.
- Augmentation on small datasets improves model accuracy.
- Separate pre-training for large and small datasets is beneficial, especially when dealing with partial and additional information.
- Epoch optimization may require separate training phases for the different models to achieve better performance.






