Skip to content

idanav100/Songs_Popularity_Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Songs Popularity Prediction using Spectrograms and Textual Metadata

This project aims to predict song popularity using a multi-modal approach, combining audio spectrograms processed with Convolutional Neural Networks (CNN) and textual metadata processed using Transformer-based models.

Spectrogram Example Spectrogram Example

Patches covering a down-sampled Spectrogram

Table of Contents

  1. Overview
  2. How To Run
  3. Datasets
  4. Challenges and Solutions
  5. Model Architecture
    1. CNN for Spectrograms
    2. Transformers for Textual Metadata
    3. Feature Fusion
  6. Training Approach
    1. Naive Approach
    2. Better Approach
  7. Training Details
  8. Preliminary Results
  9. Conclusions

Overview

This repository contains the implementation for predicting song popularity using a combination of audio features (spectrograms) and textual metadata. A Convolutional Neural Network (CNN) is used to extract features from audio spectrograms, while a Transformer network processes metadata (artist name, song name, and release year). The two model outputs are then fused for a combined prediction.

How To Run

Files Explanation:

  • Popularity_Analysis.ipynb – Contains the full training process and evaluation. You can modify hyperparameters and load previously trained models.
  • Trained_models/ – Stores pre-trained models, including naive and optimized combined models.
  • datasets/ – Contains a small dataset of 150 samples for testing. The full dataset (~3,000 samples) couldn't be uploaded due to storage limitations.
    • Files are in pickle format (.pkl), which must be loaded in the notebook's initial cells.
  • dataset_generations/ – Includes scripts for:
    • Generating down-sampled spectrograms (~3k samples).
    • Preprocessing the Spotify dataset (~550k samples).
    • If needed, you can use the Kaggle API JSON file to fetch datasets.

Steps to Run:

  1. Upload the two .pkl dataset files into the notebook.
    • For convenience, it's recommended to use Google Drive mounting in Google Colab.
  2. Run all notebook cells to train and evaluate the model.

Custom Dataset Generation:

  • If you wish to generate your own dataset (Spotify metadata processing or different spectrogram down-sampling techniques), use the scripts in dataset_generation/.

Enjoy! 🚀

Datasets

1. Billboard Hot 100 1960-2020 Spectrograms

  • Source: Kaggle - Billboard Hot 100 Spectrograms
  • Details: The dataset contains over 3,000 CWT (Continuous Wavelet Transform) spectrograms for the CNN model.
  • Spectrogram Properties: 200 pixels per second, each spectrogram is approximately 30-40K in width.

2. Spotify Dataset (1921-2020, 600k+ Tracks)

  • Source: Kaggle - Spotify Dataset
  • Details: The dataset includes metadata for songs, including artist name, song title, release date, and popularity, which are used as input for the Transformer model.

The datasets are merged to create a final dataset of ~3,000 samples, while excluding overlapping samples from the Spotify dataset.

Challenges and Solutions

Solution for Inconsistent Spectrogram Sizes

  • Down-sampling: The spectrograms are down-sampled using FFT-interpolations to a constant size of 1024x256 pixels. This standardization ensures a consistent size across all samples while retaining important patterns and features.

Solution for Small Spectrogram Dataset

  • Data Augmentation: To enhance the dataset, data augmentation techniques are applied, including:

    • Adding random noise
    • Applying random cover patches to simulate missing portions of the spectrogram.
  • Training Transformer on Larger Textual Data: The Transformer network is trained on a larger dataset consisting of textual metadata (without spectrograms), which enables it to learn more robust features and improve model performance.

Example: Patches covering a down-sampled Spectrogram

Patches covering a down-sampled Spectrogram

Model Architecture

CNN for Spectrograms

  • Layers:
    • Convolution + Pooling Layers
    • Batch Normalization (to stabilize training)
    • Dropout (to avoid overfitting)
    • Fully Connected (FC) Layers

Transformers for Textual Metadata

  • Input: Tokenized metadata (artist name, song title, and release year).
  • Transformer Encoder: Produces contextual embeddings.
  • Output: Pooled output represents the textual features.

Feature Fusion

  • Input: CNN output (64 features) and Transformer output (64 features).
  • Fusion: Both outputs are concatenated into a 128-width layer, followed by hidden layers with 64 width for final output.
  • Output: A single regression output representing song popularity.

Training Approach

Naive Approach

This approach explores three different training configurations:

  1. Transformers only (without CNN)
  2. CNN only (without Transformers)
  3. Combined CNN and Transformer

Problems:

  • Small datasets limit the effectiveness of Transformers.
  • Epochs for Transformers differ from those required for CNNs.

Better Approach

A better approach involves pre-training both models separately before combining them.

  1. Pre-train Transformers on a large dataset (~550k samples).
  2. Pre-train CNN using augmentation techniques on spectrograms.
  3. Freeze pre-trained layers and train the model based on the FC layers only.

Advantages:

  • Optimized training time for both models.
  • More expressive and effective feature representation.

Training Details

  • Datasets:

    • Separate large dataset for pre-training the Transformer network.
    • Merged dataset for CNN and the combined model (with no common samples).
  • Loss Function: Mean Squared Error

  • Optimizer: Adam (with different learning rates for each model)

  • Hyperparameters: Optimized using Optuna, including batch size, epochs, and learning rate.

Preliminary Results

Naive Approach

Naive Approach Results

Better Approach

Better Approach Results

Conclusions

  • A multi-modal approach combining audio spectrograms and text metadata improves prediction accuracy over individual models.
  • Spectrogram down-sampling helps capture better spatial and time features then cropping random parts of the spectrogram.
  • Augmentation on small datasets improves model accuracy.
  • Separate pre-training for large and small datasets is beneficial, especially when dealing with partial and additional information.
  • Epoch optimization may require separate training phases for the different models to achieve better performance.

Patches covering a down-sampled Spectrogram

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors