Generating audio files of spoken digits, using a conditional generative architectures (cVAE, cGAN) and evaluating the results with Inception Score.
GAN part is based on this repository.
The purpose of the project is to create a generative model, that can generate audio files. More specifically, we chose to focus on speech audio files of digits. Training a model to generate audio files based on their time series can be challenging, therefore we decided to use the STFT representation of the audio signal. Generally, the STFT of a signal is complex and for that reason we will represent it as a 2 channel image, where the first one is the amplitude and the second one is the phase. We examined two main generative architectures, conditional-VAE and conditional-wGAN-gp. For each architecture, we experimented with the different methods described below:
- VAE:
- Generating a spectogram amplitude image only, conditioned on the label of a digit.
- GAN:
- Experiment 1 - Generating a spectogram amplitude image only, conditioned on the label of a digit.
- Experiment 2 - Generating a 2 channel image, of the spectogram's amplitude and phase, conditioned on the label of a digit.
- Experiment 3 - Generating a spectogram amplitude image only, conditioned on both the label of a digit and a phase image compitable with the lable.
- Experiment 4 - Same idea as Experiment 3, but with a regularization (explained in the PDF).
In addition to the generative models, we trained a digit-classifier based on the spectogram amplitude, for performance mesures perposes.
| Library | Version |
|---|---|
numpy |
1.19.5 |
torch |
1.6.0 |
librosa |
0.8.0 |
tqdm |
4.53.0 |
colorama |
0.4.4 |
| File name | Purpsoe |
|---|---|
inception_scores_metrics.py |
Evaluating a generative model performance with Frechet Inception Score and 'Diversity Score' |
pre_processing.py |
Converting .wav files to .npy for fitting the networks |
weights.txt |
Link to download weights |
metrics results for the exps.txt |
The performance of our trained generative models |
dataset directory |
The datasets required for training and evaluating the experiments (partial only) |
For each experiment, the following files are provided:
| File name | Purpsoe |
|---|---|
dataset.py |
Dataset class, that fits the pytorch conventions |
eval.py |
Loads a trained model and generates .wav files |
models.py |
The model (Generator & Discriminator / VAE) |
train.py |
Train the model |
Samples Directory |
Examples for samples generated by the trained model |
final_results_example.png |
10 spectogram amplitudes generated from each label (each column is a specific label) |
train_log.txt |
The training progress |
The datasets (used for training the models, generating samples for conditioned GAN exp 3 and exp 4 and evaluting the results) should be seperated as in the provided dataset directory (which is partial only) - see explantion in How-to-use.
For running the models, one need to download the weights for the model from the link specified in weights.txt. The weights has to placed inside the relevant exp directory, with the same name as in the link.
The datasets can be found at this github, and need to be pre-processed (a.k.a converted to .npy) using pre_processing.py.
Then, it should be placed inside the right sub-folders at the dataset directory:
| Sub-Folder | Purpsoe |
|---|---|
test_spectograms |
.npy arrays with 2-channels (amplitude & phase) of the test set |
train_spectograms |
.npy arrays with 2-channels (amplitude & phase) of the train set |
test_spectograms_amplitude |
.npy arrays with 1-channel (amplitude only) of the test set |
train_spectograms_amplitude |
.npy arrays with 1-channel (amplitude only) of the train set |
data_for_metrics |
Contains a sub-folder for each exp and one for the real data. Each of these are seperated to folders by label |
