This repository contains the implementation of (MQGAN) for audio synthesis. The project is structured to facilitate the entire workflow from data preparation to model deployment.
Before you begin, ensure you have Python 3.9+ installed.
-
Clone the repository:
git clone https://github.com/your-repo/MQGAN.git cd MQGAN -
Create and activate a virtual environment:
python -m venv .venv # On Windows .venv\Scripts\activate # On macOS/Linux source .venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
The MQGAN model operates on Mel spectrograms. You'll need to convert your raw audio files into this format.
Use convert_spectrograms.py to transform your audio files (e.g., WAV, FLAC) into Mel spectrograms, saved as .npy files.
-
Prepare your audio data: Place your audio files in an input directory. The script will mirror the directory structure in the output.
-
Configure spectrogram extraction: Edit the
spec_config.yamlfile to define parameters for Mel spectrogram extraction, such assampling_rate,n_mel_channels,filter_length,hop_length, etc. You can find some at configs/ -
Run the conversion script:
python convert_spectrograms.py --config configs/spec_config_hifispeech.yaml
You can override
input_folderandoutput_folderdirectly from the command line:python convert_spectrograms.py --config configs/spec_config_hifispeech.yaml --input_folder /data/raw_audio --output_folder /data/mels
The script will create
.npyfiles in the specifiedoutput_folder, preserving the original directory structure.
After training the PreEncoder model, you might want to re-encode your original Mel spectrograms using the trained model. This is useful for generating training data for a subsequent vocoder or for analyzing the quantized representation.
You have two options for re-encoding:
-
Using a TorchScript-exported model: If you have already exported your
PreEncoderto TorchScript (see Exporting to TorchScript), usereencode_spectrograms.py.python reencode_spectrograms.py \ --model /path/to/your/exported_model_folder \ --input_dir /path/to/your/original_mels \ --output_dir /path/to/save/reencoded_mels \ --device cuda # or cpuThe
--modelargument should point to the directory containingmodel_cpu.pt(and optionallymodel_cuda.pt) andmodel_config.yaml. -
Using a raw PyTorch checkpoint: If you prefer to use a raw
.pthcheckpoint and its correspondingconfig.yamldirectly, usereencode_spectrograms_from_checkpoint.py.python reencode_spectrograms_from_checkpoint.py \ --checkpoint /path/to/your/model.pth \ --config /path/to/your/model_config.yaml \ --input_dir /path/to/your/original_mels \ --output_dir /path/to/save/reencoded_mels \ --device cuda # or cpuThe
--configargument here refers to the model's configuration file (e.g.,model_config_hifimusic.yamlormodel_config_hifispeech.yaml), notspec_config.yaml.
The PreEncoder model is trained using train.py. This script handles the GAN training loop, including the generator (PreEncoder) and discriminators.
-
Prepare your training data: Ensure you have generated Mel spectrograms using
convert_spectrograms.pyand they are located in thedata_dirspecified in your training configuration. -
Configure training parameters: Edit a training configuration file (e.g.,
model_config_hifimusic.yamlormodel_config_hifispeech.yaml). This file defines model architecture, training hyperparameters, loss weights, and data paths. -
Start training:
python train.py --config configs/model_config_hifimusic.yaml
You can resume training from a checkpoint:
python train.py --config configs/model_config_hifimusic.yaml --pretrained checkpoints/music_preencoder/checkpoint_epoch_050.pth
Training progress, losses, and example spectrograms will be logged to Weights & Biases (WandB).
Once your PreEncoder model is trained, you can export it to TorchScript for easier deployment and faster inference.
-
Run the conversion script:
python convert_to_torchscript.py \ --checkpoint /path/to/your/trained_model.pth \ --config /path/to/your/model_config.yaml \ --output_dir /path/to/save/exported_model--checkpoint: Path to the.pthcheckpoint file from your training run.--config: Path to the model's configuration YAML file (e.g.,model_config_hifimusic.yaml) that was used for training.--output_dir: Directory where the TorchScript models (model_cpu.pt,model_cuda.pt) and a copy of the config (model_config.yaml) will be saved.
-
Using the exported model: The
scripted_preencoder.pyfile provides aScriptedPreEncoderclass to easily load and use the exported TorchScript model:from scripted_preencoder import ScriptedPreEncoder import torch import numpy as np # Load the exported model model_wrapper = ScriptedPreEncoder("/path/to/save/exported_model", device='cuda') # or 'cpu' # Example usage: # Assuming 'mel_input_np' is a NumPy array of your Mel spectrogram (batch, seq_len, mel_channels) mel_input_tensor = torch.from_numpy(mel_input_np).float() lengths = torch.tensor([mel_input_np.shape[1]]) # Example for a single spectrogram # Encode to discrete tokens indices = model_wrapper.encode(mel_input_tensor, lengths=lengths) print(f"Encoded indices shape: {indices.shape}") # Decode back to spectrogram reconstructed_mel = model_wrapper.decode(indices, lengths=lengths) print(f"Reconstructed mel shape: {reconstructed_mel.shape}")
We provide a selection of pretrained MQGAN models for different audio domains. These models include both the PreEncoder (quantizer) and the iSTFTNet components.
| Model Name | Sampling Rate | Mel params (channels, fmin-max) | Link to Pretrained Models (Quantizer & iSTFTNet) | Colab Notebook Example |
|---|---|---|---|---|
| MQGAN+R-HifiSpeech-1 | 44.1 kHz | 120, 0-22050 Hz | MQGAN, ISTFTNet | Colab Link |