Arxiv: https://arxiv.org/abs/2503.16689
Our audio demo page is available at https://luotianze666.github.io/WaveFM.
Our checkpoints are available in this repository.
You can extract Mel-spectrograms for both the training and inference processes using the following command.
python dataset.py -i your_audio_path -o your_mel_saving_pathFirst modify the paths for training in src/params.py as needed, and then use python src/train.py to train WaveFM.
If trainCheckPointPath exists, the training process will resume from the corresponding checkpoints. Otherwise, new checkpoints will be saved to this path. During the training process, checkpoints are saved every trainCheckPointSavingStep steps.
Note: A coefficient
# parameters for training
trainInitialLR = 7.5e-5,
trainFinalLR = 5e-6,
trainBetas = (0.9, 0.99),
trainWeightDecay = 5e-4,
trainSteps = 1000000,
trainBatch = 24,
trainCheckPointSavingStep = 10000,
trainAudiosPath = "Your Audios",
trainMelsPath = "Your Mels",
trainCheckPointPath = "./checkpoints/WaveFM_0",
trainGPUs = [0], # [] represents CPUFirst modify the paths for inference in src/params.py as needed, and then use python src/inference.py to run inference for WaveFM.
Note: The inference time displayed on the tqdm bar doesn't include the file loading time. Additionally, the inference speed can be much faster for longer audios due to a higher degree of parallelism.
# parameters for inference
inferenceSteps = 6,
inferenceMelsPath = "Your Mels For Inference",
inferenceSavingPath = "./generation",
inferenceCheckPointPath = "./checkpoints/WaveFM_1000000",
# "./distilled_checkpoints/Distilled_WaveFM_25000",
inferenceWithGPU = True, # use cuda:0 or CPUFirst modify the paths for distillation in src/params.py as needed, and then use python src/distillation.py to distill WaveFM.
distillDelta is the time step length used in the distillation process, and distillModelPath specifies the saving path of the teacher model.
If distillCheckPointPath exists, the distillation process will resume from the corresponding checkpoints. Otherwise, new checkpoints will be saved to this path. During the distillation process, checkpoints are saved every distillCheckPointSavingStep steps.
Note: In our experiments, a distillation process of 25,000 steps is sufficient. Further increasing the distillation steps does not improve performance.
# parameters for distillation
distillInitialLR = 2e-5,
distillFinalLR = 5e-6,
distillBetas = (0.8, 0.95),
distillWeightDecay = 1e-2,
distillDeltaT = 0.01,
distillSteps = 25000,
distillBatch = 24,
distillCheckPointSavingStep = 5000,
distillModelPath = "./checkpoints/WaveFM_1000000",
distillCheckPointPath = "./distilled_checkpoints/Distilled_WaveFM_0",
distillAudiosPath = "Your Audios",
distillMelsPath = "Your Mels",
distillGPUs = [0], # [] represents CPUWaveFM has been tested on python 3.9 with the following requirements, and it should also work fine with the latest version.
torch 2.0.1+cu118
torchaudio 2.0.2@inproceedings{luo2025wavefm,
title={Wavefm: A high-fidelity and efficient vocoder based on flow matching},
author={Luo, Tianze and Miao, Xingchen and Duan, Wenbo},
booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
pages={2187--2198},
year={2025}
}