Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length)
First, install the soundfile library:
pip install soundfileNext, run:
$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext $ext --valid-percent $valid$ext should be set to flac, wav, or whatever format your dataset happens to use that soundfile can read.
$valid should be set to some reasonable percentage (like 0.01) of training data to use for validation. To use a pre-defined validation set (like dev-other from librispeech), set to it 0 and then overwrite valid.tsv with a separately pre-processed manifest file.
This configuration was used for the base model trained on the Librispeech dataset in the data2vec-aqc paper
Note that the input is expected to be single channel, sampled at 16 kHz
$ python fairseq_cli/hydra_train.py -m --config-dir examples/data2vec/config/audio/pretraining \
--config-name base_librispeech task.data=/path/to/manifests common.user_dir=examples/data2vecNote: you can simulate 16 GPUs by using k GPUs and adding command line parameters
distributed_training.distributed_world_size=k +optimization.update_freq='[x]' where x = 16/k
- The
cluster_factorandscale_factorparameters (for the clustering module) can be modified from themodelsection of the pre-training configs which can be found from the pre-training config. - The augmentations used for data2vec-aqc requires the noise set of MUSAN dataset. The path to the same is to be specified in the
path_to_musan_noise_setvariable of the getitem method of the raw_audio_dataset file.
Fine-tuning a model requires parallel audio and labels file, as well as a vocabulary file in fairseq format. A letter vocabulary can be downloaded here. An example script that generates labels for the Librispeech dataset from the tsv file produced by wav2vec_manifest.py can be used as follows:
split=train
$ python libri_labels.py /path/to/tsv --output-dir /output/dir --output-name $splitFine-tuning on 100h of Librispeech with letter targets:
$ fairseq-hydra-train \
distributed_training.distributed_port=$PORT \
task.data=/path/to/data \
model.w2v_path=/path/to/model.pt \
--config-dir /path/to/fairseq-py/examples/wav2vec/config/finetuning \
--config-name base_100h common.user_dir=examples/data2vecThere are other config files in the config/finetuning directory that can be used to fine-tune on other splits.
You can specify the right config via the --config-name parameter.
Decoding with a language model during training requires flashlight python bindings (previously called wav2letter.
If you want to use a language model, add +criterion.wer_args='[/path/to/kenlm, /path/to/lexicon, 2, -1]' to the command line.
Evaluating a CTC model with a language model requires flashlight python bindings (previously called wav2letter to be installed.
Fairseq transformer language model used in the wav2vec 2.0 paper can be obtained from the wav2letter model repository. Be sure to upper-case the language model vocab after downloading it.
Letter dictionary for pre-trained models can be found here.
Next, run the evaluation command:
python examples/speech_recognition/new/infer.py --config-dir examples/speech_recognition/new/conf \
--config-name infer task=audio_finetuning task.data=/path/to/manifests common.user_dir=examples/data2vec \
task.labels=ltr decoding.type=kenlm \
decoding.lmweight=${lmweight} decoding.wordscore=${wordscore} decoding.silweight=${silscore} \
decoding.lexicon=/path/to/lexicon \
decoding.lmpath=/path/to/lm decoding.unique_wer_file=True \
dataset.gen_subset=dev_clean,dev_other,test_clean,test_other \
common_eval.path=/path/to/checkpoint.pt decoding.beam=1500 distributed_training.distributed_world_size=${num_gpus}To get raw numbers, use decoding.type=viterbi and omit the lexicon. To use the transformer language model, use decoding.type=fairseqlm.