Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

data2vec-aqc

Training a new speech model with the CLI tools

Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length)

Prepare training data manifest:

First, install the soundfile library:

pip install soundfile

Next, run:

$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext $ext --valid-percent $valid

$ext should be set to flac, wav, or whatever format your dataset happens to use that soundfile can read.

$valid should be set to some reasonable percentage (like 0.01) of training data to use for validation. To use a pre-defined validation set (like dev-other from librispeech), set to it 0 and then overwrite valid.tsv with a separately pre-processed manifest file.

Train a data2vec-aqc Base model:

This configuration was used for the base model trained on the Librispeech dataset in the data2vec-aqc paper

Note that the input is expected to be single channel, sampled at 16 kHz

$ python fairseq_cli/hydra_train.py -m --config-dir examples/data2vec/config/audio/pretraining \
--config-name base_librispeech task.data=/path/to/manifests common.user_dir=examples/data2vec

Note: you can simulate 16 GPUs by using k GPUs and adding command line parameters distributed_training.distributed_world_size=k +optimization.update_freq='[x]' where x = 16/k

Parameters of interest

  • The cluster_factor and scale_factor parameters (for the clustering module) can be modified from the model section of the pre-training configs which can be found from the pre-training config.
  • The augmentations used for data2vec-aqc requires the noise set of MUSAN dataset. The path to the same is to be specified in the path_to_musan_noise_set variable of the getitem method of the raw_audio_dataset file.

Fine-tune a pre-trained model with CTC:

Fine-tuning a model requires parallel audio and labels file, as well as a vocabulary file in fairseq format. A letter vocabulary can be downloaded here. An example script that generates labels for the Librispeech dataset from the tsv file produced by wav2vec_manifest.py can be used as follows:

split=train
$ python libri_labels.py /path/to/tsv --output-dir /output/dir --output-name $split

Fine-tuning on 100h of Librispeech with letter targets:

$ fairseq-hydra-train \
    distributed_training.distributed_port=$PORT \
    task.data=/path/to/data \
    model.w2v_path=/path/to/model.pt \
    --config-dir /path/to/fairseq-py/examples/wav2vec/config/finetuning \
    --config-name base_100h common.user_dir=examples/data2vec

There are other config files in the config/finetuning directory that can be used to fine-tune on other splits. You can specify the right config via the --config-name parameter.

Decoding with a language model during training requires flashlight python bindings (previously called wav2letter. If you want to use a language model, add +criterion.wer_args='[/path/to/kenlm, /path/to/lexicon, 2, -1]' to the command line.

Evaluating a CTC model:

Evaluating a CTC model with a language model requires flashlight python bindings (previously called wav2letter to be installed.

Fairseq transformer language model used in the wav2vec 2.0 paper can be obtained from the wav2letter model repository. Be sure to upper-case the language model vocab after downloading it.

Letter dictionary for pre-trained models can be found here.

Next, run the evaluation command:

python examples/speech_recognition/new/infer.py --config-dir examples/speech_recognition/new/conf \
--config-name infer task=audio_finetuning task.data=/path/to/manifests common.user_dir=examples/data2vec \
task.labels=ltr decoding.type=kenlm \
decoding.lmweight=${lmweight} decoding.wordscore=${wordscore} decoding.silweight=${silscore} \
decoding.lexicon=/path/to/lexicon \
decoding.lmpath=/path/to/lm decoding.unique_wer_file=True \
dataset.gen_subset=dev_clean,dev_other,test_clean,test_other \
common_eval.path=/path/to/checkpoint.pt decoding.beam=1500 distributed_training.distributed_world_size=${num_gpus}

To get raw numbers, use decoding.type=viterbi and omit the lexicon. To use the transformer language model, use decoding.type=fairseqlm.