This directory contains the training recipe for real-time audio, visual, and audio-visual speech recognition (ASR, VSR, AV-ASR) models, which is an extension of Auto-AVSR.
- Install PyTorch (pytorch, torchvision, torchaudio) from source, along with all necessary packages:
pip install torch torchvision torchaudio pytorch-lightning sentencepiece- Preprocess LRS3. See the instructions in the data_prep folder.
python train.py --exp-dir=[exp_dir] \
--exp-name=[exp_name] \
--modality=[modality] \
--mode=[mode] \
--root-dir=[root-dir] \
--sp-model-path=[sp_model_path] \
--num-nodes=[num_nodes] \
--gpus=[gpus]exp-dirandexp-name: The directory where the checkpoints will be saved, will be stored at the location[exp_dir]/[exp_name].modality: Type of the input modality. Valid values are:video,audio, andaudiovisual.mode: Type of the mode. Valid values are:onlineandoffline.root-dir: Path to the root directory where all preprocessed files will be stored.sp-model-path: Path to the sentencepiece model. Default:./spm_unigram_1023.model, which can be produced usingtrain_spm.py.num-nodes: The number of machines used. Default: 4.gpus: The number of gpus in each machine. Default: 8.
python eval.py --modality=[modality] \
--mode=[mode] \
--root-dir=[dataset_path] \
--sp-model-path=[sp_model_path] \
--checkpoint-path=[checkpoint_path]modality: Type of the input modality. Valid values are:video,audio, andaudiovisual.mode: Type of the mode. Valid values are:onlineandoffline.root-dir: Path to the root directory where all preprocessed files will be stored.sp-model-path: Path to the sentencepiece model. Default:./spm_unigram_1023.model.checkpoint-path: Path to a pre-trained model.
The table below contains WER for AV-ASR models that were trained from scratch [offline evaluation].
| Model | Training dataset (hours) | WER [%] | Params (M) |
|---|---|---|---|
| Non-streaming models | |||
| AV-ASR | LRS3 (438) | 3.9 | 50 |
| Streaming models | |||
| AV-ASR | LRS3 (438) | 3.9 | 40 |
