TensorVox utilizes an RNN-based G2P model implemented in Tensorflow to convert text to phonemes before feeding the text2speech models.
In order to train a model, you need to prepare two things:
- A dictionary in format
WORD \t PHONETIC SPELLINGas the dataset - A config file (optional, there is already one in
config/default.yaml)
Tensorflow 2.0 or greater, is of course, required.
Since the training is very quick on GPU (Tesla T4), it's just one script that does preprocessing, training, and exporting. If you don't have one, just use Google Colab.
You can download my English dictionary (converted to tab-based from the LibriSpeech lexicon) here. Rename it from dict.d to dict.txt
The command to run it is as follows:
python3 train_and_export.py --dict-path dict.txt --config-path config/default.yaml --out-path English
Arguments should be self-explanatory.
If your phoneme format does not separate phonemes by space (like IPA), pass --char-tok-phn as an argument, because the script assumes that all phoneme texts are like ARPA (example: G R IY1 N) and tokenizes separated by spaces. One sign that it may be doing this could be very slow training on a decent GPU.
Once finished, the script will output all files required to use the model to the folder determined by the --out-path argument (will be created if it doesn't exist).
No further action is necessary, just drag it so that all the files in the folder are in the (executable file path)/g2p/language name folder and it will be used by the program to do phoneme conversion for all models it loads in that language. Make sure language name folder is capitalized.
The output consists of three things:
- char2id.txt, phn2id.txt: Two text files in format
TOKEN \t IDthat indicate the IDs that first go into the model (char) and are returned (phn). Automatically generated by the script. - dict.txt: Dictionary in format
WORD \t PHONETIC-SPELLINGthat is used to find phonetic spellings in. Automatically re-exported (words forced to lowercase) by the script. - model: The actual G2P model, saved in Tensorflow SavedModel format.
Due to the unreliability of the network, we only want to use it to guess novel words, so first it does a dictionary lookup (semi-optimized with bucketed string search) then if not found, uses the model.
An example English model is zipped in the models/ directory.