A GPT model that converts from text to phonemes with durations that is suitable to feed into voice synthesizer.
This model converts raw text to phonemes and their durations compatible with Montreal Forced Aligner.
This model converts string like "Hey, Vera, what time is it?" to list of tuples of phoneme and it's duration:
[('ç', 9), ('iː', 7), ('v', 7), ('ɛ', 8), ('ɹ', 8), ('i', 7), ('w', 6), ('ɐ', 5), ('ʔ', 3), ('tʰ', 8), ('aj', 11), ('m', 7), ('ɪ', 6), ('z', 7), ('ɪ', 6), ('ʔ', 8)]
This module require extensive dataset preparation. To prepare all needed data next commands are required to be performed:
datasets syncto download datasetspython ./datasets_prepare.pyto preprocess audio files and extract texts from datasets./datasets_align.shto generate alignmentspython ./datasets_mix.pyto mix all data togetherpython ./train_tokenizer.pyto train tokenizer on alignmentspython ./datasets_tokenize.pyto tokenize datasets
To train network execute:
./train.shMIT