Alan Baade, Puyuan Peng, David Harwath
The University of Texas at Austin
Paper Available by Request
Abstract. We introduce a method for textless any-to-any voice conversion based on the recent progress in speech synthesis driven by neural codec language models. To disentangle the speaker and linguistic information, we adapt a speaker normalizing procedure for discrete semantic units, and then generate with an autoregressive language model for greatly improved diversity. We further improve the similarity of the output audio to the target speaker's voice by leveraging classifier free guidance. We evaluate our techniques against current text to speech synthesis and voice conversion systems and compare the effectiveness of different neural codec language model pipelines. We demonstrate state-of-the-art results in speaker similarity with significantly less compute than existing codec language models such as VALL-E.
Cherrypicked examples for Voice Conversion. All speakers are unseen to all models during training.
| Source | Target | Ours-Norm | Ours-Aligned | Ours-Phone | TriAAN-VC | FreeVC | DiffVC |
|---|---|---|---|---|---|---|---|
| SID 4446 to 2300 | |||||||
| SID 61 to 2830 | |||||||
| SID 121 to 4992 | |||||||
| SID 1320 to 61 | |||||||
| SID 1995 to 5639 | |||||||
| SID 4507 to 2300 | |||||||
| SID 6930 to 2961 | |||||||
| SID 7176 to 8555 |
Cherrypicked examples for Voice Conversion on VCTK. Speakers are unseen during training for our models and TriAAN-VC. Speakers are possibly seen by FreeVC and DiffVC during training.
To demonstrate the effect of accent, these examples largely convert between speakers labeled as American or Canadian and speakers who aren't.
Notice that Ours-Norm converts between accents, while the source speaker's accent is almost entirely preserved for Ours-Aligned, TriAAN-VC, FreeVC, and DiffVC.
| Source | Target | Ours-Norm | Ours-Aligned | Ours-Phone | TriAAN-VC | FreeVC* | DiffVC* |
|---|---|---|---|---|---|---|---|
| SID p277 to p299 | |||||||
| SID p294 to p231 | |||||||
| SID p227 to p308 | |||||||
| SID p245 to p299 | |||||||
| SID p316 to p249 | |||||||
| SID p343 to p285 | |||||||
| SID p330 to p336 | |||||||
| SID p271 to p285 |