Textless Voice Conversion with Normalized Discrete Units

Alan Baade, Puyuan Peng, David Harwath

The University of Texas at Austin

Paper Available by Request

Abstract. We introduce a method for textless any-to-any voice conversion based on the recent progress in speech synthesis driven by neural codec language models. To disentangle the speaker and linguistic information, we adapt a speaker normalizing procedure for discrete semantic units, and then generate with an autoregressive language model for greatly improved diversity. We further improve the similarity of the output audio to the target speaker's voice by leveraging classifier free guidance. We evaluate our techniques against current text to speech synthesis and voice conversion systems and compare the effectiveness of different neural codec language model pipelines. We demonstrate state-of-the-art results in speaker similarity with significantly less compute than existing codec language models such as VALL-E.

Voice Conversion on Unseen Speakers from LibriSpeech Test-Clean

Cherrypicked examples for Voice Conversion. All speakers are unseen to all models during training.

Source Target Ours-Norm Ours-Aligned Ours-Phone TriAAN-VC FreeVC DiffVC
SID 4446 to 2300
SID 61 to 2830
SID 121 to 4992
SID 1320 to 61
SID 1995 to 5639
SID 4507 to 2300
SID 6930 to 2961
SID 7176 to 8555

Voice Conversion on Speakers from VCTK.

Cherrypicked examples for Voice Conversion on VCTK. Speakers are unseen during training for our models and TriAAN-VC. Speakers are possibly seen by FreeVC and DiffVC during training.
To demonstrate the effect of accent, these examples largely convert between speakers labeled as American or Canadian and speakers who aren't.
Notice that Ours-Norm converts between accents, while the source speaker's accent is almost entirely preserved for Ours-Aligned, TriAAN-VC, FreeVC, and DiffVC.

Source Target Ours-Norm Ours-Aligned Ours-Phone TriAAN-VC FreeVC* DiffVC*
SID p277 to p299
SID p294 to p231
SID p227 to p308
SID p245 to p299
SID p316 to p249
SID p343 to p285
SID p330 to p336
SID p271 to p285
Website layout taken from the SPEAR-TTS Demo Page