Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration#24799
Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration#24799ArthurZucker merged 103 commits intohuggingface:mainfrom
Conversation
|
Hi @dg845 if you are planning to add it to the |
|
For now I've added the UnivNet code to the |
|
Nice start @dg845! Yep fine to have it as a standalone model - we have ResNet in transformers as well which is not strictly attention-based. |
|
Hi @sanchit-gandhi, I think the PR is ready for review. The following are the differences between the
(*) "Outer residual block" is a bit of a misnomer, since for both blocks in question ( |
|
Also, I'm not sure why |
|
ylacombe
left a comment
There was a problem hiding this comment.
Hi @dg845,
Congratulations! This is a great start for a first PR!
Your modeling code is really clear and looks correct (at first sight) and a large part of your code is already in line with the transformers philosophy!
Here's a set of comments you should address before I make another review! Most of the comments are small things to change.
The main thing to do here is to make sure you get the same results as the original implementation, to make sure you haven't missed something in your modeling code. I've described this in detail in my comments.
Once it is done, I will look more in details into your modeling code to make sure everything is okay, but I think you did most of the job here.
With regards to your last comment, I'm afraid that I'm not sure how to address it. @sanchit-gandhi or @sgugger, do you have any ideas on how to solve this?
Thanks again for this PR!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
Nice - I think this is a good design! Gently pinging @ArthurZucker for a final review here when you get the chance 🙌 |
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM left a few small nits but everything's here! Thanks for your hard work and this clean PR!
| # Resolve batch sizes for noise_sequence and spectrogram | ||
| spectrogram_batched = input_features.dim() == 3 |
There was a problem hiding this comment.
just FMI is this a part that is here to match the diffusers implementation?
There was a problem hiding this comment.
I guess the purpose of this is just to make sure the spectrogram input is batched for UnivNet inference and output. Currently, UnivNetModel.forward always returns a batched waveform output for use in UnivNetFeatureExtractor.batch_decode, which will turn it into a (possibly ragged) list of unbatched waveforms. (If there's only one waveform, it will output a one-element list with that waveform.)
We could probably rewrite this without explicitly defining spectrogram_batched:
if input_features.dim() == 2:
input_features = input_features.unsqueeze(0)[As a note, the spectrogram input to the UnivNetModel vocoder in the Tortoise TTS diffusers pipeline should always be batched.]
| kernels = kernel_hidden_states.contiguous().view( | ||
| batch_size, | ||
| self.conv_layers, | ||
| self.conv_in_channels, | ||
| self.conv_out_channels, | ||
| self.conv_kernel_size, | ||
| seq_length, | ||
| ) | ||
| biases = bias_hidden_states.contiguous().view( | ||
| batch_size, | ||
| self.conv_layers, | ||
| self.conv_out_channels, | ||
| seq_length, | ||
| ) |
There was a problem hiding this comment.
should contiguous() not be called after a view ?
There was a problem hiding this comment.
Good catch, I think kernel_hidden_states.view(...).contiguous() makes more sense here. The original implementation calls contiguous before view (see here) but this might be a typo since it doesn't seem like there's any call before view that would warrant a contiguous call.
[Removing the contiguous call altogether doesn't seem to trigger any RuntimeError: input is not contiguous errors in any of the tests in UnivNetModelTest, but I think it's probably better to leave it in as *.view(...).contiguous().]
|
|
||
| def batch_decode(self, waveforms, waveform_lengths=None) -> List[np.ndarray]: | ||
| r""" | ||
| Removes padding from generated audio after running [`UnivNetModel.forward`]. This returns a ragged list of 1D |
There was a problem hiding this comment.
here the waveform length come from the model that just summed the padding mask returned by this class.
I am guessing this is to have a seamless batch_decode(model(**input))0? See my comment regarding the model not needing the masks
There was a problem hiding this comment.
Yeah, the primary reason is so that
audio = model(**inputs)
audio = feature_extractor.batch_decode(**audio)works seamlessly. The design is based on VitsModel and VitsModelOutput, although the VITS model is a transformer and thus VitsModel.forward uses the attention_mask argument in a non-trivial way. See #24799 (comment), #24799 (comment), #24799 (comment), #24799 (comment) for more discussion about the design.
It would be possible to calculate the original waveform lengths in UnivNetFeatureExtractor and have UnivNetModel.forward take a waveform_lengths argument instead, but this also feels weird since forward wouldn't do anything with waveform_lengths except output it. In my opinion having forward accept a padding_mask argument feels more natural in transformers (although it might be a bit confusing since it's not an attention mask).
…lel ModelTesterMixin flag.
|
@ArthurZucker @sanchit-gandhi @ylacombe I think one thing left to resolve is where to put the UnivNet model checkpoint (currently at |
|
We could reach out to maum.ai to ask them if they can create and org (if not already) and host the weights |
|
@ArthurZucker Sounds good! I believe that they have an org at https://huggingface.co/maum-ai. |
|
Hi @ArthurZucker @sanchit-gandhi @ylacombe, is there anything I can do to help out with transferring the checkpoint weights? (As a note, the checkpoint weights are currently stored at |
|
Hi @dg845, I've contacted some people from maum-ai to move the weights to their organization (without any response yet)! |
|
Hi @ArthurZucker @sanchit-gandhi @ylacombe would it be possible to merge this PR? @susnato and I have made a lot of progress on the tortoise-tts PR over at |
|
Hey both! Yeah no problem, let's use the current path for the checkpoints and merge for now as they are slow to respond! |
|
Last nit is, would you mind rebasing on main to make sure you have the correct styling? 🙏🏻 |
|
Hi @ArthurZucker, I have rebased on |
|
Thanks a lot! |
What does this PR do?
This PR adds the UnivNet GAN vocoder model (paper, code) to
transformers, which is the vocoder used in the Tortoise TTS text-to-speech model (paper, code) which is currently being integrated intodiffusers. See this issue indiffusers.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sanchit-gandhi
@susnato