Skip to content

Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration#24799

Merged
ArthurZucker merged 103 commits intohuggingface:mainfrom
dg845:univnet-vocoder-model
Nov 22, 2023
Merged

Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration#24799
ArthurZucker merged 103 commits intohuggingface:mainfrom
dg845:univnet-vocoder-model

Conversation

@dg845
Copy link
Contributor

@dg845 dg845 commented Jul 13, 2023

What does this PR do?

This PR adds the UnivNet GAN vocoder model (paper, code) to transformers, which is the vocoder used in the Tortoise TTS text-to-speech model (paper, code) which is currently being integrated into diffusers. See this issue in diffusers.

univnet_model_architecture

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sanchit-gandhi
@susnato

@susnato
Copy link
Contributor

susnato commented Jul 13, 2023

Hi @dg845 if you are planning to add it to the models folder, then I think it should have a doc file(univnet.md) in the docs.

@dg845
Copy link
Contributor Author

dg845 commented Jul 13, 2023

For now I've added the UnivNet code to the /src/transformers/models/univnet/ directory. @sanchit-gandhi, since the UnivNet model isn't technically a transformer model (in that it doesn't use any attention mechanisms), is this the best place to put it? For example, the SpeechT5HifiGan vocoder is in /src/transformers/models/speecht5/ along with the other SpeechT5 models, but I assume most of the other Tortoise TTS code will go into diffusers rather than transformers.

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Jul 13, 2023

Nice start @dg845! Yep fine to have it as a standalone model - we have ResNet in transformers as well which is not strictly attention-based.

@dg845
Copy link
Contributor Author

dg845 commented Jul 16, 2023

Hi @sanchit-gandhi, I think the PR is ready for review.

The following are the differences between the SpeechT5HifiGan and the UnivNetGan model:

  • The SpeechT5HifiGan outer residual blocks* (that is, HifiGanResidualBlock) upsamples the number of hidden channels between each outer residual block, but the UnivNetGan outer residual blocks* (UnivNetLVCBlock) keep the number of hidden channels constant.
  • Although the structures of the inner residual blocks (for UnivNet, the UnivNetLVCResidualBlock module) are similar: LReLU => dilated Conv1d => LReLU => Conv1d => skip connection, the UnivNet model uses a location variable convolutional layer followed by a gated activation unit in place of the second Conv1d layer.
  • Accordingly, each outer residual block (UnivNetLVCBlock) in UnivNet has a kernel predictor residual network (UnivNetKernelPredictor) to predict the kernels and biases for the location variable convolutional layer in each inner residual block in the main resnet.
  • In addition to a conditioning log-mel spectrogram, UnivNet takes in a noise sequence as input. The noise_waveform is the input to the "main" resnet (e.g. the stack of UnivNetLVCResidualBlocks), while the conditioning spectrogram is the input to the kernel predictor in each UnivNetLVCBlock.

(*) "Outer residual block" is a bit of a misnomer, since for both blocks in question (HifiGanResidualBlock, UnivNetLVCBlock) there's no skip connection between the input to the block and the main computation in the block.

@dg845
Copy link
Contributor Author

dg845 commented Jul 16, 2023

Also, I'm not sure why utils/check_table.py is failing. I ran make fix-copies to create a table entry for UnivNet in docs/source/en/index.md, and then added a checkmark for PyTorch support, but for some reason check_table.py doesn't seem to like that.

@dg845 dg845 changed the title [WIP] Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration Jul 16, 2023
@dg845 dg845 marked this pull request as ready for review July 16, 2023 22:24
@dg845
Copy link
Contributor Author

dg845 commented Jul 28, 2023

Also, I'm not sure why utils/check_table.py is failing.

utils/check_table.py is no longer failing after I merged main into the PR branch. Running make fix-copies adds an entry for UnivNet, but I'm not sure why it doesn't add a checkmark in the "PyTorch support" column, perhaps the model information is mis-configured.

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dg845,

Congratulations! This is a great start for a first PR!
Your modeling code is really clear and looks correct (at first sight) and a large part of your code is already in line with the transformers philosophy!

Here's a set of comments you should address before I make another review! Most of the comments are small things to change.

The main thing to do here is to make sure you get the same results as the original implementation, to make sure you haven't missed something in your modeling code. I've described this in detail in my comments.

Once it is done, I will look more in details into your modeling code to make sure everything is okay, but I think you did most of the job here.

With regards to your last comment, I'm afraid that I'm not sure how to address it. @sanchit-gandhi or @sgugger, do you have any ideas on how to solve this?

Thanks again for this PR!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@sanchit-gandhi
Copy link
Contributor

Nice - I think this is a good design! Gently pinging @ArthurZucker for a final review here when you get the chance 🙌

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM left a few small nits but everything's here! Thanks for your hard work and this clean PR!

Comment on lines +551 to +552
# Resolve batch sizes for noise_sequence and spectrogram
spectrogram_batched = input_features.dim() == 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just FMI is this a part that is here to match the diffusers implementation?

Copy link
Contributor Author

@dg845 dg845 Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the purpose of this is just to make sure the spectrogram input is batched for UnivNet inference and output. Currently, UnivNetModel.forward always returns a batched waveform output for use in UnivNetFeatureExtractor.batch_decode, which will turn it into a (possibly ragged) list of unbatched waveforms. (If there's only one waveform, it will output a one-element list with that waveform.)

We could probably rewrite this without explicitly defining spectrogram_batched:

if input_features.dim() == 2:
    input_features = input_features.unsqueeze(0)

[As a note, the spectrogram input to the UnivNetModel vocoder in the Tortoise TTS diffusers pipeline should always be batched.]

Comment on lines +187 to +200
kernels = kernel_hidden_states.contiguous().view(
batch_size,
self.conv_layers,
self.conv_in_channels,
self.conv_out_channels,
self.conv_kernel_size,
seq_length,
)
biases = bias_hidden_states.contiguous().view(
batch_size,
self.conv_layers,
self.conv_out_channels,
seq_length,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should contiguous() not be called after a view ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I think kernel_hidden_states.view(...).contiguous() makes more sense here. The original implementation calls contiguous before view (see here) but this might be a typo since it doesn't seem like there's any call before view that would warrant a contiguous call.

[Removing the contiguous call altogether doesn't seem to trigger any RuntimeError: input is not contiguous errors in any of the tests in UnivNetModelTest, but I think it's probably better to leave it in as *.view(...).contiguous().]


def batch_decode(self, waveforms, waveform_lengths=None) -> List[np.ndarray]:
r"""
Removes padding from generated audio after running [`UnivNetModel.forward`]. This returns a ragged list of 1D
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the waveform length come from the model that just summed the padding mask returned by this class.
I am guessing this is to have a seamless batch_decode(model(**input))0? See my comment regarding the model not needing the masks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the primary reason is so that

audio = model(**inputs)
audio = feature_extractor.batch_decode(**audio)

works seamlessly. The design is based on VitsModel and VitsModelOutput, although the VITS model is a transformer and thus VitsModel.forward uses the attention_mask argument in a non-trivial way. See #24799 (comment), #24799 (comment), #24799 (comment), #24799 (comment) for more discussion about the design.

It would be possible to calculate the original waveform lengths in UnivNetFeatureExtractor and have UnivNetModel.forward take a waveform_lengths argument instead, but this also feels weird since forward wouldn't do anything with waveform_lengths except output it. In my opinion having forward accept a padding_mask argument feels more natural in transformers (although it might be a bit confusing since it's not an attention mask).

@dg845
Copy link
Contributor Author

dg845 commented Oct 18, 2023

@ArthurZucker @sanchit-gandhi @ylacombe I think one thing left to resolve is where to put the UnivNet model checkpoint (currently at dg845/univnet-dev). I'm not sure which org to put the model checkpoint under since the original paper is from Kakao, but the checkpoint is from an unofficial implementation by maum.ai (see #24799 (comment), #24799 (comment)).

@ArthurZucker
Copy link
Collaborator

We could reach out to maum.ai to ask them if they can create and org (if not already) and host the weights

@dg845
Copy link
Contributor Author

dg845 commented Oct 24, 2023

@ArthurZucker Sounds good! I believe that they have an org at https://huggingface.co/maum-ai.

@dg845
Copy link
Contributor Author

dg845 commented Oct 31, 2023

Hi @ArthurZucker @sanchit-gandhi @ylacombe, is there anything I can do to help out with transferring the checkpoint weights? (As a note, the checkpoint weights are currently stored at dg845/univnet-dev [with the model card written] and this is the checkpoint identifier used e.g. in the integration tests.)

@ylacombe
Copy link
Contributor

Hi @dg845, I've contacted some people from maum-ai to move the weights to their organization (without any response yet)!

@dg845
Copy link
Contributor Author

dg845 commented Nov 21, 2023

Hi @ArthurZucker @sanchit-gandhi @ylacombe would it be possible to merge this PR? @susnato and I have made a lot of progress on the tortoise-tts PR over at diffusers: huggingface/diffusers#4106 and it would be helpful to have this PR merged to test the pipeline with the UnivNet vocoder.

@ArthurZucker
Copy link
Collaborator

Hey both! Yeah no problem, let's use the current path for the checkpoints and merge for now as they are slow to respond!

@ArthurZucker
Copy link
Collaborator

Last nit is, would you mind rebasing on main to make sure you have the correct styling? 🙏🏻

@dg845
Copy link
Contributor Author

dg845 commented Nov 22, 2023

Hi @ArthurZucker, I have rebased on main and the CI is green :).

@ArthurZucker
Copy link
Collaborator

Thanks a lot!

@ArthurZucker ArthurZucker merged commit 7f6a804 into huggingface:main Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants