Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration by dg845 · Pull Request #24799 · huggingface/transformers

dg845 · 2023-07-13T01:30:28Z

What does this PR do?

This PR adds the UnivNet GAN vocoder model (paper, code) to transformers, which is the vocoder used in the Tortoise TTS text-to-speech model (paper, code) which is currently being integrated into diffusers. See this issue in diffusers.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sanchit-gandhi
@susnato

…orts.

susnato · 2023-07-13T04:03:32Z

Hi @dg845 if you are planning to add it to the models folder, then I think it should have a doc file(univnet.md) in the docs.

…re-trained model.

dg845 · 2023-07-13T04:45:01Z

For now I've added the UnivNet code to the /src/transformers/models/univnet/ directory. @sanchit-gandhi, since the UnivNet model isn't technically a transformer model (in that it doesn't use any attention mechanisms), is this the best place to put it? For example, the SpeechT5HifiGan vocoder is in /src/transformers/models/speecht5/ along with the other SpeechT5 models, but I assume most of the other Tortoise TTS code will go into diffusers rather than transformers.

sanchit-gandhi · 2023-07-13T12:33:34Z

Nice start @dg845! Yep fine to have it as a standalone model - we have ResNet in transformers as well which is not strictly attention-based.

…docstrings.

dg845 · 2023-07-16T02:34:05Z

Hi @sanchit-gandhi, I think the PR is ready for review.

The following are the differences between the SpeechT5HifiGan and the UnivNetGan model:

The SpeechT5HifiGan outer residual blocks* (that is, HifiGanResidualBlock) upsamples the number of hidden channels between each outer residual block, but the UnivNetGan outer residual blocks* (UnivNetLVCBlock) keep the number of hidden channels constant.
Although the structures of the inner residual blocks (for UnivNet, the UnivNetLVCResidualBlock module) are similar: LReLU => dilated Conv1d => LReLU => Conv1d => skip connection, the UnivNet model uses a location variable convolutional layer followed by a gated activation unit in place of the second Conv1d layer.
Accordingly, each outer residual block (UnivNetLVCBlock) in UnivNet has a kernel predictor residual network (UnivNetKernelPredictor) to predict the kernels and biases for the location variable convolutional layer in each inner residual block in the main resnet.
In addition to a conditioning log-mel spectrogram, UnivNet takes in a noise sequence as input. The noise_waveform is the input to the "main" resnet (e.g. the stack of UnivNetLVCResidualBlocks), while the conditioning spectrogram is the input to the kernel predictor in each UnivNetLVCBlock.

(*) "Outer residual block" is a bit of a misnomer, since for both blocks in question (HifiGanResidualBlock, UnivNetLVCBlock) there's no skip connection between the input to the block and the main computation in the block.

dg845 · 2023-07-16T02:37:05Z

Also, I'm not sure why utils/check_table.py is failing. I ran make fix-copies to create a table entry for UnivNet in docs/source/en/index.md, and then added a checkmark for PyTorch support, but for some reason check_table.py doesn't seem to like that.

dg845 · 2023-07-28T07:08:31Z

Also, I'm not sure why utils/check_table.py is failing.

utils/check_table.py is no longer failing after I merged main into the PR branch. Running make fix-copies adds an entry for UnivNet, but I'm not sure why it doesn't add a checkmark in the "PyTorch support" column, perhaps the model information is mis-configured.

ylacombe

Hi @dg845,

Congratulations! This is a great start for a first PR!
Your modeling code is really clear and looks correct (at first sight) and a large part of your code is already in line with the transformers philosophy!

Here's a set of comments you should address before I make another review! Most of the comments are small things to change.

The main thing to do here is to make sure you get the same results as the original implementation, to make sure you haven't missed something in your modeling code. I've described this in detail in my comments.

Once it is done, I will look more in details into your modeling code to make sure everything is okay, but I think you did most of the job here.

With regards to your last comment, I'm afraid that I'm not sure how to address it. @sanchit-gandhi or @sgugger, do you have any ideas on how to solve this?

Thanks again for this PR!

src/transformers/models/univnet/__init__.py

src/transformers/models/univnet/modeling_univnet.py

src/transformers/models/univnet/configuration_univnet.py

tests/models/univnet/test_modeling_univnet.py

HuggingFaceDocBuilderDev · 2023-08-11T17:48:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sanchit-gandhi · 2023-10-12T14:37:30Z

Nice - I think this is a good design! Gently pinging @ArthurZucker for a final review here when you get the chance 🙌

ArthurZucker

LGTM left a few small nits but everything's here! Thanks for your hard work and this clean PR!

src/transformers/models/univnet/modeling_univnet.py

src/transformers/models/univnet/feature_extraction_univnet.py

tests/models/univnet/test_modeling_univnet.py

ArthurZucker · 2023-10-17T09:33:07Z

src/transformers/models/univnet/modeling_univnet.py

+        # Resolve batch sizes for noise_sequence and spectrogram
+        spectrogram_batched = input_features.dim() == 3


just FMI is this a part that is here to match the diffusers implementation?

I guess the purpose of this is just to make sure the spectrogram input is batched for UnivNet inference and output. Currently, UnivNetModel.forward always returns a batched waveform output for use in UnivNetFeatureExtractor.batch_decode, which will turn it into a (possibly ragged) list of unbatched waveforms. (If there's only one waveform, it will output a one-element list with that waveform.)

We could probably rewrite this without explicitly defining spectrogram_batched:

if input_features.dim() == 2: input_features = input_features.unsqueeze(0)

[As a note, the spectrogram input to the UnivNetModel vocoder in the Tortoise TTS diffusers pipeline should always be batched.]

src/transformers/models/univnet/modeling_univnet.py

ArthurZucker · 2023-10-17T09:36:35Z

src/transformers/models/univnet/modeling_univnet.py

+        kernels = kernel_hidden_states.contiguous().view(
+            batch_size,
+            self.conv_layers,
+            self.conv_in_channels,
+            self.conv_out_channels,
+            self.conv_kernel_size,
+            seq_length,
+        )
+        biases = bias_hidden_states.contiguous().view(
+            batch_size,
+            self.conv_layers,
+            self.conv_out_channels,
+            seq_length,
+        )


should contiguous() not be called after a view ?

Good catch, I think kernel_hidden_states.view(...).contiguous() makes more sense here. The original implementation calls contiguous before view (see here) but this might be a typo since it doesn't seem like there's any call before view that would warrant a contiguous call.

[Removing the contiguous call altogether doesn't seem to trigger any RuntimeError: input is not contiguous errors in any of the tests in UnivNetModelTest, but I think it's probably better to leave it in as *.view(...).contiguous().]

ArthurZucker · 2023-10-17T09:41:41Z

src/transformers/models/univnet/feature_extraction_univnet.py

+
+    def batch_decode(self, waveforms, waveform_lengths=None) -> List[np.ndarray]:
+        r"""
+        Removes padding from generated audio after running [`UnivNetModel.forward`]. This returns a ragged list of 1D


here the waveform length come from the model that just summed the padding mask returned by this class.
I am guessing this is to have a seamless batch_decode(model(**input))0? See my comment regarding the model not needing the masks

Yeah, the primary reason is so that

audio = model(**inputs) audio = feature_extractor.batch_decode(**audio)

works seamlessly. The design is based on VitsModel and VitsModelOutput, although the VITS model is a transformer and thus VitsModel.forward uses the attention_mask argument in a non-trivial way. See #24799 (comment), #24799 (comment), #24799 (comment), #24799 (comment) for more discussion about the design.

It would be possible to calculate the original waveform lengths in UnivNetFeatureExtractor and have UnivNetModel.forward take a waveform_lengths argument instead, but this also feels weird since forward wouldn't do anything with waveform_lengths except output it. In my opinion having forward accept a padding_mask argument feels more natural in transformers (although it might be a bit confusing since it's not an attention mask).

…lel ModelTesterMixin flag.

…ests.

dg845 · 2023-10-18T12:39:56Z

@ArthurZucker @sanchit-gandhi @ylacombe I think one thing left to resolve is where to put the UnivNet model checkpoint (currently at dg845/univnet-dev). I'm not sure which org to put the model checkpoint under since the original paper is from Kakao, but the checkpoint is from an unofficial implementation by maum.ai (see #24799 (comment), #24799 (comment)).

ArthurZucker · 2023-10-20T15:04:31Z

We could reach out to maum.ai to ask them if they can create and org (if not already) and host the weights

dg845 · 2023-10-24T00:48:49Z

@ArthurZucker Sounds good! I believe that they have an org at https://huggingface.co/maum-ai.

dg845 · 2023-10-31T08:00:21Z

Hi @ArthurZucker @sanchit-gandhi @ylacombe, is there anything I can do to help out with transferring the checkpoint weights? (As a note, the checkpoint weights are currently stored at dg845/univnet-dev [with the model card written] and this is the checkpoint identifier used e.g. in the integration tests.)

ylacombe · 2023-10-31T10:07:28Z

Hi @dg845, I've contacted some people from maum-ai to move the weights to their organization (without any response yet)!

dg845 · 2023-11-21T03:01:06Z

Hi @ArthurZucker @sanchit-gandhi @ylacombe would it be possible to merge this PR? @susnato and I have made a lot of progress on the tortoise-tts PR over at diffusers: huggingface/diffusers#4106 and it would be helpful to have this PR merged to test the pipeline with the UnivNet vocoder.

ArthurZucker · 2023-11-22T11:18:57Z

Hey both! Yeah no problem, let's use the current path for the checkpoints and merge for now as they are slow to respond!

ArthurZucker · 2023-11-22T11:20:16Z

Last nit is, would you mind rebasing on main to make sure you have the correct styling? 🙏🏻

dg845 · 2023-11-22T12:20:52Z

Hi @ArthurZucker, I have rebased on main and the CI is green :).

ArthurZucker · 2023-11-22T16:21:33Z

Thanks a lot!

dg845 added 2 commits July 12, 2023 03:26

initial commit

29a8cc4

Add inital testing files and modify __init__ files to add UnivNet imp…

feedec0

…orts.

dg845 mentioned this pull request Jul 13, 2023

Add Tortoise TTS as a pipeline huggingface/diffusers#3891

Open

2 tasks

Fix some bugs

baa5eb1

dg845 added 3 commits July 12, 2023 21:12

Add checkpoint conversion script and add references to transformers p…

689c9c5

…re-trained model.

Add UnivNet entries for auto.

c7fb606

Add initial docs for UnivNet.

5b06bb4

susnato mentioned this pull request Jul 14, 2023

[WIP] Add Tortoise TTS huggingface/diffusers#4106

Closed

8 tasks

dg845 added 7 commits July 14, 2023 20:36

Handle input and output shapes in UnivNetGan.forward and add initial …

d1da34d

…docstrings.

Write tests and make them pass.

1874c08

Write docs.

d7436f6

Add UnivNet doc to _toctree.yml and improve docs.

c2f9aef

fix typo

5879166

make fixup

7070658

make fix-copies

b74d489

dg845 changed the title ~~[WIP] Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration~~ Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration Jul 16, 2023

dg845 marked this pull request as ready for review July 16, 2023 22:24

dg845 added 4 commits July 27, 2023 22:49

Add upsample_rates parameter to config and improve config documentation.

634340c

make fixup

20c0337

Merge branch 'main' into univnet-vocoder-model

774609d

make fix-copies

356bff1

Remove unused upsample_rates config parameter.

a6d8de3

ylacombe reviewed Aug 11, 2023

View reviewed changes

dg845 added 4 commits October 11, 2023 03:50

make style

5e8b28d

Remove torch dependency from UnivNetFeatureExtractor.

7b25349

make style

289bd84

Fix UnivNetModel usage example

4ed6997

Clean up feature extractor code/docstrings.

b3a97a9

ArthurZucker approved these changes Oct 17, 2023

View reviewed changes

dg845 added 8 commits October 17, 2023 19:17

apply suggestions from review

caab748

make style

9061435

Add comments for tests skipped via ModelTesterMixin flags.

0d47ab4

Add comment for model parallel tests skipped via the test_model_paral…

789d135

…lel ModelTesterMixin flag.

Add # Copied from statements to copied UnivNetFeatureExtractionTest t…

8ec155d

…ests.

Simplify UnivNetFeatureExtractorTest.test_batch_decode.

17d486f

Add support for unbatched padding_masks in UnivNetModel.forward.

a89af09

Refactor unbatched padding_mask support.

cf45c5c

Merge branch 'main' into univnet-vocoder-model

2b10430

Merge branch 'main' into univnet-vocoder-model

0b881ca

dg845 added 2 commits November 22, 2023 03:43

Merge branch 'main' into univnet-vocoder-model

b451320

make style

3de455c

ArthurZucker merged commit 7f6a804 into huggingface:main Nov 22, 2023

		# Resolve batch sizes for noise_sequence and spectrogram
		spectrogram_batched = input_features.dim() == 3

Conversation

dg845 commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

susnato commented Jul 13, 2023

Uh oh!

dg845 commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchit-gandhi commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 commented Jul 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 commented Jul 16, 2023

Uh oh!

dg845 commented Jul 28, 2023

Uh oh!

ylacombe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 11, 2023

Uh oh!

sanchit-gandhi commented Oct 12, 2023

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

dg845 Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

dg845 Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

dg845 Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

dg845 commented Oct 18, 2023

Uh oh!

ArthurZucker commented Oct 20, 2023

Uh oh!

dg845 commented Oct 24, 2023

Uh oh!

dg845 commented Oct 31, 2023

Uh oh!

ylacombe commented Oct 31, 2023

Uh oh!

dg845 commented Nov 21, 2023

Uh oh!

dg845 commented Jul 13, 2023 •

edited

Loading

dg845 commented Jul 13, 2023 •

edited

Loading

sanchit-gandhi commented Jul 13, 2023 •

edited

Loading

dg845 commented Jul 16, 2023 •

edited

Loading

dg845 Oct 18, 2023 •

edited

Loading