Add CLVP by susnato · Pull Request #24745 · huggingface/transformers

susnato · 2023-07-11T07:50:32Z

What does this PR do?

Adds CLVP, which is an integral part of tortoise-tts.
Required for tortoise-tts integration in HF diffusers(Please see this issue).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. (link)
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

susnato · 2023-07-11T07:51:09Z

cc: @sanchit-gandhi, @dg845

sanchit-gandhi · 2023-07-13T12:34:49Z

Very cool @susnato! Let me know if you have any questions / queries - more than happy to lend a hand here!

HuggingFaceDocBuilderDev · 2023-07-14T20:55:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

susnato · 2023-07-14T21:01:05Z

Hi @sanchit-gandhi, this PR is ready for review!

Some notes related to this design I wanted to mention -

Although CLIP has both tokenizer and Image Processor, in tortoise the text is encoded and pushed into both the autoregressive model and the CLVP model so I think its better to have only one tokenizer(for tortoise) and encoding the text one time and pushing it to both of the models rather than defining a seperate tokenization_clvp.py.
Instead of processing Images and checking which image fits the description(from text) best, CLVP compares Speech Token Candidates and text. The speech tokens come from the output of the auto-regressive model itself, so we don't need the Image Processor too!
CLVP uses Rotary Position Embeddings.

sanchit-gandhi

Thanks for working on this @susnato! The composite models are looking very clean, nice job. We can strip out any redundant code and config attributes (see comments below), as well as modules that are simply repeats. Otherwise the modelling code is nearly there!

Now regarding the pre-processing, I don't fully agree with the design decision not to include a tokenizer / feature extractor for this model -> while this model is being used as part of a larger pipeline in diffusers, it should still work standalone in the transformers library. We can't really add a model to the library that doesn't fully work just for the sake of re-using the code in another library, since this breaks the transformers library. I'm sure we'll still need the tokenizer and feature extractor to some extent in the diffusers pipeline, so probably best to address them now

docs/source/en/model_doc/clvp.md

src/transformers/models/clvp/configuration_clvp.py

src/transformers/models/clvp/modeling_clvp.py

tests/models/clvp/test_modeling_clvp.py

susnato · 2023-08-05T15:34:29Z

Hello @sanchit-gandhi I have -

pushed the changed that you asked.
implemented the tokenizer and respective tests.

For the Feature Extractor I wanted to ask some things before going forward -

CLVP compares the text embeddings generated by the tokenizer and the latent speech embeddings generated by an autoregressive model(here gpt2), So IMO we need a Feature Extractor which generates latent embeddings, in other words the feature extractor must take the tokenizer's outputs and then apply a gpt2 model on it and then give us the output. But I don't think that it complies with the design of the library. (I don't know any Feature Extractor which uses another transformer model for processing. :( )

In CLIP it was reasonable to use a Feature Extractor to just read images but I don't think it will be straight forward in this case.

So how should we design the Feature Extractor and what are your thoughts on this?

ylacombe

Hey @susnato,

I'm taking over @sanchit-gandhi for the rest of the process.

First of all, many thanks for the changes! It is definitely looking good and I think we might be near the end!

I'm mostly addressing your questions regarding the Feature Extractor with that review and I am also making some small overall comments.

I think the key to your FE question rely on this extract from the Tortoise TTS paper, section 2.2.1:

A unique design choice made with TorToise is an additional input which is provided to both the autoregressive generator and the DDPM, which I term the speech conditioning input.

The speech conditioning input starts as one or more audio clips of the same speaker as the target. These clips are converted to MEL spectrograms and fed through an encoder consisting of a stack of self-attention layers. The autoregressive generator and the DDPM have their own conditioning encoders, both of which are learned alongside their respective networks.

If you take a step back and think of what are the purpose of the tokenizers and the feature extractors, the idea is to preprocess inputs into a format that can be understand by your model. E.g, as you said for the CLIP feature extractor, the idea is to read images into a format that CLIP can understand.

In the highlighted sentence in bold, you can see that:

The Feature Extractor can simply convert the input audio waveforms into MEL Spectrograms, which I think was done by format_conditioning in the tortoise TTS repo (https://github.com/neonbjb/tortoise-tts/blob/3c4d9c51316cd2421cc2dea11ac3a7a2d3394acd/tortoise/api.py#L103-L115)
The encoder which process the MEL spectrograms is IMO an integral part of the CLVP model, and thus should be in the modeling code.

This is quite similar to what has been done in the SpeechT5 feature extractor, which process audio in order to feed it to SpeechT5SpeechEncoderPrenet.

Of course don't hesitate to reach out to me if you have additional questions or doubts!

src/transformers/models/clvp/modeling_clvp.py

src/transformers/models/clvp/tokenization_clvp.py

tests/models/clvp/test_modeling_clvp.py

susnato · 2023-08-14T17:54:52Z

Hi @ylacombe thanks a lot for the guidance!

I have a small doubt/question about the Prenet part -
I believe that we must add an autoregressive model in the prenet after the MEL encoder. The reasons are that

The speech inputs(mel spectrograms) will be passed to the prenet and then to the speech transformer(feature extractor->prenet->speech transformer) and since the speech transformer has an embedding layer at the front, it will give us error if we directly pass the outputs of the mel encoder to the transformer. So we must use the autoregressive model to convert the outputs of the mel encoder into codes and then pass it to the speech transformer.
Also the CLVP speech transformer was trained to differentiate between text tokens and the codes generated by an gpt2 so it only makes sense to have the gpt2 model in the pipeline.

So to wrap it up I would suggest something like -

from transformers import GPT2LMHeadModel    # the autoregressive model

class CLVPPrenet(nn.Module):
    def __init__(cfg):
        super().__init__()
        self.mel_encoder = CLVPMelEncoder(cfg)
        self.autoreg = GPT2LMHeadModel(cfg)
   def forward(self, voice_mel_spectrograms, text_tokens):
       x = self.mel_encoder(voice_mel_spectrograms)
       # similar to this line - https://github.com/neonbjb/tortoise-tts/blob/3c4d9c51316cd2421cc2dea11ac3a7a2d3394acd/tortoise/api.py#L416
       codes = self.autoreg.generate(x, text_tokens)
       
       return codes

And the CLVPFeatureExtractor will only care about outputting mel spectrograms as you said.

Please let me know what do you think of this design.

ylacombe · 2023-08-15T08:20:39Z

Hi @susnato,
Indeed, it makes sense to add all the submodels that are used by the model, so I agree to add them to your modeling code.

One note though: you shouldn't mix forward and generate in your forward pass, as this mixes two different concepts and as users might want to pass some specific generate kwargs when generating.

A possible solution for that is to bypass the use of the CLVPPrenet and to directly use CLVPMelEncoder and GPT2LMHeadModel in your final model. (and at the end of the day, using them during the final model.generate)

Two last things to consider is that:

you should check that the gpt2 model of the original code indeed corresponds to our GPT2LMHeadModel before using it
If it corresponds, you might want to load it with the AutoModel class (which will load GPT2LMHeadModel if the config is correctly set!)

Let me know if you have more questions!

susnato · 2023-08-28T13:05:06Z

Hi @ylacombe first of all apologies for the huge delay. I have pushed some major changes -

Added Feature Extractor and the tests
Added Processor and the tests
Added CLVPAutoRegressiveLMHeadModel which is just a GPT2LMHeadModel but with changes to make sure that the outputs are same
Added a generate() method to the CLVPModel which calls the CLVPAutoRegressiveLMHeadModel's generate method to get speech_ids and then processes it using the speech model.
reworked the tests.

ylacombe

Hi @susnato ,
Many thanks for the huge changes that you've made, the tests are definitely better! And they all pass so that's really great!

However, there are still a few things to change that you'll find on my review! Here is a summary of the main things:

I don't think CLVPConfig.__init__ code should be that long and complex. See BarkConfig for an example of nested config, that doesn't use such if/else cases!
Your modeling code should respect transformers' single model file policy, so you should integrate GPT2Block into your modeling code instead of importing it.
I believe that you can use a single CLVPAttention instead of two classes with two different positional embedding (relative and rotatory). A beautiful example of such a class is Wav2Vec2ConformerSelfAttention. This would be especially useful since your future CLVPBlock (GPT2Block) will also use attention.

I believe that except those points, you are near the end for this PR, congrats!

src/transformers/models/clvp/tokenization_clvp.py

src/transformers/models/clvp/feature_extraction_clvp.py

docs/source/en/model_doc/clvp.md

src/transformers/models/clvp/configuration_clvp.py

tests/models/clvp/test_modeling_clvp.py

susnato · 2023-09-06T22:18:24Z

Hi @ylacombe , I have addressed all of your comments except these 3 -

For this and this I am still waiting for the views/thoughts of @ArthurZucker, @sanchit-gandhi and @amyeroberts before proceeding.
For the comment related to adding CLVPTokenizerFast, I have asked some additional questions in that thread.

Please review it and let me know if more changes are needed or not other than those 3.

ylacombe

Hey @susnato, thanks for addressing my comments, it's a really nice PR at the moment.

Let's gently ping @ArthurZucker, @amyeroberts and @sanchit-gandhi for their help here and here.

I also think that @ArthurZucker is best placed to answer your questions on the fast tokenizer here!

Anyways, I left some small comments, mostly nits. My last request would be to verify that your model also work when passing multiple audio inputs, since I only code snippets with mostly one sample. Does batching correctly work with CLVP ?

I will approve once these few questions are answered, great work!

docs/source/en/model_doc/clvp.md

src/transformers/models/clvp/modeling_clvp.py

ArthurZucker

Thanks for working on this!
Most of the work left is on the modeling / conversion, the rest is looking good

docs/source/en/model_doc/clvp.md

src/transformers/models/clvp/modeling_clvp.py

tests/models/clvp/test_modeling_clvp.py

susnato · 2023-09-14T20:11:20Z

Hi @ylacombe, I have pushed the changes you asked in your last review and answered the questions in their respective thread.

Anyways, I left some small comments, mostly nits. My last request would be to verify that your model also work when passing multiple audio inputs, since I only code snippets with mostly one sample. Does batching correctly work with CLVP ?

As I see this, there are 4 possible batching scenarios :

1 Text vs N Audios
Here we just repeat the text tokens N times, such that we are generating different responses of the same text with different voices.
N Texts vs 1 Audio
Here we just repeat the audio N times, such that we are generating responses for multiple texts with the same voice.
N Texts vs N Audios
Here we generate the response in a way that the 2nd response will be generated using the 2nd text and the 2nd audio.
N Texts vs M Audio
This is where it gets messy. We throw a Value error here because we don't know which text correspond to which audio. We can improve this by considering that each text corresponds to every audio and generating N*M speech candidates and comparing them with N Texts.

So , yes the batching works for CLVP except the last scenario(as of now).

susnato · 2023-09-15T14:41:39Z

Hi @ArthurZucker, just to be clear on the naming part -
We will have 3 configs - ClvpTextAndSpeechConfig, ClvpDecoderConfig and ClvpConfig to bind them.

For the models we will have ClvpTextAndSpeechModel (which is the CLVPTransformerWithProjection), ClvpDecoder, ClvpDecoderLMHead and will keep the ClvpModel as it is.

Do you approve these naming?

ArthurZucker · 2023-09-15T18:57:38Z

Let's forget about the TextAndSpeech it's just a general transformers. We could just have a ClvpEncoderConfig (encoder that can be used for both the speech and the text encoder) ClvpDecoderConfig and the ClvpConfig.
Regarding the modeling, ClvpEncoder, ClvpDecoder , ClvpPretrainedModel, ,ClvpModel and ClvpModelForConditionalGeneration . You should really have a look at MusicGen's modelling code 😉

susnato · 2023-11-09T19:25:24Z

Hi @ArthurZucker, @amyeroberts I just rebased and the CI is green now!

amyeroberts · 2023-11-10T13:48:57Z

@susnato 🙌

Thanks for your patience regarding the failing CI and your work adding the model!

* init commit * attention arch done except rotary emb * rotary emb done * text encoder working * outputs matching * arch first pass done * make commands done, tests and docs remaining * all tests passed, only docs remaining * docs done * doc-builder fix * convert script removed(not relevant) * minor comments done * added ckpt conversion script * tokenizer done * very minor fix of index.md 2 * mostly make fixup related * all done except fe and rotary emb * very small change * removed unidecode dependency * style changes * tokenizer removed require_backends * added require_inflect to tokenizer tests * removed VOCAB_FILES in tokenizer test * inflect dependency removed * added rotary pos emb cache and simplified the apply method * style * little doc change * more comments * feature extractor added * added processor * auto-regressive config added * added CLVPConditioningEncoder * comments done except the test one * weights added successfull(NOT tested) * tokenizer fix with numbers * generate outputs matching * almost tests passing Integ tests not written * Integ tests added * major CUDA error fixed * docs done * rebase and multiple fixes * fixed rebase overwrites * generate code simplified and tests for AutoRegressive model added * minor changes * refectored gpt2 code in clvp file * weights done and all code refactored * mostly done except the fast_tokenizer * doc test fix * config file's doc fixes * more config fix * more comments * tokenizer comments mostly done * modeling file mostly refactored and can load modules * ClvpEncoder tested * ClvpDecoder, ClvpModel and ClvpForCausalLM tested * integration and all tests passed * more fixes * docs almost done * ckpt conversion refectored * style and some failing tests fix * comments * temporary output fix but test_assisted_decoding_matches_greedy_search test fails * majority changes done * use_cache outputs same now! Along with the asisted_greedy_decoding test fix * more comments * more comments * prepare_inputs_for_generation fixed and _prepare_model_inputs added * style fix * clvp.md change * moved clvpconditionalencoder norms * add model to new index * added tokenizer input_ids_with_special_tokens * small fix * config mostly done * added config-tester and changed conversion script * more comments * comments * style fix * some comments * tokenizer changed back to prev state * small commnets * added output hidden states for the main model * style fix * comments * small change * revert small change * . * Update clvp.md * Update test_modeling_clvp.py * :) * some minor change * new fixes * remove to_dict from FE

susnato changed the title ~~[WIP] Add CLVP~~ Add CLVP Jul 14, 2023

susnato marked this pull request as ready for review July 14, 2023 21:01

susnato mentioned this pull request Jul 14, 2023

[WIP] Add Tortoise TTS huggingface/diffusers#4106

Closed

8 tasks

sanchit-gandhi reviewed Jul 28, 2023

View reviewed changes

susnato force-pushed the clvp branch 2 times, most recently from 8dd36ff to 06edb17 Compare August 4, 2023 19:40

susnato requested a review from sanchit-gandhi August 5, 2023 15:34

ylacombe reviewed Aug 11, 2023

View reviewed changes

ylacombe mentioned this pull request Aug 11, 2023

Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration #24799

Merged

5 tasks

susnato force-pushed the clvp branch from 8ac69ac to c978a67 Compare August 20, 2023 16:42

susnato force-pushed the clvp branch from 6cdaa55 to d2d8405 Compare August 28, 2023 12:38

susnato requested a review from ylacombe August 28, 2023 13:07

ylacombe suggested changes Sep 1, 2023

View reviewed changes

susnato force-pushed the clvp branch from 0b8ef34 to 83b4b89 Compare September 6, 2023 21:12

susnato requested a review from ylacombe September 6, 2023 22:19

ylacombe reviewed Sep 13, 2023

View reviewed changes

ArthurZucker reviewed Sep 14, 2023

View reviewed changes

ylacombe self-requested a review September 19, 2023 11:11

susnato added 24 commits November 10, 2023 00:11

clvp.md change

97392ea

moved clvpconditionalencoder norms

361abba

add model to new index

ffad4dc

added tokenizer input_ids_with_special_tokens

0ba8fec

small fix

eb99535

config mostly done

6b00d4a

added config-tester and changed conversion script

1e3b249

more comments

25b0f70

comments

575a239

style fix

f393a03

some comments

d80d162

tokenizer changed back to prev state

941fb96

small commnets

b3b9e99

added output hidden states for the main model

7d14e34

style fix

ce0f518

comments

670c053

small change

fe2df1a

revert small change

0ed14c4

.

8f53d4c

Update clvp.md

38595c7

Update test_modeling_clvp.py

9133dd8

:)

f6d0c84

some minor change

c567933

new fixes

f19c2f3

susnato force-pushed the clvp branch from db93796 to f19c2f3 Compare November 9, 2023 18:50

remove to_dict from FE

1173098

amyeroberts merged commit 7e9f10a into huggingface:main Nov 10, 2023

Conversation

susnato commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

susnato commented Jul 11, 2023

Uh oh!

sanchit-gandhi commented Jul 13, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Jul 14, 2023

Uh oh!

susnato commented Jul 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

susnato commented Aug 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ylacombe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

susnato commented Aug 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ylacombe commented Aug 15, 2023

Uh oh!

susnato commented Aug 28, 2023

Uh oh!

ylacombe left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

susnato commented Sep 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ylacombe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

susnato commented Jul 11, 2023 •

edited

Loading

susnato commented Jul 14, 2023 •

edited

Loading

susnato commented Aug 5, 2023 •

edited

Loading

susnato commented Aug 14, 2023 •

edited

Loading

ylacombe left a comment •

edited

Loading

susnato commented Sep 6, 2023 •

edited

Loading

susnato commented Sep 14, 2023 •

edited

Loading