Skip to content

Add CLVP#24745

Merged
amyeroberts merged 93 commits intohuggingface:mainfrom
susnato:clvp
Nov 10, 2023
Merged

Add CLVP#24745
amyeroberts merged 93 commits intohuggingface:mainfrom
susnato:clvp

Conversation

@susnato
Copy link
Contributor

@susnato susnato commented Jul 11, 2023

What does this PR do?

Adds CLVP, which is an integral part of tortoise-tts.
Required for tortoise-tts integration in HF diffusers(Please see this issue).

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@susnato
Copy link
Contributor Author

susnato commented Jul 11, 2023

cc: @sanchit-gandhi, @dg845

@sanchit-gandhi
Copy link
Contributor

Very cool @susnato! Let me know if you have any questions / queries - more than happy to lend a hand here!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@susnato susnato changed the title [WIP] Add CLVP Add CLVP Jul 14, 2023
@susnato
Copy link
Contributor Author

susnato commented Jul 14, 2023

Hi @sanchit-gandhi, this PR is ready for review!

Some notes related to this design I wanted to mention -

  1. Although CLIP has both tokenizer and Image Processor, in tortoise the text is encoded and pushed into both the autoregressive model and the CLVP model so I think its better to have only one tokenizer(for tortoise) and encoding the text one time and pushing it to both of the models rather than defining a seperate tokenization_clvp.py.
  2. Instead of processing Images and checking which image fits the description(from text) best, CLVP compares Speech Token Candidates and text. The speech tokens come from the output of the auto-regressive model itself, so we don't need the Image Processor too!
  3. CLVP uses Rotary Position Embeddings.

@susnato susnato marked this pull request as ready for review July 14, 2023 21:01
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @susnato! The composite models are looking very clean, nice job. We can strip out any redundant code and config attributes (see comments below), as well as modules that are simply repeats. Otherwise the modelling code is nearly there!

Now regarding the pre-processing, I don't fully agree with the design decision not to include a tokenizer / feature extractor for this model -> while this model is being used as part of a larger pipeline in diffusers, it should still work standalone in the transformers library. We can't really add a model to the library that doesn't fully work just for the sake of re-using the code in another library, since this breaks the transformers library. I'm sure we'll still need the tokenizer and feature extractor to some extent in the diffusers pipeline, so probably best to address them now

@susnato susnato force-pushed the clvp branch 2 times, most recently from 8dd36ff to 06edb17 Compare August 4, 2023 19:40
@susnato
Copy link
Contributor Author

susnato commented Aug 5, 2023

Hello @sanchit-gandhi I have -

  • pushed the changed that you asked.
  • implemented the tokenizer and respective tests.

For the Feature Extractor I wanted to ask some things before going forward -

CLVP compares the text embeddings generated by the tokenizer and the latent speech embeddings generated by an autoregressive model(here gpt2), So IMO we need a Feature Extractor which generates latent embeddings, in other words the feature extractor must take the tokenizer's outputs and then apply a gpt2 model on it and then give us the output. But I don't think that it complies with the design of the library. (I don't know any Feature Extractor which uses another transformer model for processing. :( )

In CLIP it was reasonable to use a Feature Extractor to just read images but I don't think it will be straight forward in this case.

So how should we design the Feature Extractor and what are your thoughts on this?

@susnato susnato requested a review from sanchit-gandhi August 5, 2023 15:34
Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @susnato,

I'm taking over @sanchit-gandhi for the rest of the process.

First of all, many thanks for the changes! It is definitely looking good and I think we might be near the end!

I'm mostly addressing your questions regarding the Feature Extractor with that review and I am also making some small overall comments.

I think the key to your FE question rely on this extract from the Tortoise TTS paper, section 2.2.1:

A unique design choice made with TorToise is an additional input which is provided to both the autoregressive generator and the DDPM, which I term the speech conditioning input.

The speech conditioning input starts as one or more audio clips of the same speaker as the target. These clips are converted to MEL spectrograms and fed through an encoder consisting of a stack of self-attention layers. The autoregressive generator and the DDPM have their own conditioning encoders, both of which are learned alongside their respective networks.

If you take a step back and think of what are the purpose of the tokenizers and the feature extractors, the idea is to preprocess inputs into a format that can be understand by your model. E.g, as you said for the CLIP feature extractor, the idea is to read images into a format that CLIP can understand.

In the highlighted sentence in bold, you can see that:

  1. The Feature Extractor can simply convert the input audio waveforms into MEL Spectrograms, which I think was done by format_conditioning in the tortoise TTS repo (https://github.com/neonbjb/tortoise-tts/blob/3c4d9c51316cd2421cc2dea11ac3a7a2d3394acd/tortoise/api.py#L103-L115)
  2. The encoder which process the MEL spectrograms is IMO an integral part of the CLVP model, and thus should be in the modeling code.

This is quite similar to what has been done in the SpeechT5 feature extractor, which process audio in order to feed it to SpeechT5SpeechEncoderPrenet.

Of course don't hesitate to reach out to me if you have additional questions or doubts!

@susnato
Copy link
Contributor Author

susnato commented Aug 14, 2023

Hi @ylacombe thanks a lot for the guidance!

I have a small doubt/question about the Prenet part -
I believe that we must add an autoregressive model in the prenet after the MEL encoder. The reasons are that

  • The speech inputs(mel spectrograms) will be passed to the prenet and then to the speech transformer(feature extractor->prenet->speech transformer) and since the speech transformer has an embedding layer at the front, it will give us error if we directly pass the outputs of the mel encoder to the transformer. So we must use the autoregressive model to convert the outputs of the mel encoder into codes and then pass it to the speech transformer.
  • Also the CLVP speech transformer was trained to differentiate between text tokens and the codes generated by an gpt2 so it only makes sense to have the gpt2 model in the pipeline.

So to wrap it up I would suggest something like -

from transformers import GPT2LMHeadModel    # the autoregressive model

class CLVPPrenet(nn.Module):
    def __init__(cfg):
        super().__init__()
        self.mel_encoder = CLVPMelEncoder(cfg)
        self.autoreg = GPT2LMHeadModel(cfg)
   def forward(self, voice_mel_spectrograms, text_tokens):
       x = self.mel_encoder(voice_mel_spectrograms)
       # similar to this line - https://github.com/neonbjb/tortoise-tts/blob/3c4d9c51316cd2421cc2dea11ac3a7a2d3394acd/tortoise/api.py#L416
       codes = self.autoreg.generate(x, text_tokens)
       
       return codes

And the CLVPFeatureExtractor will only care about outputting mel spectrograms as you said.

Please let me know what do you think of this design.

@ylacombe
Copy link
Contributor

Hi @susnato,
Indeed, it makes sense to add all the submodels that are used by the model, so I agree to add them to your modeling code.

One note though: you shouldn't mix forward and generate in your forward pass, as this mixes two different concepts and as users might want to pass some specific generate kwargs when generating.

A possible solution for that is to bypass the use of the CLVPPrenet and to directly use CLVPMelEncoder and GPT2LMHeadModel in your final model. (and at the end of the day, using them during the final model.generate)

Two last things to consider is that:

  1. you should check that the gpt2 model of the original code indeed corresponds to our GPT2LMHeadModel before using it
  2. If it corresponds, you might want to load it with the AutoModel class (which will load GPT2LMHeadModel if the config is correctly set!)

Let me know if you have more questions!

@susnato
Copy link
Contributor Author

susnato commented Aug 28, 2023

Hi @ylacombe first of all apologies for the huge delay. I have pushed some major changes -

  • Added Feature Extractor and the tests
  • Added Processor and the tests
  • Added CLVPAutoRegressiveLMHeadModel which is just a GPT2LMHeadModel but with changes to make sure that the outputs are same
  • Added a generate() method to the CLVPModel which calls the CLVPAutoRegressiveLMHeadModel's generate method to get speech_ids and then processes it using the speech model.
  • reworked the tests.

@susnato susnato requested a review from ylacombe August 28, 2023 13:07
Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @susnato ,
Many thanks for the huge changes that you've made, the tests are definitely better! And they all pass so that's really great!

However, there are still a few things to change that you'll find on my review! Here is a summary of the main things:

  1. I don't think CLVPConfig.__init__ code should be that long and complex. See BarkConfig for an example of nested config, that doesn't use such if/else cases!
  2. Your modeling code should respect transformers' single model file policy, so you should integrate GPT2Block into your modeling code instead of importing it.
  3. I believe that you can use a single CLVPAttention instead of two classes with two different positional embedding (relative and rotatory). A beautiful example of such a class is Wav2Vec2ConformerSelfAttention. This would be especially useful since your future CLVPBlock (GPT2Block) will also use attention.

I believe that except those points, you are near the end for this PR, congrats!

@susnato
Copy link
Contributor Author

susnato commented Sep 6, 2023

Hi @ylacombe , I have addressed all of your comments except these 3 -

Please review it and let me know if more changes are needed or not other than those 3.

@susnato susnato requested a review from ylacombe September 6, 2023 22:19
Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @susnato, thanks for addressing my comments, it's a really nice PR at the moment.

Let's gently ping @ArthurZucker, @amyeroberts and @sanchit-gandhi for their help here and here.

I also think that @ArthurZucker is best placed to answer your questions on the fast tokenizer here!

Anyways, I left some small comments, mostly nits. My last request would be to verify that your model also work when passing multiple audio inputs, since I only code snippets with mostly one sample. Does batching correctly work with CLVP ?

I will approve once these few questions are answered, great work!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
Most of the work left is on the modeling / conversion, the rest is looking good

@susnato
Copy link
Contributor Author

susnato commented Sep 14, 2023

Hi @ylacombe, I have pushed the changes you asked in your last review and answered the questions in their respective thread.

Anyways, I left some small comments, mostly nits. My last request would be to verify that your model also work when passing multiple audio inputs, since I only code snippets with mostly one sample. Does batching correctly work with CLVP ?

As I see this, there are 4 possible batching scenarios :

  • 1 Text vs N Audios
    Here we just repeat the text tokens N times, such that we are generating different responses of the same text with different voices.
  • N Texts vs 1 Audio
    Here we just repeat the audio N times, such that we are generating responses for multiple texts with the same voice.
  • N Texts vs N Audios
    Here we generate the response in a way that the 2nd response will be generated using the 2nd text and the 2nd audio.
  • N Texts vs M Audio
    This is where it gets messy. We throw a Value error here because we don't know which text correspond to which audio. We can improve this by considering that each text corresponds to every audio and generating N*M speech candidates and comparing them with N Texts.

So , yes the batching works for CLVP except the last scenario(as of now).

@susnato
Copy link
Contributor Author

susnato commented Sep 15, 2023

Hi @ArthurZucker, just to be clear on the naming part -
We will have 3 configs - ClvpTextAndSpeechConfig, ClvpDecoderConfig and ClvpConfig to bind them.

For the models we will have ClvpTextAndSpeechModel (which is the CLVPTransformerWithProjection), ClvpDecoder, ClvpDecoderLMHead and will keep the ClvpModel as it is.

Do you approve these naming?

@ArthurZucker
Copy link
Collaborator

Let's forget about the TextAndSpeech it's just a general transformers. We could just have a ClvpEncoderConfig (encoder that can be used for both the speech and the text encoder) ClvpDecoderConfig and the ClvpConfig.
Regarding the modeling, ClvpEncoder, ClvpDecoder , ClvpPretrainedModel, ,ClvpModel and ClvpModelForConditionalGeneration . You should really have a look at MusicGen's modelling code 😉

@ylacombe ylacombe self-requested a review September 19, 2023 11:11
@susnato
Copy link
Contributor Author

susnato commented Nov 9, 2023

Hi @ArthurZucker, @amyeroberts I just rebased and the CI is green now!

@amyeroberts
Copy link
Contributor

@susnato 🙌

Thanks for your patience regarding the failing CI and your work adding the model!

@amyeroberts amyeroberts merged commit 7e9f10a into huggingface:main Nov 10, 2023
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* init commit

* attention arch done except rotary emb

* rotary emb done

* text encoder working

* outputs matching

* arch first pass done

* make commands done, tests and docs remaining

* all tests passed, only docs remaining

* docs done

* doc-builder fix

* convert script removed(not relevant)

* minor comments done

* added ckpt conversion script

* tokenizer done

* very minor fix of index.md 2

* mostly make fixup related

* all done except fe and rotary emb

* very small change

* removed unidecode dependency

* style changes

* tokenizer removed require_backends

* added require_inflect to tokenizer tests

* removed VOCAB_FILES in tokenizer test

* inflect dependency removed

* added rotary pos emb cache and simplified the apply method

* style

* little doc change

* more comments

* feature extractor added

* added processor

* auto-regressive config added

* added CLVPConditioningEncoder

* comments done except the test one

* weights added successfull(NOT tested)

* tokenizer fix with numbers

* generate outputs matching

* almost tests passing Integ tests not written

* Integ tests added

* major CUDA error fixed

* docs done

* rebase and multiple fixes

* fixed rebase overwrites

* generate code simplified and tests for AutoRegressive model added

* minor changes

* refectored gpt2 code in clvp file

* weights done and all code refactored

* mostly done except the fast_tokenizer

* doc test fix

* config file's doc fixes

* more config fix

* more comments

* tokenizer comments mostly done

* modeling file mostly refactored and can load modules

* ClvpEncoder tested

* ClvpDecoder, ClvpModel and ClvpForCausalLM tested

* integration and all tests passed

* more fixes

* docs almost done

* ckpt conversion refectored

* style and some failing tests fix

* comments

* temporary output fix but test_assisted_decoding_matches_greedy_search test fails

* majority changes done

* use_cache outputs same now! Along with the asisted_greedy_decoding test fix

* more comments

* more comments

* prepare_inputs_for_generation fixed and _prepare_model_inputs added

* style fix

* clvp.md change

* moved clvpconditionalencoder norms

* add model to new index

* added tokenizer input_ids_with_special_tokens

* small fix

* config mostly done

* added config-tester and changed conversion script

* more comments

* comments

* style fix

* some comments

* tokenizer changed back to prev state

* small commnets

* added output hidden states for the main model

* style fix

* comments

* small change

* revert small change

* .

* Update clvp.md

* Update test_modeling_clvp.py

* :)

* some minor change

* new fixes

* remove to_dict from FE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants