Conversation
|
cc: @sanchit-gandhi, @dg845 |
|
Very cool @susnato! Let me know if you have any questions / queries - more than happy to lend a hand here! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
Hi @sanchit-gandhi, this PR is ready for review! Some notes related to this design I wanted to mention -
|
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Thanks for working on this @susnato! The composite models are looking very clean, nice job. We can strip out any redundant code and config attributes (see comments below), as well as modules that are simply repeats. Otherwise the modelling code is nearly there!
Now regarding the pre-processing, I don't fully agree with the design decision not to include a tokenizer / feature extractor for this model -> while this model is being used as part of a larger pipeline in diffusers, it should still work standalone in the transformers library. We can't really add a model to the library that doesn't fully work just for the sake of re-using the code in another library, since this breaks the transformers library. I'm sure we'll still need the tokenizer and feature extractor to some extent in the diffusers pipeline, so probably best to address them now
8dd36ff to
06edb17
Compare
|
Hello @sanchit-gandhi I have -
For the
In CLIP it was reasonable to use a Feature Extractor to just read images but I don't think it will be straight forward in this case. So how should we design the Feature Extractor and what are your thoughts on this? |
ylacombe
left a comment
There was a problem hiding this comment.
Hey @susnato,
I'm taking over @sanchit-gandhi for the rest of the process.
First of all, many thanks for the changes! It is definitely looking good and I think we might be near the end!
I'm mostly addressing your questions regarding the Feature Extractor with that review and I am also making some small overall comments.
I think the key to your FE question rely on this extract from the Tortoise TTS paper, section 2.2.1:
A unique design choice made with TorToise is an additional input which is provided to both the autoregressive generator and the DDPM, which I term the speech conditioning input.
The speech conditioning input starts as one or more audio clips of the same speaker as the target. These clips are converted to MEL spectrograms and fed through an encoder consisting of a stack of self-attention layers. The autoregressive generator and the DDPM have their own conditioning encoders, both of which are learned alongside their respective networks.
If you take a step back and think of what are the purpose of the tokenizers and the feature extractors, the idea is to preprocess inputs into a format that can be understand by your model. E.g, as you said for the CLIP feature extractor, the idea is to read images into a format that CLIP can understand.
In the highlighted sentence in bold, you can see that:
- The Feature Extractor can simply convert the input audio waveforms into MEL Spectrograms, which I think was done by format_conditioning in the tortoise TTS repo (https://github.com/neonbjb/tortoise-tts/blob/3c4d9c51316cd2421cc2dea11ac3a7a2d3394acd/tortoise/api.py#L103-L115)
- The encoder which process the MEL spectrograms is IMO an integral part of the CLVP model, and thus should be in the modeling code.
This is quite similar to what has been done in the SpeechT5 feature extractor, which process audio in order to feed it to SpeechT5SpeechEncoderPrenet.
Of course don't hesitate to reach out to me if you have additional questions or doubts!
|
Hi @ylacombe thanks a lot for the guidance! I have a small doubt/question about the
So to wrap it up I would suggest something like - from transformers import GPT2LMHeadModel # the autoregressive model
class CLVPPrenet(nn.Module):
def __init__(cfg):
super().__init__()
self.mel_encoder = CLVPMelEncoder(cfg)
self.autoreg = GPT2LMHeadModel(cfg)
def forward(self, voice_mel_spectrograms, text_tokens):
x = self.mel_encoder(voice_mel_spectrograms)
# similar to this line - https://github.com/neonbjb/tortoise-tts/blob/3c4d9c51316cd2421cc2dea11ac3a7a2d3394acd/tortoise/api.py#L416
codes = self.autoreg.generate(x, text_tokens)
return codesAnd the Please let me know what do you think of this design. |
|
Hi @susnato, One note though: you shouldn't mix forward and generate in your forward pass, as this mixes two different concepts and as users might want to pass some specific generate kwargs when generating. A possible solution for that is to bypass the use of the Two last things to consider is that:
Let me know if you have more questions! |
|
Hi @ylacombe first of all apologies for the huge delay. I have pushed some major changes -
|
There was a problem hiding this comment.
Hi @susnato ,
Many thanks for the huge changes that you've made, the tests are definitely better! And they all pass so that's really great!
However, there are still a few things to change that you'll find on my review! Here is a summary of the main things:
- I don't think
CLVPConfig.__init__code should be that long and complex. SeeBarkConfigfor an example of nested config, that doesn't use such if/else cases! - Your modeling code should respect transformers' single model file policy, so you should integrate
GPT2Blockinto your modeling code instead of importing it. - I believe that you can use a single
CLVPAttentioninstead of two classes with two different positional embedding (relative and rotatory). A beautiful example of such a class isWav2Vec2ConformerSelfAttention. This would be especially useful since your futureCLVPBlock(GPT2Block) will also use attention.
I believe that except those points, you are near the end for this PR, congrats!
|
Hi @ylacombe , I have addressed all of your comments except these 3 -
Please review it and let me know if more changes are needed or not other than those 3. |
ylacombe
left a comment
There was a problem hiding this comment.
Hey @susnato, thanks for addressing my comments, it's a really nice PR at the moment.
Let's gently ping @ArthurZucker, @amyeroberts and @sanchit-gandhi for their help here and here.
I also think that @ArthurZucker is best placed to answer your questions on the fast tokenizer here!
Anyways, I left some small comments, mostly nits. My last request would be to verify that your model also work when passing multiple audio inputs, since I only code snippets with mostly one sample. Does batching correctly work with CLVP ?
I will approve once these few questions are answered, great work!
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks for working on this!
Most of the work left is on the modeling / conversion, the rest is looking good
|
Hi @ylacombe, I have pushed the changes you asked in your last review and answered the questions in their respective thread.
As I see this, there are 4 possible batching scenarios :
So , yes the batching works for CLVP except the last scenario(as of now). |
|
Hi @ArthurZucker, just to be clear on the naming part - For the models we will have Do you approve these naming? |
|
Let's forget about the |
|
Hi @ArthurZucker, @amyeroberts I just rebased and the CI is green now! |
|
@susnato 🙌 Thanks for your patience regarding the failing CI and your work adding the model! |
* init commit * attention arch done except rotary emb * rotary emb done * text encoder working * outputs matching * arch first pass done * make commands done, tests and docs remaining * all tests passed, only docs remaining * docs done * doc-builder fix * convert script removed(not relevant) * minor comments done * added ckpt conversion script * tokenizer done * very minor fix of index.md 2 * mostly make fixup related * all done except fe and rotary emb * very small change * removed unidecode dependency * style changes * tokenizer removed require_backends * added require_inflect to tokenizer tests * removed VOCAB_FILES in tokenizer test * inflect dependency removed * added rotary pos emb cache and simplified the apply method * style * little doc change * more comments * feature extractor added * added processor * auto-regressive config added * added CLVPConditioningEncoder * comments done except the test one * weights added successfull(NOT tested) * tokenizer fix with numbers * generate outputs matching * almost tests passing Integ tests not written * Integ tests added * major CUDA error fixed * docs done * rebase and multiple fixes * fixed rebase overwrites * generate code simplified and tests for AutoRegressive model added * minor changes * refectored gpt2 code in clvp file * weights done and all code refactored * mostly done except the fast_tokenizer * doc test fix * config file's doc fixes * more config fix * more comments * tokenizer comments mostly done * modeling file mostly refactored and can load modules * ClvpEncoder tested * ClvpDecoder, ClvpModel and ClvpForCausalLM tested * integration and all tests passed * more fixes * docs almost done * ckpt conversion refectored * style and some failing tests fix * comments * temporary output fix but test_assisted_decoding_matches_greedy_search test fails * majority changes done * use_cache outputs same now! Along with the asisted_greedy_decoding test fix * more comments * more comments * prepare_inputs_for_generation fixed and _prepare_model_inputs added * style fix * clvp.md change * moved clvpconditionalencoder norms * add model to new index * added tokenizer input_ids_with_special_tokens * small fix * config mostly done * added config-tester and changed conversion script * more comments * comments * style fix * some comments * tokenizer changed back to prev state * small commnets * added output hidden states for the main model * style fix * comments * small change * revert small change * . * Update clvp.md * Update test_modeling_clvp.py * :) * some minor change * new fixes * remove to_dict from FE
What does this PR do?
Adds CLVP, which is an integral part of
tortoise-tts.Required for
tortoise-ttsintegration in HF diffusers(Please see this issue).Before submitting
Pull Request section?
to it if that's the case. (link)
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.