[WIP] Add Tortoise TTS by susnato · Pull Request #4106 · huggingface/diffusers

susnato · 2023-07-14T21:01:54Z

What does this PR do?

Adds Tortoise TTS Pipeline and Fixes #3891

Before Adding this pipeline, we need to make sure these two PR's are merged -

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

susnato · 2023-07-14T21:02:34Z

cc : @sanchit-gandhi @dg845

HuggingFaceDocBuilderDev · 2023-07-14T21:08:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

dg845 · 2023-07-17T00:24:40Z

Hi, I pulled the tortoise_tts branch from susnato/diffusers and made my own local branch at dg845/diffusers with some changes, but I'm not sure how to make those changes show up on this PR.

(When I click the "Open pull request" button on GitHub, I'm not able to specify susnato/diffusers as the base repository, and I'm not sure setting huggingface/diffusers as the base repository will do the right thing [tortoise_tts isn't a branch on huggingface/diffusers].)

susnato · 2023-07-17T07:13:31Z

Hi @dg845, I just invited you to collaborate on my susnato/diffusers(you will get an email notification).
Please accept that and clone the susnato/diffusers and then checkout the tortoise_tts branch, now after you make progress on this branch(on susnato/diffusers not on dg845/diffusers), please push it to my repository (it will be something similar to this - git push susnato/diffusers tortoise_tts).

Since I gave you access to my repo, we can now both push changes to tortoise_tts of susnato/diffusers and the changes will show on this PR!

Please let me know if this is working or not!

susnato · 2023-07-17T18:32:23Z

I have tested and added the resnet block, tomorrow will do the Attention block @dg845. Let's first add all the modules(those that you and I specified in slack) and create a basic diagram until the CLVP and Univnet is merged after that we can transfer the weights and finalize the whole model!
Later we can focus on writing docs/tests.

…e resnet block to modeling_tortoise_tts.py.

dg845 · 2023-07-18T00:37:31Z

Please let me know if this is working or not!

Looks like it is working, was able to push a commit :).

dg845 · 2023-07-18T00:45:48Z

I have moved the resnet block code to modeling_tortoise_tts.py and added some initial pipeline code in pipeline_tortoise_tts.py. I think keeping only pipeline code in pipeline_tortoise_tts.py and putting new module code in modeling_*.py files (like e.g. the Versatile Diffusion pipeline) will make the code more clear.

…a bit.

…conditioning embeddings when no conditioning audio is supplied.

…y done

susnato · 2023-07-30T20:26:05Z

Just started the part of checkpoint conversion script for the diffusion decoder model, will also add the clvp model conversion script later(btw the weights loading code is unfinished and a pure mess, will update it in the next commit), also it seems that you have done a lot of work here! I need to catch up.

…ive modeling, and diffusion modeling.

dg845 · 2023-07-31T06:46:55Z

Just a heads up that I have refactored modeling_tortoise_tts.py into 3 files:

modeling_common.py: modules not used in autoregressive modeling or diffusion modeling, like ConditioningEncoder and RandomLatentConverter, as well as blocks shared in common like AttentionBlock
modeling_autoregressive.py: modules used in autoregressive modeling, such as Tortoise TTS's version of GPT2 and blocks specific to autoregressive modeling (like the autoregressive version of ResBlock)
modeling_diffusion.py: modules used in diffusion modeling, such as the diffusion denoising model, and blocks specific to diffusion modeling (like the diffusion ResnetBlock1D block)

The code can probably be simplified further but I think this makes sense for now.

susnato · 2023-07-31T19:25:48Z

The diffusion decoder attention outputs are same now, the whole decoder model will probably be ready in next 1 or 2 days since the Resnet outputs are already verified to the official repo. Also should we not place TortoiseTTSDiffusionModelAttention in modeling_diffusion.py instead of modeling_common.py? if I am not wrong that module is specific to the decoder model only.

Please let me know what you think.

dg845 · 2023-08-01T08:42:03Z

ConditioningEncoder also uses the same attention block as the diffusion denoising model (parallel to how both tortoise.models.autoregressive.ConditioningEncoder and tortoise.models.diffusion_decoder.DiffusionLayer both use tortoise.models.arch_util.AttentionBlock in the original code), so I think TortoiseTTSDiffusionModelAttention should go in modeling_common.py.

I guess one difference is that when AttentionBlock is used in autoregressive modeling the relative position embeddings aren't used, whereas it is always used in diffusion modeling. So we could potentially replace the attention block used in autoregressive modeling with a more "vanilla" attention block (possibly one already implemented in diffusers).

[For context, the only place AttentionBlock is used in autoregressive modeling is in the ConditioningEncoder. In the current design the ConditioningEncoder module is meant to work for both autoregressive and diffusion modeling; for the latter, it needs access to relative position embeddings, which is why it's intended to use TortoiseTTSDiffusionModelAttention currently.]

…ConditioningEncoder.

…toregressive audio candidates into its own method.

…mming logic in tortoise-tts.

…s worse otherwise pretty good

ylacombe · 2023-12-08T15:07:00Z

Hey @susnato , I see that you are really active in integrating Tortoise!
Let me know if I can be of any help here, or if you need a first review!

susnato · 2023-12-14T19:11:41Z

Hi @ylacombe, sure I till let you know once it's finished, and sorry it's taking soo long.

github-actions · 2024-01-09T15:08:52Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

tuanh123789 · 2024-07-13T17:16:58Z

Can i work on this to add tortoise in Diffusers. I see that tortoise TTS is very powerful but it's very slow.
I believe that diffusers will improve the performance of this model

ylacombe · 2024-07-15T09:38:16Z

Hey @tuanh123789, let's first ping @susnato to make sure he doesn't want or have bandwidth to finish this PR!
If that's the case, feel free to continue the PR and ping me once you need a review!

@susnato, let us know ! You've already make a great effort on this PR, would you like to finish it up? Thanks!

tuanh123789 · 2024-07-17T07:01:55Z

It seems like @susnato is busy and no longer working on this PR. Perhaps I will continue and inherit his contributions to complete this pipeline @ylacombe

susnato · 2024-07-17T08:26:32Z

Hello, I am so sorry to everyone that I couldn't finish this PR 😢 .

@tuanh123789 Please feel free to take this up, and as far as I remember me and @dg845 have already impelemented CLVP and UnivNet vocoder and I was able to get the logits from start to vocoder within 2e-2 atol (within acceptable range of diffusers as far as I am aware), so it needs few more work to make it e2e compatible.

I can also invite you to my diffusers branch so that you can continue the work from there (if you want of course)

Also maybe @ylacombe you could invite @tuanh123789 to our shared slack channel (if possible) so that he could get more idea of the current state and issues that we were facing.

susnato · 2024-07-17T08:28:21Z

@tuanh123789 Let me know if you need any more pointers, I will try to answer as much as I can!

ylacombe · 2024-07-17T08:34:09Z

Thanks for the update @susnato! @tuanh123789 feel free to reach out on X or LI to get you on the channel!

tuanh123789 · 2024-07-17T10:10:40Z

Hello, I am so sorry to everyone that I couldn't finish this PR 😢 .

@tuanh123789 Please feel free to take this up, and as far as I remember me and @dg845 have already impelemented CLVP and UnivNet vocoder and I was able to get the logits from start to vocoder within 2e-2 atol (within acceptable range of diffusers as far as I am aware), so it needs few more work to make it e2e compatible.

I can also invite you to my diffusers branch so that you can continue the work from there (if you want of course)

Also maybe @ylacombe you could invite @tuanh123789 to our shared slack channel (if possible) so that he could get more idea of the current state and issues that we were facing.

Sure, pls add me to your branch

susnato · 2024-07-17T10:22:17Z

Just did @tuanh123789 ! You should see a message in your mail.

poedator · 2024-08-12T10:48:43Z

@susnato, @tuanh123789
Thank you for this massive work!
Could you comment on what is left to do in this PR, where you may need help, and what are the chances of its successful merge within reasonable time?

files created( not even init :( )

530f144

ResnetBlock added(tested too)

6ecd4dc

Add initial pipeline code and tests based on AudioLDMPipeline and mov…

9f74d2f

…e resnet block to modeling_tortoise_tts.py.

dg845 and others added 9 commits July 18, 2023 16:54

Add ConditioningEncoder module and refactor modeling_tortoise_tts.py …

391926e

…a bit.

Add AttnEncoderBlock1d with architecture ResnetBlock1D => AttentionBlock

7f28de3

Add inital implementation of Tortoise TTS diffusion denoising model.

058686a

added Attention block for Diffusion model

d3a383d

Add rough initial code for preparing latents and diffusion loop.

732ff56

Merge branch 'main' into tortoise_tts

8d81fa4

Add RandomLatentConverter to convert Gaussian noise to random latent …

60cf5d6

…conditioning embeddings when no conditioning audio is supplied.

Add cpu offload and start implementing autoregressive model sampling.

3272a6f

checkpoint conversion script started and attention output match partl…

b674862

…y done

Refactoring modeling files into files for common modules, autoregress…

f29564a

…ive modeling, and diffusion modeling.

dg845 and others added 2 commits July 31, 2023 01:14

Add rough initial code for autoregressive modeling.

5062861

diffusion decoder attention opts same

91c04c5

dg845 and others added 4 commits August 2, 2023 22:16

Add temp conversion script and fix RandomLatentConverter.

87b240e

Add conditioning encoder configs to conversion script and fix bug in …

c6b83b4

…ConditioningEncoder.

Fix further bug in ConditioningEncoder.

7c71d7f

added diffusion layer

2152fea

susnato and others added 13 commits November 9, 2023 15:37

audio emb done but UnivnetFE is not working as expected

fb14ee0

minor fixes

c511263

Refactor calculating the diffusion attention mask for trimming the au…

f57cdf3

…toregressive audio candidates into its own method.

Make diffusion attention mask computation more close to the audio tri…

30bbf22

…mming logic in tortoise-tts.

minor renaming change

e2d48f6

Add explanatory comment

c1f79d9

outputs matching till best latents

2831386

Merge remote-tracking branch 'origin/tortoise_tts' into tortoise_tts

46ea2a6

outputs same upto diffusion_cond_emb but with 1e-2!

22059ea

partially works upto model's 1st pass

a239235

more fixes

1ee3469

more fixes and outputs same for a single timestep with 1e-3

d5754d6

decoder latents with 2e-2 but vocoder denormalization is making thing…

c9462b7

…s worse otherwise pretty good

dg845 mentioned this pull request Nov 21, 2023

Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration huggingface/transformers#24799

Merged

5 tasks

torchaudio resample fix

b91585d

sanchit-gandhi mentioned this pull request Dec 19, 2023

Add Tortoise TTS to HF Pipeline huggingface/transformers#28120

Closed

github-actions bot added the stale Issues that haven't received updates label Jan 9, 2024

github-actions bot closed this Jan 18, 2024

Conversation

susnato commented Jul 14, 2023

What does this PR do?

Before submitting

Who can review?

Uh oh!

susnato commented Jul 14, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Jul 14, 2023

Uh oh!

dg845 commented Jul 17, 2023

Uh oh!

susnato commented Jul 17, 2023

Uh oh!

susnato commented Jul 17, 2023

Uh oh!

dg845 commented Jul 18, 2023

Uh oh!

dg845 commented Jul 18, 2023

Uh oh!

susnato commented Jul 30, 2023

Uh oh!

dg845 commented Jul 31, 2023

Uh oh!

susnato commented Jul 31, 2023

Uh oh!

dg845 commented Aug 1, 2023

Uh oh!

ylacombe commented Dec 8, 2023

Uh oh!

susnato commented Dec 14, 2023

Uh oh!

github-actions bot commented Jan 9, 2024

Uh oh!

tuanh123789 commented Jul 13, 2024

Uh oh!

ylacombe commented Jul 15, 2024

Uh oh!

tuanh123789 commented Jul 17, 2024

Uh oh!

susnato commented Jul 17, 2024

Uh oh!

susnato commented Jul 17, 2024

Uh oh!

ylacombe commented Jul 17, 2024

Uh oh!

tuanh123789 commented Jul 17, 2024

Uh oh!

susnato commented Jul 17, 2024

Uh oh!

poedator commented Aug 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants