Skip to content

[WIP] Add Tortoise TTS#4106

Closed
susnato wants to merge 69 commits intohuggingface:mainfrom
susnato:tortoise_tts
Closed

[WIP] Add Tortoise TTS#4106
susnato wants to merge 69 commits intohuggingface:mainfrom
susnato:tortoise_tts

Conversation

@susnato
Copy link
Contributor

@susnato susnato commented Jul 14, 2023

What does this PR do?

Adds Tortoise TTS Pipeline and Fixes #3891

Before Adding this pipeline, we need to make sure these two PR's are merged -

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@susnato
Copy link
Contributor Author

susnato commented Jul 14, 2023

cc : @sanchit-gandhi @dg845

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@dg845
Copy link
Collaborator

dg845 commented Jul 17, 2023

Hi, I pulled the tortoise_tts branch from susnato/diffusers and made my own local branch at dg845/diffusers with some changes, but I'm not sure how to make those changes show up on this PR.

(When I click the "Open pull request" button on GitHub, I'm not able to specify susnato/diffusers as the base repository, and I'm not sure setting huggingface/diffusers as the base repository will do the right thing [tortoise_tts isn't a branch on huggingface/diffusers].)

@susnato
Copy link
Contributor Author

susnato commented Jul 17, 2023

Hi @dg845, I just invited you to collaborate on my susnato/diffusers(you will get an email notification).
Please accept that and clone the susnato/diffusers and then checkout the tortoise_tts branch, now after you make progress on this branch(on susnato/diffusers not on dg845/diffusers), please push it to my repository (it will be something similar to this - git push susnato/diffusers tortoise_tts).

Since I gave you access to my repo, we can now both push changes to tortoise_tts of susnato/diffusers and the changes will show on this PR!

Please let me know if this is working or not!

@susnato
Copy link
Contributor Author

susnato commented Jul 17, 2023

I have tested and added the resnet block, tomorrow will do the Attention block @dg845. Let's first add all the modules(those that you and I specified in slack) and create a basic diagram until the CLVP and Univnet is merged after that we can transfer the weights and finalize the whole model!
Later we can focus on writing docs/tests.

@dg845
Copy link
Collaborator

dg845 commented Jul 18, 2023

Please let me know if this is working or not!

Looks like it is working, was able to push a commit :).

@dg845
Copy link
Collaborator

dg845 commented Jul 18, 2023

I have moved the resnet block code to modeling_tortoise_tts.py and added some initial pipeline code in pipeline_tortoise_tts.py. I think keeping only pipeline code in pipeline_tortoise_tts.py and putting new module code in modeling_*.py files (like e.g. the Versatile Diffusion pipeline) will make the code more clear.

@susnato
Copy link
Contributor Author

susnato commented Jul 30, 2023

Just started the part of checkpoint conversion script for the diffusion decoder model, will also add the clvp model conversion script later(btw the weights loading code is unfinished and a pure mess, will update it in the next commit), also it seems that you have done a lot of work here! I need to catch up.

@dg845
Copy link
Collaborator

dg845 commented Jul 31, 2023

Just a heads up that I have refactored modeling_tortoise_tts.py into 3 files:

  • modeling_common.py: modules not used in autoregressive modeling or diffusion modeling, like ConditioningEncoder and RandomLatentConverter, as well as blocks shared in common like AttentionBlock
  • modeling_autoregressive.py: modules used in autoregressive modeling, such as Tortoise TTS's version of GPT2 and blocks specific to autoregressive modeling (like the autoregressive version of ResBlock)
  • modeling_diffusion.py: modules used in diffusion modeling, such as the diffusion denoising model, and blocks specific to diffusion modeling (like the diffusion ResnetBlock1D block)

The code can probably be simplified further but I think this makes sense for now.

@susnato
Copy link
Contributor Author

susnato commented Jul 31, 2023

The diffusion decoder attention outputs are same now, the whole decoder model will probably be ready in next 1 or 2 days since the Resnet outputs are already verified to the official repo. Also should we not place TortoiseTTSDiffusionModelAttention in modeling_diffusion.py instead of modeling_common.py? if I am not wrong that module is specific to the decoder model only.

Please let me know what you think.

@dg845
Copy link
Collaborator

dg845 commented Aug 1, 2023

ConditioningEncoder also uses the same attention block as the diffusion denoising model (parallel to how both tortoise.models.autoregressive.ConditioningEncoder and tortoise.models.diffusion_decoder.DiffusionLayer both use tortoise.models.arch_util.AttentionBlock in the original code), so I think TortoiseTTSDiffusionModelAttention should go in modeling_common.py.

I guess one difference is that when AttentionBlock is used in autoregressive modeling the relative position embeddings aren't used, whereas it is always used in diffusion modeling. So we could potentially replace the attention block used in autoregressive modeling with a more "vanilla" attention block (possibly one already implemented in diffusers).

[For context, the only place AttentionBlock is used in autoregressive modeling is in the ConditioningEncoder. In the current design the ConditioningEncoder module is meant to work for both autoregressive and diffusion modeling; for the latter, it needs access to relative position embeddings, which is why it's intended to use TortoiseTTSDiffusionModelAttention currently.]

@ylacombe
Copy link
Contributor

ylacombe commented Dec 8, 2023

Hey @susnato , I see that you are really active in integrating Tortoise!
Let me know if I can be of any help here, or if you need a first review!

@susnato
Copy link
Contributor Author

susnato commented Dec 14, 2023

Hi @ylacombe, sure I till let you know once it's finished, and sorry it's taking soo long.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Jan 9, 2024
@github-actions github-actions bot closed this Jan 18, 2024
@tuanh123789
Copy link
Contributor

Can i work on this to add tortoise in Diffusers. I see that tortoise TTS is very powerful but it's very slow.
I believe that diffusers will improve the performance of this model

@ylacombe
Copy link
Contributor

Hey @tuanh123789, let's first ping @susnato to make sure he doesn't want or have bandwidth to finish this PR!
If that's the case, feel free to continue the PR and ping me once you need a review!

@susnato, let us know ! You've already make a great effort on this PR, would you like to finish it up? Thanks!

@tuanh123789
Copy link
Contributor

It seems like @susnato is busy and no longer working on this PR. Perhaps I will continue and inherit his contributions to complete this pipeline @ylacombe

@susnato
Copy link
Contributor Author

susnato commented Jul 17, 2024

Hello, I am so sorry to everyone that I couldn't finish this PR 😢 .

@tuanh123789 Please feel free to take this up, and as far as I remember me and @dg845 have already impelemented CLVP and UnivNet vocoder and I was able to get the logits from start to vocoder within 2e-2 atol (within acceptable range of diffusers as far as I am aware), so it needs few more work to make it e2e compatible.

I can also invite you to my diffusers branch so that you can continue the work from there (if you want of course)

Also maybe @ylacombe you could invite @tuanh123789 to our shared slack channel (if possible) so that he could get more idea of the current state and issues that we were facing.

@susnato
Copy link
Contributor Author

susnato commented Jul 17, 2024

@tuanh123789 Let me know if you need any more pointers, I will try to answer as much as I can!

@ylacombe
Copy link
Contributor

Thanks for the update @susnato! @tuanh123789 feel free to reach out on X or LI to get you on the channel!

@tuanh123789
Copy link
Contributor

Hello, I am so sorry to everyone that I couldn't finish this PR 😢 .

@tuanh123789 Please feel free to take this up, and as far as I remember me and @dg845 have already impelemented CLVP and UnivNet vocoder and I was able to get the logits from start to vocoder within 2e-2 atol (within acceptable range of diffusers as far as I am aware), so it needs few more work to make it e2e compatible.

I can also invite you to my diffusers branch so that you can continue the work from there (if you want of course)

Also maybe @ylacombe you could invite @tuanh123789 to our shared slack channel (if possible) so that he could get more idea of the current state and issues that we were facing.

Sure, pls add me to your branch

@susnato
Copy link
Contributor Author

susnato commented Jul 17, 2024

Just did @tuanh123789 ! You should see a message in your mail.

@poedator
Copy link

@susnato, @tuanh123789
Thank you for this massive work!
Could you comment on what is left to do in this PR, where you may need help, and what are the chances of its successful merge within reasonable time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Issues that haven't received updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Tortoise TTS as a pipeline

8 participants