Update LTX-2 Docs to Cover LTX-2.3 Models#13337
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| num_inference_steps=30, | ||
| guidance_scale=3.0, # Recommended LTX-2.3 guidance parameters | ||
| stg_scale=1.0, # Note that 0.0 (not 1.0) means that STG is disabled (all other guidance is disabled at 1.0) | ||
| modality_scale=3.0, |
There was a problem hiding this comment.
i think this refers to the modality isolation guidance?
| ) | ||
| ``` | ||
|
|
||
| ## Prompt Enhancement |
There was a problem hiding this comment.
Wonder if we could low-key showcase our prompt enhancement custom block powered by Gemini?
There was a problem hiding this comment.
The LTX-2.3 model seems to be quite sensitive in terms of sample quality to the input prompt. Since the current GeminiPromptExpander doesn't accept a system_prompt argument to guide the prompt expansion, I think it may not work well with LTX-2.3 because the prompts may still be out of distribution although they are expanded.
stevhliu
left a comment
There was a problem hiding this comment.
very nice, thanks for updating!
| </div> | ||
|
|
||
| LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution. | ||
| [LTX-2](https://arxiv.org/abs/2601.03233) is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution. |
There was a problem hiding this comment.
| [LTX-2](https://arxiv.org/abs/2601.03233) is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution. | |
| [LTX-2](https://arxiv.org/abs/2601.03233) is a DiT-based foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution. |
| 2. **Spatio-Temporal Guidance (STG)**: [STG](https://arxiv.org/pdf/2411.18664) moves away from a perturbed output created by short-cutting self-attention operations by substituting in the attention values instead. The idea is that this creates sharper videos and better spatiotemporal consistency. | ||
| 3. **Modality Isolation Guidance**: this moves away from a perturbed output created by disabling cross-modality (audio-to-video and video-to-audio) cross attention. This guidance is more specific to [LTX-2.X](https://arxiv.org/pdf/2601.03233) models, with the idea that this produces better consistency between the generated audio and video. | ||
|
|
||
| These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale` arguments, respectively, and can be set separately for video and audio. Additionally, for STG, the transformer block indices where self-attention is skipped needs to be specified via the `spatio_temporal_guidance_blocks` argument. In addition, the LTX-2.X pipelines also support [guidance rescaling](https://arxiv.org/abs/2305.08891) to help reduce over-exposure, which can be a problem when the guidance scales are set to high values. |
There was a problem hiding this comment.
| These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale` arguments, respectively, and can be set separately for video and audio. Additionally, for STG, the transformer block indices where self-attention is skipped needs to be specified via the `spatio_temporal_guidance_blocks` argument. In addition, the LTX-2.X pipelines also support [guidance rescaling](https://arxiv.org/abs/2305.08891) to help reduce over-exposure, which can be a problem when the guidance scales are set to high values. | |
| These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale` arguments, and can be set separately for video and audio. Additionally, for STG, the transformer block indices where self-attention is skipped needs to be specified via the `spatio_temporal_guidance_blocks` argument. In addition, the LTX-2.X pipelines also support [guidance rescaling](https://huggingfaec.co/papers2305.08891) to help reduce over-exposure, which can be a problem when the guidance scales are set to high values. |
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
Merging as the CI failures are unrelated. |
* Update LTX-2 docs to cover multimodal guidance and prompt enhancement * Apply suggestions from code review Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Apply reviewer feedback --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
What does this PR do?
This PR updates the LTX-2 docs to cover multimodal guidance and prompt enhancement, which were added with LTX-2.3 model support in #13217. Additionally, the LTX-2.X official default negative prompt, T2V system prompt, and I2V system prompt have been added to
src/diffusers/pipelines/ltx2/utils.pyto make it easier to prompt the model for inference.Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sayakpaul
@yiyixuxu
@stevhliu