Skip to content

Update LTX-2 Docs to Cover LTX-2.3 Models#13337

Merged
dg845 merged 4 commits intomainfrom
ltx2-3-update-doc-examples
Mar 27, 2026
Merged

Update LTX-2 Docs to Cover LTX-2.3 Models#13337
dg845 merged 4 commits intomainfrom
ltx2-3-update-doc-examples

Conversation

@dg845
Copy link
Copy Markdown
Collaborator

@dg845 dg845 commented Mar 26, 2026

What does this PR do?

This PR updates the LTX-2 docs to cover multimodal guidance and prompt enhancement, which were added with LTX-2.3 model support in #13217. Additionally, the LTX-2.X official default negative prompt, T2V system prompt, and I2V system prompt have been added to src/diffusers/pipelines/ltx2/utils.py to make it easier to prompt the model for inference.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul
@yiyixuxu
@stevhliu

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dg845 dg845 requested a review from sayakpaul March 26, 2026 06:27
Comment thread docs/source/en/api/pipelines/ltx2.md Outdated
num_inference_steps=30,
guidance_scale=3.0, # Recommended LTX-2.3 guidance parameters
stg_scale=1.0, # Note that 0.0 (not 1.0) means that STG is disabled (all other guidance is disabled at 1.0)
modality_scale=3.0,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is modality_scale?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this refers to the modality isolation guidance?

)
```

## Prompt Enhancement
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we could low-key showcase our prompt enhancement custom block powered by Gemini?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LTX-2.3 model seems to be quite sensitive in terms of sample quality to the input prompt. Since the current GeminiPromptExpander doesn't accept a system_prompt argument to guide the prompt expansion, I think it may not work well with LTX-2.3 because the prompts may still be out of distribution although they are expanded.

Copy link
Copy Markdown
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice, thanks for updating!

Comment thread docs/source/en/api/pipelines/ltx2.md Outdated
</div>

LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
[LTX-2](https://arxiv.org/abs/2601.03233) is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[LTX-2](https://arxiv.org/abs/2601.03233) is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
[LTX-2](https://arxiv.org/abs/2601.03233) is a DiT-based foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.

Comment thread docs/source/en/api/pipelines/ltx2.md Outdated
Comment thread docs/source/en/api/pipelines/ltx2.md Outdated
Comment thread docs/source/en/api/pipelines/ltx2.md Outdated
2. **Spatio-Temporal Guidance (STG)**: [STG](https://arxiv.org/pdf/2411.18664) moves away from a perturbed output created by short-cutting self-attention operations by substituting in the attention values instead. The idea is that this creates sharper videos and better spatiotemporal consistency.
3. **Modality Isolation Guidance**: this moves away from a perturbed output created by disabling cross-modality (audio-to-video and video-to-audio) cross attention. This guidance is more specific to [LTX-2.X](https://arxiv.org/pdf/2601.03233) models, with the idea that this produces better consistency between the generated audio and video.

These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale` arguments, respectively, and can be set separately for video and audio. Additionally, for STG, the transformer block indices where self-attention is skipped needs to be specified via the `spatio_temporal_guidance_blocks` argument. In addition, the LTX-2.X pipelines also support [guidance rescaling](https://arxiv.org/abs/2305.08891) to help reduce over-exposure, which can be a problem when the guidance scales are set to high values.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale` arguments, respectively, and can be set separately for video and audio. Additionally, for STG, the transformer block indices where self-attention is skipped needs to be specified via the `spatio_temporal_guidance_blocks` argument. In addition, the LTX-2.X pipelines also support [guidance rescaling](https://arxiv.org/abs/2305.08891) to help reduce over-exposure, which can be a problem when the guidance scales are set to high values.
These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale` arguments, and can be set separately for video and audio. Additionally, for STG, the transformer block indices where self-attention is skipped needs to be specified via the `spatio_temporal_guidance_blocks` argument. In addition, the LTX-2.X pipelines also support [guidance rescaling](https://huggingfaec.co/papers2305.08891) to help reduce over-exposure, which can be a problem when the guidance scales are set to high values.

Comment thread docs/source/en/api/pipelines/ltx2.md Outdated
dg845 and others added 2 commits March 26, 2026 17:18
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
@dg845
Copy link
Copy Markdown
Collaborator Author

dg845 commented Mar 27, 2026

Merging as the CI failures are unrelated.

@dg845 dg845 merged commit 7298f5b into main Mar 27, 2026
10 of 12 checks passed
@dg845 dg845 deleted the ltx2-3-update-doc-examples branch March 27, 2026 00:51
terarachang pushed a commit to terarachang/diffusers that referenced this pull request Apr 30, 2026
* Update LTX-2 docs to cover multimodal guidance and prompt enhancement

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Apply reviewer feedback

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants