Skip to content

[Diffusion] LTX-2 Support PR2#17496

Merged
mickqian merged 7 commits intosgl-project:mainfrom
gmixiaojin:pr2-ltx2
Jan 24, 2026
Merged

[Diffusion] LTX-2 Support PR2#17496
mickqian merged 7 commits intosgl-project:mainfrom
gmixiaojin:pr2-ltx2

Conversation

@gmixiaojin
Copy link
Copy Markdown
Contributor

@gmixiaojin gmixiaojin commented Jan 21, 2026

Motivation

Support LTX-2 Video & Audio Joint model.

this pr involves new config files and modeling files only

How to use

generate

  • T2V
sglang generate \
  --model-path LTX-2_Model \
  --prompt "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot." \
  --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." \
  --width 768 --height 512 --num-frames 121 --fps 24 \
  --num-inference-steps 40 --guidance-scale 4.0 \
  --seed 1024 \
  --output-path ./output/ --output-file-name ltx2_sample.mp4
  • TI2V
sglang generate \
  --model-path LTX-2_Model \
  --prompt "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot." \
  --image-path "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg" \
  --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." \
  --width 768 --height 512 --num-frames 121 --fps 24 \
  --num-inference-steps 40 --guidance-scale 4.0 \
  --seed 1024 \
  --output-path ./output/ --output-file-name ltx2_sample.mp4
  • SP, TP & Cache DiT (8 GPUs)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_PRESET=medium \
sglang generate \
  --model-path LTX-2_Model \
  --num-gpus 8 \
  --tp-size 2 \
  --sp-degree 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --prompt "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot." \
  --image-path "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg" \
  --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." \
  --width 768 --height 512 --num-frames 121 --fps 24 \
  --num-inference-steps 40 --guidance-scale 4.0 \
  --seed 1024 \
  --output-path ./output/ --output-file-name ltx2_sample.mp4

Accuracy Tests

1:1 align with Diffusers

Checklist

LTX-2

  • Accuracy alignment achieved
  • Support TI2V inference
  • Implement Gemma 3 support using SGLang (not Transformers)
  • Gemma 3: support TP (tensor parallelism)
  • Support RoPE / SP
  • Wire up / validate Cache-DiT
  • Support TP for the primary model
  • Verify correctness of generate and the server
  • Ensure structural and code-style consistency across modules
  • Polish code style; remove redundant comments
  • Clean up debug code
  • Fix lint and update to latest main
  • Follow up on PR reviews

LTX-2 Stack PR

  • Support one-stage generation
  • Support distilled generation
  • Support IC LoRA
  • Support keyframe_interpolation
  • Support two-stage generation

Review Process

  1. Ping Merge Oncalls to start the PR flow. See [PR Merge Process].
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • \tag-run-ci-label, \rerun-failed-ci, \tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Jan 21, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @gmixiaojin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the diffusion capabilities by integrating the LTX-2 Video & Audio Joint model. It introduces a new pipeline that orchestrates text and image inputs to generate high-quality video and audio outputs. The changes encompass new model architectures, optimized data handling for distributed environments, and detailed configuration for various multimodal components, laying the groundwork for advanced video and audio synthesis.

Highlights

  • LTX-2 Model Support: Introduced comprehensive support for the LTX-2 Video & Audio Joint model, enabling text-to-video (T2V) and text-image-to-video (TI2V) generation capabilities.
  • Gemma 3 Integration: Implemented and integrated the Gemma 3 text encoder, including its specific attention mechanisms and RMS normalization, for processing text prompts.
  • Distributed Training Enhancements: Added support for Tensor Parallelism (TP) and Sequence Parallelism (SP) across various components, including the Gemma 3 text encoder and the primary LTX-2 transformer, to optimize performance on multi-GPU setups.
  • New Architecture Configurations: New configuration files and runtime models were added for LTX-2 specific adapters, DiT (Diffusion Transformer), audio/video VAEs, and vocoder components, ensuring modularity and extensibility.
  • Advanced Latent Handling: Developed sophisticated logic for packing, unpacking, and sharding video and audio latents, including specialized rotary positional embeddings (RoPE) for both modalities, and tiling/framewise processing for VAEs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gmixiaojin gmixiaojin marked this pull request as ready for review January 21, 2026 12:17
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for the LTX-2 video and audio joint generation model. The changes are extensive, adding new configurations, model implementations for various components like DiT, VAEs, encoders, and a vocoder, as well as a new pipeline to orchestrate the generation process. The code is generally well-structured. My review focuses on improving code clarity and maintainability. I've identified a few areas where refactoring could enhance readability and consistency, such as replacing magic numbers with named constants and extracting complex logic into helper functions.

Comment on lines +68 to +69
(".gate_up_proj", ".gate_proj", "0"), # type: ignore
(".gate_up_proj", ".up_proj", "1"), # type: ignore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint for stacked_params_mapping is list[tuple[str, str, str]], but you are using integers 0 and 1 for the shard IDs here, which is inconsistent. While the weight loading logic seems to handle this, it would be better for type safety and to avoid future confusion to use strings "0" and "1" to match the type hint.

            (".gate_up_proj", ".gate_proj", "0"),
            (".gate_up_proj", ".up_proj", "1"),

Comment on lines +515 to +548
audio_latents_mean = getattr(audio_vae, "latents_mean", None)
audio_latents_std = getattr(audio_vae, "latents_std", None)
if (
isinstance(audio_latents_mean, torch.Tensor)
and isinstance(audio_latents_std, torch.Tensor)
and audio_latents_mean.numel() == audio_latents_std.numel()
):
audio_latents_mean = audio_latents_mean.to(
device=audio_latents.device, dtype=audio_latents.dtype
)
audio_latents_std = audio_latents_std.to(
device=audio_latents.device, dtype=audio_latents.dtype
)
if audio_latents.ndim == 3:
if audio_latents.shape[-1] != audio_latents_mean.numel():
raise ValueError(
f"audio_latents last dim {audio_latents.shape[-1]} "
f"does not match audio_vae stats {audio_latents_mean.numel()}"
)
audio_latents = audio_latents * audio_latents_std.view(
1, 1, -1
) + audio_latents_mean.view(1, 1, -1)
elif audio_latents.ndim == 2:
if audio_latents.shape[-1] != audio_latents_mean.numel():
raise ValueError(
f"audio_latents last dim {audio_latents.shape[-1]} "
f"does not match audio_vae stats {audio_latents_mean.numel()}"
)
audio_latents = audio_latents * audio_latents_std.view(
1, -1
) + audio_latents_mean.view(1, -1)
else:
audio_latents = audio_latents * audio_latents_std + audio_latents_mean

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for denormalizing audio latents is quite complex and makes the _unpad_and_unpack_latents method very long and hard to read. Consider extracting this logic into a separate helper method to improve readability and modularity. Additionally, there's an unused static method _denormalize_audio_latents with a similar purpose. It would be good to either use it if it's correct or remove it to avoid confusion and code duplication.

Comment on lines +833 to +851
video_per_layer_ca_scale_shift = self.video_a2v_cross_attn_scale_shift_table[
:4, :
]
video_per_layer_ca_gate = self.video_a2v_cross_attn_scale_shift_table[4:, :]

video_ca_scale_shift_table = (
video_per_layer_ca_scale_shift[None, None, :, :].to(
dtype=temb_ca_scale_shift.dtype, device=temb_ca_scale_shift.device
)
+ temb_ca_scale_shift.reshape(
batch_size, temb_ca_scale_shift.shape[1], 4, -1
)
).unbind(dim=2)
video_ca_gate = (
video_per_layer_ca_gate[None, None, :, :].to(
dtype=temb_ca_gate.dtype, device=temb_ca_gate.device
)
+ temb_ca_gate.reshape(batch_size, temb_ca_gate.shape[1], 1, -1)
).unbind(dim=2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 4 is used multiple times in this block for slicing and reshaping tensors related to cross-attention scale and shift. This makes the code harder to understand and maintain. It would be better to define a constant for this value, for example, NUM_CA_SCALE_SHIFT_PARAMS = 4, and use it here. This would improve readability and make future modifications easier.

@mickqian mickqian mentioned this pull request Jan 21, 2026
18 tasks
@yhyang201
Copy link
Copy Markdown
Collaborator

Please fix lint.

pip3 install pre-commit
pre-commit install
pre-commit run --all-files

@gmixiaojin
Copy link
Copy Markdown
Contributor Author

Please fix lint.

pip3 install pre-commit
pre-commit install
pre-commit run --all-files

fixed.

Add decoding_av.py, denoising_av.py, latent_preparation_av.py,
and text_connector.py to this PR, moved from PR1.
@mickqian
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@mickqian mickqian merged commit d0919be into sgl-project:main Jan 24, 2026
246 of 259 checks passed
@liz-badada
Copy link
Copy Markdown
Collaborator

Hi @gmixiaojin, grate work!

Curious if VAE supports SP?

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Co-authored-by: Fan Yin <1106310035@qq.com>
Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>
@davidfrz
Copy link
Copy Markdown

The current commit appears to have an issue with audio synthesis—the background noise is very loud, and the same prompt fails to generate normal video audio.

@gmixiaojin
Copy link
Copy Markdown
Contributor Author

The current commit appears to have an issue with audio synthesis—the background noise is very loud, and the same prompt fails to generate normal video audio.

Thank you for your feedback.
We opened a third PR #19151 fixing both video and audio generation bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants