[Diffusion] LTX-2 Support PR2 by gmixiaojin · Pull Request #17496 · sgl-project/sglang

gmixiaojin · 2026-01-21T12:14:55Z

Motivation

Support LTX-2 Video & Audio Joint model.

this pr involves new config files and modeling files only

How to use

generate

T2V

sglang generate \
  --model-path LTX-2_Model \
  --prompt "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot." \
  --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." \
  --width 768 --height 512 --num-frames 121 --fps 24 \
  --num-inference-steps 40 --guidance-scale 4.0 \
  --seed 1024 \
  --output-path ./output/ --output-file-name ltx2_sample.mp4

TI2V

sglang generate \
  --model-path LTX-2_Model \
  --prompt "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot." \
  --image-path "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg" \
  --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." \
  --width 768 --height 512 --num-frames 121 --fps 24 \
  --num-inference-steps 40 --guidance-scale 4.0 \
  --seed 1024 \
  --output-path ./output/ --output-file-name ltx2_sample.mp4

SP, TP & Cache DiT (8 GPUs)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_PRESET=medium \
sglang generate \
  --model-path LTX-2_Model \
  --num-gpus 8 \
  --tp-size 2 \
  --sp-degree 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --prompt "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot." \
  --image-path "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg" \
  --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." \
  --width 768 --height 512 --num-frames 121 --fps 24 \
  --num-inference-steps 40 --guidance-scale 4.0 \
  --seed 1024 \
  --output-path ./output/ --output-file-name ltx2_sample.mp4

Accuracy Tests

1:1 align with Diffusers

Checklist

LTX-2

LTX-2 Stack PR

Review Process

Ping Merge Oncalls to start the PR flow. See [PR Merge Process].
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- \tag-run-ci-label, \rerun-failed-ci, \tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-21T12:15:57Z

Summary of Changes

Hello @gmixiaojin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the diffusion capabilities by integrating the LTX-2 Video & Audio Joint model. It introduces a new pipeline that orchestrates text and image inputs to generate high-quality video and audio outputs. The changes encompass new model architectures, optimized data handling for distributed environments, and detailed configuration for various multimodal components, laying the groundwork for advanced video and audio synthesis.

Highlights

LTX-2 Model Support: Introduced comprehensive support for the LTX-2 Video & Audio Joint model, enabling text-to-video (T2V) and text-image-to-video (TI2V) generation capabilities.
Gemma 3 Integration: Implemented and integrated the Gemma 3 text encoder, including its specific attention mechanisms and RMS normalization, for processing text prompts.
Distributed Training Enhancements: Added support for Tensor Parallelism (TP) and Sequence Parallelism (SP) across various components, including the Gemma 3 text encoder and the primary LTX-2 transformer, to optimize performance on multi-GPU setups.
New Architecture Configurations: New configuration files and runtime models were added for LTX-2 specific adapters, DiT (Diffusion Transformer), audio/video VAEs, and vocoder components, ensuring modularity and extensibility.
Advanced Latent Handling: Developed sophisticated logic for packing, unpacking, and sharding video and audio latents, including specialized rotary positional embeddings (RoPE) for both modalities, and tiling/framewise processing for VAEs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces comprehensive support for the LTX-2 video and audio joint generation model. The changes are extensive, adding new configurations, model implementations for various components like DiT, VAEs, encoders, and a vocoder, as well as a new pipeline to orchestrate the generation process. The code is generally well-structured. My review focuses on improving code clarity and maintainability. I've identified a few areas where refactoring could enhance readability and consistency, such as replacing magic numbers with named constants and extracting complex logic into helper functions.

gemini-code-assist · 2026-01-21T12:18:31Z

+            (".gate_up_proj", ".gate_proj", "0"),  # type: ignore
+            (".gate_up_proj", ".up_proj", "1"),  # type: ignore


The type hint for stacked_params_mapping is list[tuple[str, str, str]], but you are using integers 0 and 1 for the shard IDs here, which is inconsistent. While the weight loading logic seems to handle this, it would be better for type safety and to avoid future confusion to use strings "0" and "1" to match the type hint.

(".gate_up_proj", ".gate_proj", "0"), (".gate_up_proj", ".up_proj", "1"),

gemini-code-assist · 2026-01-21T12:18:31Z

+        audio_latents_mean = getattr(audio_vae, "latents_mean", None)
+        audio_latents_std = getattr(audio_vae, "latents_std", None)
+        if (
+            isinstance(audio_latents_mean, torch.Tensor)
+            and isinstance(audio_latents_std, torch.Tensor)
+            and audio_latents_mean.numel() == audio_latents_std.numel()
+        ):
+            audio_latents_mean = audio_latents_mean.to(
+                device=audio_latents.device, dtype=audio_latents.dtype
+            )
+            audio_latents_std = audio_latents_std.to(
+                device=audio_latents.device, dtype=audio_latents.dtype
+            )
+            if audio_latents.ndim == 3:
+                if audio_latents.shape[-1] != audio_latents_mean.numel():
+                    raise ValueError(
+                        f"audio_latents last dim {audio_latents.shape[-1]} "
+                        f"does not match audio_vae stats {audio_latents_mean.numel()}"
+                    )
+                audio_latents = audio_latents * audio_latents_std.view(
+                    1, 1, -1
+                ) + audio_latents_mean.view(1, 1, -1)
+            elif audio_latents.ndim == 2:
+                if audio_latents.shape[-1] != audio_latents_mean.numel():
+                    raise ValueError(
+                        f"audio_latents last dim {audio_latents.shape[-1]} "
+                        f"does not match audio_vae stats {audio_latents_mean.numel()}"
+                    )
+                audio_latents = audio_latents * audio_latents_std.view(
+                    1, -1
+                ) + audio_latents_mean.view(1, -1)
+            else:
+                audio_latents = audio_latents * audio_latents_std + audio_latents_mean
+


This block of code for denormalizing audio latents is quite complex and makes the _unpad_and_unpack_latents method very long and hard to read. Consider extracting this logic into a separate helper method to improve readability and modularity. Additionally, there's an unused static method _denormalize_audio_latents with a similar purpose. It would be good to either use it if it's correct or remove it to avoid confusion and code duplication.

gemini-code-assist · 2026-01-21T12:18:31Z

+        video_per_layer_ca_scale_shift = self.video_a2v_cross_attn_scale_shift_table[
+            :4, :
+        ]
+        video_per_layer_ca_gate = self.video_a2v_cross_attn_scale_shift_table[4:, :]
+
+        video_ca_scale_shift_table = (
+            video_per_layer_ca_scale_shift[None, None, :, :].to(
+                dtype=temb_ca_scale_shift.dtype, device=temb_ca_scale_shift.device
+            )
+            + temb_ca_scale_shift.reshape(
+                batch_size, temb_ca_scale_shift.shape[1], 4, -1
+            )
+        ).unbind(dim=2)
+        video_ca_gate = (
+            video_per_layer_ca_gate[None, None, :, :].to(
+                dtype=temb_ca_gate.dtype, device=temb_ca_gate.device
+            )
+            + temb_ca_gate.reshape(batch_size, temb_ca_gate.shape[1], 1, -1)
+        ).unbind(dim=2)


The magic number 4 is used multiple times in this block for slicing and reshaping tensors related to cross-attention scale and shift. This makes the code harder to understand and maintain. It would be better to define a constant for this value, for example, NUM_CA_SCALE_SHIFT_PARAMS = 4, and use it here. This would improve readability and make future modifications easier.

yhyang201 · 2026-01-22T03:10:13Z

Please fix lint.

pip3 install pre-commit
pre-commit install
pre-commit run --all-files

gmixiaojin · 2026-01-22T11:48:09Z

Please fix lint.

pip3 install pre-commit
pre-commit install
pre-commit run --all-files

fixed.

Add decoding_av.py, denoising_av.py, latent_preparation_av.py, and text_connector.py to this PR, moved from PR1.

mickqian · 2026-01-23T03:14:22Z

/tag-and-rerun-ci

liz-badada · 2026-02-12T04:13:31Z

Hi @gmixiaojin, grate work!

Curious if VAE supports SP?

Co-authored-by: Fan Yin <1106310035@qq.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>

davidfrz · 2026-02-15T12:51:37Z

The current commit appears to have an issue with audio synthesis—the background noise is very loud, and the same prompt fails to generate normal video audio.

gmixiaojin · 2026-02-22T10:22:33Z

The current commit appears to have an issue with audio synthesis—the background noise is very loud, and the same prompt fails to generate normal video audio.

Thank you for your feedback.
We opened a third PR #19151 fixing both video and audio generation bugs.

gmixiaojin added 2 commits January 21, 2026 11:58

LTX-2 PR2

2a7e65d

LTX-2 PR2

d96f803

github-actions Bot added the diffusion SGLang Diffusion label Jan 21, 2026

gmixiaojin marked this pull request as ready for review January 21, 2026 12:17

gmixiaojin requested review from BBuf, mickqian and yhyang201 as code owners January 21, 2026 12:17

gemini-code-assist Bot reviewed Jan 21, 2026

View reviewed changes

mickqian mentioned this pull request Jan 21, 2026

[Diffusion] LTX-2 Support #17210

Closed

18 tasks

Add sample_rate parameter to LTXVocoderConfig

d294209

mickqian approved these changes Jan 22, 2026

View reviewed changes

Merge branch 'main' into pr2-ltx2

0b3b7a3

gmixiaojin added 2 commits January 22, 2026 10:54

change variable name from frame_rate to fps

bfdb37c

fix lint

12bd5d2

Add LTX-2 stage files from PR1

10b640c

Add decoding_av.py, denoising_av.py, latent_preparation_av.py, and text_connector.py to this PR, moved from PR1.

github-actions Bot added the run-ci label Jan 23, 2026

mickqian merged commit d0919be into sgl-project:main Jan 24, 2026
246 of 259 checks passed

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[diffusion] model: LTX-2 Support (2/2) (sgl-project#17496)

d0c7f06

Co-authored-by: Fan Yin <1106310035@qq.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusion] LTX-2 Support PR2#17496

[Diffusion] LTX-2 Support PR2#17496
mickqian merged 7 commits intosgl-project:mainfrom
gmixiaojin:pr2-ltx2

gmixiaojin commented Jan 21, 2026 •

edited by mickqian

Loading

Uh oh!

gemini-code-assist Bot commented Jan 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 21, 2026

Uh oh!

gemini-code-assist Bot Jan 21, 2026

Uh oh!

gemini-code-assist Bot Jan 21, 2026

Uh oh!

yhyang201 commented Jan 22, 2026

Uh oh!

gmixiaojin commented Jan 22, 2026

Uh oh!

mickqian commented Jan 23, 2026

Uh oh!

Uh oh!

liz-badada commented Feb 12, 2026

Uh oh!

davidfrz commented Feb 15, 2026

Uh oh!

gmixiaojin commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		(".gate_up_proj", ".gate_proj", "0"), # type: ignore
		(".gate_up_proj", ".up_proj", "1"), # type: ignore

Conversation

gmixiaojin commented Jan 21, 2026 • edited by mickqian Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

How to use

generate

Accuracy Tests

Checklist

LTX-2

LTX-2 Stack PR

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

yhyang201 commented Jan 22, 2026

Uh oh!

gmixiaojin commented Jan 22, 2026

Uh oh!

mickqian commented Jan 23, 2026

Uh oh!

Uh oh!

liz-badada commented Feb 12, 2026

Uh oh!

davidfrz commented Feb 15, 2026

Uh oh!

gmixiaojin commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gmixiaojin commented Jan 21, 2026 •

edited by mickqian

Loading