[Diffusion] LTX-2 Support PR1#17495
Conversation
Summary of ChangesHello @gmixiaojin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates full support for the LTX-2 Video & Audio Joint model, allowing the system to handle and generate both video and audio data seamlessly. The changes span across core components, from model loading and positional embeddings to the entire diffusion pipeline, ensuring that multimodal generation requests are processed efficiently and produce synchronized outputs. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the LTX-2 model, enabling joint video and audio generation. The changes are extensive, introducing new pipeline stages, component loaders, and updating data structures to handle audio. The implementation appears solid and well-structured. I've found a potential bug in the text encoding stage that could lead to missing attention masks, and a couple of areas for code improvement related to style and maintainability. Overall, this is a great contribution.
| if batch.prompt_attention_mask is None: | ||
| batch.prompt_attention_mask = [] | ||
| for am in prompt_masks_list: | ||
| batch.prompt_attention_mask.append(am) |
There was a problem hiding this comment.
The current logic for appending attention masks will fail if batch.prompt_attention_mask is already an empty list ([]) instead of None, as the if condition would be false. A more robust approach would be to ensure the list exists and then extend it.
if batch.prompt_attention_mask is None:
batch.prompt_attention_mask = []
batch.prompt_attention_mask.extend(prompt_masks_list)| if batch.negative_attention_mask is None: | ||
| batch.negative_attention_mask = [] | ||
| for nm in neg_masks_list: | ||
| batch.negative_attention_mask.append(nm) |
There was a problem hiding this comment.
Similar to the prompt_attention_mask, the logic for negative_attention_mask is not robust. If batch.negative_attention_mask is an empty list, new masks won't be added. It's better to ensure the list exists and then extend it.
if batch.negative_attention_mask is None:
batch.negative_attention_mask = []
batch.negative_attention_mask.extend(neg_masks_list)| def post_process_sample( | ||
| sample: torch.Tensor, | ||
| sample: Any, | ||
| data_type: DataType, | ||
| fps: int, | ||
| save_output: bool = True, | ||
| save_file_path: str = None, | ||
| save_file_path: Optional[str] = None, | ||
| audio_sample_rate: Optional[int] = None, | ||
| ): |
There was a problem hiding this comment.
The post_process_sample function contains several local imports (numpy, shutil, subprocess, tempfile, scipy.io.wavfile, imageio_ffmpeg). It's generally better practice to place all imports at the top of the file for clarity, consistency, and to avoid potential overhead from repeated imports if this function is called in a loop. Please consider moving these imports to the top of python/sglang/multimodal_gen/runtime/entrypoints/utils.py.
| return model | ||
|
|
||
|
|
||
| class AdapterLoader(ComponentLoader): |
There was a problem hiding this comment.
this file is too large, consider splitting it into a folder component_loader in a follow-up
There was a problem hiding this comment.
Sure, I think we will follow-up after the PR
|
Could you put the stages that are used only by LTX-2 here first? You could create an ltx_2 folder and move them there. |
As we discussed previously, we will refactor the code once the two PRs are merged. |
|
lint also fixed @yhyang201 @mickqian |
Remove decoding_av.py, denoising_av.py, latent_preparation_av.py, and text_connector.py from this PR, as they should be in PR2.
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
Co-authored-by: FlamingoPg <1106310035@qq.com>
Motivation
Support LTX-2 Video & Audio Joint model.
Contribution
Review Process
\tag-run-ci-label,\rerun-failed-ci,\tag-and-rerun-ci