-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Fix QwenImage txt_seq_lens handling #12702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
just a few comments, not a full review:
|
| raise ValueError(f"`txt_seq_lens` must have length {batch_size}, but got {len(txt_seq_lens)} instead.") | ||
| text_seq_len = max(text_seq_len, max(txt_seq_lens)) | ||
| elif encoder_hidden_states_mask is not None: | ||
| text_seq_len = max(text_seq_len, int(encoder_hidden_states_mask.sum(dim=1).max().item())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works if the attention mask is in the form of [True, True, True, ..., False, False, False]. While this is the case in the most common use case of text attention masks, it doesn't have to be the case.
If the mask is [True, False, True, False, True, False], self.pos_embed receives an incorrect sequence length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is resolved, but I let you close it if you agree
|
thanks @dxqb the idea was to remove |
yiyixuxu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a ton for the PR! @kashif
I left one question, let me know!
- Remove seq_lens parameter from dispatch_attention_fn - Update varlen backends to extract seqlens from masks - Update QwenImage to pass 2D joint_attention_mask - Fix native backend to handle 2D boolean masks - Fix sage_varlen seqlens_q to match seqlens_k for self-attention Note: sage_varlen still producing black images, needs further investigation
…to txt_seq_lens
|
some benchmarks with various backends: code:
|
I'll send the small changes in your PR that I needed (like support for text sequence length per sample), but keep the PR overall separate - it's a separate feature |
… attention update merge with attention selection, add FLASH_SPLIT
yiyixuxu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks so much, I went over again and left one feedback
I will also ask Qwen team to do a final review now:)
| encoder_hidden_states_mask=encoder_hidden_states_mask, | ||
| timestep=timestep, | ||
| img_shapes=img_shapes, | ||
| txt_seq_lens=txt_seq_lens, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also build joint mask for controlnet?
not merge-blocking though, I don't think it breaks anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch i think we should add it
Co-authored-by: YiYi Xu <yixu310@gmail.com>
|
Quick update: this PR passed our internal regression tests with Since this is a large PR with valuable fixes, I suggest we merge it now and handle batch>1 support separately. What are your thoughts? Below is a minimal reproduction script for the EditPlus pipeline error when import os
import torch
from diffusers import QwenImageEditPlusPipeline
from PIL import Image
torch.manual_seed(0)
pipeline = QwenImageEditPlusPipeline.from_pretrained("Qwen/Qwen-Image-Edit-2511")
print("pipeline loaded")
pipeline.to(torch.bfloat16)
pipeline.to('cuda')
pipeline.set_progress_bar_config(disable=None)
image0 = Image.open("input1.png").convert("RGB")
image1 = Image.open("input2.png").convert("RGB")
prompt = "merge picture 1 with picture 2"
inputs = {
"image": [[image0, image1], [image0, image1]],
"prompt": [prompt, prompt],
"height": 1024,
"width": 1024,
"generator": torch.manual_seed(0),
"true_cfg_scale": 4.0,
"negative_prompt": " ",
"num_inference_steps": 50,
"guidance_scale": 1.0,
"num_images_per_prompt": 1,
}
with torch.inference_mode():
output = pipeline(**inputs)
output_image = output.images[0]
output_image.save("output_image_edit_plus.png")
print("image saved at", os.path.abspath("output_image_edit_plus.png")) |
|
Thanks a lot @naykun! WDYT about raising an error from the EditPlus pipeline when batch_size > 1? Would that work for you? |
Sounds good to me! 😊 |
|
@sayakpaul @yiyixuxu Have you tested Context Parallel? It seems that this PR has caused an error in Context Parallel. [rank3]: output = function_reference.forward(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/workspace/dev/vipshop/diffusers/src/diffusers/models/transformers/transformer_qwenimage.py", line 959, in forward
[rank3]: encoder_hidden_states, hidden_states = block(
[rank3]: ^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/workspace/dev/vipshop/diffusers/src/diffusers/models/transformers/transformer_qwenimage.py", line 696, in forward
[rank3]: attn_output = self.attn(
[rank3]: ^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/workspace/dev/vipshop/diffusers/src/diffusers/models/attention_processor.py", line 605, in forward
[rank3]: return self.processor(
[rank3]: ^^^^^^^^^^^^^^^
[rank3]: File "/workspace/dev/vipshop/diffusers/src/diffusers/models/transformers/transformer_qwenimage.py", line 535, in __call__
[rank3]: txt_query = apply_rotary_emb_qwen(txt_query, txt_freqs, use_real=False)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/workspace/dev/vipshop/diffusers/src/diffusers/models/transformers/transformer_qwenimage.py", line 140, in apply_rotary_emb_qwen
[rank3]: x_out = torch.view_as_real(x_rotated * freqs_cis).flatten(3)
[rank3]: ~~~~~~~~~~^~~~~~~~~~~
[rank3]: RuntimeError: The size of tensor a (109) must match the size of tensor b (27) at non-singleton dimension 1I have checkout to b86bd99 (the last commit before this pr) and the CP is work. |
|
Was it working before in the CP setup? Cc: @kashif |
|
@sayakpaul @kashif The txt_seq_len is error if context parallel enabled, can we fix it. diffusers/src/diffusers/models/transformers/transformer_qwenimage.py Lines 145 to 152 in 9d68742
|
|
Let's please not sound condescending here. I am politely asking if it was working as expected for Qwen before this PR. Could you confirm that? If you have a fix, then we appreciate any PR otherwise, I kindly ask you to have patience. It will surely be fixed. |
|
@sayakpaul Hi~
|
|
@sayakpaul Change the CP plan from transformer-level to block-level seems can avoid sequence length mismatch in rope. I will submit a pr to fix this issue after I have conducted thorough tests. _cp_plan = {
"transformer_blocks.0": {
"hidden_states": ContextParallelInput(
split_dim=1, expected_dims=3, split_output=False
),
"encoder_hidden_states": ContextParallelInput(
split_dim=1, expected_dims=3, split_output=False
),
},
"pos_embed": {
0: ContextParallelInput(split_dim=0, expected_dims=2, split_output=True),
1: ContextParallelInput(split_dim=0, expected_dims=2, split_output=True),
},
"proj_out": ContextParallelOutput(gather_dim=1, expected_dims=3),
} |
|
@sayakpaul I have make a draft pr at #12970, I'm sorry again for my straightforward comments. I really didn't mean to offend. I've just been paying more attention to diffusers' CP's ability. As you can see, I have also made a lot of efforts to fix the bugs related to CP. I truly hope that diffusers will keep getting better and better, because I am also a loyal user of it and have benefited a lot from its code. |
Have any of you run into this issue: when running qwen-image-edit-2511 at 1024×1024 resolution on H20, the single-step inference speed dropped from 5.7s in version 0.36.0 to 7.8s? @kashif |


What does this PR do?
txt_seq_lensplumbing from all QwenImage pipelines and modular steps; the transformer now infers text length from encoder inputs/masks and validates optional overrides.encoder_hidden_states_maskinside the double-stream attention, avoiding fullseq_len²masks while keeping padding tokens masked.txt_seq_lensvalues and encoder masks are handled safely.Fixes #12344
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.