Skip to content

Conversation

@kashif
Copy link
Contributor

@kashif kashif commented Nov 23, 2025

What does this PR do?

  • Removes the redundant txt_seq_lens plumbing from all QwenImage pipelines and modular steps; the transformer now infers text length from encoder inputs/masks and validates optional overrides.
  • Builds a lightweight broadcastable attention mask from encoder_hidden_states_mask inside the double-stream attention, avoiding full seq_len² masks while keeping padding tokens masked.
  • Adjusts QwenImage Transformer/ControlNet RoPE to take a single text length and documents the fallback behavior.
  • Adds regression tests to ensure short txt_seq_lens values and encoder masks are handled safely.

Fixes #12344

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@kashif kashif requested a review from sayakpaul November 23, 2025 18:03
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sayakpaul sayakpaul requested a review from yiyixuxu November 24, 2025 01:57
@dxqb
Copy link
Contributor

dxqb commented Nov 29, 2025

just a few comments, not a full review:

  • there is some overlap with Fix qwen encoder hidden states mask #12655
  • this code has the same issue mentioned in Fix qwen encoder hidden states mask #12655 (expecting boolean semantics in a FloatTensor - but float attention masks are interpreted differently)
  • Could you clarify what the purpose of this PR is?
    If the purpose is to remove the txt_seq_lens parameters, and infer the sequence lengths from the attention mask: why is it still a parameter of the transformer model?
    If the purpose is towards passing sequence lengths to the attention dispatch (see Qwen Image: txt_seq_lens is redundant and not used #12344 (comment)), the sequence lengths for each batch sample must be inferred from the mask and passed to the transformer blocks, not only the max sequence length across all batch samples for RoPE

raise ValueError(f"`txt_seq_lens` must have length {batch_size}, but got {len(txt_seq_lens)} instead.")
text_seq_len = max(text_seq_len, max(txt_seq_lens))
elif encoder_hidden_states_mask is not None:
text_seq_len = max(text_seq_len, int(encoder_hidden_states_mask.sum(dim=1).max().item()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works if the attention mask is in the form of [True, True, True, ..., False, False, False]. While this is the case in the most common use case of text attention masks, it doesn't have to be the case.

If the mask is [True, False, True, False, True, False], self.pos_embed receives an incorrect sequence length

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is resolved, but I let you close it if you agree

@kashif
Copy link
Contributor Author

kashif commented Nov 29, 2025

thanks @dxqb the idea was to remove txt_seq_lens all together and work with any mask pattern

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a ton for the PR! @kashif
I left one question, let me know!

- Remove seq_lens parameter from dispatch_attention_fn
- Update varlen backends to extract seqlens from masks
- Update QwenImage to pass 2D joint_attention_mask
- Fix native backend to handle 2D boolean masks
- Fix sage_varlen seqlens_q to match seqlens_k for self-attention

Note: sage_varlen still producing black images, needs further investigation
@sayakpaul
Copy link
Member

#12655 provides some benchmarks on the speed, as well. Possible to provide them here too? @kashif

@kashif
Copy link
Contributor Author

kashif commented Dec 8, 2025

some benchmarks with various backends:

code:
benchmark_backends_qwen.py

backend_benchmark_complete

dxqb added a commit to dxqb/OneTrainer that referenced this pull request Dec 21, 2025
@dxqb
Copy link
Contributor

dxqb commented Dec 21, 2025

@dxqb you can also send a PR to this branch or fix this PR directly

I'll send the small changes in your PR that I needed (like support for text sequence length per sample), but keep the PR overall separate - it's a separate feature

dxqb added a commit to dxqb/OneTrainer that referenced this pull request Dec 26, 2025
… attention

update

merge with attention selection, add FLASH_SPLIT
@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Jan 6, 2026

@kashif @dxqb
is this ready for final round of review?

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much, I went over again and left one feedback
I will also ask Qwen team to do a final review now:)

encoder_hidden_states_mask=encoder_hidden_states_mask,
timestep=timestep,
img_shapes=img_shapes,
txt_seq_lens=txt_seq_lens,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also build joint mask for controlnet?
not merge-blocking though, I don't think it breaks anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch i think we should add it

@naykun
Copy link
Contributor

naykun commented Jan 12, 2026

Hi @yiyixuxu and @kashif,

Quick update: this PR passed our internal regression tests with batch_size=1. Heads up though — the EditPlus pipeline still doesn’t support batch_size > 1 yet.

Since this is a large PR with valuable fixes, I suggest we merge it now and handle batch>1 support separately. What are your thoughts?

Below is a minimal reproduction script for the EditPlus pipeline error when batch_size > 1:

import os
import torch
from diffusers import QwenImageEditPlusPipeline
from PIL import Image
torch.manual_seed(0)

pipeline = QwenImageEditPlusPipeline.from_pretrained("Qwen/Qwen-Image-Edit-2511")
print("pipeline loaded")
pipeline.to(torch.bfloat16)
pipeline.to('cuda')
pipeline.set_progress_bar_config(disable=None)

image0 = Image.open("input1.png").convert("RGB")
image1 = Image.open("input2.png").convert("RGB")
prompt = "merge picture 1 with picture 2"
inputs = {
    "image": [[image0, image1], [image0, image1]],
    "prompt": [prompt, prompt],
    "height": 1024,
    "width": 1024,
    "generator": torch.manual_seed(0),
    "true_cfg_scale": 4.0,
    "negative_prompt": " ",
    "num_inference_steps": 50,
    "guidance_scale": 1.0,
    "num_images_per_prompt": 1,
}
with torch.inference_mode():
    output = pipeline(**inputs)
    output_image = output.images[0]
    output_image.save("output_image_edit_plus.png")
    print("image saved at", os.path.abspath("output_image_edit_plus.png"))

@sayakpaul
Copy link
Member

Thanks a lot @naykun! WDYT about raising an error from the EditPlus pipeline when batch_size > 1? Would that work for you?

@naykun
Copy link
Contributor

naykun commented Jan 12, 2026

Thanks a lot @naykun! WDYT about raising an error from the EditPlus pipeline when batch_size > 1? Would that work for you?

Sounds good to me! 😊

@sayakpaul sayakpaul merged commit dad5cb5 into huggingface:main Jan 12, 2026
29 of 31 checks passed
@sayakpaul
Copy link
Member

Uff feels like a big burden off everyone's shoulders! Thanks a lot @cdutr @kashif for jamming. @cdutr please obtain your MVP certificate from here. Let us also know about your HF username, so that we can grant you some credits :)

@DefTruth
Copy link
Contributor

DefTruth commented Jan 13, 2026

@sayakpaul @yiyixuxu Have you tested Context Parallel? It seems that this PR has caused an error in Context Parallel.

[rank3]:     output = function_reference.forward(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/dev/vipshop/diffusers/src/diffusers/models/transformers/transformer_qwenimage.py", line 959, in forward
[rank3]:     encoder_hidden_states, hidden_states = block(
[rank3]:                                            ^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/dev/vipshop/diffusers/src/diffusers/models/transformers/transformer_qwenimage.py", line 696, in forward
[rank3]:     attn_output = self.attn(
[rank3]:                   ^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/dev/vipshop/diffusers/src/diffusers/models/attention_processor.py", line 605, in forward
[rank3]:     return self.processor(
[rank3]:            ^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/dev/vipshop/diffusers/src/diffusers/models/transformers/transformer_qwenimage.py", line 535, in __call__
[rank3]:     txt_query = apply_rotary_emb_qwen(txt_query, txt_freqs, use_real=False)
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/dev/vipshop/diffusers/src/diffusers/models/transformers/transformer_qwenimage.py", line 140, in apply_rotary_emb_qwen
[rank3]:     x_out = torch.view_as_real(x_rotated * freqs_cis).flatten(3)
[rank3]:                                ~~~~~~~~~~^~~~~~~~~~~
[rank3]: RuntimeError: The size of tensor a (109) must match the size of tensor b (27) at non-singleton dimension 1

I have checkout to b86bd99 (the last commit before this pr) and the CP is work.

@sayakpaul
Copy link
Member

Was it working before in the CP setup? Cc: @kashif

@DefTruth
Copy link
Contributor

DefTruth commented Jan 13, 2026

@sayakpaul @kashif The txt_seq_len is error if context parallel enabled, can we fix it.

def compute_text_seq_len_from_mask(
encoder_hidden_states: torch.Tensor, encoder_hidden_states_mask: Optional[torch.Tensor]
) -> Tuple[int, Optional[torch.Tensor], Optional[torch.Tensor]]:
"""
Compute text sequence length without assuming contiguous masks. Returns length for RoPE and a normalized bool mask.
"""
batch_size, text_seq_len = encoder_hidden_states.shape[:2]
if encoder_hidden_states_mask is None:

@sayakpaul
Copy link
Member

Let's please not sound condescending here. I am politely asking if it was working as expected for Qwen before this PR. Could you confirm that?

If you have a fix, then we appreciate any PR otherwise, I kindly ask you to have patience. It will surely be fixed.

@DefTruth
Copy link
Contributor

@sayakpaul Hi~

  1. I'm very sorry. I didn't mean to offend. I think I was just describing the problem I found. I'm trying to solve this problem. If it works, I will submit a PR for the fix. I really like your works, my friends ~

  2. I have checkout to b86bd99 (the last commit before this pr) and the CP is work.

@DefTruth
Copy link
Contributor

@sayakpaul Change the CP plan from transformer-level to block-level seems can avoid sequence length mismatch in rope. I will submit a pr to fix this issue after I have conducted thorough tests.

        _cp_plan = {
            "transformer_blocks.0": {
                "hidden_states": ContextParallelInput(
                    split_dim=1, expected_dims=3, split_output=False
                ),
                "encoder_hidden_states": ContextParallelInput(
                    split_dim=1, expected_dims=3, split_output=False
                ),
            },
            "pos_embed": {
                0: ContextParallelInput(split_dim=0, expected_dims=2, split_output=True),
                1: ContextParallelInput(split_dim=0, expected_dims=2, split_output=True),
            },
            "proj_out": ContextParallelOutput(gather_dim=1, expected_dims=3),
        }

@DefTruth
Copy link
Contributor

DefTruth commented Jan 13, 2026

@sayakpaul I have make a draft pr at #12970, I'm sorry again for my straightforward comments. I really didn't mean to offend. I've just been paying more attention to diffusers' CP's ability. As you can see, I have also made a lot of efforts to fix the bugs related to CP. I truly hope that diffusers will keep getting better and better, because I am also a loyal user of it and have benefited a lot from its code.

@GengarWu
Copy link

GengarWu commented Jan 13, 2026

some benchmarks with various backends:

code: benchmark_backends_qwen.py

backend_benchmark_complete

Have any of you run into this issue: when running qwen-image-edit-2511 at 1024×1024 resolution on H20, the single-step inference speed dropped from 5.7s in version 0.36.0 to 7.8s? @kashif

import os
import torch
from PIL import Image
from diffusers import QwenImageEditPlusPipeline

pipeline = QwenImageEditPlusPipeline.from_pretrained("/data00/models/Qwen-Image-Edit-2511", torch_dtype=torch.bfloat16)
print("pipeline loaded")

pipeline.to('cuda')
pipeline.set_progress_bar_config(disable=None)
image1 = Image.open("/data00/image_examples/qwen-image-edit-input1.jpg")
image2 = Image.open("/data00/image_examples/qwen-image-edit-input2.jpg")
prompt = "The magician bear is on the left, the alchemist bear is on the right, facing each other in the central park square."
inputs = {
    "image": [image1, image2],
    "prompt": prompt,
    "generator": torch.Generator(device="cuda").manual_seed(0),
    "true_cfg_scale": 4.0,
    "negative_prompt": " ",
    "num_inference_steps": 40,
    "guidance_scale": 1.0,
    "num_images_per_prompt": 1,
    "height": 1024,
    "width": 1024,
}
with torch.inference_mode():
    output = pipeline(**inputs)
    output_image = output.images[0]
    output_image.save("output_image_edit_2511.png")
    print("image saved at", os.path.abspath("output_image_edit_2511.png"))

@kashif kashif deleted the txt_seq_lens branch January 16, 2026 10:48
@kashif
Copy link
Contributor Author

kashif commented Jan 16, 2026

@GengarWu would you be able to test with the PR #12987 Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen Image: txt_seq_lens is redundant and not used

9 participants