Skip to content

fix wan 2.1 i2v cp and t5/umt5 te-p#639

Merged
DefTruth merged 4 commits intomainfrom
dev
Jan 4, 2026
Merged

fix wan 2.1 i2v cp and t5/umt5 te-p#639
DefTruth merged 4 commits intomainfrom
dev

Conversation

@DefTruth
Copy link
Copy Markdown
Member

@DefTruth DefTruth commented Jan 4, 2026

fixed #622

  1. We need to disable the splitting of encoder_hidden_states because the image_encoder consistently generates 257 tokens for image_embed. This causes the shape of encoder_hidden_states—whose token count is always 769 (512 + 257) after concatenation—to be indivisible by the number of devices in the CP.

  2. Since the key/value in cross-attention depends solely on encoder_hidden_states (text or img), the (q_chunk * k) * v computation can be parallelized independently. Thus, there is no need to pass the parallel_config for cross-attention. This change reduces redundant all-to-all communications—specifically (3+1)×2=8 for the two cross-attention operations (text and img)—thereby improving Wan’s performance under context parallelism.

# solely on encoder_hidden_states (text), the (q_chunk * k) * v
# computation can be parallelized independently. Thus, there is
# no need to pass the config here.
parallel_config=(self._parallel_config if encoder_hidden_states is None else None),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this parallel_config be None?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Low quality output (snow-like noise) when using Cache-DiT with Wan2.1-I2V

2 participants