bugfix: fix chrono-edit context parallel by DefTruth · Pull Request #12660 · huggingface/diffusers

DefTruth · 2025-11-14T08:28:31Z

fixed #12661, fix the crash of ChronoEdit with context parallelism.

We need to disable the splitting of encoder_hidden_states because the image_encoder consistently generates 257 tokens for image_embed. This causes the shape of encoder_hidden_states—whose token count is always 769 (512 + 257) after concatenation—to be indivisible by the number of devices in the CP.
Since the key/value in cross-attention depends solely on encoder_hidden_states (text or img), the (q_chunk * k) * v computation can be parallelized independently. Thus, there is no need to pass the parallel_config for cross-attention. This change reduces redundant all-to-all communications—specifically (3+1)×2=8 for the two cross-attention operations (text and img)—thereby improving ChronoEdit’s performance under context parallelism. With this optimization alone, I have achieved a nearly 1.85× speedup on L20x2, without relying on other optimizations such as torch.compile.

Reproduce

test script

import os
import time
import torch
import numpy as np
from PIL import Image
import torch.distributed as dist
from diffusers import (
    AutoencoderKLWan,
    ChronoEditTransformer3DModel,
    ChronoEditPipeline,
)
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers import ContextParallelConfig
from diffusers.utils import load_image
from transformers import CLIPVisionModel


dist.init_process_group(backend="nccl")
rank = dist.get_rank()
device = torch.device("cuda", rank % torch.cuda.device_count())
world_size = dist.get_world_size()
torch.cuda.set_device(device)

model_id = "nvidia/ChronoEdit-14B-Diffusers"
model_id = os.environ.get("CHRONO_EDIT_DIR", model_id)

image_encoder = CLIPVisionModel.from_pretrained(
    model_id, subfolder="image_encoder", torch_dtype=torch.float32
)
vae = AutoencoderKLWan.from_pretrained(
    model_id, subfolder="vae", torch_dtype=torch.float32
)
transformer = ChronoEditTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)

pipe = ChronoEditPipeline.from_pretrained(
    model_id,
    vae=vae,
    image_encoder=image_encoder,
    transformer=transformer,
    torch_dtype=torch.bfloat16,
    quantization_config=(
        PipelineQuantizationConfig(
            quant_backend="bitsandbytes_4bit",
            quant_kwargs={
                "load_in_4bit": True,
                "bnb_4bit_quant_type": "nf4",
                "bnb_4bit_compute_dtype": torch.bfloat16,
            },
            # text_encoder: ~ 6GiB, transformer: ~ 8GiB, total: ~14GiB
            components_to_quantize=["text_encoder", "transformer"],
        )
    ),
).to(device)

torch.cuda.empty_cache()
assert isinstance(pipe.vae, AutoencoderKLWan)
pipe.vae.enable_tiling()

image = load_image("../examples/data/chrono_edit_example.png")

max_area = 720 * 1280
aspect_ratio = image.height / image.width
mod_value = (
    pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
)
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))

prompt = (
    "The user wants to transform the image by adding a small, cute mouse sitting inside the floral teacup, enjoying a spa bath. The mouse should appear relaxed and cheerful, with a tiny white bath towel draped over its head like a turban. It should be positioned comfortably in the cup’s liquid, with gentle steam rising around it to blend with the cozy atmosphere. "
    "The mouse’s pose should be natural—perhaps sitting upright with paws resting lightly on the rim or submerged in the tea. The teacup’s floral design, gold trim, and warm lighting must remain unchanged to preserve the original aesthetic. The steam should softly swirl around the mouse, enhancing the spa-like, whimsical mood."
)

assert isinstance(pipe.transformer, ChronoEditTransformer3DModel)
pipe.transformer.set_attention_backend("native")
if world_size > 1:
    pipe.transformer.enable_parallelism(
        config=ContextParallelConfig(ulysses_degree=world_size)
    )

pipe.set_progress_bar_config(disable=rank != 0)


def run_pipe(warmup: bool = False):
    output = pipe(
        image=image,
        prompt=prompt,
        height=height,
        width=width,
        num_frames=5,
        guidance_scale=5.0,
        enable_temporal_reasoning=False,
        num_temporal_reasoning_steps=0,
        num_inference_steps=50 if not warmup else 2,
        generator=torch.Generator("cuda").manual_seed(0),
    ).frames[0]
    output = Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8"))
    return output


start = time.time()
output = run_pipe()
end = time.time()

if rank == 0:
    time_cost = end - start
    save_path = f"chrono-edit.{world_size}gpus.png"
    print(f"Time cost: {time_cost:.2f}s")
    print(f"Saving image to {save_path}")
    output.save(save_path)

if dist.is_initialized():
    dist.destroy_process_group()

test cmd:

torchrun --nproc_per_node=4 run_chrono_edit.py

w/o this PR:

rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/dev/vipshop/diffusers/src/diffusers/hooks/hooks.py", line 188, in new_forward
[rank1]:     args, kwargs = function_reference.pre_forward(module, *args, **kwargs)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/dev/vipshop/diffusers/src/diffusers/hooks/context_parallel.py", line 157, in pre_forward
[rank1]:     input_val = self._prepare_cp_input(input_val, cpm)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/dev/vipshop/diffusers/src/diffusers/hooks/context_parallel.py", line 211, in _prepare_cp_input
[rank1]:     return EquipartitionSharder.shard(x, cp_input.split_dim, self.parallel_config._flattened_mesh)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/dev/vipshop/diffusers/src/diffusers/hooks/context_parallel.py", line 261, in shard
[rank1]:     assert tensor.size()[dim] % mesh.size() == 0, (
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AssertionError: Tensor size along dimension to be sharded must be divisible by mesh size

w/ this PR:

Attention backends are an experimental feature and the API may be subject to change.
`enable_parallelism` is an experimental feature. The API may change in the future and breaking changes may be introduced at any time without warning.
Attention backends are an experimental feature and the API may be subject to change.
`enable_parallelism` is an experimental feature. The API may change in the future and breaking changes may be introduced at any time without warning.
Attention backends are an experimental feature and the API may be subject to change.
`enable_parallelism` is an experimental feature. The API may change in the future and breaking changes may be introduced at any time without warning.
Attention backends are an experimental feature and the API may be subject to change.
`enable_parallelism` is an experimental feature. The API may change in the future and breaking changes may be introduced at any time without warning.
100%|████████████████████████████████| 50/50 [01:22<00:00,  1.64s/it]
Saving image to chrono-edit.4gpus.png

Baseline	Ulysses 4

DefTruth · 2025-11-18T01:31:07Z

@sayakpaul @yiyixuxu @DN6 Hi~ can you take a look to this PR?

DN6

Thanks @DefTruth. Just updated the notes to yourself in the comments with a reference this PR.

HuggingFaceDocBuilderDev · 2025-11-21T03:39:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

Removed unnecessary comments regarding parallelization in cross-attention.

DefTruth · 2025-11-21T08:17:13Z

Thanks @DefTruth. Just updated the notes to yourself in the comments with a reference this PR.

done

DN6 · 2025-11-21T12:29:28Z

@bot /style

github-actions · 2025-11-21T12:29:55Z

Style fix is beginning .... View the workflow run here.

DN6 · 2025-11-21T16:23:44Z

@DefTruth could you run make style && make quality so the QC checks pass.

DefTruth · 2025-11-22T02:50:22Z

done

DN6 · 2025-11-22T05:00:38Z

Ah. Issue is with Copied from in the attention processor. @DefTruth would you mind also applying the change to the Wan Attn Processor (it should also be valid since it would also experience the same issue with cross attention)

DefTruth · 2025-11-22T06:45:31Z

@DN6 I haven't fully tested the WAN model yet. I'll hold off on submitting the PR until the testing is done — this way we can make sure we don't break the existing functionality.

DN6 · 2025-11-24T02:53:35Z

@DefTruth Could you then remove the #Copied from statement on the ChronoEditAttnProcessor`.
https://github.com/xlite-dev/diffusers/blob/e5fed0133c0b8780d21104ae844e5f27959467aa/src/diffusers/models/transformers/transformer_chronoedit.py#L70

It's the reason why the QC checks aren't passing

DefTruth · 2025-11-24T04:41:46Z

@DefTruth Could you then remove the #Copied from statement on the ChronoEditAttnProcessor`. https://github.com/xlite-dev/diffusers/blob/e5fed0133c0b8780d21104ae844e5f27959467aa/src/diffusers/models/transformers/transformer_chronoedit.py#L70

It's the reason why the QC checks aren't passing

Done, rewrite 'Copied from' -> 'modified from'

DN6 · 2025-11-24T11:07:12Z

Thank you @DefTruth 🙏🏽

DefTruth · 2025-11-24T11:21:51Z

Thank you @DefTruth 🙏🏽

I will also test the Wan I2V model. If they have the same problem, I will submit a PR for repair.

DefTruth · 2025-11-25T02:40:13Z

@DN6

diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py

Lines 677 to 700 in dde8754

    
           # only wan 2.1 i2v transformer accepts image_embeds 
        
           if self.transformer is not None and self.transformer.config.image_dim is not None: 
        
               if image_embeds is None: 
        
                   if last_image is None: 
        
                       image_embeds = self.encode_image(image, device) 
        
                   else: 
        
                       image_embeds = self.encode_image([image, last_image], device) 
        
               image_embeds = image_embeds.repeat(batch_size, 1, 1) 
        
               image_embeds = image_embeds.to(transformer_dtype) 
        
           # 4. Prepare timesteps 
        
           self.scheduler.set_timesteps(num_inference_steps, device=device) 
        
           timesteps = self.scheduler.timesteps 
        
           # 5. Prepare latent variables 
        
           num_channels_latents = self.vae.config.z_dim 
        
           image = self.video_processor.preprocess(image, height=height, width=width).to(device, dtype=torch.float32) 
        
           if last_image is not None: 
        
               last_image = self.video_processor.preprocess(last_image, height=height, width=width).to( 
        
                   device, dtype=torch.float32 
        
               ) 
        
           latents_outputs = self.prepare_latents( 
        
               image,

since only wan 2.1 i2v transformer accepts image_embeds (ChronoEdit will always accepts image_embeds), i did not came across the same crash while using wan 2.2 i2v.

DefTruth and others added 6 commits November 14, 2025 08:05

bugfix: fix chrono-edit context parallel

9bf10f9

Merge branch 'main' into fix-chrono-edit-cp

1bb9c40

bugfix: fix chrono-edit context parallel

ecf2a29

Merge branch 'main' into fix-chrono-edit-cp

8acc10e

Merge branch 'main' into fix-chrono-edit-cp

c6d8353

Merge branch 'main' into fix-chrono-edit-cp

b095b61

DefTruth added 2 commits November 18, 2025 19:48

Merge branch 'main' into fix-chrono-edit-cp

7f42701

Merge branch 'main' into fix-chrono-edit-cp

17e87bd

sayakpaul requested a review from DN6 November 19, 2025 03:23

DefTruth added 2 commits November 19, 2025 12:32

Merge branch 'main' into fix-chrono-edit-cp

ceb3953

Merge branch 'main' into fix-chrono-edit-cp

344bab2

DN6 approved these changes Nov 21, 2025

View reviewed changes

Comment thread src/diffusers/models/transformers/transformer_chronoedit.py Outdated

Comment thread src/diffusers/models/transformers/transformer_chronoedit.py Outdated

Comment thread src/diffusers/models/transformers/transformer_chronoedit.py Outdated

DefTruth and others added 3 commits November 21, 2025 16:14

Update src/diffusers/models/transformers/transformer_chronoedit.py

13f96fb

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

Update src/diffusers/models/transformers/transformer_chronoedit.py

f74c9f0

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

Clean up comments in transformer_chronoedit.py

5c62617

Removed unnecessary comments regarding parallelization in cross-attention.

DefTruth and others added 2 commits November 22, 2025 10:37

Merge branch 'main' into fix-chrono-edit-cp

c548e40

fix style

e5fed01

fix qc

a72204b

Merge branch 'main' into fix-chrono-edit-cp

8ad4a0f

DN6 merged commit 354d35a into huggingface:main Nov 24, 2025
10 of 11 checks passed

DefTruth mentioned this pull request Jan 4, 2026

Fix wan 2.1 i2v context parallel #12909

Merged

DefTruth deleted the fix-chrono-edit-cp branch February 4, 2026 11:38

hlky mentioned this pull request Apr 29, 2026

chronoedit model/pipeline review #13620

Open

Conversation

DefTruth commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproduce

Uh oh!

DefTruth commented Nov 18, 2025

Uh oh!

DN6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 21, 2025

Uh oh!

DefTruth commented Nov 21, 2025

Uh oh!

DN6 commented Nov 21, 2025

Uh oh!

github-actions Bot commented Nov 21, 2025

Uh oh!

DN6 commented Nov 21, 2025

Uh oh!

DefTruth commented Nov 22, 2025

Uh oh!

DN6 commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DefTruth commented Nov 22, 2025

Uh oh!

DN6 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DefTruth commented Nov 24, 2025

Uh oh!

Uh oh!

DN6 commented Nov 24, 2025

Uh oh!

DefTruth commented Nov 24, 2025

Uh oh!

DefTruth commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DefTruth commented Nov 14, 2025 •

edited

Loading

DN6 commented Nov 22, 2025 •

edited

Loading

DN6 commented Nov 24, 2025 •

edited

Loading