Skip to content

Remove compile bottlenecks from ZImage pipeline#13461

Merged
sayakpaul merged 6 commits intohuggingface:mainfrom
hitchhiker3010:main
Apr 15, 2026
Merged

Remove compile bottlenecks from ZImage pipeline#13461
sayakpaul merged 6 commits intohuggingface:mainfrom
hitchhiker3010:main

Conversation

@hitchhiker3010
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes performance issues identified by profiling ZImagePipeline with torch.profiler as part of #13401 .

What does this PR do?

Profiled ZImagePipeline (using Tongyi-MAI/Z-Image-Turbo) in both eager and torch.compile modes following the profiling guide. The Chrome traces revealed two device-to-host (DtoH) synchronization points that break asynchronous GPU execution and prevent torch.compile from yielding its full speedup.

Pipeline denoising loop: t_norm = timestep[0].item() DtoH sync

  1. Inside the denoising loop, timestep[0].item() triggers a GPU→CPU sync every step to read t_norm for CFG truncation logic. Since the full timestep schedule is known before the loop begins, we precompute all t_norm values into a plain Python list before entering the loop and index into it with i.
  2. This also lets us set scheduler.set_begin_index(0) upfront to avoid the DtoH sync in _init_step_index (same pattern as Avoid DtoH sync from access of nonzero() item in scheduler #11696 )

Profiling ZImagePipeline
GPU - L4
num_inference_steps - 4,
guidance_scale - 0.0 ( Guidance should be 0 for the Turbo models)

Before
image
The first scheduler_step took 657.8µs
Number of cudaStreamSynchronize blocks - 19

After
image
The first scheduler_step took 15.49 µs after this fix
Number of cudaStreamSynchronize blocks - 13
Part of #13401 .

Before submitting

Who can review?

@sayakpaul @dg845

@github-actions github-actions Bot added pipelines size/S PR with diff < 50 LOC labels Apr 13, 2026
@sayakpaul sayakpaul added the performance Anything related to performance improvements, profiling and benchmarking label Apr 14, 2026
@sayakpaul
Copy link
Copy Markdown
Member

Thanks for your PR! Can we eliminate all the cudaStreamSynchronize calls?

…former

Boolean mask indexing (tensor[mask] = val) implicitly calls nonzero(),
which triggers a DtoH sync that stalls the CPU while the GPU queue drains.
Replacing it with torch.where eliminates these syncs from the transformer's
pad-token assignment.

Profiling (4-step turbo, fix_2 vs fix_1):
- Eager: nonzero CPU time drops from ~2091 ms to <1 ms; index_put eliminated
- Compile: nonzero CPU time drops from ~3057 ms to <1 ms; index_put eliminated
@github-actions github-actions Bot added models size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 14, 2026
@hitchhiker3010
Copy link
Copy Markdown
Contributor Author

hitchhiker3010 commented Apr 14, 2026

Here are some comparison stats between commit_1 and commit_2

Metric commit_1 eager commit_2 eager commit_1 compile commit_2 compile
nonzero calls 28 4 28 4
nonzero CPU time 2091 ms 0.72 ms 3057 ms 0.49 ms
index_put calls 20 0 36 0
index_put total 4183 ms 0 ms 9172 ms 0 ms
cudaStreamSynchronize calls 13 5 13 5
cudaStreamSynchronize total 2089 ms 0.47 ms 3055 ms 0.32 ms

@hitchhiker3010
Copy link
Copy Markdown
Contributor Author

all the trace files can be accessed here.

The cudaStreamSynchronize traces from the Denoising phase are eliminated now, the remaining 5 cudaStreamSynchronize seem to be from the text encoding phase, should we fix them too?

cc: @sayakpaul

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@github-actions github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 15, 2026
@github-actions github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 15, 2026
@github-actions github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 15, 2026
@sayakpaul
Copy link
Copy Markdown
Member

Merging as the outputs with and without this PR are the same:

"""
Minimal script to verify PR #13461 does not change ZImagePipeline outputs.

Compares latent outputs between the current branch (hitchhiker3010-main, with PR changes)
and the main branch (without PR changes) using a fixed seed.

The PR makes two changes:
  1. Precompute cfg_truncation t_norms outside the loop (avoids DtoH sync)
  2. Use torch.where instead of boolean mask indexing in the transformer

Both are pure optimizations — outputs should be identical.

Usage:
    # On the current branch (with PR changes):
    python test_zimage_pr13461.py --save current_branch.pt

    # On the main branch (without PR changes):
    git checkout main
    python test_zimage_pr13461.py --save main_branch.pt

    # Compare:
    python test_zimage_pr13461.py --compare current_branch.pt main_branch.pt
"""

import argparse
import torch


def run_pipeline():
    from diffusers import ZImagePipeline

    pipe = ZImagePipeline.from_pretrained(
        "Tongyi-MAI/Z-Image-Turbo",
        torch_dtype=torch.bfloat16,
    )
    pipe.to("cuda")

    generator = torch.Generator(device="cuda").manual_seed(0)

    # Use guidance_scale > 1 and cfg_truncation to exercise both changed code paths.
    # Small resolution + few steps + latent output for speed.
    output = pipe(
        prompt="a cat",
        height=256,
        width=256,
        num_inference_steps=2,
        guidance_scale=3.5,
        cfg_truncation=0.5,
        output_type="latent",
        generator=generator,
    )
    return output.images


def save(latents, path):
    torch.save(latents.cpu(), path)
    print(f"Saved latents with shape {latents.shape} and dtype {latents.dtype} to {path}")


def compare(path_a, path_b):
    a = torch.load(path_a, weights_only=True)
    b = torch.load(path_b, weights_only=True)

    print(f"Tensor A: shape={a.shape}, dtype={a.dtype}")
    print(f"Tensor B: shape={b.shape}, dtype={b.dtype}")

    if a.shape != b.shape:
        print("FAIL: shapes differ")
        return

    exact_match = torch.equal(a, b)
    max_diff = (a.float() - b.float()).abs().max().item()
    print(f"Exact match: {exact_match}")
    print(f"Max absolute difference: {max_diff}")

    if exact_match:
        print("PASS: outputs are identical")
    elif max_diff < 1e-3:
        print(f"PASS: outputs differ by at most {max_diff} (within tolerance)")
    else:
        print(f"FAIL: outputs differ by {max_diff}")


def main():
    parser = argparse.ArgumentParser(description="ZImage PR #13461 output comparison")
    parser.add_argument("--save", type=str, help="Run pipeline and save latents to this path")
    parser.add_argument("--compare", nargs=2, metavar=("A", "B"), help="Compare two saved latent files")
    args = parser.parse_args()

    if args.save:
        latents = run_pipeline()
        save(latents, args.save)
    elif args.compare:
        compare(args.compare[0], args.compare[1])
    else:
        parser.print_help()


if __name__ == "__main__":
    main()

@sayakpaul sayakpaul merged commit 71a6fd9 into huggingface:main Apr 15, 2026
13 of 14 checks passed
@hitchhiker3010
Copy link
Copy Markdown
Contributor Author

Hey @sayakpaul @dg845

Thanks for the contributing opportunity, there must be similar issues in the other Z Image pipelines[controlnet, controlnet_inpaint, img2img, inpaint and omni] along with some pipeline specific ones, can I look into those or do you suggest any other pipelines that might be of priority?

terarachang pushed a commit to terarachang/diffusers that referenced this pull request Apr 30, 2026
* [core] Remove DtoH syncs from ZImage pipeline denoising loop

* [core] Replace boolean mask indexing with torch.where in ZImage transformer

Boolean mask indexing (tensor[mask] = val) implicitly calls nonzero(),
which triggers a DtoH sync that stalls the CPU while the GPU queue drains.
Replacing it with torch.where eliminates these syncs from the transformer's
pad-token assignment.

Profiling (4-step turbo, fix_2 vs fix_1):
- Eager: nonzero CPU time drops from ~2091 ms to <1 ms; index_put eliminated
- Compile: nonzero CPU time drops from ~3057 ms to <1 ms; index_put eliminated

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

models performance Anything related to performance improvements, profiling and benchmarking pipelines size/S PR with diff < 50 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants