Remove compile bottlenecks from ZImage pipeline by hitchhiker3010 · Pull Request #13461 · huggingface/diffusers

hitchhiker3010 · 2026-04-13T20:40:44Z

What does this PR do?

Fixes performance issues identified by profiling ZImagePipeline with torch.profiler as part of #13401 .

What does this PR do?

Profiled ZImagePipeline (using Tongyi-MAI/Z-Image-Turbo) in both eager and torch.compile modes following the profiling guide. The Chrome traces revealed two device-to-host (DtoH) synchronization points that break asynchronous GPU execution and prevent torch.compile from yielding its full speedup.

Pipeline denoising loop: t_norm = timestep[0].item() DtoH sync

Inside the denoising loop, timestep[0].item() triggers a GPU→CPU sync every step to read t_norm for CFG truncation logic. Since the full timestep schedule is known before the loop begins, we precompute all t_norm values into a plain Python list before entering the loop and index into it with i.
This also lets us set scheduler.set_begin_index(0) upfront to avoid the DtoH sync in _init_step_index (same pattern as Avoid DtoH sync from access of nonzero() item in scheduler #11696 )

Profiling ZImagePipeline
GPU - L4
num_inference_steps - 4,
guidance_scale - 0.0 ( Guidance should be 0 for the Turbo models)

Before

The first scheduler_step took 657.8µs
Number of cudaStreamSynchronize blocks - 19

After

The first scheduler_step took 15.49 µs after this fix
Number of cudaStreamSynchronize blocks - 13
Part of #13401 .

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. Here is the link to the discussion
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul @dg845

sayakpaul · 2026-04-14T03:31:10Z

Thanks for your PR! Can we eliminate all the cudaStreamSynchronize calls?

…former Boolean mask indexing (tensor[mask] = val) implicitly calls nonzero(), which triggers a DtoH sync that stalls the CPU while the GPU queue drains. Replacing it with torch.where eliminates these syncs from the transformer's pad-token assignment. Profiling (4-step turbo, fix_2 vs fix_1): - Eager: nonzero CPU time drops from ~2091 ms to <1 ms; index_put eliminated - Compile: nonzero CPU time drops from ~3057 ms to <1 ms; index_put eliminated

hitchhiker3010 · 2026-04-14T07:38:33Z

Here are some comparison stats between commit_1 and commit_2

Metric	commit_1 eager	commit_2 eager	commit_1 compile	commit_2 compile
nonzero calls	28	4	28	4
nonzero CPU time	2091 ms	0.72 ms	3057 ms	0.49 ms
index_put calls	20	0	36	0
index_put total	4183 ms	0 ms	9172 ms	0 ms
cudaStreamSynchronize calls	13	5	13	5
cudaStreamSynchronize total	2089 ms	0.47 ms	3055 ms	0.32 ms

hitchhiker3010 · 2026-04-14T08:08:25Z

all the trace files can be accessed here.

The cudaStreamSynchronize traces from the Denoising phase are eliminated now, the remaining 5 cudaStreamSynchronize seem to be from the text encoding phase, should we fix them too?

cc: @sayakpaul

HuggingFaceDocBuilderDev · 2026-04-15T08:18:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845

Thanks for the PR!

sayakpaul · 2026-04-15T10:55:34Z

Merging as the outputs with and without this PR are the same:

"""
Minimal script to verify PR #13461 does not change ZImagePipeline outputs.

Compares latent outputs between the current branch (hitchhiker3010-main, with PR changes)
and the main branch (without PR changes) using a fixed seed.

The PR makes two changes:
  1. Precompute cfg_truncation t_norms outside the loop (avoids DtoH sync)
  2. Use torch.where instead of boolean mask indexing in the transformer

Both are pure optimizations — outputs should be identical.

Usage:
    # On the current branch (with PR changes):
    python test_zimage_pr13461.py --save current_branch.pt

    # On the main branch (without PR changes):
    git checkout main
    python test_zimage_pr13461.py --save main_branch.pt

    # Compare:
    python test_zimage_pr13461.py --compare current_branch.pt main_branch.pt
"""

import argparse
import torch


def run_pipeline():
    from diffusers import ZImagePipeline

    pipe = ZImagePipeline.from_pretrained(
        "Tongyi-MAI/Z-Image-Turbo",
        torch_dtype=torch.bfloat16,
    )
    pipe.to("cuda")

    generator = torch.Generator(device="cuda").manual_seed(0)

    # Use guidance_scale > 1 and cfg_truncation to exercise both changed code paths.
    # Small resolution + few steps + latent output for speed.
    output = pipe(
        prompt="a cat",
        height=256,
        width=256,
        num_inference_steps=2,
        guidance_scale=3.5,
        cfg_truncation=0.5,
        output_type="latent",
        generator=generator,
    )
    return output.images


def save(latents, path):
    torch.save(latents.cpu(), path)
    print(f"Saved latents with shape {latents.shape} and dtype {latents.dtype} to {path}")


def compare(path_a, path_b):
    a = torch.load(path_a, weights_only=True)
    b = torch.load(path_b, weights_only=True)

    print(f"Tensor A: shape={a.shape}, dtype={a.dtype}")
    print(f"Tensor B: shape={b.shape}, dtype={b.dtype}")

    if a.shape != b.shape:
        print("FAIL: shapes differ")
        return

    exact_match = torch.equal(a, b)
    max_diff = (a.float() - b.float()).abs().max().item()
    print(f"Exact match: {exact_match}")
    print(f"Max absolute difference: {max_diff}")

    if exact_match:
        print("PASS: outputs are identical")
    elif max_diff < 1e-3:
        print(f"PASS: outputs differ by at most {max_diff} (within tolerance)")
    else:
        print(f"FAIL: outputs differ by {max_diff}")


def main():
    parser = argparse.ArgumentParser(description="ZImage PR #13461 output comparison")
    parser.add_argument("--save", type=str, help="Run pipeline and save latents to this path")
    parser.add_argument("--compare", nargs=2, metavar=("A", "B"), help="Compare two saved latent files")
    args = parser.parse_args()

    if args.save:
        latents = run_pipeline()
        save(latents, args.save)
    elif args.compare:
        compare(args.compare[0], args.compare[1])
    else:
        parser.print_help()


if __name__ == "__main__":
    main()

hitchhiker3010 · 2026-04-16T15:30:26Z

Hey @sayakpaul @dg845

Thanks for the contributing opportunity, there must be similar issues in the other Z Image pipelines[controlnet, controlnet_inpaint, img2img, inpaint and omni] along with some pipeline specific ones, can I look into those or do you suggest any other pipelines that might be of priority?

* [core] Remove DtoH syncs from ZImage pipeline denoising loop * [core] Replace boolean mask indexing with torch.where in ZImage transformer Boolean mask indexing (tensor[mask] = val) implicitly calls nonzero(), which triggers a DtoH sync that stalls the CPU while the GPU queue drains. Replacing it with torch.where eliminates these syncs from the transformer's pad-token assignment. Profiling (4-step turbo, fix_2 vs fix_1): - Eager: nonzero CPU time drops from ~2091 ms to <1 ms; index_put eliminated - Compile: nonzero CPU time drops from ~3057 ms to <1 ms; index_put eliminated --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

hitchhiker3010 and others added 2 commits April 14, 2026 01:11

[core] Remove DtoH syncs from ZImage pipeline denoising loop

b04b66f

Merge branch 'huggingface:main' into main

2bf717b

github-actions Bot added pipelines size/S PR with diff < 50 LOC labels Apr 13, 2026

sayakpaul added the performance Anything related to performance improvements, profiling and benchmarking label Apr 14, 2026

sayakpaul requested a review from dg845 April 14, 2026 03:31

hitchhiker3010 mentioned this pull request Apr 14, 2026

Help us profile important pipelines and improve if needed #13401

Open

github-actions Bot added models size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 14, 2026

dg845 approved these changes Apr 15, 2026

View reviewed changes

Merge branch 'main' into main

1ac5124

github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 15, 2026

Merge branch 'main' into main

d07e370

github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 15, 2026

Merge branch 'main' into main

d473030

github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 15, 2026

sayakpaul merged commit 71a6fd9 into huggingface:main Apr 15, 2026
13 of 14 checks passed

akshan-main mentioned this pull request Apr 28, 2026

Cache ModelMixin.dtype to avoid named_parameters walk per access #13571

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove compile bottlenecks from ZImage pipeline#13461

Remove compile bottlenecks from ZImage pipeline#13461
sayakpaul merged 6 commits intohuggingface:mainfrom
hitchhiker3010:main

hitchhiker3010 commented Apr 13, 2026

Uh oh!

sayakpaul commented Apr 14, 2026

Uh oh!

hitchhiker3010 commented Apr 14, 2026 •

edited

Loading

Uh oh!

hitchhiker3010 commented Apr 14, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 15, 2026

Uh oh!

dg845 left a comment

Uh oh!

sayakpaul commented Apr 15, 2026

Uh oh!

Uh oh!

hitchhiker3010 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hitchhiker3010 commented Apr 13, 2026

What does this PR do?

What does this PR do?

Before submitting

Who can review?

Uh oh!

sayakpaul commented Apr 14, 2026

Uh oh!

hitchhiker3010 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hitchhiker3010 commented Apr 14, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 15, 2026

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Apr 15, 2026

Uh oh!

Uh oh!

hitchhiker3010 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hitchhiker3010 commented Apr 14, 2026 •

edited

Loading