Skip to content

[Diffusion] Bump up cache-dit & support quant for diffusers backend#20361

Merged
mickqian merged 9 commits intosgl-project:mainfrom
xlite-dev:diffusers-be-quant
Mar 17, 2026
Merged

[Diffusion] Bump up cache-dit & support quant for diffusers backend#20361
mickqian merged 9 commits intosgl-project:mainfrom
xlite-dev:diffusers-be-quant

Conversation

@DefTruth
Copy link
Copy Markdown
Contributor

@DefTruth DefTruth commented Mar 11, 2026

Bump up cache-dit & support quant for diffusers backend (~48% speedup for FLUX.1-dev on L20). Need #20338 ready.

Cache-DiT v1.3.0 is a major release after v.1.2.0, the major changes incude:

  • Optimize VAE Parallel comm, use batched isend/irecv
  • 2D/3D Parallelism: Hybrid CP(USP) + TP, e.g, SP2 + TP2
  • Support USP (hybrid ulysses and ring attention)
  • New models support: GLM-Image, FLUX.2-Klein, Helios, FireRed-Image-Edit, and more.
  • Support pass a quantize_config to enable_cache API and load from config yaml
  • FP8 Blockwise dynamic quantization support
  • AMD GPUs support
  • ...

The major changes related to SGLang Diffusion are that Cache-DiT v1.3.0 has improved its configs loading functions, enabling support for nearly all optimizations configured in the YAML file. These optimizations include:

  • hybrid cache acceleration (DBCache, TaylorSeer, SCM, etc.);
  • comprehensive parallelism optimizations, including Context Parallelism, Tensor Parallelism, hybrid 2D or 3D parallelism, and dedicated extra parallelism support for Text Encoder, VAE, and ControlNet;
  • Ulysses Anything Attention, Async Ulysses CP, Ulysses FP8 Comm;
  • quantization (float8 (DQ), float8_weight_only, float8_blockwise, int8 (DQ), int8_weight_only, etc.)

Please refer to LOAD_CONFIGS for more details. This allows all optimizations in cache-dit to be used in SGLang Diffusion, greatly optimizing diffusers backend inference performance.

Benchmarking and Profiling

NVIDIA L20 x 1, FLUX.1-dev, 28 steps, 1024 x 1024

diffusers backend, baseline diffusers backend + fp8 quantize
20.46s 13.81s (~48% speedup)
flux_diffusers_torch_compile flux_diffusers_torch_compile_float8

NVIDIA H200 x 1, FLUX.1-dev, 28 steps, 1024 x 1024

diffusers backend, baseline diffusers backend + fp8 quantize
3.73s 2.77s (~35% speedup)
flux_diffusers_torch_compile_h200 flux_diffusers_torch_compile_fp8_h200
  • upgrade torchao
pip install -U torchao # >= 0.16.0
  • test configs

https://github.com/vipshop/cache-dit/tree/main/examples/configs

git clone https://github.com/vipshop/cache-dit && cd examples/configs
  • quantize.yaml
quantize_config: # quantization configuration for transformer modules
  quant_type: "float8" # float8 (DQ), float8_weight_only, float8_blockwise, int8 (DQ), int8_weight_only, etc.
  exclude_layers:  # layers to exclude from quantization (transformer)
    - "embedder"
    - "embed"
  verbose: false # whether to print verbose logs during quantization
  • NVIDIA L20 w/o FP8: 20.46s (Diffusers backend, baseline)
# NVIDIA L20 (Ada)
sglang generate \
  --model-path=$FLUX_DIR \
  --backend diffusers \
  --log-level=info \
  --prompt='A fantasy landscape with mountains and a river, detailed, vibrant colors' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=28 \
  --warmup \
  --warmup-steps 28 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --enable-torch-compile \
  --save-output --output-path outputs --output-file-name flux_diffusers_torch_compile.png

# Output logs:
[03-11 12:37:45] Scheduler bind at endpoint: tcp://127.0.0.1:5599
[03-11 12:37:45] Initializing distributed environment with world_size=1, device=cuda:0, timeout=3600
[03-11 12:37:45] Setting distributed timeout to 3600 seconds
[03-11 12:37:46] No pipeline_class_name specified, using model_index.json
[03-11 12:37:46] Using diffusers backend for model '/workspace/dev/vipdev/hf_models/FLUX.1-dev' (explicitly requested)
[03-11 12:37:46] Using pipeline from model_index.json: DiffusersPipeline
[03-11 12:37:46] Loading diffusers pipeline from /workspace/dev/vipdev/hf_models/FLUX.1-dev
[03-11 12:37:46] Model already exists locally and is complete
[03-11 12:37:46] Loading diffusers pipeline with dtype=torch.bfloat16
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 39.88it/s]
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 219/219 [00:00<00:00, 9109.64it/s]
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 196/196 [00:00<00:00, 9801.88it/s]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  7.90it/s]
[03-11 12:37:56] Applied torch.compile to transformer component of the pipeline
[03-11 12:37:56] Loaded diffusers pipeline: FluxPipeline
[03-11 12:37:56] Pipeline instantiated
[03-11 12:37:56] Worker 0: Initialized device, model, and distributed environment.
[03-11 12:37:56] Worker 0: Scheduler loop started.
[03-11 12:37:56] Processing prompt 1/1: A fantasy landscape with mountains and a river, detailed, vibrant colors
[03-11 12:37:56] Processing warmup req... (1/1)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:36<00:00,  1.32s/it]
[03-11 12:38:35] Warmup req (1/1) processed in 38.30 seconds
[03-11 12:38:35] [DiffusersExecutionStage] started...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:19<00:00,  1.44it/s]
[03-11 12:38:55] [DiffusersExecutionStage] finished in 20.4576 seconds
[03-11 12:38:55] Peak GPU memory: 35.83 GB, Peak allocated: 33.83 GB, Memory pool overhead: 2.01 GB (5.6%), Remaining GPU memory at peak: 9.15 GB. Components that could stay resident (based on the last request workload): []. Related offload server args to disable: None
[03-11 12:38:56] Output saved to outputs/flux_diffusers_torch_compile.png
[03-11 12:38:56] Pixel data generated successfully in 59.35 seconds
[03-11 12:38:56] Completed batch processing. Generated 1 outputs in 59.35 seconds
[03-11 12:38:56] Warmed-up request processed in 20.46 seconds (with warmup excluded)
[03-11 12:38:56] Memory usage - Max peak: 36694.00 MB, Avg peak: 36694.00 MB

Warmed-up request processed in 20.46 seconds

  • NVIDIA L20 w/ FP8: 13.81s (vs 20.46s, 48% speedup), (Diffusers backend + quantize)
sglang generate \
  --model-path=$FLUX_DIR \
  --backend diffusers \
  --log-level=info \
  --prompt='A fantasy landscape with mountains and a river, detailed, vibrant colors' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=28 \
  --warmup \
  --warmup-steps 28 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --cache-dit-config ./quantize.yaml \
  --enable-torch-compile \
  --save-output --output-path outputs --output-file-name flux_diffusers_torch_compile_float8.png

# Output logs:
[03-11 09:24:39] Scheduler bind at endpoint: tcp://127.0.0.1:5580
[03-11 09:24:39] Initializing distributed environment with world_size=1, device=cuda:0, timeout=3600
[03-11 09:24:39] Setting distributed timeout to 3600 seconds
[03-11 09:24:40] No pipeline_class_name specified, using model_index.json
[03-11 09:24:40] Using diffusers backend for model '/workspace/dev/vipdev/hf_models/FLUX.1-dev' (explicitly requested)
[03-11 09:24:40] Using pipeline from model_index.json: DiffusersPipeline
[03-11 09:24:40] Loading diffusers pipeline from /workspace/dev/vipdev/hf_models/FLUX.1-dev
[03-11 09:24:40] Model already exists locally and is complete
[03-11 09:24:40] Loading diffusers pipeline with dtype=torch.bfloat16
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 196/196 [00:00<00:00, 10009.54it/s]
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 219/219 [00:00<00:00, 9843.88it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 39.15it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  7.76it/s]
[03-11 09:24:50] [Cache-DiT] cache_config is None, skip cache acceleration for FluxPipeline.
[03-11 09:24:53] [Cache-DiT] Quantized        Module: FluxTransformer2DModel
[03-11 09:24:53] [Cache-DiT] Quantized        Method: float8
[03-11 09:24:53] [Cache-DiT] Quantized Linear Layers:   496
[03-11 09:24:53] [Cache-DiT] Skipped   Linear Layers:     8
[03-11 09:24:53] [Cache-DiT] Total     Linear Layers:   504
[03-11 09:24:53] [Cache-DiT] Total     (all)  Layers:  1279
[03-11 09:24:53] Enabled cache-dit for diffusers pipeline
[03-11 09:24:53] Applied torch.compile to transformer component of the pipeline
[03-11 09:24:53] Loaded diffusers pipeline: FluxPipeline
[03-11 09:24:53] Pipeline instantiated
[03-11 09:24:53] Worker 0: Initialized device, model, and distributed environment.
[03-11 09:24:53] Worker 0: Scheduler loop started.
[03-11 09:24:53] Processing prompt 1/1: A fantasy landscape with mountains and a river, detailed, vibrant colors
[03-11 09:24:53] Processing warmup req... (1/1)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [01:24<00:00,  3.03s/it]
[03-11 09:26:19] Warmup req (1/1) processed in 85.98 seconds
[03-11 09:26:19] [DiffusersExecutionStage] started...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:13<00:00,  2.13it/s]
[03-11 09:26:33] [DiffusersExecutionStage] finished in 13.8018 seconds
[03-11 09:26:33] Peak GPU memory: 24.87 GB, Peak allocated: 22.80 GB, Memory pool overhead: 2.07 GB (8.3%), Remaining GPU memory at peak: 20.12 GB. Components that could stay resident (based on the last request workload): []. Related offload server args to disable: None
[03-11 09:26:34] Output saved to outputs/flux_diffusers_torch_compile_float8.png
[03-11 09:26:34] Pixel data generated successfully in 100.43 seconds
[03-11 09:26:34] Completed batch processing. Generated 1 outputs in 100.43 seconds
[03-11 09:26:34] Warmed-up request processed in 13.81 seconds (with warmup excluded)
[03-11 09:26:34] Memory usage - Max peak: 25468.00 MB, Avg peak: 25468.00 MB

Warmed-up request processed in 13.81 seconds

For NVIDIA H200 (Hopper): 3.73s -> 2.77s

# NVIDIA H200 (Hopper, Diffusers backend baseline, flash_3)
sglang generate \
  --model-path=$FLUX_DIR \
  --backend diffusers \
  --log-level=info \
  --prompt='A fantasy landscape with mountains and a river, detailed, vibrant colors' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=28 \
  --warmup \
  --warmup-steps 28 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --enable-torch-compile \
  --cache-dit-config ./hopper/flash_3.yaml \
  --save-output --output-path outputs --output-file-name flux_diffusers_torch_compile_h200.png

# NVIDIA H200 (Hopper, Diffusers backend, flash_3 + quantize)
sglang generate \
  --model-path=$FLUX_DIR \
  --backend diffusers \
  --log-level=info \
  --prompt='A fantasy landscape with mountains and a river, detailed, vibrant colors' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=28 \
  --warmup \
  --warmup-steps 28 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --enable-torch-compile \
  --cache-dit-config ./hopper/quantize.yaml \
  --save-output --output-path outputs --output-file-name flux_diffusers_torch_compile_fp8_h200.png

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

cc @mickqian @RubiaCx

@github-actions github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file labels Mar 11, 2026
@RubiaCx RubiaCx added the diffusion SGLang Diffusion label Mar 11, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on significantly improving the performance of the diffusers backend by integrating advanced optimization techniques. It updates the cache-dit library to its latest version and introduces comprehensive support for model quantization, allowing users to achieve substantial speedups, as demonstrated by nearly a 50% reduction in inference time for a specific model. The changes also provide more granular control over caching and parallelism strategies through updated configuration options, enhancing the overall efficiency and flexibility of the diffusion pipeline.

Highlights

  • Performance Improvement: Achieved approximately 48% speedup for FLUX.1-dev on L20 by enabling cache-dit quantization for the diffusers backend.
  • Dependency Update: The cache-dit library dependency was bumped to version 1.3.0.
  • Quantization Support: Introduced support for specifying quantization configurations (e.g., float8) via YAML files for transformer modules, with options to exclude specific layers.
  • Enhanced Caching and Parallelism Documentation: Expanded documentation for cache-dit to include new configurations such as Step Computation Mask (SCM), Cache CFG, Ulysses Anything Attention, Ulysses FP8 Communication, Async Ulysses CP, and Text Encoder/VAE Parallelism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/diffusion/performance/cache/cache_dit.md
    • Added new cache configurations including DBCache + TaylorSeer + SCM and DBCache + TaylorSeer + SCM + Cache CFG.
    • Updated parallelism configuration examples to directly include attention_backend and extra_parallel_modules.
    • Introduced documentation for Ulysses Anything Attention, Ulysses FP8 Communication, Async Ulysses CP, and TE-P/VAE-P parallelism options.
    • Added a new section detailing how to specify attention backend and quantization configurations via YAML files.
    • Included an example of combining cache, parallelism, and quantization configurations in a single YAML file.
  • python/pyproject.toml
    • Updated the cache-dit dependency from version 1.2.3 to 1.3.0.
Activity
  • Benchmarking results were provided, showcasing a 48% speedup for FLUX.1-dev on L20 with cache-dit quantization.
  • Detailed command-line examples and output logs for both baseline and FP8 quantized runs were included.
  • The checklist items for code formatting, documentation, and benchmarking have been marked as completed by the author.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request bumps up the cache-dit version and adds support for quantization for the diffusers backend, resulting in a reported speedup. The changes primarily involve updating the pyproject.toml file and modifying the cache_dit.md documentation to include examples of quantization configurations. I have added a review comment to address a potential issue.

Note: Security Review has been skipped due to the limited scope of the PR.

Comment thread docs/diffusion/performance/cache/cache_dit.md
Comment thread python/pyproject.toml
@mickqian
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@mickqian
Copy link
Copy Markdown
Collaborator

diffusion affected only, lint passed, bypassign

@mickqian mickqian merged commit 025691c into sgl-project:main Mar 17, 2026
121 of 133 checks passed
@DefTruth DefTruth deleted the diffusers-be-quant branch March 17, 2026 11:57
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants