[Diffusion] Bump up cache-dit & support quant for diffusers backend by DefTruth · Pull Request #20361 · sgl-project/sglang

DefTruth · 2026-03-11T12:50:04Z

Bump up cache-dit & support quant for diffusers backend (~48% speedup for FLUX.1-dev on L20). Need #20338 ready.

Cache-DiT v1.3.0 is a major release after v.1.2.0, the major changes incude:

Optimize VAE Parallel comm, use batched isend/irecv
2D/3D Parallelism: Hybrid CP(USP) + TP, e.g, SP2 + TP2
Support USP (hybrid ulysses and ring attention)
New models support: GLM-Image, FLUX.2-Klein, Helios, FireRed-Image-Edit, and more.
Support pass a quantize_config to enable_cache API and load from config yaml
FP8 Blockwise dynamic quantization support
AMD GPUs support
...

The major changes related to SGLang Diffusion are that Cache-DiT v1.3.0 has improved its configs loading functions, enabling support for nearly all optimizations configured in the YAML file. These optimizations include:

hybrid cache acceleration (DBCache, TaylorSeer, SCM, etc.);
comprehensive parallelism optimizations, including Context Parallelism, Tensor Parallelism, hybrid 2D or 3D parallelism, and dedicated extra parallelism support for Text Encoder, VAE, and ControlNet;
Ulysses Anything Attention, Async Ulysses CP, Ulysses FP8 Comm;
quantization (float8 (DQ), float8_weight_only, float8_blockwise, int8 (DQ), int8_weight_only, etc.)

Please refer to LOAD_CONFIGS for more details. This allows all optimizations in cache-dit to be used in SGLang Diffusion, greatly optimizing diffusers backend inference performance.

Benchmarking and Profiling

NVIDIA L20 x 1, FLUX.1-dev, 28 steps, 1024 x 1024

diffusers backend, baseline	diffusers backend + fp8 quantize
20.46s	13.81s (~48% speedup)

NVIDIA H200 x 1, FLUX.1-dev, 28 steps, 1024 x 1024

diffusers backend, baseline	diffusers backend + fp8 quantize
3.73s	2.77s (~35% speedup)

upgrade torchao

pip install -U torchao # >= 0.16.0

test configs

https://github.com/vipshop/cache-dit/tree/main/examples/configs

git clone https://github.com/vipshop/cache-dit && cd examples/configs

quantize.yaml

quantize_config: # quantization configuration for transformer modules
  quant_type: "float8" # float8 (DQ), float8_weight_only, float8_blockwise, int8 (DQ), int8_weight_only, etc.
  exclude_layers:  # layers to exclude from quantization (transformer)
    - "embedder"
    - "embed"
  verbose: false # whether to print verbose logs during quantization

NVIDIA L20 w/o FP8: 20.46s (Diffusers backend, baseline)

# NVIDIA L20 (Ada)
sglang generate \
  --model-path=$FLUX_DIR \
  --backend diffusers \
  --log-level=info \
  --prompt='A fantasy landscape with mountains and a river, detailed, vibrant colors' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=28 \
  --warmup \
  --warmup-steps 28 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --enable-torch-compile \
  --save-output --output-path outputs --output-file-name flux_diffusers_torch_compile.png

# Output logs:
[03-11 12:37:45] Scheduler bind at endpoint: tcp://127.0.0.1:5599
[03-11 12:37:45] Initializing distributed environment with world_size=1, device=cuda:0, timeout=3600
[03-11 12:37:45] Setting distributed timeout to 3600 seconds
[03-11 12:37:46] No pipeline_class_name specified, using model_index.json
[03-11 12:37:46] Using diffusers backend for model '/workspace/dev/vipdev/hf_models/FLUX.1-dev' (explicitly requested)
[03-11 12:37:46] Using pipeline from model_index.json: DiffusersPipeline
[03-11 12:37:46] Loading diffusers pipeline from /workspace/dev/vipdev/hf_models/FLUX.1-dev
[03-11 12:37:46] Model already exists locally and is complete
[03-11 12:37:46] Loading diffusers pipeline with dtype=torch.bfloat16
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 39.88it/s]
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 219/219 [00:00<00:00, 9109.64it/s]
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 196/196 [00:00<00:00, 9801.88it/s]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  7.90it/s]
[03-11 12:37:56] Applied torch.compile to transformer component of the pipeline
[03-11 12:37:56] Loaded diffusers pipeline: FluxPipeline
[03-11 12:37:56] Pipeline instantiated
[03-11 12:37:56] Worker 0: Initialized device, model, and distributed environment.
[03-11 12:37:56] Worker 0: Scheduler loop started.
[03-11 12:37:56] Processing prompt 1/1: A fantasy landscape with mountains and a river, detailed, vibrant colors
[03-11 12:37:56] Processing warmup req... (1/1)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:36<00:00,  1.32s/it]
[03-11 12:38:35] Warmup req (1/1) processed in 38.30 seconds
[03-11 12:38:35] [DiffusersExecutionStage] started...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:19<00:00,  1.44it/s]
[03-11 12:38:55] [DiffusersExecutionStage] finished in 20.4576 seconds
[03-11 12:38:55] Peak GPU memory: 35.83 GB, Peak allocated: 33.83 GB, Memory pool overhead: 2.01 GB (5.6%), Remaining GPU memory at peak: 9.15 GB. Components that could stay resident (based on the last request workload): []. Related offload server args to disable: None
[03-11 12:38:56] Output saved to outputs/flux_diffusers_torch_compile.png
[03-11 12:38:56] Pixel data generated successfully in 59.35 seconds
[03-11 12:38:56] Completed batch processing. Generated 1 outputs in 59.35 seconds
[03-11 12:38:56] Warmed-up request processed in 20.46 seconds (with warmup excluded)
[03-11 12:38:56] Memory usage - Max peak: 36694.00 MB, Avg peak: 36694.00 MB

Warmed-up request processed in 20.46 seconds

NVIDIA L20 w/ FP8: 13.81s (vs 20.46s, 48% speedup), (Diffusers backend + quantize)

sglang generate \
  --model-path=$FLUX_DIR \
  --backend diffusers \
  --log-level=info \
  --prompt='A fantasy landscape with mountains and a river, detailed, vibrant colors' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=28 \
  --warmup \
  --warmup-steps 28 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --cache-dit-config ./quantize.yaml \
  --enable-torch-compile \
  --save-output --output-path outputs --output-file-name flux_diffusers_torch_compile_float8.png

# Output logs:
[03-11 09:24:39] Scheduler bind at endpoint: tcp://127.0.0.1:5580
[03-11 09:24:39] Initializing distributed environment with world_size=1, device=cuda:0, timeout=3600
[03-11 09:24:39] Setting distributed timeout to 3600 seconds
[03-11 09:24:40] No pipeline_class_name specified, using model_index.json
[03-11 09:24:40] Using diffusers backend for model '/workspace/dev/vipdev/hf_models/FLUX.1-dev' (explicitly requested)
[03-11 09:24:40] Using pipeline from model_index.json: DiffusersPipeline
[03-11 09:24:40] Loading diffusers pipeline from /workspace/dev/vipdev/hf_models/FLUX.1-dev
[03-11 09:24:40] Model already exists locally and is complete
[03-11 09:24:40] Loading diffusers pipeline with dtype=torch.bfloat16
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 196/196 [00:00<00:00, 10009.54it/s]
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 219/219 [00:00<00:00, 9843.88it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 39.15it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  7.76it/s]
[03-11 09:24:50] [Cache-DiT] cache_config is None, skip cache acceleration for FluxPipeline.
[03-11 09:24:53] [Cache-DiT] Quantized        Module: FluxTransformer2DModel
[03-11 09:24:53] [Cache-DiT] Quantized        Method: float8
[03-11 09:24:53] [Cache-DiT] Quantized Linear Layers:   496
[03-11 09:24:53] [Cache-DiT] Skipped   Linear Layers:     8
[03-11 09:24:53] [Cache-DiT] Total     Linear Layers:   504
[03-11 09:24:53] [Cache-DiT] Total     (all)  Layers:  1279
[03-11 09:24:53] Enabled cache-dit for diffusers pipeline
[03-11 09:24:53] Applied torch.compile to transformer component of the pipeline
[03-11 09:24:53] Loaded diffusers pipeline: FluxPipeline
[03-11 09:24:53] Pipeline instantiated
[03-11 09:24:53] Worker 0: Initialized device, model, and distributed environment.
[03-11 09:24:53] Worker 0: Scheduler loop started.
[03-11 09:24:53] Processing prompt 1/1: A fantasy landscape with mountains and a river, detailed, vibrant colors
[03-11 09:24:53] Processing warmup req... (1/1)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [01:24<00:00,  3.03s/it]
[03-11 09:26:19] Warmup req (1/1) processed in 85.98 seconds
[03-11 09:26:19] [DiffusersExecutionStage] started...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:13<00:00,  2.13it/s]
[03-11 09:26:33] [DiffusersExecutionStage] finished in 13.8018 seconds
[03-11 09:26:33] Peak GPU memory: 24.87 GB, Peak allocated: 22.80 GB, Memory pool overhead: 2.07 GB (8.3%), Remaining GPU memory at peak: 20.12 GB. Components that could stay resident (based on the last request workload): []. Related offload server args to disable: None
[03-11 09:26:34] Output saved to outputs/flux_diffusers_torch_compile_float8.png
[03-11 09:26:34] Pixel data generated successfully in 100.43 seconds
[03-11 09:26:34] Completed batch processing. Generated 1 outputs in 100.43 seconds
[03-11 09:26:34] Warmed-up request processed in 13.81 seconds (with warmup excluded)
[03-11 09:26:34] Memory usage - Max peak: 25468.00 MB, Avg peak: 25468.00 MB

Warmed-up request processed in 13.81 seconds

For NVIDIA H200 (Hopper): 3.73s -> 2.77s

# NVIDIA H200 (Hopper, Diffusers backend baseline, flash_3)
sglang generate \
  --model-path=$FLUX_DIR \
  --backend diffusers \
  --log-level=info \
  --prompt='A fantasy landscape with mountains and a river, detailed, vibrant colors' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=28 \
  --warmup \
  --warmup-steps 28 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --enable-torch-compile \
  --cache-dit-config ./hopper/flash_3.yaml \
  --save-output --output-path outputs --output-file-name flux_diffusers_torch_compile_h200.png

# NVIDIA H200 (Hopper, Diffusers backend, flash_3 + quantize)
sglang generate \
  --model-path=$FLUX_DIR \
  --backend diffusers \
  --log-level=info \
  --prompt='A fantasy landscape with mountains and a river, detailed, vibrant colors' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=28 \
  --warmup \
  --warmup-steps 28 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --enable-torch-compile \
  --cache-dit-config ./hopper/quantize.yaml \
  --save-output --output-path outputs --output-file-name flux_diffusers_torch_compile_fp8_h200.png

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

cc @mickqian @RubiaCx

…ckend

gemini-code-assist · 2026-03-11T13:26:25Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on significantly improving the performance of the diffusers backend by integrating advanced optimization techniques. It updates the cache-dit library to its latest version and introduces comprehensive support for model quantization, allowing users to achieve substantial speedups, as demonstrated by nearly a 50% reduction in inference time for a specific model. The changes also provide more granular control over caching and parallelism strategies through updated configuration options, enhancing the overall efficiency and flexibility of the diffusion pipeline.

Highlights

Performance Improvement: Achieved approximately 48% speedup for FLUX.1-dev on L20 by enabling cache-dit quantization for the diffusers backend.
Dependency Update: The cache-dit library dependency was bumped to version 1.3.0.
Quantization Support: Introduced support for specifying quantization configurations (e.g., float8) via YAML files for transformer modules, with options to exclude specific layers.
Enhanced Caching and Parallelism Documentation: Expanded documentation for cache-dit to include new configurations such as Step Computation Mask (SCM), Cache CFG, Ulysses Anything Attention, Ulysses FP8 Communication, Async Ulysses CP, and Text Encoder/VAE Parallelism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/diffusion/performance/cache/cache_dit.md
- Added new cache configurations including DBCache + TaylorSeer + SCM and DBCache + TaylorSeer + SCM + Cache CFG.
- Updated parallelism configuration examples to directly include attention_backend and extra_parallel_modules.
- Introduced documentation for Ulysses Anything Attention, Ulysses FP8 Communication, Async Ulysses CP, and TE-P/VAE-P parallelism options.
- Added a new section detailing how to specify attention backend and quantization configurations via YAML files.
- Included an example of combining cache, parallelism, and quantization configurations in a single YAML file.
python/pyproject.toml
- Updated the cache-dit dependency from version 1.2.3 to 1.3.0.

Activity

Benchmarking results were provided, showcasing a 48% speedup for FLUX.1-dev on L20 with cache-dit quantization.
Detailed command-line examples and output logs for both baseline and FP8 quantized runs were included.
The checklist items for code formatting, documentation, and benchmarking have been marked as completed by the author.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request bumps up the cache-dit version and adds support for quantization for the diffusers backend, resulting in a reported speedup. The changes primarily involve updating the pyproject.toml file and modifying the cache_dit.md documentation to include examples of quantization configurations. I have added a review comment to address a potential issue.

_{Note: Security Review has been skipped due to the limited scope of the PR.}

…ckend

mickqian · 2026-03-17T00:53:19Z

/tag-and-rerun-ci

yhyang201 · 2026-03-17T01:17:43Z

/tag-and-rerun-ci

mickqian · 2026-03-17T04:51:19Z

diffusion affected only, lint passed, bypassign

…ckend (sgl-project#20361)

[diffusion] Bump up cache-dit & support quantization for diffusers ba…

720c3bb

…ckend

DefTruth requested review from Fridge003, ispobock and merrymercy as code owners March 11, 2026 12:50

github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file labels Mar 11, 2026

RubiaCx added the diffusion SGLang Diffusion label Mar 11, 2026

gemini-code-assist Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread docs/diffusion/performance/cache/cache_dit.md

Comment thread python/pyproject.toml

DefTruth and others added 8 commits March 12, 2026 01:35

[diffusion] Bump up cache-dit & support quantization for diffusers ba…

90af401

…ckend

Merge branch 'main' into diffusers-be-quant

3ed67e8

Merge branch 'main' into diffusers-be-quant

c80c962

chore: update load configs docs

76f7f76

Merge branch 'main' into diffusers-be-quant

b3fead2

Merge branch 'main' into diffusers-be-quant

f97106a

Merge branch 'main' into diffusers-be-quant

1117982

Merge branch 'main' into diffusers-be-quant

c44021e

github-actions Bot added the run-ci label Mar 17, 2026

mickqian approved these changes Mar 17, 2026

View reviewed changes

mickqian merged commit 025691c into sgl-project:main Mar 17, 2026
121 of 133 checks passed

DefTruth deleted the diffusers-be-quant branch March 17, 2026 11:57

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[diffusion] chore: bump up cache-dit & support quant for diffusers ba…

bb12948

…ckend (sgl-project#20361)

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

[diffusion] chore: bump up cache-dit & support quant for diffusers ba…

8c938a5

…ckend (sgl-project#20361)

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[diffusion] chore: bump up cache-dit & support quant for diffusers ba…

866697b

…ckend (sgl-project#20361)

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[diffusion] chore: bump up cache-dit & support quant for diffusers ba…

ea5a093

…ckend (sgl-project#20361)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusion] Bump up cache-dit & support quant for diffusers backend#20361

[Diffusion] Bump up cache-dit & support quant for diffusers backend#20361
mickqian merged 9 commits intosgl-project:mainfrom
xlite-dev:diffusers-be-quant

DefTruth commented Mar 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

mickqian commented Mar 17, 2026

Uh oh!

yhyang201 commented Mar 17, 2026

Uh oh!

mickqian commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DefTruth commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mickqian commented Mar 17, 2026

Uh oh!

yhyang201 commented Mar 17, 2026

Uh oh!

mickqian commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DefTruth commented Mar 11, 2026 •

edited

Loading