Skip to content

[Diffusion] Add Qwen Image ModelOpt FP8 support#23155

Merged
BBuf merged 11 commits intosgl-project:mainfrom
BBuf:codex/qwen-image-modelopt-fp8
May 3, 2026
Merged

[Diffusion] Add Qwen Image ModelOpt FP8 support#23155
BBuf merged 11 commits intosgl-project:mainfrom
BBuf:codex/qwen-image-modelopt-fp8

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Apr 19, 2026

Summary

Add ModelOpt FP8 support for Qwen Image diffusion transformers in the SGLang runtime and FP8 converter.

  • make Qwen Image attention, MLP, and top-level projections quant-aware with full checkpoint prefixes
  • add a Qwen Image / Qwen Image Edit BF16 fallback profile, including transformer_blocks.*.img_mlp.net.2 after image-quality ablation
  • fix the FP8 converter so explicit BF16 fallback tensors are written before ModelOpt ignore-preservation skips the source tensor
  • publish clean SGLang-native ModelOpt FP8 transformer overrides under the lmsys Hugging Face org
  • document the validated Qwen Image and Qwen Image Edit ModelOpt FP8 checkpoint flow in docs/diffusion/quantization.md
  • update the diffusion ModelOpt quant skill with the Qwen Image FP8 fallback and converter-ordering notes
  • add Qwen Image and Qwen Image Edit ModelOpt FP8 cases to the B200 diffusion CI set

Published FP8 weights

Both repos are intentionally clean transformer override repos: README.md, config.json, and .safetensors shards only.

Validation

Validated on H100 rank0 (CUDA_VISIBLE_DEVICES=0) for the generated artifacts and benchmarks.

  • SGLang main base used during validation: 6ecd6f84d
  • ModelOpt main used for PTQ export: 26ae8da51
  • python -m compileall -q python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py python/sglang/multimodal_gen/tools/build_modelopt_fp8_transformer.py -> passed
  • isort, black, and ruff check --select=F401,F821 --fix on changed Python files -> passed, no further changes
  • git diff --check -> passed
  • python3 -m py_compile python/sglang/multimodal_gen/test/server/testcase_configs.py python/sglang/multimodal_gen/test/server/gpu_cases.py -> passed after adding B200 cases
  • python3 -m black --check python/sglang/multimodal_gen/test/server/testcase_configs.py python/sglang/multimodal_gen/test/server/gpu_cases.py -> passed
  • python3 -m ruff check --select=F401,F821 python/sglang/multimodal_gen/test/server/testcase_configs.py python/sglang/multimodal_gen/test/server/gpu_cases.py -> passed

Qwen Image 1024x1024, 50 steps

Prompt: A futuristic cyberpunk city at night, neon lights reflecting on wet streets

  • BF16 image: visually normal
  • Old FP8 default: severe dark/blurred quality regression
  • Fixed FP8 (img_mlp.net.2 BF16 fallback): visually normal
  • converter stats: 660 quantized weights, 1320 scale tensors, 186 BF16 fallback weights, 3 output shards
  • benchmark with native sglang generate --backend=sglang --warmup:
    • E2E: 13589.20 ms -> 12159.39 ms, 10.5% faster
    • Denoising: 12928.76 ms -> 11437.40 ms, 11.5% faster

Qwen Image Edit 512x512, 8 steps

Prompt: A clean product photo of a small ceramic teapot on a wooden table, soft daylight, sharp details.

  • BF16 and fixed FP8 images are visually aligned for the smoke edit workload
  • converter stats: 660 quantized weights, 1320 scale tensors, 186 BF16 fallback weights, 3 output shards
  • benchmark with native sglang generate --backend=sglang --warmup:
    • E2E: 6791.97 ms -> 6085.03 ms, 10.4% faster
    • Denoising: 5204.32 ms -> 4524.01 ms, 13.1% faster

B200 CI

Added to ONE_GPU_MODELOPT_CASES for multimodal-gen-test-1-b200:

  • qwen_image_modelopt_fp8_t2i
  • qwen_image_edit_modelopt_fp8_ti2i

Notes

  • The ModelOpt Qwen model registration used for PTQ export was patched only in the H100 validation checkout and is not included here.
  • The fixed profiler trace was captured for Qwen Image 1024x1024, 8 steps, --profile --num-profiled-timesteps=2.

@github-actions github-actions Bot added quant LLM Quantization diffusion SGLang Diffusion labels Apr 19, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces quantization support for the Qwen Image model. Key changes include replacing standard linear layers with ReplicatedLinear to support quantization configurations and prefixes, and implementing custom QwenImageGELU and QwenImageFeedForward modules to maintain compatibility with the model's expected state dict structure. Additionally, the PR adds FP8 fallback patterns for Qwen Image and includes comprehensive unit tests to verify prefixing and quantization method assignments. I have no feedback to provide.

@BBuf BBuf force-pushed the codex/qwen-image-modelopt-fp8 branch 3 times, most recently from 30fad97 to aca4193 Compare April 19, 2026 13:46
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 19, 2026

Qwen Image / Image Edit FP8 validation update

The clean SGLang-native ModelOpt FP8 transformer checkpoints now live under the lmsys Hugging Face org:

These repos intentionally contain only the model card, config.json, and .safetensors shards. Validation images, benchmark JSON, command logs, and profiler traces are intentionally not stored in the clean model repos.

Benchmark summary

Native SGLang backend, H100 rank0, CUDA_VISIBLE_DEVICES=0, sglang generate --backend=sglang --warmup. FP8 uses the fixed checkpoint with transformer_blocks.*.img_mlp.net.2 kept as BF16 fallback.

Workload BF16 E2E FP8 E2E E2E speedup BF16 denoising FP8 denoising Denoising speedup
Qwen Image 1024x1024, 50 steps 13589.20 ms 12159.39 ms 10.5% 12928.76 ms 11437.40 ms 11.5%
Qwen Image Edit 512x512, 8 steps 6791.97 ms 6085.03 ms 10.4% 5204.32 ms 4524.01 ms 13.1%

Quality / profiler notes

  • Qwen Image BF16 and fixed FP8 outputs were visually aligned for the 1024x1024 50-step prompt; the old default FP8 checkpoint had the severe dark/blurred regression.
  • Qwen Image Edit BF16 and fixed FP8 outputs were visually aligned for the 512x512 8-step smoke edit workload.
  • Converter stats for both Qwen Image and Qwen Image Edit: 660 quantized weights, 1320 scale tensors, 186 BF16 fallback weights, 3 output shards.
  • Qwen Image 1024x1024, 8-step profiler capture: BF16 802.00 ms total CUDA kernel time vs fixed FP8 581.71 ms in the profiled region; FP8 CUTLASS GEMMs replace/reduce the dominant BF16 GEMM bucket while _static_quant_fp8 accounts for about 4.2% of captured CUDA kernel time.

@BBuf BBuf marked this pull request as ready for review April 19, 2026 14:10
@BBuf BBuf force-pushed the codex/qwen-image-modelopt-fp8 branch from aca4193 to 017dfc3 Compare April 19, 2026 22:58
@BBuf BBuf force-pushed the codex/qwen-image-modelopt-fp8 branch from 017dfc3 to f36faeb Compare April 19, 2026 23:04
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 19, 2026
@BBuf BBuf requested a review from wisclmy0611 as a code owner April 25, 2026 01:00
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 25, 2026

/tag-and-rerun-ci

1 similar comment
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 25, 2026

/tag-and-rerun-ci

@BBuf BBuf requested a review from JustinTong0323 as a code owner April 28, 2026 08:14
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 28, 2026

Updated this PR to use the new clean lmsys ModelOpt diffusion repos.

Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 28, 2026

Pushed one follow-up lint fix (cd1ab4de3) for the import ordering reported by CI. The latest lint check is now green; multimodal-gen-component-accuracy and multimodal-gen-test-1-b200 are still queued.

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 2, 2026

/tag-and-rerun-ci

BBuf added 2 commits May 2, 2026 21:12
# Conflicts:
#	docs/diffusion/quantization.md
#	docs_new/docs/sglang-diffusion/quantization.mdx
#	python/sglang/multimodal_gen/test/server/testcase_configs.py
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 3, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 3, 2026

@BBuf BBuf merged commit f2d1390 into sgl-project:main May 3, 2026
71 of 79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation high priority quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants