Skip to content

[Diffusion] Add Wan2.2 ModelOpt NVFP4 support#22681

Merged
mickqian merged 2 commits intocodex/flux1-modelopt-nvfp4-resubmitfrom
codex/wan22-modelopt-nvfp4-from-22672
Apr 13, 2026
Merged

[Diffusion] Add Wan2.2 ModelOpt NVFP4 support#22681
mickqian merged 2 commits intocodex/flux1-modelopt-nvfp4-resubmitfrom
codex/wan22-modelopt-nvfp4-from-22672

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Apr 13, 2026

Summary

  • add Wan2.2 ModelOpt NVFP4 support on top of #22672
  • keep a global --transformer-weights-path override scoped to the primary transformer so transformer_2 stays on the base BF16 checkpoint unless explicitly overridden
  • make scheduler loading tolerate newer config fields such as shift_terminal when the resolved SGLang scheduler class does not accept them
  • honor SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND in diffusion ModelOpt FP4 GEMM selection so Blackwell bring-up can force the validated FlashInfer path
  • document the validated Wan2.2 NVFP4 launch recipe

Validation

  • official ModelOpt FP4 export for Wan-AI/Wan2.2-T2V-A14B-Diffusers, with only the primary transformer quantized and transformer_2 kept BF16
  • B200 no-compile generation with base BF16 model + --transformer-weights-path override succeeded after forcing SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn
  • fixed-config B200 comparison on 832x480 / 17 frames / 2 steps:
    • main BF16 no-compile: E2E 55.89s, DenoisingStage 53.51s
    • this branch NVFP4 no-compile: E2E 25.72s, DenoisingStage 23.46s
    • delta: E2E -54.0%, DenoisingStage -56.2%
  • warmup compile check on the same PR branch/config:
    • warmup eager: E2E 17.87s, DenoisingStage 16.20s
    • warmup compile: E2E 22.64s, DenoisingStage 20.96s
    • compile was slower on this setup (+26.7% E2E, +29.4% denoise)

Notes

  • this PR is intentionally stacked on #22672
  • the validated Blackwell bring-up path currently uses FlashInfer FP4 GEMM via SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn
  • local B200 artifacts (videos, perf dumps, traces, summaries) were collected separately outside the repo

Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 13, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization diffusion SGLang Diffusion run-ci labels Apr 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Wan2.2-T2V-A14B-Diffusers quantization and improves the robustness of component loading. Key changes include documentation for dual-transformer FP4 exports, a new utility to filter unsupported scheduler initialization arguments, and logic to mask global quantization overrides for secondary transformer components. Additionally, the CUDA platform now supports an environment variable to prefer FlashInfer for FP4 GEMM operations. Unit tests were added to verify the new filtering and masking behaviors. I have no feedback to provide.

@mickqian mickqian merged commit 85863f4 into codex/flux1-modelopt-nvfp4-resubmit Apr 13, 2026
2 checks passed
@mickqian mickqian deleted the codex/wan22-modelopt-nvfp4-from-22672 branch April 13, 2026 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants