Skip to content

[Diffusion] modelopt diffusion fp8 support for flux1/flux2 and wan2.2#22365

Merged
BBuf merged 17 commits intomainfrom
bbuf/modelopt-diffusion-fp8
Apr 10, 2026
Merged

[Diffusion] modelopt diffusion fp8 support for flux1/flux2 and wan2.2#22365
BBuf merged 17 commits intomainfrom
bbuf/modelopt-diffusion-fp8

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Apr 8, 2026

Summary

This PR adds a diffusion-side ModelOpt FP8 loading path for SGLang and a reusable workflow for converting ModelOpt diffusers exports into SGLang-loadable checkpoints.

The main goal is to make ModelOpt FP8 practical for SGLang diffusion models without requiring users to manually reconstruct FP8 checkpoints from backbone.pt every time.

What changed

Runtime support

  • add a dedicated modelopt_fp8 quantization path for diffusion models
  • resolve quant_method=modelopt + quant_algo=FP8 into the SGLang diffusion FP8 runtime path
  • allow quant config detection from an overridden transformer checkpoint instead of relying only on the base model config
  • force-disable dit_cpu_offload and dit_layerwise_offload for ModelOpt FP8 checkpoints

Why offload is disabled:

  • the current diffusion FP8 linear path depends on a CUTLASS-compatible FP8 weight layout
  • the DiT offload/restore path does not preserve that layout
  • enabling those offload modes can break FP8 GEMM expectations during runtime

FP8 checkpoint conversion

  • add python -m sglang.multimodal_gen.tools.convert_modelopt_fp8_checkpoint
  • this tool reads a ModelOpt diffusers FP8 export plus backbone.pt
  • it reconstructs weight_scale / input_scale
  • it materializes SGLang-native float8_e4m3fn weights
  • it preserves ModelOpt ignore layers in their original dtype

The converter is generic in its core flow. The only model-family-specific part is an optional BF16 fallback profile. Today the validated built-in fallback profile is for FLUX.2.

Validation helper

  • add python -m sglang.multimodal_gen.tools.compare_diffusion_trajectory_similarity
  • this compares BF16 and quantized runs with the same prompt/seed and reports:
    • latent trajectory cosine similarity
    • MAE / RMSE / max-abs
    • final image or video frame PSNR / MAE

Skill

  • add a reusable skill for future diffusion ModelOpt work:
    • python/sglang/multimodal_gen/.claude/skills/sglang-diffusion-modelopt-quant/SKILL.md
  • the skill focuses on a reusable workflow first, with FLUX.2 and Wan2.2 as reference examples

Notes on ModelOpt formats

FP8 currently needs an extra SGLang-side conversion step.

Why:

  • the current diffusion FP8 runtime expects explicit weight_scale and input_scale tensors
  • the validated ModelOpt diffusers FP8 export still needs those tensors to be materialized from backbone.pt
  • SGLang also consumes validated float8_e4m3fn weights in the converted checkpoint

NVFP4 is different:

  • the official ModelOpt diffusers export already contains the packed FP4 weights, scale tensors, and enough metadata for SGLang to rebuild the runtime quant config
  • SGLang mainly needs checkpoint-family detection plus runtime layout adaptation

Published checkpoints

The following converted checkpoints are already published so users do not need to run ModelOpt export + SGLang conversion themselves.

  • FLUX.2 FP8 transformer:
    • BBuf/flux2-dev-modelopt-fp8-sglang-transformer
  • Wan2.2 FP8 transformer:
    • BBuf/wan22-t2v-a14b-modelopt-fp8-sglang-transformer

Example usage:

python -m sglang.multimodal_gen.runtime.entrypoints.cli.main generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --transformer-path BBuf/flux2-dev-modelopt-fp8-sglang-transformer \
  ...
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --transformer-path BBuf/wan22-t2v-a14b-modelopt-fp8-sglang-transformer \
  ...

Note:

  • the published Wan2.2 checkpoint is the validated primary transformer FP8 override currently used in our H100 validation runs
  • transformer_2 remains loaded from the base model in BF16 for that published recipe

Validation

FLUX.2

Validation was run on H100 with nightly-aligned settings and BF16/FP8 output comparisons.

Observed latency:

  • BF16: 24.47 s total, 23.21 s denoising
  • FP8: 17.13 s total, 16.21 s denoising
  • improvement: about 30.0% total and 30.1% denoising

Reduced deterministic validation also showed high latent trajectory agreement:

  • last-step latent cosine similarity: about 0.9971
flux2_bf16 图片 flux2_fp8 图片

5.7ms->2.5ms in Profile 5 step's last layer.

Wan2.2

Validation was run on H100 with nightly-aligned settings using the validated primary-transformer FP8 override.

Observed latency:

  • BF16: 212.19 s total, 204.09 s denoising
  • FP8: 204.38 s total, 196.28 s denoising
  • improvement: about 3.68% total and 3.83% denoising

Reduced deterministic validation also showed stable trajectory agreement:

  • last-step latent cosine similarity: about 0.9755
wan22_bf16_nocompile.mp4
wan22_fp8_nocompile.mp4

Artifacts

For both FLUX.2 and Wan2.2, I collected:

  • BF16 and FP8 generated outputs
  • torch profiler traces
  • perf json dumps
  • trajectory similarity results

These artifacts were used during local validation and can be attached in review if needed.

Scope

This PR focuses on:

  • diffusion ModelOpt FP8 loading
  • ModelOpt FP8 checkpoint conversion
  • reusable validation guidance and workflow

It does not add ModelOpt mixed precision support.

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization diffusion SGLang Diffusion labels Apr 8, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for NVIDIA ModelOpt FP8 and NVFP4 quantization in SGLang Diffusion, introducing new runtime layers, loading adapters, and tools for checkpoint conversion and accuracy validation. Feedback focuses on generalizing the layer exclusion logic to avoid LLM-specific assumptions and ensuring that parameter metadata is preserved during weight processing by using appropriate utility functions.

Comment on lines +109 to +124
import regex as re

fused_patterns = ["q_a_proj", "q_b_proj", "kv_a_proj_with_mqa", "kv_b_proj"]
prefix_split = prefix.split(".")
for pattern in self.exclude_modules:
regex_str = pattern.replace(".", r"\.").replace("*", r".*")
pattern_split = pattern.split(".")
if re.fullmatch(regex_str, prefix):
return True
if (
pattern_split[-1] in fused_patterns
and pattern_split[-1] in prefix_split[-1]
):
assert len(prefix_split) == 5 and len(pattern_split) == 5
return True
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The is_layer_excluded method contains logic and assertions that are specific to LLM layer structures in sglang.srt (e.g., fused_patterns like q_a_proj and assert len(prefix_split) == 5). These are likely not applicable to diffusion models and could cause runtime errors or incorrect exclusion behavior. Additionally, it's recommended to use the standard re library instead of regex for these simple patterns.

        import re

        for pattern in self.exclude_modules:
            regex_str = pattern.replace(".", r"\.").replace("*", r".*")
            if re.fullmatch(regex_str, prefix):
                return True
        return False

Comment on lines +356 to +360
layer.weight = Parameter(quantized_weight.t(), requires_grad=False)
if self.cutlass_fp8_supported:
max_w_scale = convert_to_channelwise(max_w_scale, layer.logical_widths)
layer.weight_scale = Parameter(max_w_scale, requires_grad=False)
layer.input_scale = Parameter(layer.input_scale.max(), requires_grad=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In process_weights_after_loading, replacing layer.weight, layer.weight_scale, and layer.input_scale with plain Parameter objects removes the custom metadata and attributes (like weight_loader, input_dim, etc.) associated with ModelWeightParameter and PerTensorScaleParameter. It is safer to use copy_or_rebind_param to update the data while preserving the parameter types and their metadata.

Suggested change
layer.weight = Parameter(quantized_weight.t(), requires_grad=False)
if self.cutlass_fp8_supported:
max_w_scale = convert_to_channelwise(max_w_scale, layer.logical_widths)
layer.weight_scale = Parameter(max_w_scale, requires_grad=False)
layer.input_scale = Parameter(layer.input_scale.max(), requires_grad=False)
copy_or_rebind_param(layer, "weight", quantized_weight.t())
if self.cutlass_fp8_supported:
max_w_scale = convert_to_channelwise(max_w_scale, layer.logical_widths)
copy_or_rebind_param(layer, "weight_scale", max_w_scale)
copy_or_rebind_param(layer, "input_scale", layer.input_scale.max())

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 9, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@mickqian mickqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some TODOs:

  1. adapt quantization doc if necessary
  2. add at least one testcase for modelopt fp8

"""
quant_config = get_quant_config(hf_config, component_model_path)
if quant_config is None and server_args.transformer_weights_path:
override_quantized_path = maybe_download_model(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe extract to a dedicated function here to better illustrate the quant load logic

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 10, 2026

Split the ModelOpt FP8 skill and helper tooling out into stacked PR #22492 so this PR stays focused on the runtime / loader / test changes.

This PR now only keeps the runtime-side code, docs, and the diffusion FP8 correctness test.

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 10, 2026

/tag-and-rerun-ci

@BBuf BBuf changed the title [Diffusion] modelopt diffusion fp8 support for flux2 and wan2.2 [Diffusion] modelopt diffusion fp8 support for flux1/flux2 and wan2.2 Apr 10, 2026
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 10, 2026

FLUX1

main

{
  "timestamp": "2026-04-10T08:54:23.530155+00:00",
  "request_id": "19b63cc2-4529-4b9a-9aa7-33a0783dcda2",
  "commit_hash": "N/A",
  "tag": "cli_generate",
  "total_duration_ms": 6695.81201497931,
  "steps": [
    {
      "name": "InputValidationStage",
      "duration_ms": 0.025160028599202633
    },
    {
      "name": "TextEncodingStage",
      "duration_ms": 30.52324301097542
    },
    {
      "name": "TimestepPreparationStage",
      "duration_ms": 0.36307109985500574
    },
    {
      "name": "LatentPreparationStage",
      "duration_ms": 0.11596397962421179
    },
    {
      "name": "DenoisingStage",
      "duration_ms": 6460.792711004615
    },
    {
      "name": "DecodingStage",
      "duration_ms": 27.556644985452294
    }
  ],
  "denoise_steps_ms": [
    {
      "step": 0,
      "duration_ms": 45.21148093044758
    },
    {
      "step": 1,
      "duration_ms": 107.14216693304479
    },
    {
      "step": 2,
      "duration_ms": 127.85932107362896
    },
    {
      "step": 3,
      "duration_ms": 127.8339650016278
    },
    {
      "step": 4,
      "duration_ms": 128.16816894337535
    },
    {
      "step": 5,
      "duration_ms": 128.27936396934092
    },
    {
      "step": 6,
      "duration_ms": 128.3823469420895
    },
    {
      "step": 7,
      "duration_ms": 135.9735649311915
    },
    {
      "step": 8,
      "duration_ms": 139.40520700998604
    },
    {
      "step": 9,
      "duration_ms": 133.79655499011278
    },
    {
      "step": 10,
      "duration_ms": 129.9675980117172
    },
    {
      "step": 11,
      "duration_ms": 128.2759360037744
    },
    {
      "step": 12,
      "duration_ms": 128.12580401077867
    },
    {
      "step": 13,
      "duration_ms": 128.40036593843251
    },
    {
      "step": 14,
      "duration_ms": 131.5733449300751
    },
    {
      "step": 15,
      "duration_ms": 138.8318210374564
    },
    {
      "step": 16,
      "duration_ms": 136.0765720019117
    },
    {
      "step": 17,
      "duration_ms": 131.5552389714867
    },
    {
      "step": 18,
      "duration_ms": 128.84778704028577
    },
    {
      "step": 19,
      "duration_ms": 128.27524892054498
    },
    {
      "step": 20,
      "duration_ms": 128.18402098491788
    },
    {
      "step": 21,
      "duration_ms": 129.26305492874235
    },
    {
      "step": 22,
      "duration_ms": 135.51268901210278
    },
    {
      "step": 23,
      "duration_ms": 133.56514391489327
    },
    {
      "step": 24,
      "duration_ms": 134.86903789453208
    },
    {
      "step": 25,
      "duration_ms": 130.30908501241356
    },
    {
      "step": 26,
      "duration_ms": 128.80951107945293
    },
    {
      "step": 27,
      "duration_ms": 128.9640519535169
    },
    {
      "step": 28,
      "duration_ms": 129.3845809996128
    },
    {
      "step": 29,
      "duration_ms": 132.02287105377764
    },
    {
      "step": 30,
      "duration_ms": 135.59336494654417
    },
    {
      "step": 31,
      "duration_ms": 134.8614349262789
    },
    {
      "step": 32,
      "duration_ms": 131.2766190385446
    },
    {
      "step": 33,
      "duration_ms": 131.51185400784016
    },
    {
      "step": 34,
      "duration_ms": 128.7684499984607
    },
    {
      "step": 35,
      "duration_ms": 128.77396901603788
    },
    {
      "step": 36,
      "duration_ms": 130.59861096553504
    },
    {
      "step": 37,
      "duration_ms": 132.69286998547614
    },
    {
      "step": 38,
      "duration_ms": 134.68070805538446
    },
    {
      "step": 39,
      "duration_ms": 132.85610906314105
    },
    {
      "step": 40,
      "duration_ms": 130.88433700613678
    },
    {
      "step": 41,
      "duration_ms": 129.696708987467
    },
    {
      "step": 42,
      "duration_ms": 130.65053895115852
    },
    {
      "step": 43,
      "duration_ms": 130.38856105413288
    },
    {
      "step": 44,
      "duration_ms": 131.33260502945632
    },
    {
      "step": 45,
      "duration_ms": 133.01292096730322
    },
    {
      "step": 46,
      "duration_ms": 133.72311799321324
    },
    {
      "step": 47,
      "duration_ms": 132.46590201742947
    },
    {
      "step": 48,
      "duration_ms": 129.67829301487654
    },
    {
      "step": 49,
      "duration_ms": 130.17743604723364
    }
  ],
  "memory_checkpoints": {
    "before_forward": {
      "allocated_mb": 23069.31,
      "reserved_mb": 31882.0,
      "peak_allocated_mb": 23069.31,
      "peak_reserved_mb": 31882.0
    },
    "after_forward": {
      "allocated_mb": 23085.82,
      "reserved_mb": 31882.0,
      "peak_allocated_mb": 27956.82,
      "peak_reserved_mb": 31882.0
    }
  },
  "meta": {
    "prompt": [
      "A futuristic cyberpunk city at night, neon lights reflecting on wet streets"
    ],
    "model": "/tmp/flux1_fp8_run/base_model"
  }
}
main_bf16

pr (fp8)

{
  "timestamp": "2026-04-10T09:19:39.207633+00:00",
  "request_id": "91de44b2-a4f2-4002-8d82-b082c7b3bcaa",
  "commit_hash": "N/A",
  "tag": "cli_generate",
  "total_duration_ms": 5582.085501984693,
  "steps": [
    {
      "name": "InputValidationStage",
      "duration_ms": 0.030776020139455795
    },
    {
      "name": "TextEncodingStage",
      "duration_ms": 35.968265030533075
    },
    {
      "name": "TimestepPreparationStage",
      "duration_ms": 0.4988630535081029
    },
    {
      "name": "LatentPreparationStage",
      "duration_ms": 0.13579893857240677
    },
    {
      "name": "DenoisingStage",
      "duration_ms": 5350.660264957696
    },
    {
      "name": "DecodingStage",
      "duration_ms": 22.37797703128308
    }
  ],
  "denoise_steps_ms": [
    {
      "step": 0,
      "duration_ms": 39.828247972764075
    },
    {
      "step": 1,
      "duration_ms": 92.50018000602722
    },
    {
      "step": 2,
      "duration_ms": 107.16919403057545
    },
    {
      "step": 3,
      "duration_ms": 107.11465799249709
    },
    {
      "step": 4,
      "duration_ms": 106.54556192457676
    },
    {
      "step": 5,
      "duration_ms": 107.54498501773924
    },
    {
      "step": 6,
      "duration_ms": 107.66280791722238
    },
    {
      "step": 7,
      "duration_ms": 107.3214530479163
    },
    {
      "step": 8,
      "duration_ms": 110.10823596734554
    },
    {
      "step": 9,
      "duration_ms": 116.6555939707905
    },
    {
      "step": 10,
      "duration_ms": 109.03783701360226
    },
    {
      "step": 11,
      "duration_ms": 107.415645965375
    },
    {
      "step": 12,
      "duration_ms": 107.34590794891119
    },
    {
      "step": 13,
      "duration_ms": 107.16972593218088
    },
    {
      "step": 14,
      "duration_ms": 106.64337896741927
    },
    {
      "step": 15,
      "duration_ms": 106.93808004725724
    },
    {
      "step": 16,
      "duration_ms": 107.53305698744953
    },
    {
      "step": 17,
      "duration_ms": 109.33092003688216
    },
    {
      "step": 18,
      "duration_ms": 111.23268003575504
    },
    {
      "step": 19,
      "duration_ms": 110.03234900999814
    },
    {
      "step": 20,
      "duration_ms": 108.85851900093257
    },
    {
      "step": 21,
      "duration_ms": 108.3691141102463
    },
    {
      "step": 22,
      "duration_ms": 108.2050099503249
    },
    {
      "step": 23,
      "duration_ms": 107.47028910554945
    },
    {
      "step": 24,
      "duration_ms": 107.59423393756151
    },
    {
      "step": 25,
      "duration_ms": 107.26817406248301
    },
    {
      "step": 26,
      "duration_ms": 108.95517095923424
    },
    {
      "step": 27,
      "duration_ms": 109.6436909865588
    },
    {
      "step": 28,
      "duration_ms": 110.43768108356744
    },
    {
      "step": 29,
      "duration_ms": 109.63370196986943
    },
    {
      "step": 30,
      "duration_ms": 108.53177506942302
    },
    {
      "step": 31,
      "duration_ms": 108.52476407308131
    },
    {
      "step": 32,
      "duration_ms": 108.16039005294442
    },
    {
      "step": 33,
      "duration_ms": 108.08389307931066
    },
    {
      "step": 34,
      "duration_ms": 108.0660269362852
    },
    {
      "step": 35,
      "duration_ms": 109.21005508862436
    },
    {
      "step": 36,
      "duration_ms": 109.02151791378856
    },
    {
      "step": 37,
      "duration_ms": 109.09041599370539
    },
    {
      "step": 38,
      "duration_ms": 109.37492293305695
    },
    {
      "step": 39,
      "duration_ms": 109.04035402927548
    },
    {
      "step": 40,
      "duration_ms": 108.39987103827298
    },
    {
      "step": 41,
      "duration_ms": 108.93821506761014
    },
    {
      "step": 42,
      "duration_ms": 108.05032507050782
    },
    {
      "step": 43,
      "duration_ms": 108.09110407717526
    },
    {
      "step": 44,
      "duration_ms": 108.69649704545736
    },
    {
      "step": 45,
      "duration_ms": 109.48362399358302
    },
    {
      "step": 46,
      "duration_ms": 108.94432198256254
    },
    {
      "step": 47,
      "duration_ms": 109.13362202700227
    },
    {
      "step": 48,
      "duration_ms": 109.00457098614424
    },
    {
      "step": 49,
      "duration_ms": 109.0579240117222
    }
  ],
  "memory_checkpoints": {
    "before_forward": {
      "allocated_mb": 17663.82,
      "reserved_mb": 26480.0,
      "peak_allocated_mb": 17663.82,
      "peak_reserved_mb": 26480.0
    },
    "after_forward": {
      "allocated_mb": 17680.33,
      "reserved_mb": 26480.0,
      "peak_allocated_mb": 22551.33,
      "peak_reserved_mb": 26480.0
    }
  },
  "meta": {
    "prompt": [
      "A futuristic cyberpunk city at night, neon lights reflecting on wet streets"
    ],
    "model": "/tmp/flux1_fp8_run/base_model"
  }
}
pr_fp8

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 10, 2026

@BBuf BBuf merged commit 1ff5155 into main Apr 10, 2026
42 of 76 checks passed
@BBuf BBuf deleted the bbuf/modelopt-diffusion-fp8 branch April 10, 2026 12:56
@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Apr 10, 2026

alisonshao pushed a commit that referenced this pull request Apr 10, 2026
Mock maybe_download_model in test_resolve_transformer_quant_load_spec_keeps_nunchaku_hook
to prevent it from trying to download a fake local path as an HF repo.

#22365 added _resolve_quant_config_from_transformer_override which calls
maybe_download_model on the transformer_weights_path, but the test uses
a non-existent /tmp path that fails HF Hub validation.
hnyls2002 pushed a commit that referenced this pull request Apr 11, 2026
Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>
pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026
…gl-project#22560)

Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…gl-project#22560)

Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation jit-kernel quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants