[fusion] add composable fusion pass framework by DevashishLal-CB · Pull Request #10549 · sgl-project/sglang

DevashishLal-CB · 2025-09-17T05:32:37Z

Motivation

Initial implementation of the changes proposed in #10118

Modifications

This PR adds the fusion passes and integration tests for them

Passes added

gate_up project + silu and mul (triton kernel, cutlass kernel is possible)
gate_up project + silu and mul + quant (triton kernel, cutlass kernel is possible)
rmsnorm + quant (kernel from vllm)
fused_add_rmsnorm + quant (kernel from vllm)

For fusion passes to work with cuda graph runner I had to get rid for the model patching (or I could rewrite the pass with the pattern functions looking for pure pytorch code, we should avoid this model patching as it will interfere with the compilation process)

I have also added model_bench.py, the idea with this is to provide a stripped down sglang runtime where each layer can be instantiated in isolation helping write integration and accuracy tests from fusion passes and fused kernels

Accuracy Tests

Benchmarking and Profiling

For llama 3.1 8B FP8, BS1, ISL 1024, OSL 1024. 6.2% Gains

# Dual GEMM Quant + RMSNorm Quant
Benchmark ...
Prefill. latency: 0.01871 s, throughput:  54740.79 token/s
Decode 0. Batch size: 1, latency: 0.00531 s, throughput:    188.24 token/s
Decode 1. Batch size: 1, latency: 0.00535 s, throughput:    186.91 token/s
Decode 2. Batch size: 1, latency: 0.00536 s, throughput:    186.62 token/s
Decode 3. Batch size: 1, latency: 0.00533 s, throughput:    187.63 token/s
Decode 4. Batch size: 1, latency: 0.00529 s, throughput:    189.06 token/s
Decode.  median latency: 0.00529 s, median throughput:    189.14 token/s
Total. latency:  5.445 s, throughput:    376.13 token/s

# RMSNorm Quant
Benchmark ...
Prefill. latency: 0.01944 s, throughput:  52668.28 token/s
Decode 0. Batch size: 1, latency: 0.00552 s, throughput:    181.00 token/s
Decode 1. Batch size: 1, latency: 0.00543 s, throughput:    184.12 token/s
Decode 2. Batch size: 1, latency: 0.00545 s, throughput:    183.35 token/s
Decode 3. Batch size: 1, latency: 0.00539 s, throughput:    185.65 token/s
Decode 4. Batch size: 1, latency: 0.00544 s, throughput:    183.94 token/s
Decode.  median latency: 0.00539 s, median throughput:    185.64 token/s
Total. latency:  5.546 s, throughput:    369.31 token/s

# Default Torch Compile + Graph Break Fixes
Benchmark ...
Prefill. latency: 0.01865 s, throughput:  54917.17 token/s
Decode 0. Batch size: 1, latency: 0.00566 s, throughput:    176.68 token/s
Decode 1. Batch size: 1, latency: 0.00562 s, throughput:    178.03 token/s
Decode 2. Batch size: 1, latency: 0.00560 s, throughput:    178.63 token/s
Decode 3. Batch size: 1, latency: 0.00559 s, throughput:    178.76 token/s
Decode 4. Batch size: 1, latency: 0.00561 s, throughput:    178.39 token/s
Decode.  median latency: 0.00561 s, median throughput:    178.13 token/s
Total. latency:  5.765 s, throughput:    355.25 token/s

Logs

MM + Silu and Mul fusion

 ===== Before_FusedActivationPass =====
 <eval_with_key>.198 from /usr/local/lib/python3.12/dist-packages/torch/fx/experimental/proxy_tensor.py:1301 in wrapped class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "bf16[1, 2048][2048, 1]cuda:0", arg1_1: "bf16[1, 2048][2048, 1]cuda:0", arg2_1: "bf16[2048][1]cuda:0", arg3_1: "bf16[16384, 2048][2048, 1]cuda:0", arg4_1: "bf16[2048, 8192][8192, 1]cuda:0"):
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py:81 in fused_add_rmsnorm, code: torch.ops.sgl_kernel.fused_add_rmsnorm.default(
        auto_functionalized_v2 = torch.ops.higher_order.auto_functionalized_v2(torch.ops.sgl_kernel.fused_add_rmsnorm.default, weight = arg2_1, eps = 1e-05, enable_pdl = False, _input_base_index = 0, _residual_base_index = 1, _all_bases = [arg0_1, arg1_1]);  arg2_1 = None
        getitem_1: "bf16[1, 2048][2048, 1]cuda:0" = auto_functionalized_v2[1]
        getitem_2: "bf16[1, 2048][2048, 1]cuda:0" = auto_functionalized_v2[2];  auto_functionalized_v2 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/unquant.py:135 in apply, code: return F.linear(x, layer.weight, bias)
        permute: "bf16[2048, 16384][1, 2048]cuda:0" = torch.ops.aten.permute.default(arg3_1, [1, 0]);  arg3_1 = None
        mm: "bf16[1, 16384][16384, 1]cuda:0" = torch.ops.aten.mm.default(getitem_1, permute);  permute = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/activation.py:69 in forward_cuda, code: out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
        empty: "bf16[1, 8192][8192, 1]cuda:0" = torch.ops.aten.empty.memory_format([1, 8192], dtype = torch.bfloat16, device = device(type='cuda', index=0), pin_memory = False)
        
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py:183 in silu_and_mul, code: torch.ops.sgl_kernel.silu_and_mul.default(out, input)
        auto_functionalized_v2_1 = torch.ops.higher_order.auto_functionalized_v2(torch.ops.sgl_kernel.silu_and_mul.default, input = mm, _out_base_index = 0, _all_bases = [empty]);  mm = empty = None
        getitem_4: "bf16[1, 8192][8192, 1]cuda:0" = auto_functionalized_v2_1[1];  auto_functionalized_v2_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/unquant.py:135 in apply, code: return F.linear(x, layer.weight, bias)
        permute_1: "bf16[8192, 2048][1, 8192]cuda:0" = torch.ops.aten.permute.default(arg4_1, [1, 0]);  arg4_1 = None
        mm_1: "bf16[1, 2048][2048, 1]cuda:0" = torch.ops.aten.mm.default(getitem_4, permute_1);  getitem_4 = permute_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py:81 in fused_add_rmsnorm, code: torch.ops.sgl_kernel.fused_add_rmsnorm.default(
        copy_: "bf16[1, 2048][2048, 1]cuda:0" = torch.ops.aten.copy_.default(arg0_1, getitem_1);  arg0_1 = getitem_1 = copy_ = None
        copy__1: "bf16[1, 2048][2048, 1]cuda:0" = torch.ops.aten.copy_.default(arg1_1, getitem_2);  arg1_1 = getitem_2 = copy__1 = None
        return (mm_1,)
        
[2025-09-17 02:10:27] TRACED GRAPH
 ===== After_FusedActivationPass =====
 <eval_with_key>.198 from /usr/local/lib/python3.12/dist-packages/torch/fx/experimental/proxy_tensor.py:1301 in wrapped class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "bf16[1, 2048][2048, 1]cuda:0", arg1_1: "bf16[1, 2048][2048, 1]cuda:0", arg2_1: "bf16[2048][1]cuda:0", arg3_1: "bf16[16384, 2048][2048, 1]cuda:0", arg4_1: "bf16[2048, 8192][8192, 1]cuda:0"):
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py:81 in fused_add_rmsnorm, code: torch.ops.sgl_kernel.fused_add_rmsnorm.default(
        auto_functionalized_v2 = torch.ops.higher_order.auto_functionalized_v2(torch.ops.sgl_kernel.fused_add_rmsnorm.default, weight = arg2_1, eps = 1e-05, enable_pdl = False, _input_base_index = 0, _residual_base_index = 1, _all_bases = [arg0_1, arg1_1]);  arg2_1 = None
        getitem_1: "bf16[1, 2048][2048, 1]cuda:0" = auto_functionalized_v2[1]
        getitem_2: "bf16[1, 2048][2048, 1]cuda:0" = auto_functionalized_v2[2];  auto_functionalized_v2 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/unquant.py:135 in apply, code: return F.linear(x, layer.weight, bias)
        permute: "bf16[2048, 16384][1, 2048]cuda:0" = torch.ops.aten.permute.default(arg3_1, [1, 0]);  arg3_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/activation.py:69 in forward_cuda, code: out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
        empty: "bf16[1, 8192][8192, 1]cuda:0" = torch.ops.aten.empty.memory_format([1, 8192], dtype = torch.bfloat16, device = device(type='cuda', index=0), pin_memory = False);  empty = None
        
        # No stacktrace found for following nodes
        fused_swiglu_default: "bf16[1, 8192][8192, 1]cuda:0" = torch.ops.sglang.fused_swiglu.default(getitem_1, permute);  permute = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/unquant.py:135 in apply, code: return F.linear(x, layer.weight, bias)
        permute_1: "bf16[8192, 2048][1, 8192]cuda:0" = torch.ops.aten.permute.default(arg4_1, [1, 0]);  arg4_1 = None
        mm_1: "bf16[1, 2048][2048, 1]cuda:0" = torch.ops.aten.mm.default(fused_swiglu_default, permute_1);  fused_swiglu_default = permute_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py:81 in fused_add_rmsnorm, code: torch.ops.sgl_kernel.fused_add_rmsnorm.default(
        copy_: "bf16[1, 2048][2048, 1]cuda:0" = torch.ops.aten.copy_.default(arg0_1, getitem_1);  arg0_1 = getitem_1 = copy_ = None
        copy__1: "bf16[1, 2048][2048, 1]cuda:0" = torch.ops.aten.copy_.default(arg1_1, getitem_2);  arg1_1 = getitem_2 = copy__1 = None
        return (mm_1,)
        
[2025-09-17 02:10:27] FusedActivationPass completed in 19.8 ms, matched 1 times

MM + Silu and Mul + Quant (I have a small diff to use sgl_per_tensor_quant_fp8 for quant instead of the triton quant kernel, will add support for the default quant kernel before merge)

[2025-09-17 02:10:39] TRACED GRAPH
 ===== Before_FusedActivationPass =====
 <eval_with_key>.56 from /usr/local/lib/python3.12/dist-packages/torch/fx/experimental/proxy_tensor.py:1301 in wrapped class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f16[6, 4096][4096, 1]cuda:0", arg1_1: "f16[6, 4096][4096, 1]cuda:0", arg2_1: "f16[4096][1]cuda:0", arg3_1: "f8e4m3fn[4096, 22016][1, 4096]cuda:0", arg4_1: "f32[][]cuda:0", arg5_1: "f32[][]cuda:0", arg6_1: "f8e4m3fn[11008, 4096][1, 11008]cuda:0", arg7_1: "f32[][]cuda:0", arg8_1: "f32[][]cuda:0"):
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py:81 in fused_add_rmsnorm, code: torch.ops.sgl_kernel.fused_add_rmsnorm.default(
        auto_functionalized_v2 = torch.ops.higher_order.auto_functionalized_v2(torch.ops.sgl_kernel.fused_add_rmsnorm.default, weight = arg2_1, eps = 1e-05, enable_pdl = False, _input_base_index = 0, _residual_base_index = 1, _all_bases = [arg0_1, arg1_1]);  arg2_1 = None
        getitem_1: "f16[6, 4096][4096, 1]cuda:0" = auto_functionalized_v2[1]
        getitem_2: "f16[6, 4096][4096, 1]cuda:0" = auto_functionalized_v2[2];  auto_functionalized_v2 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/fp8_kernel.py:1417 in scaled_fp8_quant, code: output = torch.empty(shape, device=input.device, dtype=fp8_dtype)
        empty: "f8e4m3fn[6, 4096][4096, 1]cuda:0" = torch.ops.aten.empty.memory_format([6, 4096], dtype = torch.float8_e4m3fn, device = device(type='cuda', index=0), pin_memory = False)
        
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/gemm.py:136 in sgl_per_tensor_quant_fp8, code: torch.ops.sgl_kernel.sgl_per_tensor_quant_fp8.default(
        auto_functionalized_v2_1 = torch.ops.higher_order.auto_functionalized_v2(torch.ops.sgl_kernel.sgl_per_tensor_quant_fp8.default, input = getitem_1, output_s = arg5_1, is_static = True, _output_q_base_index = 0, _all_bases = [empty]);  empty = None
        getitem_4: "f8e4m3fn[6, 4096][4096, 1]cuda:0" = auto_functionalized_v2_1[1];  auto_functionalized_v2_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/fp8_utils.py:803 in apply_fp8_linear, code: output = torch._scaled_mm(
        _scaled_mm: "f16[6, 22016][22016, 1]cuda:0" = torch.ops.aten._scaled_mm.default(getitem_4, arg3_1, arg5_1, arg4_1, None, None, torch.float16);  getitem_4 = arg3_1 = arg5_1 = arg4_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/activation.py:69 in forward_cuda, code: out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
        empty_1: "f16[6, 11008][11008, 1]cuda:0" = torch.ops.aten.empty.memory_format([6, 11008], dtype = torch.float16, device = device(type='cuda', index=0), pin_memory = False)
        
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py:183 in silu_and_mul, code: torch.ops.sgl_kernel.silu_and_mul.default(out, input)
        auto_functionalized_v2_2 = torch.ops.higher_order.auto_functionalized_v2(torch.ops.sgl_kernel.silu_and_mul.default, input = _scaled_mm, _out_base_index = 0, _all_bases = [empty_1]);  _scaled_mm = empty_1 = None
        getitem_6: "f16[6, 11008][11008, 1]cuda:0" = auto_functionalized_v2_2[1];  auto_functionalized_v2_2 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/fp8_kernel.py:1417 in scaled_fp8_quant, code: output = torch.empty(shape, device=input.device, dtype=fp8_dtype)
        empty_2: "f8e4m3fn[6, 11008][11008, 1]cuda:0" = torch.ops.aten.empty.memory_format([6, 11008], dtype = torch.float8_e4m3fn, device = device(type='cuda', index=0), pin_memory = False)
        
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/gemm.py:136 in sgl_per_tensor_quant_fp8, code: torch.ops.sgl_kernel.sgl_per_tensor_quant_fp8.default(
        auto_functionalized_v2_3 = torch.ops.higher_order.auto_functionalized_v2(torch.ops.sgl_kernel.sgl_per_tensor_quant_fp8.default, input = getitem_6, output_s = arg8_1, is_static = True, _output_q_base_index = 0, _all_bases = [empty_2]);  getitem_6 = empty_2 = None
        getitem_8: "f8e4m3fn[6, 11008][11008, 1]cuda:0" = auto_functionalized_v2_3[1];  auto_functionalized_v2_3 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/fp8_utils.py:803 in apply_fp8_linear, code: output = torch._scaled_mm(
        _scaled_mm_1: "f16[6, 4096][4096, 1]cuda:0" = torch.ops.aten._scaled_mm.default(getitem_8, arg6_1, arg8_1, arg7_1, None, None, torch.float16);  getitem_8 = arg6_1 = arg8_1 = arg7_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py:81 in fused_add_rmsnorm, code: torch.ops.sgl_kernel.fused_add_rmsnorm.default(
        copy_: "f16[6, 4096][4096, 1]cuda:0" = torch.ops.aten.copy_.default(arg0_1, getitem_1);  arg0_1 = getitem_1 = copy_ = None
        copy__1: "f16[6, 4096][4096, 1]cuda:0" = torch.ops.aten.copy_.default(arg1_1, getitem_2);  arg1_1 = getitem_2 = copy__1 = None
        return (_scaled_mm_1,)
        
[2025-09-17 02:10:39] TRACED GRAPH
 ===== After_FusedActivationPass =====
 <eval_with_key>.56 from /usr/local/lib/python3.12/dist-packages/torch/fx/experimental/proxy_tensor.py:1301 in wrapped class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f16[6, 4096][4096, 1]cuda:0", arg1_1: "f16[6, 4096][4096, 1]cuda:0", arg2_1: "f16[4096][1]cuda:0", arg3_1: "f8e4m3fn[4096, 22016][1, 4096]cuda:0", arg4_1: "f32[][]cuda:0", arg5_1: "f32[][]cuda:0", arg6_1: "f8e4m3fn[11008, 4096][1, 11008]cuda:0", arg7_1: "f32[][]cuda:0", arg8_1: "f32[][]cuda:0"):
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py:81 in fused_add_rmsnorm, code: torch.ops.sgl_kernel.fused_add_rmsnorm.default(
        auto_functionalized_v2 = torch.ops.higher_order.auto_functionalized_v2(torch.ops.sgl_kernel.fused_add_rmsnorm.default, weight = arg2_1, eps = 1e-05, enable_pdl = False, _input_base_index = 0, _residual_base_index = 1, _all_bases = [arg0_1, arg1_1]);  arg2_1 = None
        getitem_1: "f16[6, 4096][4096, 1]cuda:0" = auto_functionalized_v2[1]
        getitem_2: "f16[6, 4096][4096, 1]cuda:0" = auto_functionalized_v2[2];  auto_functionalized_v2 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/fp8_kernel.py:1417 in scaled_fp8_quant, code: output = torch.empty(shape, device=input.device, dtype=fp8_dtype)
        empty: "f8e4m3fn[6, 4096][4096, 1]cuda:0" = torch.ops.aten.empty.memory_format([6, 4096], dtype = torch.float8_e4m3fn, device = device(type='cuda', index=0), pin_memory = False)
        
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/gemm.py:136 in sgl_per_tensor_quant_fp8, code: torch.ops.sgl_kernel.sgl_per_tensor_quant_fp8.default(
        auto_functionalized_v2_1 = torch.ops.higher_order.auto_functionalized_v2(torch.ops.sgl_kernel.sgl_per_tensor_quant_fp8.default, input = getitem_1, output_s = arg5_1, is_static = True, _output_q_base_index = 0, _all_bases = [empty]);  empty = None
        getitem_4: "f8e4m3fn[6, 4096][4096, 1]cuda:0" = auto_functionalized_v2_1[1];  auto_functionalized_v2_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/activation.py:69 in forward_cuda, code: out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
        empty_1: "f16[6, 11008][11008, 1]cuda:0" = torch.ops.aten.empty.memory_format([6, 11008], dtype = torch.float16, device = device(type='cuda', index=0), pin_memory = False);  empty_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/fp8_kernel.py:1417 in scaled_fp8_quant, code: output = torch.empty(shape, device=input.device, dtype=fp8_dtype)
        empty_2: "f8e4m3fn[6, 11008][11008, 1]cuda:0" = torch.ops.aten.empty.memory_format([6, 11008], dtype = torch.float8_e4m3fn, device = device(type='cuda', index=0), pin_memory = False);  empty_2 = None
        
        # No stacktrace found for following nodes
        fused_swiglu_default: "f8e4m3fn[6, 11008][11008, 1]cuda:0" = torch.ops.sglang.fused_swiglu.default(getitem_4, arg3_1, arg5_1, arg4_1, arg8_1);  getitem_4 = arg3_1 = arg5_1 = arg4_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/quantization/fp8_utils.py:803 in apply_fp8_linear, code: output = torch._scaled_mm(
        _scaled_mm_1: "f16[6, 4096][4096, 1]cuda:0" = torch.ops.aten._scaled_mm.default(fused_swiglu_default, arg6_1, arg8_1, arg7_1, None, None, torch.float16);  fused_swiglu_default = arg6_1 = arg8_1 = arg7_1 = None
        
         # File: /usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py:81 in fused_add_rmsnorm, code: torch.ops.sgl_kernel.fused_add_rmsnorm.default(
        copy_: "f16[6, 4096][4096, 1]cuda:0" = torch.ops.aten.copy_.default(arg0_1, getitem_1);  arg0_1 = getitem_1 = copy_ = None
        copy__1: "f16[6, 4096][4096, 1]cuda:0" = torch.ops.aten.copy_.default(arg1_1, getitem_2);  arg1_1 = getitem_2 = copy__1 = None
        return (_scaled_mm_1,)
        
[2025-09-17 02:10:39] FusedActivationPass completed in 17.4 ms, matched 1 times

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-09-17T05:32:39Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

DevashishLal-CB · 2025-09-17T05:43:45Z

Things Pending as of now

Cuda graph support, currently the fused activation pass doesn't work with cuda graph enabled
RMSNorm Quant fusion pass
Unit tests for fusion passes

BBuf · 2025-09-19T12:38:09Z

Can we add a sgl-kernel fuse kernel pass example? Such as topk_softmax

DevashishLal-CB · 2025-09-26T09:14:03Z

Can we add a sgl-kernel fuse kernel pass example? Such as topk_softmax

@BBuf Added the example for topk_softmax fusion, Also added rmsnorm_quant fusion pass with tests

This MR is ready for review, will look into cuda graph support and do it as a separate MR

Will collaborate with @yuan-luo

BBuf · 2025-09-26T12:46:51Z

Can we add a sgl-kernel fuse kernel pass example? Such as topk_softmax

@BBuf Added the example for topk_softmax fusion, Also added rmsnorm_quant fusion pass with tests

This MR is ready for review, will look into cuda graph support and do it as a separate MR

Will collaborate with @yuan-luo

Cool, we'll review ASAP.

yuan-luo · 2025-09-26T13:47:29Z

+from sglang.srt.server_args import ServerArgs
+
+
+class FusionManager(CustomGraphPass):


Instead of FusionManager, we prefer to do abstraction and form a PassManager, in which fusion is one type of all the Pass types like llvm pass concept. There can be other Pass types like AsyncTPPass, AllReduceFusionPass, RMSNormQuantFusionPass and etc.
Refer to https://github.com/sgl-project/sglang/pull/10987/files#diff-61475915ef47a86d47da62c647cd346f64c4b702c94728ab84172aed428e4fc0
for more details.

yuan-luo · 2025-09-27T12:27:01Z

#10987

yuan-luo · 2025-09-27T10:40:20Z

+from sglang.srt.server_args import ServerArgs
+
+try:
+    from vllm import _custom_ops  # noqa: F401


Don't depend on vllm.

I'll port over the kernel

yuan-luo · 2025-09-27T11:40:43Z

@@ -147,14 +156,21 @@ def patch_model(
            tp_group.ca_comm = backup_ca_comm


-def set_torch_compile_config():
+def set_torch_compile_config(server_args, model_config):


Parameters in def should have type.

yuan-luo · 2025-09-28T04:24:14Z

@@ -1788,6 +1788,8 @@ def init_device_graphs(self):
            return

        if self.device != "cpu" and self.server_args.disable_cuda_graph:
+            if self.server_args.enable_torch_compile:


Do we need to conduct torch_compile in case of disable_cuda_graph?

I haven't looked into it much but two passes I added weren't working with cuda graph enabled, also I am not sure about if all other hw platforms support cuda graph

yuan-luo · 2025-09-28T09:09:47Z

+# limitations under the License.
+# ==============================================================================
+
+import logging


We'd better put this configuration file in the python/sglang/srt/configs/ directory.

eshoguli · 2025-09-29T14:17:41Z

+    return torch.compile(
+        torch.no_grad()(forward),
+        mode=os.environ.get("SGLANG_TORCH_COMPILE_MODE", "max-autotune-no-cudagraphs"),
+        dynamic=False,
+    )


You have to use fullgraph=True. It's merge stopper, isn't it?

Currently dynamo encounters graph breaks on attention, a unified attention op would solve it as done here #10062

eshoguli · 2025-09-29T14:46:01Z

@@ -114,6 +114,21 @@ def _to_torch(model: torch.nn.Module, reverse: bool, num_tokens: int):
            _to_torch(sub, reverse, num_tokens)


+def _torch_compile_wrapper(forward):


No more design patterns in 2025 except Wrapper and Manager, right? [sarcasm]
Your function is Decorator, not Wrapper.

Yeah, this entry point is suppose to be a placeholder, once we have a custom backend (which will be required by piecewise cuda graphs) that would manage this invocation, I didn't wanna do a big diff

eshoguli · 2025-09-29T15:06:09Z

+from sglang.srt.compilation.fusion.fusion_pass import FusionPass
+
+
+class RMSNormQuantPass(FusionPass):


Not clear from name and namespace: what type of quantization is supported: fp8 / int8/ int4 or binary?

Signed-off-by: Devashish Lal <devcode@fb.com>

…#2243)  ## 📌 Description FP8 model inference requires multiple intermediate quantization kernels, which can be avoided by fusing norm and quantization kernels. Consumers like sglang and vllm can lower to these norm + quant fusion kernels using custom torch compile passes ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes ### Reference I have been working on adding custom fusion passes to sglang as part of the following [RFC](sgl-project/sglang#10118) and would like to use flashinfer's norm kernels for the norm quant fusions instead of migrating vllm kernels to sglang as part of the following [MR](sgl-project/sglang#10549) ### Implementation I realise that existing kernels (at least for rmsnorm) can be modified to add the scale parameter as an optional parameter, thereby avoiding most code duplication. However, as an initial implementation, I have opted for a separate implementation route. This can be refactored if required. For fused_add_rmsnorm_quant, I don't think an in-place update would be possible since dtypes for input and output differ Currently, FP8_E3M4 numeric limits (448) have been hard-coded, as I am not aware of getting this value at compile time without including c10 headers from torch, and not sure if that is acceptable post tvm ffi migration Following is a snippet from VLLM, and I have seen similar code for getting the FP8 numeric limits ```cpp #include <c10/util/Float8_e4m3fn.h> template <typename T, typename = std::enable_if_t<std::is_same_v<T, c10::Float8_e4m3fn> || std::is_same_v<T, c10::Float8_e4m3fnuz> || std::is_same_v<T, int8_t>>> struct quant_type_max { static constexpr T val() { return std::numeric_limits<T>::max(); } }; ``` The best option in my mind is to introduce `include/flashinfer/fp8.h` containing something similar to the above snippet, and also support e5m2 ### Tests atol and rtol for the fp8 assertions had to be high due to the low precision nature of the data, but with tolerances of 1e-2, just a few tests fail with a single element mismatch  ## Summary by CodeRabbit * **New Features** * Added quantized RMSNorm and fused quantized RMSNorm (residual-add) with configurable scale, eps, and PDL toggle. * Supports FP16/FP8 paths and optional per-token or per-tensor scaling; outputs are clamped for quantized formats. * **Tests** * Added tests validating quantized normalization and fused-residual flows across dtypes, batch sizes, scaling modes, and PDL configurations. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>  Signed-off-by: Devashish Lal <laldevashish@gmail.com>

Signed-off-by: Devashish Lal <devcode@fb.com>

these kernels are faster for all benchmarks when compared against aot sglang, fused flashinfer (cutedsl) and unfused impl Signed-off-by: Devashish Lal <devcode@fb.com>

Signed-off-by: Devashish Lal <devcode@fb.com>

## 📌 Description FP8 model inference requires multiple intermediate quantization kernels, which can be avoided by fusing norm and quantization kernels. Consumers like sglang and vllm can lower to these norm + quant fusion kernels using custom torch compile passes ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes ### Reference I have been working on adding custom fusion passes to sglang as part of the following [RFC](sgl-project/sglang#10118) and would like to use flashinfer's norm kernels for the norm quant fusions instead of migrating vllm kernels to sglang as part of the following [MR](sgl-project/sglang#10549) ### Implementation I realise that existing kernels (at least for rmsnorm) can be modified to add the scale parameter as an optional parameter, thereby avoiding most code duplication. However, as an initial implementation, I have opted for a separate implementation route. This can be refactored if required. For fused_add_rmsnorm_quant, I don't think an in-place update would be possible since dtypes for input and output differ Currently, FP8_E3M4 numeric limits (448) have been hard-coded, as I am not aware of getting this value at compile time without including c10 headers from torch, and not sure if that is acceptable post tvm ffi migration Following is a snippet from VLLM, and I have seen similar code for getting the FP8 numeric limits ```cpp #include <c10/util/Float8_e4m3fn.h> template <typename T, typename = std::enable_if_t<std::is_same_v<T, c10::Float8_e4m3fn> || std::is_same_v<T, c10::Float8_e4m3fnuz> || std::is_same_v<T, int8_t>>> struct quant_type_max { static constexpr T val() { return std::numeric_limits<T>::max(); } }; ``` The best option in my mind is to introduce `include/flashinfer/fp8.h` containing something similar to the above snippet, and also support e5m2 ### Tests atol and rtol for the fp8 assertions had to be high due to the low precision nature of the data, but with tolerances of 1e-2, just a few tests fail with a single element mismatch  ## Summary by CodeRabbit * **New Features** * Added quantized RMSNorm and fused quantized RMSNorm (residual-add) with configurable scale, eps, and PDL toggle. * Supports FP16/FP8 paths and optional per-token or per-tensor scaling; outputs are clamped for quantized formats. * **Tests** * Added tests validating quantized normalization and fused-residual flows across dtypes, batch sizes, scaling modes, and PDL configurations. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>  Signed-off-by: Devashish Lal <laldevashish@gmail.com>

Signed-off-by: Devashish Lal <devcode@fb.com>

DevashishLal-CB changed the title ~~[fusion] add fusion pass manager, fusion passes and fused activation pass~~ [DRAFT][fusion] add fusion pass manager, fusion passes and fused activation pass Sep 17, 2025

DevashishLal-CB requested review from Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners September 17, 2025 05:40

DevashishLal-CB marked this pull request as draft September 17, 2025 05:44

DevashishLal-CB changed the title ~~[DRAFT][fusion] add fusion pass manager, fusion passes and fused activation pass~~ [fusion] add fusion pass manager, fusion passes and fused activation pass Sep 17, 2025

DevashishLal-CB changed the title ~~[fusion] add fusion pass manager, fusion passes and fused activation pass~~ [fusion] add fusion pass manager, base fusion pass and fused activation pass Sep 17, 2025

DevashishLal-CB force-pushed the gh/dlal/sgl-fusion branch 2 times, most recently from 56aec73 to 33aa252 Compare September 19, 2025 05:47

DevashishLal-CB force-pushed the gh/dlal/sgl-fusion branch 3 times, most recently from b2c8368 to 6990ed4 Compare September 24, 2025 00:30

DevashishLal-CB mentioned this pull request Sep 26, 2025

[RFC] SGLang unified kernel fusion and torch compile optimisations #10118

Closed

DevashishLal-CB force-pushed the gh/dlal/sgl-fusion branch from 6990ed4 to ba01b82 Compare September 26, 2025 09:08

DevashishLal-CB marked this pull request as ready for review September 26, 2025 09:08

DevashishLal-CB requested review from ping1jing2 and yizhang2077 as code owners September 26, 2025 09:08

yuan-luo reviewed Sep 26, 2025

View reviewed changes

yuan-luo reviewed Sep 28, 2025

View reviewed changes

eshoguli reviewed Sep 29, 2025

View reviewed changes

DevashishLal-CB force-pushed the gh/dlal/sgl-fusion branch from 13f96ea to 69c1778 Compare March 11, 2026 23:48

DevashishLal-CB changed the title ~~[fusion] add fused activate and rmsnorm + quant fusions pass~~ [fusion] add composable fusion pass framework Mar 12, 2026

[fusion][8/N] Add composable fusion pass framework

59ca839

Signed-off-by: Devashish Lal <devcode@fb.com>

DevashishLal-CB force-pushed the gh/dlal/sgl-fusion branch from 69c1778 to 59ca839 Compare March 12, 2026 01:55

Merge remote-tracking branch 'upstream/main' into gh/dlal/sgl-fusion

2053ce1

DevashishLal-CB requested a review from b8zhong as a code owner March 12, 2026 21:51

Devashish Lal added 3 commits March 13, 2026 17:21

[fusion][9/N] add norm quant jit kenrels and benchmarks

811d620

Signed-off-by: Devashish Lal <devcode@fb.com>

Merge remote-tracking branch 'upstream/main' into gh/dlal/sgl-fusion

20ee820

[fusion][9/N] rmsnorm_quant, fused_add_rmsnorm_quant jit kernels

7ec84f3

these kernels are faster for all benchmarks when compared against aot sglang, fused flashinfer (cutedsl) and unfused impl Signed-off-by: Devashish Lal <devcode@fb.com>

DevashishLal-CB requested review from DarkSharpness, HydraQYH and celve as code owners March 19, 2026 23:48

github-actions Bot added the jit-kernel label Mar 19, 2026

[fusion][10/N] Remove graph breaks and ensure full graph compilation

b6bb605

Signed-off-by: Devashish Lal <devcode@fb.com>

DevashishLal-CB requested review from hanming-lu, hzh0425 and xiezhq-hermann as code owners March 20, 2026 19:31

DevashishLal-CB force-pushed the gh/dlal/sgl-fusion branch 2 times, most recently from dc0d2f3 to 399e2bb Compare March 27, 2026 21:52

[fusion][11/N] Add cutedsl dual gemm kernel

24e868a

Signed-off-by: Devashish Lal <devcode@fb.com>

DevashishLal-CB force-pushed the gh/dlal/sgl-fusion branch from 399e2bb to 24e868a Compare April 21, 2026 22:54

DevashishLal-CB requested a review from Oasis-Git as a code owner April 21, 2026 22:54

DevashishLal-CB marked this pull request as draft April 21, 2026 23:25

Merge remote-tracking branch 'upstream/main' into gh/dlal/sgl-fusion

ce977fd

DevashishLal-CB force-pushed the gh/dlal/sgl-fusion branch from fab166d to e891a27 Compare April 21, 2026 23:57

[fusion][12/N] Uprev and Cleanup

b16861d

Signed-off-by: Devashish Lal <devcode@fb.com>

DevashishLal-CB force-pushed the gh/dlal/sgl-fusion branch from e891a27 to b16861d Compare April 23, 2026 03:29

		from sglang.srt.server_args import ServerArgs


		class FusionManager(CustomGraphPass):

		@@ -114,6 +114,21 @@ def _to_torch(model: torch.nn.Module, reverse: bool, num_tokens: int):
		_to_torch(sub, reverse, num_tokens)


		def _torch_compile_wrapper(forward):

		from sglang.srt.compilation.fusion.fusion_pass import FusionPass


		class RMSNormQuantPass(FusionPass):

Conversation

DevashishLal-CB commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Logs

Checklist

Uh oh!

gemini-code-assist Bot commented Sep 17, 2025

Uh oh!

DevashishLal-CB commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BBuf commented Sep 19, 2025

Uh oh!

DevashishLal-CB commented Sep 26, 2025

Uh oh!

BBuf commented Sep 26, 2025

Uh oh!

yuan-luo Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuan-luo commented Sep 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DevashishLal-CB commented Sep 17, 2025 •

edited

Loading

DevashishLal-CB commented Sep 17, 2025 •

edited

Loading

yuan-luo Sep 26, 2025 •

edited

Loading