DeepGemm integrate to sgl-kernel by laixinn · Pull Request #4165 · sgl-project/sglang

laixinn · 2025-03-07T07:04:33Z

Motivation

Integrate DeepGemm in setup.
Linear usage: #4199 .

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhyncs · 2025-03-07T07:18:51Z

Please fix build error https://github.com/sgl-project/sglang/actions/runs/13715681407/job/38360041673?pr=4165

HandH1998 · 2025-03-07T09:05:18Z

Please fix build error https://github.com/sgl-project/sglang/actions/runs/13715681407/job/38360041673?pr=4165

It seems that the version of setuptools in CI is old. We succeed to build it with >= 75.0.0 locally. Maybe can update setuptools in CI to solve this issue.

FlamingoPg · 2025-03-08T12:24:37Z

Test

We fix deepgemm build with JIT module.

Install command:

cd sgl-kernel
make build

Test script is copied from deepgemm/tests/test_core.py

Deepgemm test result below:

Library path:
 > ['/usr/local/lib/python3.10/dist-packages/deep_gemm']

Testing GEMM:
 > Performance (m=   64, n= 2112, k= 7168):   10 us | throughput:  193 TFLOPS, 1583 GB/s
 > Performance (m=   64, n=24576, k= 1536):   13 us | throughput:  363 TFLOPS, 3083 GB/s
 > Performance (m=   64, n=32768, k=  512):   10 us | throughput:  226 TFLOPS, 2210 GB/s
 > Performance (m=   64, n= 7168, k=16384):   36 us | throughput:  423 TFLOPS, 3363 GB/s
 > Performance (m=   64, n= 4096, k= 7168):   12 us | throughput:  313 TFLOPS, 2531 GB/s
 > Performance (m=   64, n= 7168, k= 2048):    7 us | throughput:  281 TFLOPS, 2349 GB/s
 > Performance (m=  128, n= 2112, k= 7168):   11 us | throughput:  347 TFLOPS, 1488 GB/s
 > Performance (m=  128, n=24576, k= 1536):   15 us | throughput:  648 TFLOPS, 2964 GB/s
 > Performance (m=  128, n=32768, k=  512):   11 us | throughput:  383 TFLOPS, 2247 GB/s
 > Performance (m=  128, n= 7168, k=16384):   38 us | throughput:  789 TFLOPS, 3184 GB/s
 > Performance (m=  128, n= 4096, k= 7168):   13 us | throughput:  566 TFLOPS, 2361 GB/s
 > Performance (m=  128, n= 7168, k= 2048):    8 us | throughput:  481 TFLOPS, 2147 GB/s
 > Performance (m= 4096, n= 2112, k= 7168):  118 us | throughput: 1053 TFLOPS,  525 GB/s
 > Performance (m= 4096, n=24576, k= 1536):  315 us | throughput:  980 TFLOPS,  778 GB/s
 > Performance (m= 4096, n=32768, k=  512):  231 us | throughput:  595 TFLOPS, 1245 GB/s
 > Performance (m= 4096, n= 7168, k=16384):  691 us | throughput: 1392 TFLOPS,  352 GB/s
 > Performance (m= 4096, n= 4096, k= 7168):  179 us | throughput: 1343 TFLOPS,  515 GB/s
 > Performance (m= 4096, n= 7168, k= 2048):  118 us | throughput: 1016 TFLOPS,  691 GB/s

Testing grouped contiguous GEMM:
 > Performance (num_groups=4, m_per_group=8192, n=4096, k=7168): 1418 us | throughput: 1357 TFLOPS,  438 GB/s
 > Performance (num_groups=4, m_per_group=8192, n=7168, k=2048):  883 us | throughput: 1089 TFLOPS,  674 GB/s
 > Performance (num_groups=8, m_per_group=4096, n=4096, k=7168): 1427 us | throughput: 1348 TFLOPS,  517 GB/s
 > Performance (num_groups=8, m_per_group=4096, n=7168, k=2048):  884 us | throughput: 1089 TFLOPS,  740 GB/s

Testing grouped masked GEMM:
 > Performance (num_groups=1, m_per_group=1024, n=4096, k=7168):   48 us | throughput: 1261 TFLOPS,  945 GB/s
 > Performance (num_groups=1, m_per_group=1024, n=7168, k=2048):   32 us | throughput:  925 TFLOPS,  968 GB/s
 > Performance (num_groups=2, m_per_group= 512, n=4096, k=7168):   49 us | throughput: 1216 TFLOPS, 1505 GB/s
 > Performance (num_groups=2, m_per_group= 512, n=7168, k=2048):   32 us | throughput:  931 TFLOPS, 1429 GB/s
 > Performance (num_groups=4, m_per_group= 256, n=4096, k=7168):   54 us | throughput: 1105 TFLOPS, 2448 GB/s
 > Performance (num_groups=4, m_per_group= 256, n=7168, k=2048):   34 us | throughput:  878 TFLOPS, 2205 GB/s

How we build

We use setup.py to customize the DeepGEMM installation process.

Since DeepGEMM uses JIT compilation, we've integrated it as a third-party library. During the setup.py build process for our wheel package, we first build the AOT sgl-kernel, and then simply copy the DeepGEMM files into Python package.

How to use

import deep_gemm

FlamingoPg · 2025-03-08T12:26:24Z

cc: @sleepcoo @HandH1998 @laixinn
The PR is ready for review. cc: @zhyncs

zhyncs · 2025-03-08T20:47:06Z

We should also test the make build command and use the wheel directly.

zhyncs · 2025-03-09T07:07:21Z

Please help rebase latest main

laixinn · 2025-03-09T09:41:50Z

@zhyncs Symlinks are necessary for the head files of JIT. DeepGemm tests are forked into sgl-kernel tests.

…& build DeepGemm in setup.py Co-authored-by: sleepcoo <sleepcoo@gmail.com>

zhyncs · 2025-03-10T07:31:44Z

Thank you all! The code is functional but messy. I will work on improving it later, but for now I will merge it.

inkhare · 2025-03-10T13:04:26Z

Has anyone compared the performance improvement of Deepseek r1 after using DeepGemm?

sleepcoo · 2025-03-10T13:19:08Z

Has anyone compared the performance improvement of Deepseek r1 after using DeepGemm?

Here #4199

lishicheng1996 · 2025-03-11T03:11:51Z

SGL_ENABLE_JIT_DEEPGEMM

@laixinn Hi, I don't see this env variable SGL_ENABLE_JIT_DEEPGEMM is code. May I ask where to find it?

laixinn · 2025-03-12T02:10:29Z

@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration.

CUHKSZzxy · 2025-03-26T10:19:44Z

@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration.

To use DeepGEMM, do we still need SGL_ENABLE_JIT_DEEPGEMM, or is this enabled by default?

tbzhang · 2025-03-26T10:59:53Z

@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration.

To use DeepGEMM, do we still need SGL_ENABLE_JIT_DEEPGEMM, or is this enabled by default?

deepgemm will be be used in the hopper architecture, you can check out this pull: #4613

laixinn force-pushed the jit-deep-gemm branch from e2ade20 to cef44b5 Compare March 7, 2025 07:14

zhyncs reviewed Mar 7, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/quantization/fp8_kernel.py Outdated

HandH1998 force-pushed the jit-deep-gemm branch from 8086c1d to 057167d Compare March 7, 2025 08:48

shuaills force-pushed the jit-deep-gemm branch 3 times, most recently from 0db1e43 to 269b6c0 Compare March 8, 2025 04:37

sleepcoo mentioned this pull request Mar 8, 2025

linear support deepgemm #4199

Merged

laixinn marked this pull request as ready for review March 8, 2025 12:30

laixinn requested review from BBuf, HandH1998, ispobock, merrymercy and yizhang2077 as code owners March 8, 2025 12:30

laixinn changed the title ~~DeepGemm gemm_fp8_fp8_bf16_nt in JIT~~ DeepGemm integrate to sgl-kernel Mar 8, 2025

zhyncs suggested changes Mar 8, 2025

View reviewed changes

Comment thread scripts/ci_install_dependency.sh Outdated

Comment thread sgl-kernel/build.sh Outdated

Comment thread sgl-kernel/build.sh Outdated

Comment thread sgl-kernel/setup.py Outdated

zhyncs self-assigned this Mar 8, 2025

zhyncs added the high priority label Mar 8, 2025

laixinn force-pushed the jit-deep-gemm branch from dca4d9e to 951e414 Compare March 9, 2025 07:20

laixinn and others added 5 commits March 10, 2025 09:57

add unittest to block fp8 & add gemm_fp8_fp8_bf16_nt into fp8_kernel …

42eac6c

…& build DeepGemm in setup.py Co-authored-by: sleepcoo <sleepcoo@gmail.com>

code formating

d547089

update setuptools >= 75.0

40a6463

update sgl-kernel to 0.0.3.post7

1e5904a

update CI dependency

3ee72e0

sleepcoo and others added 11 commits March 10, 2025 09:58

Update pyproject.toml

18ce138

fix setup.py to hack deepgemm

7b4da22

fix deepgemm code format

c35f065

Update utils.h

dff2f68

fix typo

d056093

streamline setup.py

0003313

remove ci build debugging print

b5ac5e5

formating

9d65bf7

fork deepgemm test_jit and test_core into unit test

b311804

fix cute.h/cutlass.h missing for JIT

72a2a4b

code formating

bb58944

laixinn force-pushed the jit-deep-gemm branch from 2889657 to bb58944 Compare March 10, 2025 01:59

Merge branch 'main' into jit-deep-gemm

7c7bc28

zhyncs reviewed Mar 10, 2025

View reviewed changes

Comment thread .gitmodules Outdated

Comment thread .gitmodules Outdated

Comment thread .gitmodules Outdated

Comment thread sgl-kernel/setup.py Outdated

zhyncs added 5 commits March 10, 2025 00:23

upd

1bef425

upd

72e73b0

upd

1383750

upd

25170d0

upd

5edfa24

zhyncs approved these changes Mar 10, 2025

View reviewed changes

zhyncs merged commit c553e16 into sgl-project:main Mar 10, 2025

laixinn mentioned this pull request Mar 12, 2025

Integrate DeepGemm contiguous group gemm into Fused MoE #4343

Closed

6 tasks

sleepcoo deleted the jit-deep-gemm branch March 26, 2025 10:21

Conversation

laixinn commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

zhyncs commented Mar 7, 2025

Uh oh!

HandH1998 commented Mar 7, 2025

Uh oh!

FlamingoPg commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test

How we build

How to use

Uh oh!

FlamingoPg commented Mar 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Mar 8, 2025

Uh oh!

zhyncs commented Mar 9, 2025

Uh oh!

laixinn commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Mar 10, 2025

Uh oh!

inkhare commented Mar 10, 2025

Uh oh!

sleepcoo commented Mar 10, 2025

Uh oh!

lishicheng1996 commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laixinn commented Mar 12, 2025

Uh oh!

CUHKSZzxy commented Mar 26, 2025

Uh oh!

tbzhang commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

laixinn commented Mar 7, 2025 •

edited

Loading

FlamingoPg commented Mar 8, 2025 •

edited

Loading

laixinn commented Mar 9, 2025 •

edited

Loading

lishicheng1996 commented Mar 11, 2025 •

edited

Loading