Skip to content

DeepGemm integrate to sgl-kernel#4165

Merged
zhyncs merged 30 commits intosgl-project:mainfrom
laixinn:jit-deep-gemm
Mar 10, 2025
Merged

DeepGemm integrate to sgl-kernel#4165
zhyncs merged 30 commits intosgl-project:mainfrom
laixinn:jit-deep-gemm

Conversation

@laixinn
Copy link
Copy Markdown
Contributor

@laixinn laixinn commented Mar 7, 2025

Motivation

Integrate DeepGemm in setup.
Linear usage: #4199 .

Modifications

Checklist

Comment thread python/sglang/srt/layers/quantization/fp8_kernel.py Outdated
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 7, 2025

@HandH1998
Copy link
Copy Markdown
Collaborator

Please fix build error https://github.com/sgl-project/sglang/actions/runs/13715681407/job/38360041673?pr=4165

It seems that the version of setuptools in CI is old. We succeed to build it with >= 75.0.0 locally. Maybe can update setuptools in CI to solve this issue.

@shuaills shuaills force-pushed the jit-deep-gemm branch 3 times, most recently from 0db1e43 to 269b6c0 Compare March 8, 2025 04:37
@sleepcoo sleepcoo mentioned this pull request Mar 8, 2025
@FlamingoPg
Copy link
Copy Markdown
Collaborator

FlamingoPg commented Mar 8, 2025

Test

We fix deepgemm build with JIT module.

  1. Install command:
cd sgl-kernel
make build
  1. Test script is copied from deepgemm/tests/test_core.py

Deepgemm test result below:

Library path:
 > ['/usr/local/lib/python3.10/dist-packages/deep_gemm']

Testing GEMM:
 > Performance (m=   64, n= 2112, k= 7168):   10 us | throughput:  193 TFLOPS, 1583 GB/s
 > Performance (m=   64, n=24576, k= 1536):   13 us | throughput:  363 TFLOPS, 3083 GB/s
 > Performance (m=   64, n=32768, k=  512):   10 us | throughput:  226 TFLOPS, 2210 GB/s
 > Performance (m=   64, n= 7168, k=16384):   36 us | throughput:  423 TFLOPS, 3363 GB/s
 > Performance (m=   64, n= 4096, k= 7168):   12 us | throughput:  313 TFLOPS, 2531 GB/s
 > Performance (m=   64, n= 7168, k= 2048):    7 us | throughput:  281 TFLOPS, 2349 GB/s
 > Performance (m=  128, n= 2112, k= 7168):   11 us | throughput:  347 TFLOPS, 1488 GB/s
 > Performance (m=  128, n=24576, k= 1536):   15 us | throughput:  648 TFLOPS, 2964 GB/s
 > Performance (m=  128, n=32768, k=  512):   11 us | throughput:  383 TFLOPS, 2247 GB/s
 > Performance (m=  128, n= 7168, k=16384):   38 us | throughput:  789 TFLOPS, 3184 GB/s
 > Performance (m=  128, n= 4096, k= 7168):   13 us | throughput:  566 TFLOPS, 2361 GB/s
 > Performance (m=  128, n= 7168, k= 2048):    8 us | throughput:  481 TFLOPS, 2147 GB/s
 > Performance (m= 4096, n= 2112, k= 7168):  118 us | throughput: 1053 TFLOPS,  525 GB/s
 > Performance (m= 4096, n=24576, k= 1536):  315 us | throughput:  980 TFLOPS,  778 GB/s
 > Performance (m= 4096, n=32768, k=  512):  231 us | throughput:  595 TFLOPS, 1245 GB/s
 > Performance (m= 4096, n= 7168, k=16384):  691 us | throughput: 1392 TFLOPS,  352 GB/s
 > Performance (m= 4096, n= 4096, k= 7168):  179 us | throughput: 1343 TFLOPS,  515 GB/s
 > Performance (m= 4096, n= 7168, k= 2048):  118 us | throughput: 1016 TFLOPS,  691 GB/s

Testing grouped contiguous GEMM:
 > Performance (num_groups=4, m_per_group=8192, n=4096, k=7168): 1418 us | throughput: 1357 TFLOPS,  438 GB/s
 > Performance (num_groups=4, m_per_group=8192, n=7168, k=2048):  883 us | throughput: 1089 TFLOPS,  674 GB/s
 > Performance (num_groups=8, m_per_group=4096, n=4096, k=7168): 1427 us | throughput: 1348 TFLOPS,  517 GB/s
 > Performance (num_groups=8, m_per_group=4096, n=7168, k=2048):  884 us | throughput: 1089 TFLOPS,  740 GB/s

Testing grouped masked GEMM:
 > Performance (num_groups=1, m_per_group=1024, n=4096, k=7168):   48 us | throughput: 1261 TFLOPS,  945 GB/s
 > Performance (num_groups=1, m_per_group=1024, n=7168, k=2048):   32 us | throughput:  925 TFLOPS,  968 GB/s
 > Performance (num_groups=2, m_per_group= 512, n=4096, k=7168):   49 us | throughput: 1216 TFLOPS, 1505 GB/s
 > Performance (num_groups=2, m_per_group= 512, n=7168, k=2048):   32 us | throughput:  931 TFLOPS, 1429 GB/s
 > Performance (num_groups=4, m_per_group= 256, n=4096, k=7168):   54 us | throughput: 1105 TFLOPS, 2448 GB/s
 > Performance (num_groups=4, m_per_group= 256, n=7168, k=2048):   34 us | throughput:  878 TFLOPS, 2205 GB/s

How we build

We use setup.py to customize the DeepGEMM installation process.

Since DeepGEMM uses JIT compilation, we've integrated it as a third-party library. During the setup.py build process for our wheel package, we first build the AOT sgl-kernel, and then simply copy the DeepGEMM files into Python package.

How to use

import deep_gemm

@FlamingoPg
Copy link
Copy Markdown
Collaborator

cc: @sleepcoo @HandH1998 @laixinn
The PR is ready for review. cc: @zhyncs

@laixinn laixinn marked this pull request as ready for review March 8, 2025 12:30
@laixinn laixinn changed the title DeepGemm gemm_fp8_fp8_bf16_nt in JIT DeepGemm integrate to sgl-kernel Mar 8, 2025
Comment thread scripts/ci_install_dependency.sh Outdated
Comment thread sgl-kernel/build.sh Outdated
Comment thread sgl-kernel/build.sh Outdated
Comment thread sgl-kernel/setup.py Outdated
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 8, 2025

We should also test the make build command and use the wheel directly.

@zhyncs zhyncs self-assigned this Mar 8, 2025
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 9, 2025

Please help rebase latest main

@laixinn
Copy link
Copy Markdown
Contributor Author

laixinn commented Mar 9, 2025

@zhyncs Symlinks are necessary for the head files of JIT. DeepGemm tests are forked into sgl-kernel tests.

Comment thread .gitmodules Outdated
Comment thread .gitmodules Outdated
Comment thread .gitmodules Outdated
Comment thread sgl-kernel/setup.py Outdated
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 10, 2025

Thank you all! The code is functional but messy. I will work on improving it later, but for now I will merge it.

@zhyncs zhyncs merged commit c553e16 into sgl-project:main Mar 10, 2025
@inkhare
Copy link
Copy Markdown

inkhare commented Mar 10, 2025

Has anyone compared the performance improvement of Deepseek r1 after using DeepGemm?

@sleepcoo
Copy link
Copy Markdown
Collaborator

Has anyone compared the performance improvement of Deepseek r1 after using DeepGemm?

Here #4199

@lishicheng1996
Copy link
Copy Markdown

lishicheng1996 commented Mar 11, 2025

SGL_ENABLE_JIT_DEEPGEMM

@laixinn Hi, I don't see this env variable SGL_ENABLE_JIT_DEEPGEMM is code. May I ask where to find it?

@laixinn
Copy link
Copy Markdown
Contributor Author

laixinn commented Mar 12, 2025

@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration.

@CUHKSZzxy
Copy link
Copy Markdown
Contributor

@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration.

To use DeepGEMM, do we still need SGL_ENABLE_JIT_DEEPGEMM, or is this enabled by default?

@sleepcoo sleepcoo deleted the jit-deep-gemm branch March 26, 2025 10:21
@tbzhang
Copy link
Copy Markdown
Contributor

tbzhang commented Mar 26, 2025

@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration.

To use DeepGEMM, do we still need SGL_ENABLE_JIT_DEEPGEMM, or is this enabled by default?

deepgemm will be be used in the hopper architecture, you can check out this pull: #4613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants