DeepGemm integrate to sgl-kernel#4165
Conversation
|
Please fix build error https://github.com/sgl-project/sglang/actions/runs/13715681407/job/38360041673?pr=4165 |
It seems that the version of |
0db1e43 to
269b6c0
Compare
TestWe fix deepgemm build with JIT module.
cd sgl-kernel
make build
Deepgemm test result below: Library path:
> ['/usr/local/lib/python3.10/dist-packages/deep_gemm']
Testing GEMM:
> Performance (m= 64, n= 2112, k= 7168): 10 us | throughput: 193 TFLOPS, 1583 GB/s
> Performance (m= 64, n=24576, k= 1536): 13 us | throughput: 363 TFLOPS, 3083 GB/s
> Performance (m= 64, n=32768, k= 512): 10 us | throughput: 226 TFLOPS, 2210 GB/s
> Performance (m= 64, n= 7168, k=16384): 36 us | throughput: 423 TFLOPS, 3363 GB/s
> Performance (m= 64, n= 4096, k= 7168): 12 us | throughput: 313 TFLOPS, 2531 GB/s
> Performance (m= 64, n= 7168, k= 2048): 7 us | throughput: 281 TFLOPS, 2349 GB/s
> Performance (m= 128, n= 2112, k= 7168): 11 us | throughput: 347 TFLOPS, 1488 GB/s
> Performance (m= 128, n=24576, k= 1536): 15 us | throughput: 648 TFLOPS, 2964 GB/s
> Performance (m= 128, n=32768, k= 512): 11 us | throughput: 383 TFLOPS, 2247 GB/s
> Performance (m= 128, n= 7168, k=16384): 38 us | throughput: 789 TFLOPS, 3184 GB/s
> Performance (m= 128, n= 4096, k= 7168): 13 us | throughput: 566 TFLOPS, 2361 GB/s
> Performance (m= 128, n= 7168, k= 2048): 8 us | throughput: 481 TFLOPS, 2147 GB/s
> Performance (m= 4096, n= 2112, k= 7168): 118 us | throughput: 1053 TFLOPS, 525 GB/s
> Performance (m= 4096, n=24576, k= 1536): 315 us | throughput: 980 TFLOPS, 778 GB/s
> Performance (m= 4096, n=32768, k= 512): 231 us | throughput: 595 TFLOPS, 1245 GB/s
> Performance (m= 4096, n= 7168, k=16384): 691 us | throughput: 1392 TFLOPS, 352 GB/s
> Performance (m= 4096, n= 4096, k= 7168): 179 us | throughput: 1343 TFLOPS, 515 GB/s
> Performance (m= 4096, n= 7168, k= 2048): 118 us | throughput: 1016 TFLOPS, 691 GB/s
Testing grouped contiguous GEMM:
> Performance (num_groups=4, m_per_group=8192, n=4096, k=7168): 1418 us | throughput: 1357 TFLOPS, 438 GB/s
> Performance (num_groups=4, m_per_group=8192, n=7168, k=2048): 883 us | throughput: 1089 TFLOPS, 674 GB/s
> Performance (num_groups=8, m_per_group=4096, n=4096, k=7168): 1427 us | throughput: 1348 TFLOPS, 517 GB/s
> Performance (num_groups=8, m_per_group=4096, n=7168, k=2048): 884 us | throughput: 1089 TFLOPS, 740 GB/s
Testing grouped masked GEMM:
> Performance (num_groups=1, m_per_group=1024, n=4096, k=7168): 48 us | throughput: 1261 TFLOPS, 945 GB/s
> Performance (num_groups=1, m_per_group=1024, n=7168, k=2048): 32 us | throughput: 925 TFLOPS, 968 GB/s
> Performance (num_groups=2, m_per_group= 512, n=4096, k=7168): 49 us | throughput: 1216 TFLOPS, 1505 GB/s
> Performance (num_groups=2, m_per_group= 512, n=7168, k=2048): 32 us | throughput: 931 TFLOPS, 1429 GB/s
> Performance (num_groups=4, m_per_group= 256, n=4096, k=7168): 54 us | throughput: 1105 TFLOPS, 2448 GB/s
> Performance (num_groups=4, m_per_group= 256, n=7168, k=2048): 34 us | throughput: 878 TFLOPS, 2205 GB/sHow we buildWe use setup.py to customize the DeepGEMM installation process. Since DeepGEMM uses JIT compilation, we've integrated it as a third-party library. During the setup.py build process for our wheel package, we first build the AOT sgl-kernel, and then simply copy the DeepGEMM files into Python package. How to useimport deep_gemm |
|
cc: @sleepcoo @HandH1998 @laixinn |
|
We should also test the |
|
Please help rebase latest main |
|
@zhyncs Symlinks are necessary for the head files of JIT. DeepGemm tests are forked into sgl-kernel tests. |
…& build DeepGemm in setup.py Co-authored-by: sleepcoo <sleepcoo@gmail.com>
|
Thank you all! The code is functional but messy. I will work on improving it later, but for now I will merge it. |
|
Has anyone compared the performance improvement of Deepseek r1 after using DeepGemm? |
Here #4199 |
@laixinn Hi, I don't see this env variable |
|
@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration. |
To use DeepGEMM, do we still need |
deepgemm will be be used in the hopper architecture, you can check out this pull: #4613 |
Motivation
Integrate DeepGemm in setup.
Linear usage: #4199 .
Modifications
Checklist