[MUSA][2/N] sgl-kernel build by yeahdongcn · Pull Request #17053 · sgl-project/sglang

yeahdongcn · 2026-01-14T02:39:40Z

Motivation

This PR is the second in a series of pull requests (tracked in #16565) to add full support for Moore Threads GPUs, leveraging MUSA (Meta-computing Unified System Architecture) to accelerate LLM inference.

Modifications

Following the AMD approach, we add a small set of MUSA-specific files:

pyproject_musa.toml: used later during the Docker build.
setup_musa.py: builds the MUSA extension.
common_extension_musa.cc: provides Python bindings for the C++ sources.

Testing Done

Tested in a clean torch_musa container:

root@worker3218:/ws/sgl-kernel# python setup_musa.py install
2026-01-14 10:33:33 | dist | 140527655466112 | INFO : running install
2026-01-14 10:33:33 | warnings | 140527655466112 | WARNING : /usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()

2026-01-14 10:33:33 | warnings | 140527655466112 | WARNING : /usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` and ``easy_install``.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://github.com/pypa/setuptools/issues/917 for details.
        ********************************************************************************

!!
  self.initialize_options()

2026-01-14 10:33:33 | dist | 140527655466112 | INFO : running bdist_egg
2026-01-14 10:33:33 | dist | 140527655466112 | INFO : running egg_info
2026-01-14 10:33:33 | egg_info | 140527655466112 | INFO : writing python/sgl_kernel.egg-info/PKG-INFO
2026-01-14 10:33:33 | egg_info | 140527655466112 | INFO : writing dependency_links to python/sgl_kernel.egg-info/dependency_links.txt
2026-01-14 10:33:33 | egg_info | 140527655466112 | INFO : writing top-level names to python/sgl_kernel.egg-info/top_level.txt
2026-01-14 10:33:33 | egg_info | 140527655466112 | INFO : adding license file 'LICENSE'
2026-01-14 10:33:33 | util | 140527655466112 | INFO : writing manifest file 'python/sgl_kernel.egg-info/SOURCES.txt'
2026-01-14 10:33:33 | bdist_egg | 140527655466112 | INFO : installing library code to build/bdist.linux-x86_64/egg
2026-01-14 10:33:33 | dist | 140527655466112 | INFO : running install_lib
2026-01-14 10:33:33 | dist | 140527655466112 | INFO : running build_py
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/attention.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/spatial.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/gemm.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/__init__.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/sampling.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/sparse_flash_attn.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/cutlass_moe.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/test_utils.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/memory.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/fused_moe.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/elementwise.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/flash_mla.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/_fa4_interface.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/expert_specialization.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/marlin.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/flash_attn.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/utils.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/kvcacheio.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/hadamard.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/version.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/mamba.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/top_k.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/moe.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/allreduce.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/load_utils.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/grammar.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/speculative.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/scalar_type.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/testing/rotary_embedding.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel/testing
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/testing/__init__.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel/testing
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/quantization/__init__.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel/quantization
2026-01-14 10:33:33 | file_util | 140527655466112 | INFO : copying python/sgl_kernel/quantization/gguf.py -> build/lib.linux-x86_64-cpython-310/sgl_kernel/quantization
2026-01-14 10:33:33 | dist | 140527655466112 | INFO : running build_ext
Cloning third-party repositories...
Fetching origin
HEAD is now at 3abd6a72 update minimum compiler version
Fetching origin
HEAD is now at bc29697b ci: collect module status and update flashinfer-cli (#1676)
Third-party repositories ready.
Emitting ninja build file /ws/sgl-kernel/build/temp.linux-x86_64-cpython-310/build.ninja...
Compiling objects...
Using envvar MAX_JOBS (128) as the number of workers...
[1/4] /usr/local/musa/bin/mcc -MD -MF /ws/sgl-kernel/build/temp.linux-x86_64-cpython-310/ws/sgl-kernel/third_party/flashinfer/csrc_musa/norm.o.d -I/ws/sgl-kernel/include_musa -I/ws/sgl-kernel/include -I/ws/sgl-kernel/include/impl -I/ws/sgl-kernel/csrc_musa -I/ws/sgl-kernel/csrc -I/ws/sgl-kernel/third_party/flashinfer/include_musa -I/ws/sgl-kernel/third_party/flashinfer/include -I/ws/sgl-kernel/third_party/flashinfer/csrc_musa -I/ws/sgl-kernel/third_party/flashinfer/csrc -I/ws/sgl-kernel/third_party/mutlass/include_musa -I/ws/sgl-kernel/third_party/mutlass/include -I/usr/local/musa/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/aten/src -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/torch_musa_codegen -I/usr/local/lib/python3.10/dist-packages -I/usr/local/musa/include -I/usr/include/python3.10 -c -c /ws/sgl-kernel/third_party/flashinfer/csrc_musa/norm.mu -o /ws/sgl-kernel/build/temp.linux-x86_64-cpython-310/ws/sgl-kernel/third_party/flashinfer/csrc_musa/norm.o -fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl_kernel -O3 -fPIC -std=c++17 --cuda-gpu-arch=mp_31 -x musa -mtgpu -Od3 -ffast-math -fmusa-flush-denormals-to-zero -fno-strict-aliasing -DUSE_MUSA -DENABLE_BF16 -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DENABLE_FP8 -DFLASHINFER_ENABLE_FP8 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 --offload-arch=mp_31 -march=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=common_ops -D_GLIBCXX_USE_CXX11_ABI=1
[2/4] /usr/local/musa/bin/mcc -MD -MF /ws/sgl-kernel/build/temp.linux-x86_64-cpython-310/ws/sgl-kernel/third_party/flashinfer/csrc_musa/renorm.o.d -I/ws/sgl-kernel/include_musa -I/ws/sgl-kernel/include -I/ws/sgl-kernel/include/impl -I/ws/sgl-kernel/csrc_musa -I/ws/sgl-kernel/csrc -I/ws/sgl-kernel/third_party/flashinfer/include_musa -I/ws/sgl-kernel/third_party/flashinfer/include -I/ws/sgl-kernel/third_party/flashinfer/csrc_musa -I/ws/sgl-kernel/third_party/flashinfer/csrc -I/ws/sgl-kernel/third_party/mutlass/include_musa -I/ws/sgl-kernel/third_party/mutlass/include -I/usr/local/musa/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/aten/src -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/torch_musa_codegen -I/usr/local/lib/python3.10/dist-packages -I/usr/local/musa/include -I/usr/include/python3.10 -c -c /ws/sgl-kernel/third_party/flashinfer/csrc_musa/renorm.mu -o /ws/sgl-kernel/build/temp.linux-x86_64-cpython-310/ws/sgl-kernel/third_party/flashinfer/csrc_musa/renorm.o -fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl_kernel -O3 -fPIC -std=c++17 --cuda-gpu-arch=mp_31 -x musa -mtgpu -Od3 -ffast-math -fmusa-flush-denormals-to-zero -fno-strict-aliasing -DUSE_MUSA -DENABLE_BF16 -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DENABLE_FP8 -DFLASHINFER_ENABLE_FP8 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 --offload-arch=mp_31 -march=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=common_ops -D_GLIBCXX_USE_CXX11_ABI=1
[3/4] /usr/local/musa/bin/mcc -x musa -MMD -MF /ws/sgl-kernel/build/temp.linux-x86_64-cpython-310/ws/sgl-kernel/csrc_musa/common_extension_musa.o.d -I/ws/sgl-kernel/include_musa -I/ws/sgl-kernel/include -I/ws/sgl-kernel/include/impl -I/ws/sgl-kernel/csrc_musa -I/ws/sgl-kernel/csrc -I/ws/sgl-kernel/third_party/flashinfer/include_musa -I/ws/sgl-kernel/third_party/flashinfer/include -I/ws/sgl-kernel/third_party/flashinfer/csrc_musa -I/ws/sgl-kernel/third_party/flashinfer/csrc -I/ws/sgl-kernel/third_party/mutlass/include_musa -I/ws/sgl-kernel/third_party/mutlass/include -I/usr/local/musa/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/aten/src -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/torch_musa_codegen -I/usr/local/lib/python3.10/dist-packages -I/usr/local/musa/include -I/usr/include/python3.10 -c -c /ws/sgl-kernel/csrc_musa/common_extension_musa.cc -o /ws/sgl-kernel/build/temp.linux-x86_64-cpython-310/ws/sgl-kernel/csrc_musa/common_extension_musa.o -fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl_kernel -O3 -fPIC -std=c++17 --cuda-gpu-arch=mp_31 -x musa -mtgpu -Od3 -ffast-math -fmusa-flush-denormals-to-zero -fno-strict-aliasing -DUSE_MUSA -DENABLE_BF16 -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DENABLE_FP8 -DFLASHINFER_ENABLE_FP8 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 --offload-arch=mp_31 -march=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=common_ops -D_GLIBCXX_USE_CXX11_ABI=1
[4/4] /usr/local/musa/bin/mcc -MD -MF /ws/sgl-kernel/build/temp.linux-x86_64-cpython-310/ws/sgl-kernel/third_party/flashinfer/csrc_musa/sampling.o.d -I/ws/sgl-kernel/include_musa -I/ws/sgl-kernel/include -I/ws/sgl-kernel/include/impl -I/ws/sgl-kernel/csrc_musa -I/ws/sgl-kernel/csrc -I/ws/sgl-kernel/third_party/flashinfer/include_musa -I/ws/sgl-kernel/third_party/flashinfer/include -I/ws/sgl-kernel/third_party/flashinfer/csrc_musa -I/ws/sgl-kernel/third_party/flashinfer/csrc -I/ws/sgl-kernel/third_party/mutlass/include_musa -I/ws/sgl-kernel/third_party/mutlass/include -I/usr/local/musa/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/aten/src -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/generated_cuda_compatible/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch_musa/share/torch_musa_codegen -I/usr/local/lib/python3.10/dist-packages -I/usr/local/musa/include -I/usr/include/python3.10 -c -c /ws/sgl-kernel/third_party/flashinfer/csrc_musa/sampling.mu -o /ws/sgl-kernel/build/temp.linux-x86_64-cpython-310/ws/sgl-kernel/third_party/flashinfer/csrc_musa/sampling.o -fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl_kernel -O3 -fPIC -std=c++17 --cuda-gpu-arch=mp_31 -x musa -mtgpu -Od3 -ffast-math -fmusa-flush-denormals-to-zero -fno-strict-aliasing -DUSE_MUSA -DENABLE_BF16 -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DENABLE_FP8 -DFLASHINFER_ENABLE_FP8 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 --offload-arch=mp_31 -march=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=common_ops -D_GLIBCXX_USE_CXX11_ABI=1
2026-01-14 10:35:12 | easy_install | 140527655466112 | INFO : Adding sgl-kernel 0.3.20 to easy-install.pth file
2026-01-14 10:35:12 | easy_install | 140527655466112 | INFO : 
Installed /usr/local/lib/python3.10/dist-packages/sgl_kernel-0.3.20-py3.10-linux-x86_64.egg
2026-01-14 10:35:12 | easy_install | 140527655466112 | INFO : Processing dependencies for sgl-kernel==0.3.20
2026-01-14 10:35:12 | easy_install | 140527655466112 | INFO : Finished processing dependencies for sgl-kernel==0.3.20
root@worker3218:/ws/sgl-kernel#

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-14T02:39:44Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

sglang-bot · 2026-01-15T08:31:15Z

/tag-and-rerun-ci

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

yeahdongcn requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy, yizhang2077 and zhyncs as code owners January 14, 2026 02:39

github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file sgl-kernel labels Jan 14, 2026

yeahdongcn added the mthreads label Jan 14, 2026

yeahdongcn mentioned this pull request Jan 14, 2026

[Roadmap][Feature] Support Moore Threads (MUSA) GPU #16565

Open

2 tasks

[MUSA][2/N] sgl-kernel build

5267e6a

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

yeahdongcn force-pushed the xd/musa_sgl_kernel branch from 56a7eb1 to 5267e6a Compare January 14, 2026 02:43

sglang-bot approved these changes Jan 15, 2026

View reviewed changes

github-actions Bot added the run-ci label Jan 15, 2026

Kangyan-Zhou merged commit 628ab5d into sgl-project:main Jan 23, 2026
133 of 145 checks passed

hnyls2002 reviewed Jan 31, 2026

View reviewed changes

Comment thread .gitignore

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[MUSA][2/N] sgl-kernel build (sgl-project#17053)

77fae0d

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MUSA][2/N] sgl-kernel build#17053

[MUSA][2/N] sgl-kernel build#17053
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yeahdongcn:xd/musa_sgl_kernel

yeahdongcn commented Jan 14, 2026

Uh oh!

gemini-code-assist Bot commented Jan 14, 2026

Uh oh!

sglang-bot commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yeahdongcn commented Jan 14, 2026

Motivation

Modifications

Testing Done

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 14, 2026

Uh oh!

sglang-bot commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants