[Kernel Slimming] Migrate AWQ marlin repack kernel to JIT by celve · Pull Request #18949 · sgl-project/sglang

celve · 2026-02-18T01:34:09Z

Motivation

See #17865

Modifications

New files:

python/sglang/jit_kernel/csrc/gemm/marlin/awq_marlin_repack.cuh — JIT-compiled CUDA kernel ported from sgl-kernel/csrc/gemm/marlin/awq_marlin_repack.cu
python/sglang/jit_kernel/awq_marlin_repack.py — Python wrapper with JIT loading and output tensor allocation
python/sglang/jit_kernel/tests/test_awq_marlin_repack.py
python/sglang/jit_kernel/benchmark/bench_awq_marlin_repack.py

Modified files:

python/sglang/srt/layers/quantization/awq.py — Switch awq_marlin_repack import from sgl_kernel to sglang.jit_kernel

Accuracy Tests

Pass all tests defined in python/sglang/jit_kernel/tests/test_awq_marlin_repack.py

Benchmarking and Profiling

awq-marlin-repack-performance:
   size_k  JIT Kernel  AOT Kernel
0   512.0   12.126000   12.121699
1  1024.0   12.187091   15.707735
2  2048.0   23.288894   30.660015
3  4096.0   51.975607   58.915189
4  8192.0  103.436129  103.417934

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-18T01:34:26Z

Summary of Changes

Hello @celve, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the awq_marlin_repack kernel by moving its implementation to a Just-In-Time (JIT) compilation system. This change aims to improve the adaptability and efficiency of the kernel by allowing it to be compiled dynamically. The migration includes the creation of necessary infrastructure for JIT compilation, along with comprehensive testing and performance evaluation to validate the new approach.

Highlights

JIT Kernel Migration: The awq_marlin_repack CUDA kernel has been migrated to a Just-In-Time (JIT) compilation framework, enhancing flexibility and potential performance.
New JIT Kernel Components: New Python wrapper, CUDA kernel implementation, unit tests, and benchmarking scripts have been introduced specifically for the JIT-compiled awq_marlin_repack.
Integration Update: The awq.py module has been updated to import and utilize the newly JIT-compiled awq_marlin_repack kernel, ensuring seamless integration.
Performance Benchmarking: Benchmarking results indicate that the JIT kernel generally performs comparably to or better than the AOT kernel across various size_k configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/jit_kernel/awq_marlin_repack.py
- Added a Python wrapper for the JIT-compiled awq_marlin_repack CUDA kernel.
python/sglang/jit_kernel/benchmark/bench_awq_marlin_repack.py
- Added a benchmarking script to compare the performance of the JIT-compiled kernel against the AOT-compiled version.
python/sglang/jit_kernel/csrc/gemm/marlin/awq_marlin_repack.cuh
- Added the JIT-compiled CUDA kernel for AWQ Marlin repack operations.
python/sglang/jit_kernel/tests/test_awq_marlin_repack.py
- Added unit tests to verify the correctness of the JIT-compiled awq_marlin_repack kernel.
python/sglang/srt/layers/quantization/awq.py
- Updated the import statement to use the JIT-compiled awq_marlin_repack from sglang.jit_kernel.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

celve · 2026-02-18T01:34:40Z

JIT

100%|██████████████████████████████████████| 14042/14042 [04:56<00:00, 47.41it/s]
subject: abstract_algebra, #q:100, acc: 0.560
subject: anatomy, #q:135, acc: 0.696
subject: astronomy, #q:152, acc: 0.855
subject: business_ethics, #q:100, acc: 0.780
subject: clinical_knowledge, #q:265, acc: 0.777
subject: college_biology, #q:144, acc: 0.847
subject: college_chemistry, #q:100, acc: 0.530
subject: college_computer_science, #q:100, acc: 0.660
subject: college_mathematics, #q:100, acc: 0.510
subject: college_medicine, #q:173, acc: 0.728
subject: college_physics, #q:102, acc: 0.510
subject: computer_security, #q:100, acc: 0.800
subject: conceptual_physics, #q:235, acc: 0.732
subject: econometrics, #q:114, acc: 0.658
subject: electrical_engineering, #q:145, acc: 0.745
subject: elementary_mathematics, #q:378, acc: 0.714
subject: formal_logic, #q:126, acc: 0.556
subject: global_facts, #q:100, acc: 0.430
subject: high_school_biology, #q:310, acc: 0.861
subject: high_school_chemistry, #q:203, acc: 0.660
subject: high_school_computer_science, #q:100, acc: 0.860
subject: high_school_european_history, #q:165, acc: 0.842
subject: high_school_geography, #q:198, acc: 0.909
subject: high_school_government_and_politics, #q:193, acc: 0.917
subject: high_school_macroeconomics, #q:390, acc: 0.779
subject: high_school_mathematics, #q:270, acc: 0.559
subject: high_school_microeconomics, #q:238, acc: 0.887
subject: high_school_physics, #q:151, acc: 0.556
subject: high_school_psychology, #q:545, acc: 0.903
subject: high_school_statistics, #q:216, acc: 0.727
subject: high_school_us_history, #q:204, acc: 0.873
subject: high_school_world_history, #q:237, acc: 0.861
subject: human_aging, #q:223, acc: 0.735
subject: human_sexuality, #q:131, acc: 0.794
subject: international_law, #q:121, acc: 0.843
subject: jurisprudence, #q:108, acc: 0.815
subject: logical_fallacies, #q:163, acc: 0.791
subject: machine_learning, #q:112, acc: 0.536
subject: management, #q:103, acc: 0.854
subject: marketing, #q:234, acc: 0.927
subject: medical_genetics, #q:100, acc: 0.830
subject: miscellaneous, #q:783, acc: 0.847
subject: moral_disputes, #q:346, acc: 0.786
subject: moral_scenarios, #q:895, acc: 0.611
subject: nutrition, #q:306, acc: 0.804
subject: philosophy, #q:311, acc: 0.778
subject: prehistory, #q:324, acc: 0.846
subject: professional_accounting, #q:282, acc: 0.578
subject: professional_law, #q:1534, acc: 0.503
subject: professional_medicine, #q:272, acc: 0.772
subject: professional_psychology, #q:612, acc: 0.771
subject: public_relations, #q:110, acc: 0.736
subject: security_studies, #q:245, acc: 0.804
subject: sociology, #q:201, acc: 0.876
subject: us_foreign_policy, #q:100, acc: 0.920
subject: virology, #q:166, acc: 0.542
subject: world_religions, #q:171, acc: 0.883
Total latency: 295.768
Average accuracy: 0.733

AOT

100%|██████████████████████████████████████| 14042/14042 [04:54<00:00, 47.67it/s]
subject: abstract_algebra, #q:100, acc: 0.560
subject: anatomy, #q:135, acc: 0.704
subject: astronomy, #q:152, acc: 0.855
subject: business_ethics, #q:100, acc: 0.790
subject: clinical_knowledge, #q:265, acc: 0.777
subject: college_biology, #q:144, acc: 0.847
subject: college_chemistry, #q:100, acc: 0.530
subject: college_computer_science, #q:100, acc: 0.650
subject: college_mathematics, #q:100, acc: 0.510
subject: college_medicine, #q:173, acc: 0.728
subject: college_physics, #q:102, acc: 0.510
subject: computer_security, #q:100, acc: 0.800
subject: conceptual_physics, #q:235, acc: 0.732
subject: econometrics, #q:114, acc: 0.649
subject: electrical_engineering, #q:145, acc: 0.745
subject: elementary_mathematics, #q:378, acc: 0.714
subject: formal_logic, #q:126, acc: 0.548
subject: global_facts, #q:100, acc: 0.440
subject: high_school_biology, #q:310, acc: 0.861
subject: high_school_chemistry, #q:203, acc: 0.655
subject: high_school_computer_science, #q:100, acc: 0.860
subject: high_school_european_history, #q:165, acc: 0.842
subject: high_school_geography, #q:198, acc: 0.914
subject: high_school_government_and_politics, #q:193, acc: 0.917
subject: high_school_macroeconomics, #q:390, acc: 0.782
subject: high_school_mathematics, #q:270, acc: 0.563
subject: high_school_microeconomics, #q:238, acc: 0.887
subject: high_school_physics, #q:151, acc: 0.556
subject: high_school_psychology, #q:545, acc: 0.903
subject: high_school_statistics, #q:216, acc: 0.727
subject: high_school_us_history, #q:204, acc: 0.873
subject: high_school_world_history, #q:237, acc: 0.861
subject: human_aging, #q:223, acc: 0.731
subject: human_sexuality, #q:131, acc: 0.794
subject: international_law, #q:121, acc: 0.843
subject: jurisprudence, #q:108, acc: 0.815
subject: logical_fallacies, #q:163, acc: 0.791
subject: machine_learning, #q:112, acc: 0.545
subject: management, #q:103, acc: 0.854
subject: marketing, #q:234, acc: 0.927
subject: medical_genetics, #q:100, acc: 0.830
subject: miscellaneous, #q:783, acc: 0.847
subject: moral_disputes, #q:346, acc: 0.783
subject: moral_scenarios, #q:895, acc: 0.610
subject: nutrition, #q:306, acc: 0.807
subject: philosophy, #q:311, acc: 0.778
subject: prehistory, #q:324, acc: 0.846
subject: professional_accounting, #q:282, acc: 0.582
subject: professional_law, #q:1534, acc: 0.502
subject: professional_medicine, #q:272, acc: 0.768
subject: professional_psychology, #q:612, acc: 0.771
subject: public_relations, #q:110, acc: 0.736
subject: security_studies, #q:245, acc: 0.804
subject: sociology, #q:201, acc: 0.876
subject: us_foreign_policy, #q:100, acc: 0.920
subject: virology, #q:166, acc: 0.542
subject: world_religions, #q:171, acc: 0.883
Total latency: 294.586
Average accuracy: 0.733

gemini-code-assist

Code Review

This pull request successfully migrates the AWQ marlin repack kernel to a JIT-compiled version, which is a great step for kernel slimming. The implementation is solid, including the new Python wrapper, CUDA kernel, tests, and benchmarks. I've identified a few minor areas for improvement regarding code duplication in test/benchmark files, a magic number that could be a constant, and some unreachable code in the CUDA host wrapper. Addressing these points will enhance the code's maintainability and clarity. Overall, this is a well-executed migration.

gemini-code-assist · 2026-02-18T01:35:44Z

+    size_n: int,
+    num_bits: int,
+) -> torch.Tensor:
+    tile_size = 16


The tile_size is hardcoded as a magic number. It would be better to define it as a module-level constant (e.g., MARLIN_TILE_SIZE = 16) to improve readability and maintainability, as this value is tied to the kernel's implementation details.

gemini-code-assist · 2026-02-18T01:35:44Z

+def awq_pack(q_w, num_bits, size_k, size_n):
+    if num_bits == 4:
+        interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
+    elif num_bits == 8:
+        interleave = np.array([0, 2, 1, 3])
+    else:
+        raise Exception("num_bits must be 4 or 8, got {}".format(num_bits))
+
+    q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()
+    q_w = q_w.reshape((-1, size_n)).contiguous()
+    return pack_cols(q_w, num_bits, size_k, size_n)


This awq_pack function is duplicated in python/sglang/jit_kernel/tests/test_awq_marlin_repack.py. To avoid code duplication and improve maintainability, consider moving this function to a shared test utility module.

Additionally, it's better to raise a more specific exception like ValueError instead of a generic Exception.

Suggested change

def awq_pack(q_w, num_bits, size_k, size_n):

if num_bits == 4:

interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])

elif num_bits == 8:

interleave = np.array([0, 2, 1, 3])

else:

raise Exception("num_bits must be 4 or 8, got {}".format(num_bits))

q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()

q_w = q_w.reshape((-1, size_n)).contiguous()

return pack_cols(q_w, num_bits, size_k, size_n)

def awq_pack(q_w, num_bits, size_k, size_n):

if num_bits == 4:

interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])

elif num_bits == 8:

interleave = np.array([0, 2, 1, 3])

else:

raise ValueError(f"num_bits must be 4 or 8, got {num_bits}")

q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()

q_w = q_w.reshape((-1, size_n)).contiguous()

return pack_cols(q_w, num_bits, size_k, size_n)

gemini-code-assist · 2026-02-18T01:35:44Z

+  } else {
+    RuntimeCheck(false, "Unsupported repack config: num_bits = ", num_bits);
+  }


The RuntimeCheck at line 198 already ensures that num_bits is either 4 or 8. Therefore, this else block is unreachable and can be removed for code clarity.

gemini-code-assist · 2026-02-18T01:35:44Z

+def awq_pack(
+    q_w: torch.Tensor,
+    num_bits: int,
+    size_k: int,
+    size_n: int,
+):
+    assert q_w.shape == (size_k, size_n)
+
+    if num_bits == 4:
+        interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
+    elif num_bits == 8:
+        interleave = np.array([0, 2, 1, 3])
+    else:
+        raise Exception("num_bits must be 4 or 8, got {}".format(num_bits))
+
+    q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()
+    q_w = q_w.reshape((-1, size_n)).contiguous()
+
+    return pack_cols(q_w, num_bits, size_k, size_n)


This awq_pack function is duplicated in python/sglang/jit_kernel/benchmark/bench_awq_marlin_repack.py. To improve maintainability, it would be best to extract it into a shared test utility file.

Also, it's good practice to raise a more specific ValueError instead of a generic Exception.

Suggested change

def awq_pack(

q_w: torch.Tensor,

num_bits: int,

size_k: int,

size_n: int,

):

assert q_w.shape == (size_k, size_n)

if num_bits == 4:

interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])

elif num_bits == 8:

interleave = np.array([0, 2, 1, 3])

else:

raise Exception("num_bits must be 4 or 8, got {}".format(num_bits))

q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()

q_w = q_w.reshape((-1, size_n)).contiguous()

return pack_cols(q_w, num_bits, size_k, size_n)

def awq_pack(

q_w: torch.Tensor,

num_bits: int,

size_k: int,

size_n: int,

):

assert q_w.shape == (size_k, size_n)

if num_bits == 4:

interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])

elif num_bits == 8:

interleave = np.array([0, 2, 1, 3])

else:

raise ValueError(f"num_bits must be 4 or 8, got {num_bits}")

q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()

q_w = q_w.reshape((-1, size_n)).contiguous()

return pack_cols(q_w, num_bits, size_k, size_n)

Copilot

Pull request overview

This PR migrates the AWQ marlin repack kernel from ahead-of-time (AOT) compilation in sgl-kernel to just-in-time (JIT) compilation, as part of the kernel slimming initiative to reduce the sgl-kernel wheel size by approximately 97.5MB.

Changes:

Moved awq_marlin_repack kernel implementation from sgl-kernel AOT to JIT compilation
Added comprehensive tests comparing JIT vs AOT implementations for correctness
Added benchmarking to compare JIT vs AOT performance

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
python/sglang/srt/layers/quantization/awq.py	Updated import to use JIT-compiled awq_marlin_repack instead of sgl_kernel AOT version
python/sglang/jit_kernel/awq_marlin_repack.py	Python wrapper for JIT kernel with output tensor allocation
python/sglang/jit_kernel/csrc/gemm/marlin/awq_marlin_repack.cuh	JIT-compiled CUDA kernel ported from AOT implementation
python/sglang/jit_kernel/tests/test_awq_marlin_repack.py	Comprehensive tests for correctness (JIT vs AOT and expected behavior)
python/sglang/jit_kernel/benchmark/bench_awq_marlin_repack.py	Performance benchmarking comparing JIT vs AOT implementations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

BBuf · 2026-02-18T06:55:24Z

@@ -60,7 +60,9 @@
    import torch_npu

 if _is_cuda:
-    from sgl_kernel import awq_dequantize, awq_marlin_moe_repack, awq_marlin_repack
+    from sgl_kernel import awq_dequantize, awq_marlin_moe_repack


todo: we should also remove awq_dequantize and awq_marlin_moe_repack to jit_kernel

BBuf · 2026-02-18T06:55:52Z

/tag-and-rerun-ci

BBuf · 2026-02-18T09:16:38Z

/rerun-failed-ci

celve · 2026-02-19T08:12:59Z

awq marlin moe repack benchmark:

awq-marlin-moe-repack-performance:
   num_experts  JIT Kernel  AOT Kernel
0          2.0  101.580372  101.588083
1          4.0  204.759229  204.780869
2          8.0  409.524876  409.537730
3         16.0  819.356504  819.270963

Serve TheBloke/dolphin-2.7-mixtral-8x7b-AWQ with JIT:

(sglang) python3 benchmark/mmlu/bench_sglang.py                                      
100%|██████████████████████████████████████████| 14042/14042 [10:58<00:00, 21.31it/s]
subject: abstract_algebra, #q:100, acc: 0.340                                        
subject: anatomy, #q:135, acc: 0.652                                                 
subject: astronomy, #q:152, acc: 0.770                                               
subject: business_ethics, #q:100, acc: 0.700                                         
subject: clinical_knowledge, #q:265, acc: 0.800                                      
subject: college_biology, #q:144, acc: 0.799                                         
subject: college_chemistry, #q:100, acc: 0.540                                       
subject: college_computer_science, #q:100, acc: 0.620                                
subject: college_mathematics, #q:100, acc: 0.390                                     
subject: college_medicine, #q:173, acc: 0.705                                        
subject: college_physics, #q:102, acc: 0.412                                         
subject: computer_security, #q:100, acc: 0.770                                       
subject: conceptual_physics, #q:235, acc: 0.630                                      
subject: econometrics, #q:114, acc: 0.623
subject: electrical_engineering, #q:145, acc: 0.634
subject: elementary_mathematics, #q:378, acc: 0.463
subject: formal_logic, #q:126, acc: 0.524
subject: global_facts, #q:100, acc: 0.440
subject: high_school_biology, #q:310, acc: 0.819
subject: high_school_chemistry, #q:203, acc: 0.586
subject: high_school_computer_science, #q:100, acc: 0.740
subject: high_school_european_history, #q:165, acc: 0.776
subject: high_school_geography, #q:198, acc: 0.848
subject: high_school_government_and_politics, #q:193, acc: 0.933
subject: high_school_macroeconomics, #q:390, acc: 0.705
subject: high_school_mathematics, #q:270, acc: 0.389
subject: high_school_microeconomics, #q:238, acc: 0.756
subject: high_school_physics, #q:151, acc: 0.391
subject: high_school_psychology, #q:545, acc: 0.861
subject: high_school_statistics, #q:216, acc: 0.556
subject: high_school_us_history, #q:204, acc: 0.858
subject: high_school_world_history, #q:237, acc: 0.865
subject: human_aging, #q:223, acc: 0.709
subject: human_sexuality, #q:131, acc: 0.771
subject: international_law, #q:121, acc: 0.876
subject: jurisprudence, #q:108, acc: 0.815 
subject: logical_fallacies, #q:163, acc: 0.773
subject: machine_learning, #q:112, acc: 0.482
subject: management, #q:103, acc: 0.864
subject: marketing, #q:234, acc: 0.906
subject: medical_genetics, #q:100, acc: 0.760
subject: miscellaneous, #q:783, acc: 0.870
subject: moral_disputes, #q:346, acc: 0.775
subject: moral_scenarios, #q:895, acc: 0.446
subject: nutrition, #q:306, acc: 0.778
subject: philosophy, #q:311, acc: 0.752
subject: prehistory, #q:324, acc: 0.793
subject: professional_accounting, #q:282, acc: 0.496
subject: professional_law, #q:1534, acc: 0.509
subject: professional_medicine, #q:272, acc: 0.735
subject: professional_psychology, #q:612, acc: 0.748
subject: public_relations, #q:110, acc: 0.673
subject: security_studies, #q:245, acc: 0.743
subject: sociology, #q:201, acc: 0.891
subject: us_foreign_policy, #q:100, acc: 0.870
subject: virology, #q:166, acc: 0.518
subject: world_religions, #q:171, acc: 0.865
Total latency: 659.126
Average accuracy: 0.681

With AOT:

100%|██████████████████████████████████████████| 14042/14042 [11:02<00:00, 21.19it/s]
subject: abstract_algebra, #q:100, acc: 0.340
subject: anatomy, #q:135, acc: 0.644
subject: astronomy, #q:152, acc: 0.770
subject: business_ethics, #q:100, acc: 0.680
subject: clinical_knowledge, #q:265, acc: 0.796
subject: college_biology, #q:144, acc: 0.785
subject: college_chemistry, #q:100, acc: 0.540
subject: college_computer_science, #q:100, acc: 0.630
subject: college_mathematics, #q:100, acc: 0.410
subject: college_medicine, #q:173, acc: 0.699
subject: college_physics, #q:102, acc: 0.402
subject: computer_security, #q:100, acc: 0.770
subject: conceptual_physics, #q:235, acc: 0.630
subject: econometrics, #q:114, acc: 0.614
subject: electrical_engineering, #q:145, acc: 0.641
subject: elementary_mathematics, #q:378, acc: 0.466
subject: formal_logic, #q:126, acc: 0.524
subject: global_facts, #q:100, acc: 0.450
subject: high_school_biology, #q:310, acc: 0.816
subject: high_school_chemistry, #q:203, acc: 0.581
subject: high_school_computer_science, #q:100, acc: 0.730
subject: high_school_european_history, #q:165, acc: 0.764
subject: high_school_geography, #q:198, acc: 0.854
subject: high_school_government_and_politics, #q:193, acc: 0.927
subject: high_school_macroeconomics, #q:390, acc: 0.705
subject: high_school_mathematics, #q:270, acc: 0.389
subject: high_school_microeconomics, #q:238, acc: 0.748
subject: high_school_physics, #q:151, acc: 0.384
subject: high_school_psychology, #q:545, acc: 0.861
subject: high_school_statistics, #q:216, acc: 0.565
subject: high_school_us_history, #q:204, acc: 0.863
subject: high_school_world_history, #q:237, acc: 0.873
subject: human_aging, #q:223, acc: 0.709
subject: human_sexuality, #q:131, acc: 0.786
subject: international_law, #q:121, acc: 0.876
subject: jurisprudence, #q:108, acc: 0.806
subject: logical_fallacies, #q:163, acc: 0.767
subject: machine_learning, #q:112, acc: 0.482
subject: management, #q:103, acc: 0.874
subject: marketing, #q:234, acc: 0.906
subject: medical_genetics, #q:100, acc: 0.770
subject: miscellaneous, #q:783, acc: 0.874
subject: moral_disputes, #q:346, acc: 0.775
subject: moral_scenarios, #q:895, acc: 0.444
subject: nutrition, #q:306, acc: 0.778
subject: philosophy, #q:311, acc: 0.759
subject: prehistory, #q:324, acc: 0.793
subject: professional_accounting, #q:282, acc: 0.496
subject: professional_law, #q:1534, acc: 0.512
subject: professional_medicine, #q:272, acc: 0.739
subject: professional_psychology, #q:612, acc: 0.748
subject: public_relations, #q:110, acc: 0.682
subject: security_studies, #q:245, acc: 0.743
subject: sociology, #q:201, acc: 0.886
subject: us_foreign_policy, #q:100, acc: 0.870
subject: virology, #q:166, acc: 0.518
subject: world_religions, #q:171, acc: 0.865
Total latency: 662.616
Average accuracy: 0.681

celve · 2026-02-19T08:13:50Z

awq dequant benchmark:

awq-dequantize-jit-vs-aot:
    qweight_row  qweight_col  JIT Kernel  AOT Kernel
0         128.0         16.0    1.253346    1.253989
1         128.0         32.0    1.284276    1.284160
2         128.0         64.0    1.299761    1.300043
3         128.0        128.0    1.344773    1.344287
4         128.0        448.0    1.596273    1.595940
5         256.0         16.0    1.288422    1.288046
6         256.0         32.0    1.295849    1.294851
7         256.0         64.0    1.363338    1.361475
8         256.0        128.0    1.429022    1.429168
9         256.0        448.0    1.998848    1.998996
10        512.0         16.0    1.299584    1.299666
11        512.0         32.0    1.349975    1.348758
12        512.0         64.0    1.433945    1.434366
13        512.0        128.0    1.709850    1.710682
14        512.0        448.0    2.612053    2.615450
15       1024.0         16.0    1.345730    1.346437
16       1024.0         32.0    1.428128    1.426818
17       1024.0         64.0    1.669251    1.669805
18       1024.0        128.0    1.982234    1.982000
19       1024.0        448.0    3.915917    3.915874
20       3584.0         16.0    1.524787    1.525134
21       3584.0         32.0    1.958996    1.945239
22       3584.0         64.0    2.632860    2.632395
23       3584.0        128.0    3.902435    3.903013
24       3584.0        448.0   10.662116   10.705651

Serve Qwen/Qwen2.5-7B-Instruct-AWQ with quantization enforced:

100%|██████████████████████████████████████| 14042/14042 [03:13<00:00, 72.75it/s]
subject: abstract_algebra, #q:100, acc: 0.560
subject: anatomy, #q:135, acc: 0.704
subject: astronomy, #q:152, acc: 0.855
subject: business_ethics, #q:100, acc: 0.780
subject: clinical_knowledge, #q:265, acc: 0.777
subject: college_biology, #q:144, acc: 0.847
subject: college_chemistry, #q:100, acc: 0.530
subject: college_computer_science, #q:100, acc: 0.650
subject: college_mathematics, #q:100, acc: 0.510
subject: college_medicine, #q:173, acc: 0.728
subject: college_physics, #q:102, acc: 0.510
subject: computer_security, #q:100, acc: 0.810
subject: conceptual_physics, #q:235, acc: 0.732
subject: econometrics, #q:114, acc: 0.649
subject: electrical_engineering, #q:145, acc: 0.745
subject: elementary_mathematics, #q:378, acc: 0.714
subject: formal_logic, #q:126, acc: 0.556
subject: global_facts, #q:100, acc: 0.430
subject: high_school_biology, #q:310, acc: 0.861
subject: high_school_chemistry, #q:203, acc: 0.660
subject: high_school_computer_science, #q:100, acc: 0.860
subject: high_school_european_history, #q:165, acc: 0.842
subject: high_school_geography, #q:198, acc: 0.914
subject: high_school_government_and_politics, #q:193, acc: 0.917
subject: high_school_macroeconomics, #q:390, acc: 0.782
subject: high_school_mathematics, #q:270, acc: 0.559
subject: high_school_microeconomics, #q:238, acc: 0.887
subject: high_school_physics, #q:151, acc: 0.556
subject: high_school_psychology, #q:545, acc: 0.903
subject: high_school_statistics, #q:216, acc: 0.727
subject: high_school_us_history, #q:204, acc: 0.873
subject: high_school_world_history, #q:237, acc: 0.861
subject: human_aging, #q:223, acc: 0.731
subject: human_sexuality, #q:131, acc: 0.794
subject: international_law, #q:121, acc: 0.843
subject: jurisprudence, #q:108, acc: 0.815
subject: logical_fallacies, #q:163, acc: 0.791
subject: machine_learning, #q:112, acc: 0.536
subject: management, #q:103, acc: 0.854
subject: marketing, #q:234, acc: 0.927
subject: medical_genetics, #q:100, acc: 0.830
subject: miscellaneous, #q:783, acc: 0.845
subject: moral_disputes, #q:346, acc: 0.783
subject: moral_scenarios, #q:895, acc: 0.611
subject: nutrition, #q:306, acc: 0.804
subject: philosophy, #q:311, acc: 0.778
subject: prehistory, #q:324, acc: 0.846
subject: professional_accounting, #q:282, acc: 0.578
subject: professional_law, #q:1534, acc: 0.503
subject: professional_medicine, #q:272, acc: 0.772
subject: professional_psychology, #q:612, acc: 0.773
subject: public_relations, #q:110, acc: 0.736
subject: security_studies, #q:245, acc: 0.804
subject: sociology, #q:201, acc: 0.876
subject: us_foreign_policy, #q:100, acc: 0.920
subject: virology, #q:166, acc: 0.542
subject: world_religions, #q:171, acc: 0.883
Total latency: 193.072
Average accuracy: 0.733

BBuf · 2026-02-20T15:18:29Z

/rerun-failed-ci

…o xverse_moe * 'xverse_moe' of https://github.com/xiaobaicxy/sglang: (275 commits) fix: add missing blank line after docstring in serving_transcription.py (sgl-project#19206) Whisper model support & `/v1/audio/transcriptions` endpoint & benchmark (sgl-project#16983) fix: patch docker image fixes (sgl-project#19100) [PD-Disagg] Unify prefill info data transition flow, all with `PrefillServerInfo` (sgl-project#19195) [CI] Tiny enhance the dp attention load blance benchmark (sgl-project#19194) add new ci user (sgl-project#19133) [CI] fix the teardown output of disaggregation test (sgl-project#19193) [PD-Disagg] Support query dp rank from bootstrap server. (sgl-project#19168) [Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-project#18949) [Diffusion] Match rotary_embedding module name style (sgl-project#19179) [Refactor] Split rotary_embedding.py into a modular package (sgl-project#19144) [NPU] bump sgl-kernel-npu to 2026.02.01.post2 (sgl-project#19178) Use single mma warp group for short q_len in FA to optimize decoding performance (sgl-project#18985) Reorganize topk logic to clean up code and expose logical experts (sgl-project#16945) [ROCm] Use unreg path for custom all-reduce during CUDA graph capture (sgl-project#19162) [diffusion] feat: detect Flux2 custom VAE path from component_paths (sgl-project#19170) [AMD] ENV flags tuning and cleanup (sgl-project#19176) Fix bench_one_batch_server by moving the print statements (sgl-project#19175) Update rocm7.2 Dockerfile to install amdsmi for QuickReduce Initialization (sgl-project#19091) Revert "Refactor graph input buffers (sgl-project#18991)" (sgl-project#19173) ...

…t#18949) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

celve added 3 commits February 17, 2026 16:47

wip: awq marlin repack

4fada85

wip: data tpr

3b3a829

wip: lint

e0d04a5

Copilot AI review requested due to automatic review settings February 18, 2026 01:34

celve requested review from AniZpZ, BBuf, DarkSharpness, Edwardf0t1, FlamingoPg and ch-wan as code owners February 18, 2026 01:34

Copilot started reviewing on behalf of celve February 18, 2026 01:34 View session

gemini-code-assist Bot reviewed Feb 18, 2026

View reviewed changes

Copilot AI reviewed Feb 18, 2026

View reviewed changes

BBuf mentioned this pull request Feb 18, 2026

[Feature] sgl-kernel wheel slimming plan tracking #17865

Closed

74 tasks

BBuf reviewed Feb 18, 2026

View reviewed changes

BBuf approved these changes Feb 18, 2026

View reviewed changes

github-actions Bot added the run-ci label Feb 18, 2026

wip: more kernels

265406e

github-actions Bot added the quant LLM Quantization label Feb 19, 2026

wip: lint

38cb5b9

Merge branch 'main' into jit-awq-marlin-repack

6fbd92d

BBuf merged commit 2cdde5d into sgl-project:main Feb 23, 2026
234 of 252 checks passed

BBuf mentioned this pull request Feb 24, 2026

[Kernel Slimming] Remove sgl-kernel AOT marlin kernels #19241

Merged

5 tasks

celve deleted the jit-awq-marlin-repack branch February 25, 2026 07:57

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

[Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-projec…

161c08d

…t#18949) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-projec…

81312cc

…t#18949) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-projec…

c36cfff

…t#18949) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

Conversation

celve commented Feb 18, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Feb 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

celve commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

BBuf Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

BBuf commented Feb 18, 2026

Uh oh!

BBuf commented Feb 18, 2026

Uh oh!

celve commented Feb 19, 2026

Uh oh!

celve commented Feb 19, 2026

Uh oh!

BBuf commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

celve commented Feb 18, 2026 •

edited

Loading