Skip to content

[Kernel Slimming] Migrate AWQ marlin repack kernel to JIT#18949

Merged
BBuf merged 6 commits intosgl-project:mainfrom
celve:jit-awq-marlin-repack
Feb 23, 2026
Merged

[Kernel Slimming] Migrate AWQ marlin repack kernel to JIT#18949
BBuf merged 6 commits intosgl-project:mainfrom
celve:jit-awq-marlin-repack

Conversation

@celve
Copy link
Copy Markdown
Collaborator

@celve celve commented Feb 18, 2026

Motivation

See #17865

Modifications

New files:

  • python/sglang/jit_kernel/csrc/gemm/marlin/awq_marlin_repack.cuh — JIT-compiled CUDA kernel ported from sgl-kernel/csrc/gemm/marlin/awq_marlin_repack.cu
  • python/sglang/jit_kernel/awq_marlin_repack.py — Python wrapper with JIT loading and output tensor allocation
  • python/sglang/jit_kernel/tests/test_awq_marlin_repack.py
  • python/sglang/jit_kernel/benchmark/bench_awq_marlin_repack.py

Modified files:

  • python/sglang/srt/layers/quantization/awq.py — Switch awq_marlin_repack import from sgl_kernel to sglang.jit_kernel

Accuracy Tests

Pass all tests defined in python/sglang/jit_kernel/tests/test_awq_marlin_repack.py

Benchmarking and Profiling

awq-marlin-repack-performance:
   size_k  JIT Kernel  AOT Kernel
0   512.0   12.126000   12.121699
1  1024.0   12.187091   15.707735
2  2048.0   23.288894   30.660015
3  4096.0   51.975607   58.915189
4  8192.0  103.436129  103.417934

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Copilot AI review requested due to automatic review settings February 18, 2026 01:34
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @celve, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the awq_marlin_repack kernel by moving its implementation to a Just-In-Time (JIT) compilation system. This change aims to improve the adaptability and efficiency of the kernel by allowing it to be compiled dynamically. The migration includes the creation of necessary infrastructure for JIT compilation, along with comprehensive testing and performance evaluation to validate the new approach.

Highlights

  • JIT Kernel Migration: The awq_marlin_repack CUDA kernel has been migrated to a Just-In-Time (JIT) compilation framework, enhancing flexibility and potential performance.
  • New JIT Kernel Components: New Python wrapper, CUDA kernel implementation, unit tests, and benchmarking scripts have been introduced specifically for the JIT-compiled awq_marlin_repack.
  • Integration Update: The awq.py module has been updated to import and utilize the newly JIT-compiled awq_marlin_repack kernel, ensuring seamless integration.
  • Performance Benchmarking: Benchmarking results indicate that the JIT kernel generally performs comparably to or better than the AOT kernel across various size_k configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/jit_kernel/awq_marlin_repack.py
    • Added a Python wrapper for the JIT-compiled awq_marlin_repack CUDA kernel.
  • python/sglang/jit_kernel/benchmark/bench_awq_marlin_repack.py
    • Added a benchmarking script to compare the performance of the JIT-compiled kernel against the AOT-compiled version.
  • python/sglang/jit_kernel/csrc/gemm/marlin/awq_marlin_repack.cuh
    • Added the JIT-compiled CUDA kernel for AWQ Marlin repack operations.
  • python/sglang/jit_kernel/tests/test_awq_marlin_repack.py
    • Added unit tests to verify the correctness of the JIT-compiled awq_marlin_repack kernel.
  • python/sglang/srt/layers/quantization/awq.py
    • Updated the import statement to use the JIT-compiled awq_marlin_repack from sglang.jit_kernel.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@celve
Copy link
Copy Markdown
Collaborator Author

celve commented Feb 18, 2026

  • JIT

    100%|██████████████████████████████████████| 14042/14042 [04:56<00:00, 47.41it/s]
    subject: abstract_algebra, #q:100, acc: 0.560
    subject: anatomy, #q:135, acc: 0.696
    subject: astronomy, #q:152, acc: 0.855
    subject: business_ethics, #q:100, acc: 0.780
    subject: clinical_knowledge, #q:265, acc: 0.777
    subject: college_biology, #q:144, acc: 0.847
    subject: college_chemistry, #q:100, acc: 0.530
    subject: college_computer_science, #q:100, acc: 0.660
    subject: college_mathematics, #q:100, acc: 0.510
    subject: college_medicine, #q:173, acc: 0.728
    subject: college_physics, #q:102, acc: 0.510
    subject: computer_security, #q:100, acc: 0.800
    subject: conceptual_physics, #q:235, acc: 0.732
    subject: econometrics, #q:114, acc: 0.658
    subject: electrical_engineering, #q:145, acc: 0.745
    subject: elementary_mathematics, #q:378, acc: 0.714
    subject: formal_logic, #q:126, acc: 0.556
    subject: global_facts, #q:100, acc: 0.430
    subject: high_school_biology, #q:310, acc: 0.861
    subject: high_school_chemistry, #q:203, acc: 0.660
    subject: high_school_computer_science, #q:100, acc: 0.860
    subject: high_school_european_history, #q:165, acc: 0.842
    subject: high_school_geography, #q:198, acc: 0.909
    subject: high_school_government_and_politics, #q:193, acc: 0.917
    subject: high_school_macroeconomics, #q:390, acc: 0.779
    subject: high_school_mathematics, #q:270, acc: 0.559
    subject: high_school_microeconomics, #q:238, acc: 0.887
    subject: high_school_physics, #q:151, acc: 0.556
    subject: high_school_psychology, #q:545, acc: 0.903
    subject: high_school_statistics, #q:216, acc: 0.727
    subject: high_school_us_history, #q:204, acc: 0.873
    subject: high_school_world_history, #q:237, acc: 0.861
    subject: human_aging, #q:223, acc: 0.735
    subject: human_sexuality, #q:131, acc: 0.794
    subject: international_law, #q:121, acc: 0.843
    subject: jurisprudence, #q:108, acc: 0.815
    subject: logical_fallacies, #q:163, acc: 0.791
    subject: machine_learning, #q:112, acc: 0.536
    subject: management, #q:103, acc: 0.854
    subject: marketing, #q:234, acc: 0.927
    subject: medical_genetics, #q:100, acc: 0.830
    subject: miscellaneous, #q:783, acc: 0.847
    subject: moral_disputes, #q:346, acc: 0.786
    subject: moral_scenarios, #q:895, acc: 0.611
    subject: nutrition, #q:306, acc: 0.804
    subject: philosophy, #q:311, acc: 0.778
    subject: prehistory, #q:324, acc: 0.846
    subject: professional_accounting, #q:282, acc: 0.578
    subject: professional_law, #q:1534, acc: 0.503
    subject: professional_medicine, #q:272, acc: 0.772
    subject: professional_psychology, #q:612, acc: 0.771
    subject: public_relations, #q:110, acc: 0.736
    subject: security_studies, #q:245, acc: 0.804
    subject: sociology, #q:201, acc: 0.876
    subject: us_foreign_policy, #q:100, acc: 0.920
    subject: virology, #q:166, acc: 0.542
    subject: world_religions, #q:171, acc: 0.883
    Total latency: 295.768
    Average accuracy: 0.733
  • AOT

    100%|██████████████████████████████████████| 14042/14042 [04:54<00:00, 47.67it/s]
    subject: abstract_algebra, #q:100, acc: 0.560
    subject: anatomy, #q:135, acc: 0.704
    subject: astronomy, #q:152, acc: 0.855
    subject: business_ethics, #q:100, acc: 0.790
    subject: clinical_knowledge, #q:265, acc: 0.777
    subject: college_biology, #q:144, acc: 0.847
    subject: college_chemistry, #q:100, acc: 0.530
    subject: college_computer_science, #q:100, acc: 0.650
    subject: college_mathematics, #q:100, acc: 0.510
    subject: college_medicine, #q:173, acc: 0.728
    subject: college_physics, #q:102, acc: 0.510
    subject: computer_security, #q:100, acc: 0.800
    subject: conceptual_physics, #q:235, acc: 0.732
    subject: econometrics, #q:114, acc: 0.649
    subject: electrical_engineering, #q:145, acc: 0.745
    subject: elementary_mathematics, #q:378, acc: 0.714
    subject: formal_logic, #q:126, acc: 0.548
    subject: global_facts, #q:100, acc: 0.440
    subject: high_school_biology, #q:310, acc: 0.861
    subject: high_school_chemistry, #q:203, acc: 0.655
    subject: high_school_computer_science, #q:100, acc: 0.860
    subject: high_school_european_history, #q:165, acc: 0.842
    subject: high_school_geography, #q:198, acc: 0.914
    subject: high_school_government_and_politics, #q:193, acc: 0.917
    subject: high_school_macroeconomics, #q:390, acc: 0.782
    subject: high_school_mathematics, #q:270, acc: 0.563
    subject: high_school_microeconomics, #q:238, acc: 0.887
    subject: high_school_physics, #q:151, acc: 0.556
    subject: high_school_psychology, #q:545, acc: 0.903
    subject: high_school_statistics, #q:216, acc: 0.727
    subject: high_school_us_history, #q:204, acc: 0.873
    subject: high_school_world_history, #q:237, acc: 0.861
    subject: human_aging, #q:223, acc: 0.731
    subject: human_sexuality, #q:131, acc: 0.794
    subject: international_law, #q:121, acc: 0.843
    subject: jurisprudence, #q:108, acc: 0.815
    subject: logical_fallacies, #q:163, acc: 0.791
    subject: machine_learning, #q:112, acc: 0.545
    subject: management, #q:103, acc: 0.854
    subject: marketing, #q:234, acc: 0.927
    subject: medical_genetics, #q:100, acc: 0.830
    subject: miscellaneous, #q:783, acc: 0.847
    subject: moral_disputes, #q:346, acc: 0.783
    subject: moral_scenarios, #q:895, acc: 0.610
    subject: nutrition, #q:306, acc: 0.807
    subject: philosophy, #q:311, acc: 0.778
    subject: prehistory, #q:324, acc: 0.846
    subject: professional_accounting, #q:282, acc: 0.582
    subject: professional_law, #q:1534, acc: 0.502
    subject: professional_medicine, #q:272, acc: 0.768
    subject: professional_psychology, #q:612, acc: 0.771
    subject: public_relations, #q:110, acc: 0.736
    subject: security_studies, #q:245, acc: 0.804
    subject: sociology, #q:201, acc: 0.876
    subject: us_foreign_policy, #q:100, acc: 0.920
    subject: virology, #q:166, acc: 0.542
    subject: world_religions, #q:171, acc: 0.883
    Total latency: 294.586
    Average accuracy: 0.733

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully migrates the AWQ marlin repack kernel to a JIT-compiled version, which is a great step for kernel slimming. The implementation is solid, including the new Python wrapper, CUDA kernel, tests, and benchmarks. I've identified a few minor areas for improvement regarding code duplication in test/benchmark files, a magic number that could be a constant, and some unreachable code in the CUDA host wrapper. Addressing these points will enhance the code's maintainability and clarity. Overall, this is a well-executed migration.

size_n: int,
num_bits: int,
) -> torch.Tensor:
tile_size = 16
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tile_size is hardcoded as a magic number. It would be better to define it as a module-level constant (e.g., MARLIN_TILE_SIZE = 16) to improve readability and maintainability, as this value is tied to the kernel's implementation details.

Comment on lines +33 to +43
def awq_pack(q_w, num_bits, size_k, size_n):
if num_bits == 4:
interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
elif num_bits == 8:
interleave = np.array([0, 2, 1, 3])
else:
raise Exception("num_bits must be 4 or 8, got {}".format(num_bits))

q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()
q_w = q_w.reshape((-1, size_n)).contiguous()
return pack_cols(q_w, num_bits, size_k, size_n)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This awq_pack function is duplicated in python/sglang/jit_kernel/tests/test_awq_marlin_repack.py. To avoid code duplication and improve maintainability, consider moving this function to a shared test utility module.

Additionally, it's better to raise a more specific exception like ValueError instead of a generic Exception.

Suggested change
def awq_pack(q_w, num_bits, size_k, size_n):
if num_bits == 4:
interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
elif num_bits == 8:
interleave = np.array([0, 2, 1, 3])
else:
raise Exception("num_bits must be 4 or 8, got {}".format(num_bits))
q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()
q_w = q_w.reshape((-1, size_n)).contiguous()
return pack_cols(q_w, num_bits, size_k, size_n)
def awq_pack(q_w, num_bits, size_k, size_n):
if num_bits == 4:
interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
elif num_bits == 8:
interleave = np.array([0, 2, 1, 3])
else:
raise ValueError(f"num_bits must be 4 or 8, got {num_bits}")
q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()
q_w = q_w.reshape((-1, size_n)).contiguous()
return pack_cols(q_w, num_bits, size_k, size_n)

Comment on lines +248 to +250
} else {
RuntimeCheck(false, "Unsupported repack config: num_bits = ", num_bits);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The RuntimeCheck at line 198 already ensures that num_bits is either 4 or 8. Therefore, this else block is unreachable and can be removed for code clarity.

Comment on lines +20 to +38
def awq_pack(
q_w: torch.Tensor,
num_bits: int,
size_k: int,
size_n: int,
):
assert q_w.shape == (size_k, size_n)

if num_bits == 4:
interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
elif num_bits == 8:
interleave = np.array([0, 2, 1, 3])
else:
raise Exception("num_bits must be 4 or 8, got {}".format(num_bits))

q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()
q_w = q_w.reshape((-1, size_n)).contiguous()

return pack_cols(q_w, num_bits, size_k, size_n)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This awq_pack function is duplicated in python/sglang/jit_kernel/benchmark/bench_awq_marlin_repack.py. To improve maintainability, it would be best to extract it into a shared test utility file.

Also, it's good practice to raise a more specific ValueError instead of a generic Exception.

Suggested change
def awq_pack(
q_w: torch.Tensor,
num_bits: int,
size_k: int,
size_n: int,
):
assert q_w.shape == (size_k, size_n)
if num_bits == 4:
interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
elif num_bits == 8:
interleave = np.array([0, 2, 1, 3])
else:
raise Exception("num_bits must be 4 or 8, got {}".format(num_bits))
q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()
q_w = q_w.reshape((-1, size_n)).contiguous()
return pack_cols(q_w, num_bits, size_k, size_n)
def awq_pack(
q_w: torch.Tensor,
num_bits: int,
size_k: int,
size_n: int,
):
assert q_w.shape == (size_k, size_n)
if num_bits == 4:
interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
elif num_bits == 8:
interleave = np.array([0, 2, 1, 3])
else:
raise ValueError(f"num_bits must be 4 or 8, got {num_bits}")
q_w = q_w.reshape((-1, len(interleave)))[:, interleave].ravel()
q_w = q_w.reshape((-1, size_n)).contiguous()
return pack_cols(q_w, num_bits, size_k, size_n)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the AWQ marlin repack kernel from ahead-of-time (AOT) compilation in sgl-kernel to just-in-time (JIT) compilation, as part of the kernel slimming initiative to reduce the sgl-kernel wheel size by approximately 97.5MB.

Changes:

  • Moved awq_marlin_repack kernel implementation from sgl-kernel AOT to JIT compilation
  • Added comprehensive tests comparing JIT vs AOT implementations for correctness
  • Added benchmarking to compare JIT vs AOT performance

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
python/sglang/srt/layers/quantization/awq.py Updated import to use JIT-compiled awq_marlin_repack instead of sgl_kernel AOT version
python/sglang/jit_kernel/awq_marlin_repack.py Python wrapper for JIT kernel with output tensor allocation
python/sglang/jit_kernel/csrc/gemm/marlin/awq_marlin_repack.cuh JIT-compiled CUDA kernel ported from AOT implementation
python/sglang/jit_kernel/tests/test_awq_marlin_repack.py Comprehensive tests for correctness (JIT vs AOT and expected behavior)
python/sglang/jit_kernel/benchmark/bench_awq_marlin_repack.py Performance benchmarking comparing JIT vs AOT implementations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -60,7 +60,9 @@
import torch_npu

if _is_cuda:
from sgl_kernel import awq_dequantize, awq_marlin_moe_repack, awq_marlin_repack
from sgl_kernel import awq_dequantize, awq_marlin_moe_repack
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: we should also remove awq_dequantize and awq_marlin_moe_repack to jit_kernel

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Feb 18, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Feb 18, 2026

/rerun-failed-ci

@github-actions github-actions Bot added the quant LLM Quantization label Feb 19, 2026
@celve
Copy link
Copy Markdown
Collaborator Author

celve commented Feb 19, 2026

awq marlin moe repack benchmark:

awq-marlin-moe-repack-performance:
   num_experts  JIT Kernel  AOT Kernel
0          2.0  101.580372  101.588083
1          4.0  204.759229  204.780869
2          8.0  409.524876  409.537730
3         16.0  819.356504  819.270963

Serve TheBloke/dolphin-2.7-mixtral-8x7b-AWQ with JIT:

(sglang) python3 benchmark/mmlu/bench_sglang.py                                      
100%|██████████████████████████████████████████| 14042/14042 [10:58<00:00, 21.31it/s]
subject: abstract_algebra, #q:100, acc: 0.340                                        
subject: anatomy, #q:135, acc: 0.652                                                 
subject: astronomy, #q:152, acc: 0.770                                               
subject: business_ethics, #q:100, acc: 0.700                                         
subject: clinical_knowledge, #q:265, acc: 0.800                                      
subject: college_biology, #q:144, acc: 0.799                                         
subject: college_chemistry, #q:100, acc: 0.540                                       
subject: college_computer_science, #q:100, acc: 0.620                                
subject: college_mathematics, #q:100, acc: 0.390                                     
subject: college_medicine, #q:173, acc: 0.705                                        
subject: college_physics, #q:102, acc: 0.412                                         
subject: computer_security, #q:100, acc: 0.770                                       
subject: conceptual_physics, #q:235, acc: 0.630                                      
subject: econometrics, #q:114, acc: 0.623
subject: electrical_engineering, #q:145, acc: 0.634
subject: elementary_mathematics, #q:378, acc: 0.463
subject: formal_logic, #q:126, acc: 0.524
subject: global_facts, #q:100, acc: 0.440
subject: high_school_biology, #q:310, acc: 0.819
subject: high_school_chemistry, #q:203, acc: 0.586
subject: high_school_computer_science, #q:100, acc: 0.740
subject: high_school_european_history, #q:165, acc: 0.776
subject: high_school_geography, #q:198, acc: 0.848
subject: high_school_government_and_politics, #q:193, acc: 0.933
subject: high_school_macroeconomics, #q:390, acc: 0.705
subject: high_school_mathematics, #q:270, acc: 0.389
subject: high_school_microeconomics, #q:238, acc: 0.756
subject: high_school_physics, #q:151, acc: 0.391
subject: high_school_psychology, #q:545, acc: 0.861
subject: high_school_statistics, #q:216, acc: 0.556
subject: high_school_us_history, #q:204, acc: 0.858
subject: high_school_world_history, #q:237, acc: 0.865
subject: human_aging, #q:223, acc: 0.709
subject: human_sexuality, #q:131, acc: 0.771
subject: international_law, #q:121, acc: 0.876
subject: jurisprudence, #q:108, acc: 0.815 
subject: logical_fallacies, #q:163, acc: 0.773
subject: machine_learning, #q:112, acc: 0.482
subject: management, #q:103, acc: 0.864
subject: marketing, #q:234, acc: 0.906
subject: medical_genetics, #q:100, acc: 0.760
subject: miscellaneous, #q:783, acc: 0.870
subject: moral_disputes, #q:346, acc: 0.775
subject: moral_scenarios, #q:895, acc: 0.446
subject: nutrition, #q:306, acc: 0.778
subject: philosophy, #q:311, acc: 0.752
subject: prehistory, #q:324, acc: 0.793
subject: professional_accounting, #q:282, acc: 0.496
subject: professional_law, #q:1534, acc: 0.509
subject: professional_medicine, #q:272, acc: 0.735
subject: professional_psychology, #q:612, acc: 0.748
subject: public_relations, #q:110, acc: 0.673
subject: security_studies, #q:245, acc: 0.743
subject: sociology, #q:201, acc: 0.891
subject: us_foreign_policy, #q:100, acc: 0.870
subject: virology, #q:166, acc: 0.518
subject: world_religions, #q:171, acc: 0.865
Total latency: 659.126
Average accuracy: 0.681

With AOT:

100%|██████████████████████████████████████████| 14042/14042 [11:02<00:00, 21.19it/s]
subject: abstract_algebra, #q:100, acc: 0.340
subject: anatomy, #q:135, acc: 0.644
subject: astronomy, #q:152, acc: 0.770
subject: business_ethics, #q:100, acc: 0.680
subject: clinical_knowledge, #q:265, acc: 0.796
subject: college_biology, #q:144, acc: 0.785
subject: college_chemistry, #q:100, acc: 0.540
subject: college_computer_science, #q:100, acc: 0.630
subject: college_mathematics, #q:100, acc: 0.410
subject: college_medicine, #q:173, acc: 0.699
subject: college_physics, #q:102, acc: 0.402
subject: computer_security, #q:100, acc: 0.770
subject: conceptual_physics, #q:235, acc: 0.630
subject: econometrics, #q:114, acc: 0.614
subject: electrical_engineering, #q:145, acc: 0.641
subject: elementary_mathematics, #q:378, acc: 0.466
subject: formal_logic, #q:126, acc: 0.524
subject: global_facts, #q:100, acc: 0.450
subject: high_school_biology, #q:310, acc: 0.816
subject: high_school_chemistry, #q:203, acc: 0.581
subject: high_school_computer_science, #q:100, acc: 0.730
subject: high_school_european_history, #q:165, acc: 0.764
subject: high_school_geography, #q:198, acc: 0.854
subject: high_school_government_and_politics, #q:193, acc: 0.927
subject: high_school_macroeconomics, #q:390, acc: 0.705
subject: high_school_mathematics, #q:270, acc: 0.389
subject: high_school_microeconomics, #q:238, acc: 0.748
subject: high_school_physics, #q:151, acc: 0.384
subject: high_school_psychology, #q:545, acc: 0.861
subject: high_school_statistics, #q:216, acc: 0.565
subject: high_school_us_history, #q:204, acc: 0.863
subject: high_school_world_history, #q:237, acc: 0.873
subject: human_aging, #q:223, acc: 0.709
subject: human_sexuality, #q:131, acc: 0.786
subject: international_law, #q:121, acc: 0.876
subject: jurisprudence, #q:108, acc: 0.806
subject: logical_fallacies, #q:163, acc: 0.767
subject: machine_learning, #q:112, acc: 0.482
subject: management, #q:103, acc: 0.874
subject: marketing, #q:234, acc: 0.906
subject: medical_genetics, #q:100, acc: 0.770
subject: miscellaneous, #q:783, acc: 0.874
subject: moral_disputes, #q:346, acc: 0.775
subject: moral_scenarios, #q:895, acc: 0.444
subject: nutrition, #q:306, acc: 0.778
subject: philosophy, #q:311, acc: 0.759
subject: prehistory, #q:324, acc: 0.793
subject: professional_accounting, #q:282, acc: 0.496
subject: professional_law, #q:1534, acc: 0.512
subject: professional_medicine, #q:272, acc: 0.739
subject: professional_psychology, #q:612, acc: 0.748
subject: public_relations, #q:110, acc: 0.682
subject: security_studies, #q:245, acc: 0.743
subject: sociology, #q:201, acc: 0.886
subject: us_foreign_policy, #q:100, acc: 0.870
subject: virology, #q:166, acc: 0.518
subject: world_religions, #q:171, acc: 0.865
Total latency: 662.616
Average accuracy: 0.681

@celve
Copy link
Copy Markdown
Collaborator Author

celve commented Feb 19, 2026

awq dequant benchmark:

awq-dequantize-jit-vs-aot:
    qweight_row  qweight_col  JIT Kernel  AOT Kernel
0         128.0         16.0    1.253346    1.253989
1         128.0         32.0    1.284276    1.284160
2         128.0         64.0    1.299761    1.300043
3         128.0        128.0    1.344773    1.344287
4         128.0        448.0    1.596273    1.595940
5         256.0         16.0    1.288422    1.288046
6         256.0         32.0    1.295849    1.294851
7         256.0         64.0    1.363338    1.361475
8         256.0        128.0    1.429022    1.429168
9         256.0        448.0    1.998848    1.998996
10        512.0         16.0    1.299584    1.299666
11        512.0         32.0    1.349975    1.348758
12        512.0         64.0    1.433945    1.434366
13        512.0        128.0    1.709850    1.710682
14        512.0        448.0    2.612053    2.615450
15       1024.0         16.0    1.345730    1.346437
16       1024.0         32.0    1.428128    1.426818
17       1024.0         64.0    1.669251    1.669805
18       1024.0        128.0    1.982234    1.982000
19       1024.0        448.0    3.915917    3.915874
20       3584.0         16.0    1.524787    1.525134
21       3584.0         32.0    1.958996    1.945239
22       3584.0         64.0    2.632860    2.632395
23       3584.0        128.0    3.902435    3.903013
24       3584.0        448.0   10.662116   10.705651

Serve Qwen/Qwen2.5-7B-Instruct-AWQ with quantization enforced:

100%|██████████████████████████████████████| 14042/14042 [03:13<00:00, 72.75it/s]
subject: abstract_algebra, #q:100, acc: 0.560
subject: anatomy, #q:135, acc: 0.704
subject: astronomy, #q:152, acc: 0.855
subject: business_ethics, #q:100, acc: 0.780
subject: clinical_knowledge, #q:265, acc: 0.777
subject: college_biology, #q:144, acc: 0.847
subject: college_chemistry, #q:100, acc: 0.530
subject: college_computer_science, #q:100, acc: 0.650
subject: college_mathematics, #q:100, acc: 0.510
subject: college_medicine, #q:173, acc: 0.728
subject: college_physics, #q:102, acc: 0.510
subject: computer_security, #q:100, acc: 0.810
subject: conceptual_physics, #q:235, acc: 0.732
subject: econometrics, #q:114, acc: 0.649
subject: electrical_engineering, #q:145, acc: 0.745
subject: elementary_mathematics, #q:378, acc: 0.714
subject: formal_logic, #q:126, acc: 0.556
subject: global_facts, #q:100, acc: 0.430
subject: high_school_biology, #q:310, acc: 0.861
subject: high_school_chemistry, #q:203, acc: 0.660
subject: high_school_computer_science, #q:100, acc: 0.860
subject: high_school_european_history, #q:165, acc: 0.842
subject: high_school_geography, #q:198, acc: 0.914
subject: high_school_government_and_politics, #q:193, acc: 0.917
subject: high_school_macroeconomics, #q:390, acc: 0.782
subject: high_school_mathematics, #q:270, acc: 0.559
subject: high_school_microeconomics, #q:238, acc: 0.887
subject: high_school_physics, #q:151, acc: 0.556
subject: high_school_psychology, #q:545, acc: 0.903
subject: high_school_statistics, #q:216, acc: 0.727
subject: high_school_us_history, #q:204, acc: 0.873
subject: high_school_world_history, #q:237, acc: 0.861
subject: human_aging, #q:223, acc: 0.731
subject: human_sexuality, #q:131, acc: 0.794
subject: international_law, #q:121, acc: 0.843
subject: jurisprudence, #q:108, acc: 0.815
subject: logical_fallacies, #q:163, acc: 0.791
subject: machine_learning, #q:112, acc: 0.536
subject: management, #q:103, acc: 0.854
subject: marketing, #q:234, acc: 0.927
subject: medical_genetics, #q:100, acc: 0.830
subject: miscellaneous, #q:783, acc: 0.845
subject: moral_disputes, #q:346, acc: 0.783
subject: moral_scenarios, #q:895, acc: 0.611
subject: nutrition, #q:306, acc: 0.804
subject: philosophy, #q:311, acc: 0.778
subject: prehistory, #q:324, acc: 0.846
subject: professional_accounting, #q:282, acc: 0.578
subject: professional_law, #q:1534, acc: 0.503
subject: professional_medicine, #q:272, acc: 0.772
subject: professional_psychology, #q:612, acc: 0.773
subject: public_relations, #q:110, acc: 0.736
subject: security_studies, #q:245, acc: 0.804
subject: sociology, #q:201, acc: 0.876
subject: us_foreign_policy, #q:100, acc: 0.920
subject: virology, #q:166, acc: 0.542
subject: world_religions, #q:171, acc: 0.883
Total latency: 193.072
Average accuracy: 0.733

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Feb 20, 2026

/rerun-failed-ci

@BBuf BBuf merged commit 2cdde5d into sgl-project:main Feb 23, 2026
234 of 252 checks passed
xiaobaicxy added a commit to xiaobaicxy/sglang that referenced this pull request Feb 24, 2026
…o xverse_moe

* 'xverse_moe' of https://github.com/xiaobaicxy/sglang: (275 commits)
  fix: add missing blank line after docstring in serving_transcription.py (sgl-project#19206)
  Whisper model support & `/v1/audio/transcriptions` endpoint & benchmark (sgl-project#16983)
  fix: patch docker image fixes (sgl-project#19100)
  [PD-Disagg] Unify prefill info data transition flow, all with `PrefillServerInfo` (sgl-project#19195)
  [CI] Tiny enhance the dp attention load blance benchmark (sgl-project#19194)
  add new ci user (sgl-project#19133)
  [CI] fix the teardown output of disaggregation test (sgl-project#19193)
  [PD-Disagg] Support query dp rank from bootstrap server. (sgl-project#19168)
  [Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-project#18949)
  [Diffusion] Match rotary_embedding module name style (sgl-project#19179)
  [Refactor] Split rotary_embedding.py into a modular package (sgl-project#19144)
  [NPU] bump sgl-kernel-npu to 2026.02.01.post2 (sgl-project#19178)
  Use single mma warp group for short q_len in FA to optimize decoding performance (sgl-project#18985)
  Reorganize topk logic to clean up code and expose logical experts (sgl-project#16945)
  [ROCm] Use unreg path for custom all-reduce during CUDA graph capture (sgl-project#19162)
  [diffusion] feat: detect Flux2 custom VAE path from component_paths (sgl-project#19170)
  [AMD] ENV flags tuning and cleanup (sgl-project#19176)
  Fix bench_one_batch_server by moving the print statements (sgl-project#19175)
  Update rocm7.2 Dockerfile to install amdsmi for QuickReduce Initialization (sgl-project#19091)
  Revert "Refactor graph input buffers (sgl-project#18991)" (sgl-project#19173)
  ...
@celve celve deleted the jit-awq-marlin-repack branch February 25, 2026 07:57
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
…t#18949)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…t#18949)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…t#18949)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants