[FEAT] Support GGUF format by zhengy001 · Pull Request #2215 · sgl-project/sglang

zhengy001 · 2024-11-27T07:23:56Z

Motivation

Modifications

Support GGUF format

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhengy001 · 2024-11-27T08:23:59Z

lm_head.weight is directly used in many places, however, vllm changes it to be qweight for gguf. This would be an issue.

merrymercy · 2024-11-27T11:32:11Z

Thanks for the contributions. Can you fix the CI errors?

zhengy001 · 2024-11-27T13:11:44Z

Thanks for the contributions. Can you fix the CI errors?

How to trigger the CI?

zhengy001 · 2024-11-27T14:10:45Z

lm_head.weight is directly used in many places, however, vllm changes it to be qweight for gguf. This would be an issue.

Pass lm_head to LogitsProcessor and check the weight inside

merrymercy · 2024-11-27T23:39:07Z

@zhengy001 CI won't be triggered for you automatically because you are a first-time contributor. You can send a random typo fix PR and I can merge that for you so your future commits can trigger CI automatically.

merrymercy · 2024-11-30T00:04:39Z

@zhengy001 Can you fix the CI errors?

zhengy001 · 2024-11-30T00:06:55Z

@zhengy001 Can you fix the CI errors?

@merrymercy Sure, working on it.

merrymercy · 2024-11-30T00:39:35Z

#2269 adds you as a new contributor so your future commits will trigger CI automatically

zhengy001 · 2024-11-30T00:42:03Z

#2269 adds you as a new contributor so your future commits will trigger CI automatically

@merrymercy :)

zhengy001 · 2024-11-30T04:17:17Z

There won't be "lm_head.weight" if self.config.tie_word_embeddings is True

zhengy001 · 2024-11-30T04:20:57Z

Compared the result with vllm's. Pls suggest if there is a better way.

This reverts commit 883c955.

Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>

Port the multi-CTA radix-based top-k kernel from flashinfer PR sgl-project#2215 (flashinfer-ai/flashinfer#2215) into sglang as a JIT-compiled kernel. This replaces the existing AOT single-CTA top-k implementation for NSA attention, providing better performance on long sequences (32K+) where the multi-CTA path activates. Key changes: - Add `python/sglang/jit_kernel/topk.py`: Python API exposing three JIT top-k variants (basic, page-table transform, ragged transform) with workspace management and lazy compilation via `cache_once`. - Add `python/sglang/jit_kernel/csrc/elementwise/topk.cuh`: CUDA wrapper providing TVM FFI entry points that dispatch to the flashinfer adaptive top-k kernels (TopKDispatch, TopKPageTableTransformDispatch, TopKRaggedTransformDispatch). - Add `python/sglang/jit_kernel/include/sgl_kernel/topk_fi.cuh`: Core CUDA implementation adapted from flashinfer, featuring: - 8-bit radix selection algorithm with multi-CTA support for large sequences (threshold configurable, default 32K) - Support for float32, float16, and bfloat16 input types - row_starts parameter for ragged input score layouts (sglang-specific) - Three output modes: indices-only, page-table lookup, and ragged offset addition - Update `python/sglang/srt/layers/attention/nsa_backend.py`: Switch NSA indexer to import from JIT kernel instead of AOT sgl_kernel. - Update `sgl-kernel/python/sgl_kernel/top_k.py`: Add JIT fallback path controlled by SGLANG_USE_JIT_TOPK env var (default enabled). When JIT is available, fast_topk_v2 / fast_topk_transform_fused / fast_topk_transform_ragged_fused transparently delegate to JIT kernels. - Add `sgl-kernel/tests/test_topk_jit.py`: Correctness tests covering basic, page-table, ragged, and trivial (length <= topk) cases across various batch sizes and sequence lengths up to 131K. - Add `sgl-kernel/benchmarks/bench_topk_jit.py`: Latency benchmark comparing JIT multi-CTA vs AOT single-CTA kernels. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zhengy001 requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners November 27, 2024 07:23

zhengy001 mentioned this pull request Nov 27, 2024

[Feature] GGUF support #1616

Closed

2 tasks

zhengy001 force-pushed the zyang_dev branch from 8462af3 to 75d9978 Compare November 27, 2024 08:20

zhengy001 force-pushed the zyang_dev branch 2 times, most recently from 5c616a5 to 2cffa70 Compare November 27, 2024 13:10

merrymercy requested changes Nov 30, 2024

View reviewed changes

Comment thread test/srt/run_suite.py Outdated

Comment thread test/srt/test_gguf.py Outdated

Comment thread python/sglang/srt/server_args.py Outdated

Comment thread python/sglang/srt/layers/vocab_parallel_embedding.py Outdated

zhengy001 commented Nov 30, 2024

View reviewed changes

zhengy001 force-pushed the zyang_dev branch from 804c608 to bc4b19f Compare November 30, 2024 05:05

Yang Zheng(SW)(Alex) and others added 7 commits November 30, 2024 05:23

Support GGUF format

33fc64e

Update test

ec2dd82

Format

28328bf

ServerArgs: remove gguf_file

8f42879

Pass lm_head to LogitsProcessor

8bf1890

Fix CI

02a56b0

Fix CI and cleanup tie_word_embeddings

f1ffb30

zhengy001 added 2 commits November 30, 2024 05:24

Update run_suite.py order

f027e7d

Check quant_config properity

8f8ceba

zhengy001 force-pushed the zyang_dev branch from bc4b19f to 8f8ceba Compare November 30, 2024 05:25

Merge branch 'main' into zyang_dev

2a34061

merrymercy approved these changes Nov 30, 2024

View reviewed changes

merrymercy enabled auto-merge (squash) November 30, 2024 07:47

Update llama.py

b3eaf49

merrymercy reviewed Nov 30, 2024

View reviewed changes

Comment thread python/sglang/srt/layers/logits_processor.py Outdated

merrymercy added 3 commits November 30, 2024 00:10

Apply suggestions from code review

7d79172

Merge branch 'main' into zyang_dev

540dd3e

Merge branch 'main' into zyang_dev

8a2e8ef

merrymercy disabled auto-merge November 30, 2024 08:44

merrymercy merged commit 883c955 into sgl-project:main Nov 30, 2024

merrymercy added a commit that referenced this pull request Dec 1, 2024

Revert "[FEAT] Support GGUF format (#2215)"

0b295e5

This reverts commit 883c955.

merrymercy mentioned this pull request Dec 1, 2024

Revert "[FEAT] Support GGUF format" #2285

Merged

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

[FEAT] Support GGUF format (sgl-project#2215)

bc82fa0

Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>

yhyang201 mentioned this pull request Mar 28, 2025

unsloth model support zhaochenyang20/Awesome-ML-SYS-Tutorial#81

Open

11 tasks

Conversation

zhengy001 commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

zhengy001 commented Nov 27, 2024

Uh oh!

merrymercy commented Nov 27, 2024

Uh oh!

zhengy001 commented Nov 27, 2024

Uh oh!

zhengy001 commented Nov 27, 2024

Uh oh!

merrymercy commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy commented Nov 30, 2024

Uh oh!

zhengy001 commented Nov 30, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

merrymercy commented Nov 30, 2024

Uh oh!

zhengy001 commented Nov 30, 2024

Uh oh!

zhengy001 Nov 30, 2024

Choose a reason for hiding this comment

Uh oh!

zhengy001 Nov 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengy001 commented Nov 27, 2024 •

edited

Loading

merrymercy commented Nov 27, 2024 •

edited

Loading