Skip to content

[FEAT] Support GGUF format#2215

Merged
merrymercy merged 14 commits intosgl-project:mainfrom
zhengy001:zyang_dev
Nov 30, 2024
Merged

[FEAT] Support GGUF format#2215
merrymercy merged 14 commits intosgl-project:mainfrom
zhengy001:zyang_dev

Conversation

@zhengy001
Copy link
Copy Markdown
Contributor

@zhengy001 zhengy001 commented Nov 27, 2024

Motivation

#1616

Modifications

Support GGUF format

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhengy001
Copy link
Copy Markdown
Contributor Author

lm_head.weight is directly used in many places, however, vllm changes it to be qweight for gguf. This would be an issue.

@merrymercy
Copy link
Copy Markdown
Contributor

Thanks for the contributions. Can you fix the CI errors?

@zhengy001 zhengy001 force-pushed the zyang_dev branch 2 times, most recently from 5c616a5 to 2cffa70 Compare November 27, 2024 13:10
@zhengy001
Copy link
Copy Markdown
Contributor Author

Thanks for the contributions. Can you fix the CI errors?

How to trigger the CI?

@zhengy001
Copy link
Copy Markdown
Contributor Author

lm_head.weight is directly used in many places, however, vllm changes it to be qweight for gguf. This would be an issue.

Pass lm_head to LogitsProcessor and check the weight inside

@merrymercy
Copy link
Copy Markdown
Contributor

merrymercy commented Nov 27, 2024

@zhengy001 CI won't be triggered for you automatically because you are a first-time contributor. You can send a random typo fix PR and I can merge that for you so your future commits can trigger CI automatically.

@merrymercy
Copy link
Copy Markdown
Contributor

@zhengy001 Can you fix the CI errors?

@zhengy001
Copy link
Copy Markdown
Contributor Author

@zhengy001 Can you fix the CI errors?

@merrymercy Sure, working on it.

Comment thread test/srt/run_suite.py Outdated
Comment thread test/srt/test_gguf.py Outdated
Comment thread python/sglang/srt/server_args.py Outdated
Comment thread python/sglang/srt/layers/vocab_parallel_embedding.py Outdated
@merrymercy
Copy link
Copy Markdown
Contributor

#2269 adds you as a new contributor so your future commits will trigger CI automatically

@zhengy001
Copy link
Copy Markdown
Contributor Author

#2269 adds you as a new contributor so your future commits will trigger CI automatically

@merrymercy :)

Comment thread python/sglang/srt/models/olmo.py Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There won't be "lm_head.weight" if self.config.tie_word_embeddings is True

Comment thread test/srt/test_gguf.py Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared the result with vllm's. Pls suggest if there is a better way.

@merrymercy merrymercy enabled auto-merge (squash) November 30, 2024 07:47
Comment thread python/sglang/srt/layers/logits_processor.py Outdated
@merrymercy merrymercy disabled auto-merge November 30, 2024 08:44
@merrymercy merrymercy merged commit 883c955 into sgl-project:main Nov 30, 2024
merrymercy added a commit that referenced this pull request Dec 1, 2024
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
hammersam added a commit to hammersam/sglang that referenced this pull request Mar 8, 2026
Port the multi-CTA radix-based top-k kernel from flashinfer PR sgl-project#2215
(flashinfer-ai/flashinfer#2215) into sglang as
a JIT-compiled kernel. This replaces the existing AOT single-CTA top-k
implementation for NSA attention, providing better performance on long
sequences (32K+) where the multi-CTA path activates.

Key changes:

- Add `python/sglang/jit_kernel/topk.py`: Python API exposing three
  JIT top-k variants (basic, page-table transform, ragged transform)
  with workspace management and lazy compilation via `cache_once`.

- Add `python/sglang/jit_kernel/csrc/elementwise/topk.cuh`: CUDA wrapper
  providing TVM FFI entry points that dispatch to the flashinfer adaptive
  top-k kernels (TopKDispatch, TopKPageTableTransformDispatch,
  TopKRaggedTransformDispatch).

- Add `python/sglang/jit_kernel/include/sgl_kernel/topk_fi.cuh`: Core
  CUDA implementation adapted from flashinfer, featuring:
  - 8-bit radix selection algorithm with multi-CTA support for large
    sequences (threshold configurable, default 32K)
  - Support for float32, float16, and bfloat16 input types
  - row_starts parameter for ragged input score layouts (sglang-specific)
  - Three output modes: indices-only, page-table lookup, and ragged
    offset addition

- Update `python/sglang/srt/layers/attention/nsa_backend.py`: Switch
  NSA indexer to import from JIT kernel instead of AOT sgl_kernel.

- Update `sgl-kernel/python/sgl_kernel/top_k.py`: Add JIT fallback path
  controlled by SGLANG_USE_JIT_TOPK env var (default enabled). When JIT
  is available, fast_topk_v2 / fast_topk_transform_fused /
  fast_topk_transform_ragged_fused transparently delegate to JIT kernels.

- Add `sgl-kernel/tests/test_topk_jit.py`: Correctness tests covering
  basic, page-table, ragged, and trivial (length <= topk) cases across
  various batch sizes and sequence lengths up to 131K.

- Add `sgl-kernel/benchmarks/bench_topk_jit.py`: Latency benchmark
  comparing JIT multi-CTA vs AOT single-CTA kernels.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants