use flashinfer.sampling by pansicheng · Pull Request #18696 · sgl-project/sglang

pansicheng · 2026-02-12T05:56:17Z

Motivation

#17865
move (external) flashinfer/csrc/sampling.cu
Call flashinfer directly from python, instead of compiling the operators into sgl_kernel

Modifications

Accuracy Tests

unittest python -m pytest sgl-kernel/tests/test_sampling.py -s

gsm8k

python -m sglang.launch_server --model /data/Qwen3-8B/
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 128
100%|███████████████████████████████████████████████████████████████| 1319/1319 [01:08<00:00, 19.13it/s]
Accuracy: 0.904
Invalid: 0.000
Latency: 68.988 s
Output throughput: 2507.349 token/s

python -m sglang.launch_server --model /data/Qwen3-8B/ --disable-radix-cache
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 128
100%|███████████████████████████████████████████████████████████████| 1319/1319 [06:12<00:00,  3.54it/s]
Accuracy: 0.901
Invalid: 0.000
Latency: 372.805 s
Output throughput: 460.144 token/s

Benchmarking and Profiling

python benchmark/bench_top_k_top_p_sampling.py

this patch
============================================================
Starting performance benchmark...
top-k-top-p-joint-sampling-performance:
    batch_size  vocab_size    p  Torch Reference  SGL Kernel
0         16.0       111.0  0.1      3517.440081   34.816001
1         16.0       111.0  0.5      3519.488096   29.696001
2         16.0     32000.0  0.1      4187.648058  191.487998
3         16.0     32000.0  0.5      4232.192039  163.839996
4         64.0       111.0  0.1     14012.415886   38.911998
5         64.0       111.0  0.5     13886.464119   34.816001
6         64.0     32000.0  0.1     16578.559875  243.711993
7         64.0     32000.0  0.5     16712.703705  214.528002
8        128.0       111.0  0.1     27687.936783   47.104001
9        128.0       111.0  0.5     27639.808655   39.935999
10       128.0     32000.0  0.1     33045.503616  319.487989
11       128.0     32000.0  0.5     33159.679413  273.407996

main
============================================================
Starting performance benchmark...
top-k-top-p-joint-sampling-performance:
    batch_size  vocab_size    p  Torch Reference  SGL Kernel
0         16.0       111.0  0.1      3526.655912   33.792000
1         16.0       111.0  0.5      3526.143909   28.672000
2         16.0     32000.0  0.1      4183.552027  226.303995
3         16.0     32000.0  0.5      4174.335957  193.024002
4         64.0       111.0  0.1     13879.296303   37.888002
5         64.0       111.0  0.5     13943.807602   33.792000
6         64.0     32000.0  0.1     16728.063583  281.087995
7         64.0     32000.0  0.5     16711.679459  246.784002
8        128.0       111.0  0.1     27573.247910   45.056000
9        128.0       111.0  0.5     27688.959122   39.935999
10       128.0     32000.0  0.1     33324.544907  370.687991
11       128.0     32000.0  0.5     33060.352325  314.368010

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-12T05:56:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

BBuf · 2026-02-14T08:38:28Z

It's cool to saw some performance improve.

BBuf

Good job.

BBuf · 2026-02-14T08:38:50Z

/tag-and-rerun-ci

DarkSharpness · 2026-02-14T15:48:42Z

        if torch.any(torch.isnan(probs)):
            raise ValueError("Input probs contains NaN.")
-    return _top_p_sampling_from_probs_internal(
+    return get_sampling_module().top_p_sampling_from_probs(


I do not recommend directly get module from flashinfer (It's not a public API). You may take a look at my implementation in mini-sglang as a reference:
https://github.com/sgl-project/mini-sglang/blob/82722ad6dc85df766278c48061d768b4117a3bd4/python/minisgl/engine/sample.py#L24-L45

@pansicheng

Fixed, PTAL

BBuf · 2026-02-24T09:38:06Z

@DarkSharpness Any other advices?

DarkSharpness · 2026-02-24T12:54:32Z

Why don't we directly apply the flashinfer kernel directly in use place? (Now we are modifying the sgl-kernel inplace)

BBuf · 2026-02-25T01:13:13Z

Why don't we directly apply the flashinfer kernel directly in use place? (Now we are modifying the sgl-kernel inplace)

I also think it's more appropriate. @pansicheng Can you do a change for this, thanks!

pansicheng · 2026-02-25T02:29:04Z

Why don't we directly apply the flashinfer kernel directly in use place? (Now we are modifying the sgl-kernel inplace)

I also think it's more appropriate. @pansicheng Can you do a change for this, thanks!

Fixed, PTAL

BBuf

LGTM now. Waiting for ci

BBuf · 2026-02-25T03:01:16Z

/tag-and-rerun-ci

BBuf · 2026-02-26T01:26:59Z

Merged with ci passed https://github.com/sgl-project/sglang/actions/runs/22379938350/job/64925133127?pr=18696

pansicheng requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy, yizhang2077 and zhyncs as code owners February 12, 2026 05:56

github-actions Bot added the sgl-kernel label Feb 12, 2026

use flashinfer.sampling

f3932e5

pansicheng force-pushed the flashinfer-sampling branch from cc8e318 to f3932e5 Compare February 12, 2026 05:59

BBuf approved these changes Feb 14, 2026

View reviewed changes

github-actions Bot added the run-ci label Feb 14, 2026

BBuf mentioned this pull request Feb 14, 2026

[Feature] sgl-kernel wheel slimming plan tracking #17865

Closed

74 tasks

DarkSharpness requested changes Feb 14, 2026

View reviewed changes

DarkSharpness mentioned this pull request Feb 15, 2026

Migrate renorm kernels from sgl-kernel to FlashInfer JIT #18854

Merged

5 tasks

pansicheng added 2 commits February 24, 2026 06:59

use flashinfer.sampling

7dc44a2

Merge branch 'main' into flashinfer-sampling

0ce9308

use flashinfer.sampling in use place

3b877fb

pansicheng requested review from Edwardf0t1, Fridge003, Ying1123 and ch-wan as code owners February 25, 2026 02:28

fix import

12d6268

BBuf approved these changes Feb 25, 2026

View reviewed changes

BBuf approved these changes Feb 26, 2026

View reviewed changes

DarkSharpness approved these changes Feb 26, 2026

View reviewed changes

BBuf merged commit 2ad475b into sgl-project:main Feb 26, 2026
240 of 267 checks passed

klhhhhh pushed a commit to klhhhhh/sglang that referenced this pull request Feb 26, 2026

use flashinfer.sampling (sgl-project#18696)

3eb7a22

whybeyoung mentioned this pull request Feb 28, 2026

[Bug] FlashInfer sampling.lock contention with PP + CP: multiple ranks block on shared cache lock on first request #19583

Closed

5 tasks

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

use flashinfer.sampling (sgl-project#18696)

3a51425

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

use flashinfer.sampling (sgl-project#18696)

91227a1

yafengio mentioned this pull request Mar 24, 2026

[MUSA][8/N] Port CUDA kernels that are compatible with MUSA #17946

Merged

5 tasks

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

use flashinfer.sampling (sgl-project#18696)

fc346e4

Conversation

pansicheng commented Feb 12, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Feb 12, 2026

Uh oh!

BBuf commented Feb 14, 2026

Uh oh!

BBuf left a comment

Choose a reason for hiding this comment

Uh oh!

BBuf commented Feb 14, 2026

Uh oh!

DarkSharpness Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

BBuf Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

pansicheng Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

BBuf commented Feb 24, 2026

Uh oh!

DarkSharpness commented Feb 24, 2026

Uh oh!

BBuf commented Feb 25, 2026

Uh oh!

pansicheng commented Feb 25, 2026

Uh oh!

BBuf left a comment

Choose a reason for hiding this comment

Uh oh!

BBuf commented Feb 25, 2026

Uh oh!

BBuf commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants