Skip to content

fix Sliding Window and Sink Token Support in Unified Kernel#11634

Merged
hebiao064 merged 5 commits intosgl-project:bhe/1_stage_triton_kernelfrom
zminglei:triton-sliding-window
Oct 14, 2025
Merged

fix Sliding Window and Sink Token Support in Unified Kernel#11634
hebiao064 merged 5 commits intosgl-project:bhe/1_stage_triton_kernelfrom
zminglei:triton-sliding-window

Conversation

@zminglei
Copy link
Copy Markdown
Collaborator

Motivation

Current unified kernel's logic for model with sliding window attention and sink tokens is wrong.
This PR is to fix it for sliding window attention and sink tokens model like gpt-oss-20b.

python3 -m sglang.launch_server --model-path /shared/public/elr-models/openai/gpt-oss-20b/6cd4d0ffba39483fe4fb0f5637831f717dafca35/ --attention-backend triton --enable-deterministic-inference
Before:

lm_eval --model local-chat-completions --model_args model=gpt-oss,base_url=http://127.0.0.1:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks gsm8k --batch_size 1024 --apply_chat_template --num_fewshot 1 --limit 200

INFO:lm_eval.loggers.evaluation_tracker:Output path not provided, skipping saving results aggregated
local-chat-completions (model=gpt-oss,base_url=http://127.0.0.1:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048), gen_kwargs: (None), limit: 200.0, num_fewshot: 1, batch_size: 1024
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     1|exact_match|↑  |    0|±  |     0|
|     |       |strict-match    |     1|exact_match|↑  |    0|±  |     0|

After:

lm_eval --model local-chat-completions --model_args model=gpt-oss,base_url=http://127.0.0.1:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks gsm8k --batch_size 1024 --apply_chat_template --num_fewshot 1 --limit 200

INFO:lm_eval.loggers.evaluation_tracker:Output path not provided, skipping saving results aggregated
local-chat-completions (model=gpt-oss,base_url=http://127.0.0.1:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048), gen_kwargs: (None), limit: 200.0, num_fewshot: 1, batch_size: 1024
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     1|exact_match|↑  | 0.84|±  |0.0260|
|     |       |strict-match    |     1|exact_match|↑  | 0.03|±  |0.0121|

Modifications

Sink Token Fix
The unified deterministic kernel (_fwd_kernel_unified) had the HAS_SINK parameter defined but wasn't actually using it, causing incorrect softmax computation when sink tokens were present.

Sliding Window Fix
The sliding window attention mask was incorrectly comparing unified array indices with absolute sequence positions, leading to incorrect attention masking.

Accuracy Tests

Benchmarking and Profiling

Checklist

@zminglei zminglei marked this pull request as ready for review October 14, 2025 21:50
@hebiao064 hebiao064 merged commit ec2a21c into sgl-project:bhe/1_stage_triton_kernel Oct 14, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants