[POC] Support deterministic inference by Fridge003 · Pull Request #10417 · sgl-project/sglang

Fridge003 · 2025-09-14T04:11:47Z

co-author: @hebiao064 , @Qiaolin-Yu

Motivation

#10278

Modifications

(Flashinfer) Apply the batch invariant attention kernels from feat: Batch-size invariant FA2 Prefill & Decode flashinfer-ai/flashinfer#1675
(Flashinfer) Only use BatchPrefillWithPagedKVCacheWrapper for prefilling (even when there is no prefix). Fix size of split-kv to 2048 for decode and 4096 for prefill.
Use forward_native for norm kernels
Enable batch invariant mode from https://github.com/thinking-machines-lab/batch_invariant_ops/tree/main
To handle chunked prefill batches,we choose to truncate sequence at multiple of split size. So that the edges of blocks are always aligned with split size.
Support cuda graph by disabling split-kv during graph capturing and graph replay for flashinfer
Generate a fixed sampling seed for each prompt, so temperature > 0 can also achieve determinism

Reproduction

Environment: H200, Cuda12.6, sglang 0.5.2, torch 2.8.0, Python 3.12.11

For flashinfer, we need to install its latest main branch from source, under the instruction of https://github.com/flashinfer-ai/flashinfer?tab=readme-ov-file#install-from-source
Batch invariant ops needs to be built from https://github.com/thinking-machines-lab/batch_invariant_ops/tree/main
Apply the modification in this PR

Launch qwen3-8b:

python3 -m sglang.launch_server --model-path Qwen/Qwen3-8B --attention-backend flashinfer --disable-radix-cache

Test determinism with single prompt and different batch sizes:

python3 -m sglang.test.test_deterministic --test-mode single

# Main branch
Total samples: 50, Unique samples: 6

# This PR
Total samples: 50, Unique samples: 1

Test determinism with mixture of short prompts and long prompts in each batch:

# Requires running multiple times
python3 -m sglang.test.test_deterministic --test-mode mixed

Prompt 1: total samples: 644, Unique samples: 1
Prompt 2: total samples: 423, Unique samples: 1
Long prompt: total samples: 208, Unique samples: 1

Test determinism with multiple prompts with different lengths of common prefix:

python3 -m sglang.test.test_deterministic --test-mode prefix

Prompt 0 with prefix length 1: total samples: 327, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 316, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 326, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 306, Unique samples: 1

Accuracy Tests

GSM8K

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

# Main Branch
Accuracy: 0.902
Invalid: 0.000
Latency: 105.043 s
Output throughput: 1631.791 token/s

# This PR
Accuracy: 0.901
Invalid: 0.000
Latency: 125.780 s
Output throughput: 1366.029 token/s

GPQA

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 4096

# Main Branch
'score:std': np.float64(0.40881022918884974), 'score': np.float64(0.21212121212121213)

# This PR
'score:std': np.float64(0.40520665017240837), 'score': np.float64(0.20707070707070707)

Temperature > 0

For launching, we need to specify pytorch sampling backend temporarily

python3 -m sglang.launch_server --model-path Qwen/Qwen3-8B --attention-backend flashinfer --disable-radix-cache --sampling-backend pytorch

For testing, we set temperature to a value greater than 0

python3 -m sglang.test.test_deterministic --test-mode mixed --temperature 0.2

# Results
Prompt 1: total samples: 530, Unique samples: 1
Prompt 2: total samples: 527, Unique samples: 1
Long prompt: total samples: 218, Unique samples: 1

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Edenzzzz · 2025-09-18T01:17:29Z

                encoder_lens=encoder_lens,
                spec_info=spec_info,
+                fixed_split_size=-1,
+                disable_split_kv=True,


For small bs, you might not need to disable split kv (e.g. bs 1-4)
For example, each SM launches 2 blocks, and on H100 for bs=1, split size 4096, it takes 1 million context length to break this limit.
But this is optional and can be left for future work.

should we also consider the num of heads?
Related comment: Dao-AILab/flash-attention#609 (comment)

Yes this is more like an optional heuristic

hebiao064 · 2025-09-18T06:15:40Z

FA3 Backend are batch invariant as well after this PR

Single Mode

Total samples: 50, Unique samples: 1

Mixed Mode
Prompt 1: total samples: 600, Unique samples: 1
Prompt 2: total samples: 485, Unique samples: 1
Long prompt: total samples: 190, Unique samples: 1

Prefix Mode
Prompt 0 with prefix length 1: total samples: 311, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 319, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 323, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 322, Unique samples: 1

And after I disabled enable_batch_invariant_mode, single mode will fail

Single Mode

Total samples: 50, Unique samples: 5

hebiao064 · 2025-09-18T06:28:39Z

Triton Backend need some work
Single Mode
Total samples: 50, Unique samples: 4

Fridge003 added 2 commits September 14, 2025 04:04

add test script

daa812c

use deterministic kernels

86e7cd3

This was referenced Sep 13, 2025

[Feature] Support deterministic inference with Batch Invariant Ops #10278

Closed

Add script for testing deterministic #10393

Closed

Fridge003 changed the title ~~[POC] Support deterministic kernel with Flashinfer backend~~ [POC] Support deterministic inference with Flashinfer backend Sep 14, 2025

ispobock mentioned this pull request Sep 14, 2025

Add split tile size for Triton attention #10425

Merged

Edenzzzz reviewed Sep 14, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/attention/flashinfer_backend.py

update test

d38d411

Fridge003 assigned ispobock Sep 14, 2025

Fridge003 added 4 commits September 15, 2025 02:09

add hard test

124ab33

update mixed test

419854f

upd

a4a6871

upd

fdd4a36

Fridge003 force-pushed the det branch from 117c8d5 to fdd4a36 Compare September 16, 2025 02:49

fix chunked prefill

87ad7d5

hebiao064 self-assigned this Sep 16, 2025

Fridge003 and others added 4 commits September 17, 2025 00:52

add test for radix cache

36d3c10

add profile option

71554da

upd

5763d8e

suppport cuda graph

d6e1fe5

Edenzzzz reviewed Sep 18, 2025

View reviewed changes

hebiao064 changed the title ~~[POC] Support deterministic inference with Flashinfer backend~~ [POC] Support deterministic inference with Flashinfer and FlashAttention backend Sep 18, 2025

support det sampling

fb17504

Fridge003 changed the title ~~[POC] Support deterministic inference with Flashinfer and FlashAttention backend~~ [POC] Support deterministic inference Sep 18, 2025

Qiaolin-Yu and others added 2 commits September 18, 2025 08:28

upd

be72d09

arrange codes

9d637aa

hebiao064 mentioned this pull request Sep 18, 2025

Question: Is FlashAttention Kernel already batch invariant? Dao-AILab/flash-attention#1897

Closed

skyzh mentioned this pull request Sep 19, 2025

[PoC] make radix cache deterministic #10639

Closed

4 tasks

Fridge003 mentioned this pull request Sep 19, 2025

[1/2] Support deterministic inference with flashinfer attention backend #10645

Merged

4 tasks

hebiao064 mentioned this pull request Sep 19, 2025

[Feature] Support deterministic inference with FA3 backend #10651

Merged

4 tasks

upd

04f6ff9

Fridge003 closed this Sep 21, 2025

Fridge003 deleted the det branch September 30, 2025 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] Support deterministic inference#10417

[POC] Support deterministic inference#10417
Fridge003 wants to merge 16 commits intosgl-project:mainfrom
Fridge003:det

Fridge003 commented Sep 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Edenzzzz Sep 18, 2025

Uh oh!

hebiao064 Sep 18, 2025

Uh oh!

Edenzzzz Sep 18, 2025 •

edited

Loading

Uh oh!

hebiao064 commented Sep 18, 2025 •

edited

Loading

Uh oh!

hebiao064 commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Fridge003 commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Reproduction

Accuracy Tests

Temperature > 0

Benchmarking and Profiling

Checklist

Uh oh!

Uh oh!

Edenzzzz Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

hebiao064 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hebiao064 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hebiao064 commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fridge003 commented Sep 14, 2025 •

edited

Loading

Edenzzzz Sep 18, 2025 •

edited

Loading

hebiao064 commented Sep 18, 2025 •

edited

Loading