[Feature] [Clone from PR#10645] Support deterministic inference with triton backend by yushengsu-thu · Pull Request #10674 · sgl-project/sglang

yushengsu-thu · 2025-09-19T21:25:16Z

Motivation

Part of #10278
Clone from PR#10645
Reference: Defeating Non-Determinism in LLM Inference

Thanks to the earlier work from @Fridge003, @Edenzzzz, @hebiao064, and @Qiaolin-Yu in the following PRs:

FlashInfer: flashinfer-ai/flashinfer#1675
SGLang FlashInfer: #10645
SGLang DET POC: #10417
SGLang Support deterministic inference with FA3 backend: #10651

Context

Fix sglang/python/sglang/srt/layers/attention/triton_backend.py to support triton-backend deterministic inference

Modifications

python/sglang/srt/layers/attention/triton_backend.py
python/sglang/srt/server_args.py

Accuracy Tests

Launch SGLang server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --attention-backend triton \
    --mem-fraction-static 0.7 \
    --host 0.0.0.0 --port 30000 \
    --enable-deterministic-inference

python -m sglang.test.test_deterministic --test-mode single --profile --n-trials 50 

# --disable-radix-cache
Total samples: 50, Unique samples: 1

python3 -m sglang.test.test_deterministic --test-mode prefix --profile --n-trials 50

# --disable-radix-cache 
Prompt 1: total samples: 557, Unique samples: 1
Prompt 2: total samples: 518, Unique samples: 1
Long prompt: total samples: 200, Unique samples: 1

python3 -m sglang.test.test_deterministic --test-mode mixed --profile --n-trials 50 

# --disable-radix-cache
Prompt 1: total samples: 582, Unique samples: 1
Prompt 2: total samples: 460, Unique samples: 1
Long prompt: total samples: 233, Unique samples: 1

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-09-19T21:25:39Z

Summary of Changes

Hello @yushengsu-thu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SGLang framework by introducing comprehensive support for deterministic inference, particularly for the Triton attention backend. The core objective is to ensure that given the same input and configuration, the model consistently produces identical outputs, regardless of batching strategies or other non-deterministic factors. This is achieved through careful modifications to attention backends, scheduler logic, and the introduction of new configuration options, all validated by a dedicated test suite.

Highlights

Deterministic Inference Support: Introduced the capability for deterministic inference in SGLang, specifically extending it to the Triton attention backend.
Configuration via Environment Variables: Added new environment variables (SGLANG_ENABLE_DETERMINISTIC_INFERENCE, SGLANG_FLASHINFER_PREFILL_SPLIT_TILE_SIZE, SGLANG_FLASHINFER_DECODE_SPLIT_TILE_SIZE) to control deterministic behavior and attention split tile sizes.
FlashInfer Backend Enhancements: Modified the FlashInfer backend to utilize tensor cores for decoding and disable KV splitting for CUDA graphs when deterministic inference is active, ensuring consistent behavior.
Triton Backend Adaptations: Updated the Triton backend to enforce a fixed split tile size for KV cache operations in deterministic mode, promoting batch invariance.
Scheduler Alignment for Prefill: Implemented a truncation_align_size in the scheduler to ensure that prefill prefix lengths are always multiples of the attention split size, which is critical for deterministic results.
New CLI Argument: Added --enable-deterministic-inference to server_args.py for easy activation of this feature.
Dedicated Deterministic Tests: Included a new test file (test_deterministic.py) to rigorously verify the deterministic output across various batching and prompt scenarios.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for deterministic inference with the Triton backend, which is a great feature for reproducibility. The changes are extensive, touching several files to plumb through the deterministic inference configuration and adapt the attention and normalization layers. The addition of a new test file for deterministic inference is also a valuable contribution.

My review focuses on improving code clarity and maintainability. I've identified several areas in triton_backend.py with redundant logic, commented-out code, and development artifacts that should be cleaned up. I also suggest a small refactoring in layernorm.py to avoid direct environment variable access, improving modularity.

… (flashinfer backend)

Edenzzzz · 2025-09-20T21:49:23Z

+                "SGLANG_FLASHINFER_DECODE_SPLIT_TILE_SIZE", 2048
+            )
+            self.disable_cuda_graph_kv_split = True
+            global_config.flashinfer_workspace_size = 2048 * 1024 * 1024


also use env var to optionally change this?

yushengsu-thu · 2025-09-20T22:10:57Z

@Fridge003 @ispobock @Edenzzzz
Please check and comment on this new PR: #10694
Sorry, I broke the code of this PR. I will revise and update the above new PR.

yushengsu-thu · 2025-09-20T22:16:45Z

@Edenzzzz , please check and comment on this new PR: #10694
Sorry, I broke the code of this PR. I will revise and update the above new PR.

Fridge003 and others added 23 commits September 19, 2025 01:13

add test script

5420a56

use deterministic kernels

2bd80c8

update test

9491b94

add hard test

102ef0f

update mixed test

ebc2638

upd

7830f11

upd

797cdf5

fix chunked prefill

da42ca3

add test for radix cache

ad9fbc9

add profile option

9644f8b

upd

0c0292f

suppport cuda graph

06e022f

support det sampling

8d5f0f3

upd

e3b1dac

arrange codes

39acd47

clean ups before making it a formal PR

162f951

fix

7f68e09

split temperature > 0

7d4dfdf

fix

24a6377

Merge remote-tracking branch 'origin/main' into det-merge

8a0e38c

fix

5b0094b

add fast decode

1ecdd69

move some args to envs

5f29acb

yushengsu-thu requested review from HaiShaw, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners September 19, 2025 21:25

yushengsu-thu requested review from BBuf, Edwardf0t1, ch-wan and kushanam as code owners September 19, 2025 21:25

yushengsu-thu changed the title ~~[Feature] Support deterministic inference with triton backend~~ [Feature] [Clone from PR#10645] Support deterministic inference with triton backend Sep 19, 2025

gemini-code-assist Bot reviewed Sep 19, 2025

View reviewed changes

yushengsu-thu closed this Sep 19, 2025

yushengsu-thu reopened this Sep 19, 2025

Fridge003 mentioned this pull request Sep 19, 2025

[Feature] Support deterministic inference with Batch Invariant Ops #10278

Closed

28 tasks

Fridge003 and others added 7 commits September 19, 2025 22:09

fix arg passing bug for flashinfer

afef305

Support triton backend - Deterministic path tweaks on top of PR #10645…

2be2363

… (flashinfer backend)

remove old code and comments

a791075

update

d30172c

Support triton backend - Deterministic path tweaks on top of PR #10645…

beae465

… (flashinfer backend)

remove old code and comments

2935ee0

update

5ddef72

yushengsu-thu closed this Sep 19, 2025

yushengsu-thu reopened this Sep 19, 2025

yushengsu-thu added 2 commits September 20, 2025 21:09

fix comments

dbae0fc

Merge branch 'yusheng-thu_triton_det_new' into yusheng-thu_triton_det

850bb94

Edenzzzz reviewed Sep 20, 2025

View reviewed changes

yushengsu-thu closed this Sep 20, 2025

yushengsu-thu mentioned this pull request Sep 20, 2025

Support deterministic inference with triton backend (Hardware test: NV and AMD GPUs) #10694

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [Clone from PR#10645] Support deterministic inference with triton backend#10674

[Feature] [Clone from PR#10645] Support deterministic inference with triton backend#10674
yushengsu-thu wants to merge 32 commits intosgl-project:mainfrom
yushengsu-thu:yusheng-thu_triton_det

yushengsu-thu commented Sep 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Sep 19, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Edenzzzz Sep 20, 2025

Uh oh!

yushengsu-thu commented Sep 20, 2025

Uh oh!

yushengsu-thu commented Sep 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yushengsu-thu commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Context

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Sep 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Edenzzzz Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

yushengsu-thu commented Sep 20, 2025

Uh oh!

yushengsu-thu commented Sep 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yushengsu-thu commented Sep 19, 2025 •

edited

Loading