Skip to content

[Feature] [Clone from PR#10645] Support deterministic inference with triton backend#10674

Closed
yushengsu-thu wants to merge 32 commits intosgl-project:mainfrom
yushengsu-thu:yusheng-thu_triton_det
Closed

[Feature] [Clone from PR#10645] Support deterministic inference with triton backend#10674
yushengsu-thu wants to merge 32 commits intosgl-project:mainfrom
yushengsu-thu:yusheng-thu_triton_det

Conversation

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

@yushengsu-thu yushengsu-thu commented Sep 19, 2025

Motivation

Part of #10278
Clone from PR#10645
Reference: Defeating Non-Determinism in LLM Inference

Thanks to the earlier work from @Fridge003, @Edenzzzz, @hebiao064, and @Qiaolin-Yu in the following PRs:

FlashInfer: flashinfer-ai/flashinfer#1675
SGLang FlashInfer: #10645
SGLang DET POC: #10417
SGLang Support deterministic inference with FA3 backend: #10651

Context

Fix sglang/python/sglang/srt/layers/attention/triton_backend.py to support triton-backend deterministic inference

Modifications

  1. python/sglang/srt/layers/attention/triton_backend.py
  2. python/sglang/srt/server_args.py

Accuracy Tests

Launch SGLang server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --attention-backend triton \
    --mem-fraction-static 0.7 \
    --host 0.0.0.0 --port 30000 \
    --enable-deterministic-inference
python -m sglang.test.test_deterministic --test-mode single --profile --n-trials 50 

# --disable-radix-cache
Total samples: 50, Unique samples: 1
python3 -m sglang.test.test_deterministic --test-mode prefix --profile --n-trials 50

# --disable-radix-cache 
Prompt 1: total samples: 557, Unique samples: 1
Prompt 2: total samples: 518, Unique samples: 1
Long prompt: total samples: 200, Unique samples: 1
python3 -m sglang.test.test_deterministic --test-mode mixed --profile --n-trials 50 

# --disable-radix-cache
Prompt 1: total samples: 582, Unique samples: 1
Prompt 2: total samples: 460, Unique samples: 1
Long prompt: total samples: 233, Unique samples: 1

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yushengsu-thu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SGLang framework by introducing comprehensive support for deterministic inference, particularly for the Triton attention backend. The core objective is to ensure that given the same input and configuration, the model consistently produces identical outputs, regardless of batching strategies or other non-deterministic factors. This is achieved through careful modifications to attention backends, scheduler logic, and the introduction of new configuration options, all validated by a dedicated test suite.

Highlights

  • Deterministic Inference Support: Introduced the capability for deterministic inference in SGLang, specifically extending it to the Triton attention backend.
  • Configuration via Environment Variables: Added new environment variables (SGLANG_ENABLE_DETERMINISTIC_INFERENCE, SGLANG_FLASHINFER_PREFILL_SPLIT_TILE_SIZE, SGLANG_FLASHINFER_DECODE_SPLIT_TILE_SIZE) to control deterministic behavior and attention split tile sizes.
  • FlashInfer Backend Enhancements: Modified the FlashInfer backend to utilize tensor cores for decoding and disable KV splitting for CUDA graphs when deterministic inference is active, ensuring consistent behavior.
  • Triton Backend Adaptations: Updated the Triton backend to enforce a fixed split tile size for KV cache operations in deterministic mode, promoting batch invariance.
  • Scheduler Alignment for Prefill: Implemented a truncation_align_size in the scheduler to ensure that prefill prefix lengths are always multiples of the attention split size, which is critical for deterministic results.
  • New CLI Argument: Added --enable-deterministic-inference to server_args.py for easy activation of this feature.
  • Dedicated Deterministic Tests: Included a new test file (test_deterministic.py) to rigorously verify the deterministic output across various batching and prompt scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yushengsu-thu yushengsu-thu changed the title [Feature] Support deterministic inference with triton backend [Feature] [Clone from PR#10645] Support deterministic inference with triton backend Sep 19, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for deterministic inference with the Triton backend, which is a great feature for reproducibility. The changes are extensive, touching several files to plumb through the deterministic inference configuration and adapt the attention and normalization layers. The addition of a new test file for deterministic inference is also a valuable contribution.

My review focuses on improving code clarity and maintainability. I've identified several areas in triton_backend.py with redundant logic, commented-out code, and development artifacts that should be cleaned up. I also suggest a small refactoring in layernorm.py to avoid direct environment variable access, improving modularity.

Comment thread python/sglang/srt/layers/attention/triton_backend.py Outdated
Comment thread python/sglang/srt/layers/attention/triton_backend.py Outdated
Comment thread python/sglang/srt/layers/attention/triton_backend.py Outdated
Comment thread python/sglang/srt/layers/attention/triton_backend.py Outdated
Comment thread python/sglang/srt/layers/attention/triton_backend.py Outdated
Comment thread python/sglang/srt/layers/layernorm.py Outdated
"SGLANG_FLASHINFER_DECODE_SPLIT_TILE_SIZE", 2048
)
self.disable_cuda_graph_kv_split = True
global_config.flashinfer_workspace_size = 2048 * 1024 * 1024
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also use env var to optionally change this?

@yushengsu-thu
Copy link
Copy Markdown
Collaborator Author

@Fridge003 @ispobock @Edenzzzz
Please check and comment on this new PR: #10694
Sorry, I broke the code of this PR. I will revise and update the above new PR.

@yushengsu-thu
Copy link
Copy Markdown
Collaborator Author

@Edenzzzz , please check and comment on this new PR: #10694
Sorry, I broke the code of this PR. I will revise and update the above new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants