[Feature] Support deterministic inference with Batch Invariant Ops

### Backbone: Attention Backend, batch_invariant library integration

- [x] Support cuda graph #10645 @Fridge003 
- [x] Support temperature > 0 #10678 @Qiaolin-Yu @Fridge003 
- [x] Attention Backend: Flashinfer: https://github.com/flashinfer-ai/flashinfer/pull/1675 #10645 @Edenzzzz @Fridge003 
- [x] Attention Backend: Triton #10425 #10694 @yushengsu-thu @ispobock 
- [x] Attention Backend:  FA3 #10651 @hebiao064 
- [x] Move batch invariant triton kernels to sglang repo #10695 @hebiao064 
- [x] Add unit tests #11095  #10994 #11368 @skyzh 


### Communication (NCCL)
- [x] Support tensor parallelism with deterministic all-reduce kernels https://github.com/sgl-project/sglang/issues/10785  #10930 @JustinTong0323  @yuan-luo 
- [x] Support tensor parallelism with deterministic all-reduce kernels (AMD) https://github.com/sgl-project/sglang/pull/15340 @sunxxuns 
- [ ] Not deterministic on Blackwell for TP4 #11513 



### Radix Cache Support
- [x] FA3 supported given its already 1-stage prefill at the beginning
- [x] Triton Supported by PR: https://github.com/sgl-project/sglang/pull/11147 @hebiao064 @zminglei @byjiang1996 
- [ ] FlashInfer Support
- [ ] Making Prefill with Radix Cache has the same output as Prefill without Radix cache @hebiao064 @hanming-lu 


### Model Support
- [x] Qwen3 Dense
- [x] Qwen3 MOE
- [x] Deepseek v3 Deterministic Support @zminglei @b8zhong  #12095
- [ ] Qwen3 Next (Linear Attention) @zminglei @vedantjh2  #12845

### Quantization
- [ ] Blockwise FP8 Kernel https://github.com/sgl-project/sglang/pull/11491 @b8zhong 
- [ ] per-channel fp8 Gemm
- [ ] fp8 fused MoE
- [ ] nvfp4 Gemm/MoE

### Parallelism 

- [ ] Support DP Attention https://github.com/sgl-project/sglang/pull/11023
- [ ] Support EP

### Spec Decoding
- [ ] Speculative decoding drafters #8391 #9877

### Perf
- [x] Accelerate batch invariant triton kernels #12142 #12144


### Issues

### Usability & Documentation
- [x] Set Default Argument (e.g Attention Backend): https://github.com/sgl-project/sglang/pull/11801 @zminglei 
- [x] Write Documentation to https://docs.sglang.ai/ @zminglei  #11956
### Related resources

https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
https://github.com/thinking-machines-lab/batch_invariant_ops


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support deterministic inference with Batch Invariant Ops #10278

Backbone: Attention Backend, batch_invariant library integration

Communication (NCCL)

Radix Cache Support

Model Support

Quantization

Parallelism

Spec Decoding

Perf

Issues

Usability & Documentation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Support deterministic inference with Batch Invariant Ops #10278

Description

Backbone: Attention Backend, batch_invariant library integration

Communication (NCCL)

Radix Cache Support

Model Support

Quantization

Parallelism

Spec Decoding

Perf

Issues

Usability & Documentation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions