set default attention backend for deterministic inference by zminglei · Pull Request #11801 · sgl-project/sglang

zminglei · 2025-10-18T04:24:29Z

Motivation

Set default deterministic compatible attention backend when deterministic enabled and no attention backend being set.

Tested on a single H100
Before:

python3 -m sglang.launch_server --model-path /shared/public/elr-models/Qwen/Qwen3-8B/2069b3fae1114555f3c020c81410e51fa0f656f2 --enable-deterministic-inference
/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:sglang.srt.server_args:Sampling backend is set to pytorch for deterministic inference.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/jobuser/zminglei/sglang/python/sglang/launch_server.py", line 11, in <module>
    server_args = prepare_server_args(sys.argv[1:])
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/server_args.py", line 3489, in prepare_server_args
    return ServerArgs.from_cli_args(raw_args)
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/server_args.py", line 3138, in from_cli_args
    return cls(**{attr: getattr(args, attr) for attr in attrs})
  File "<string>", line 258, in __init__
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/server_args.py", line 580, in __post_init__
    self._handle_deterministic_inference()
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/server_args.py", line 1415, in _handle_deterministic_inference
    raise ValueError(
ValueError: Currently only ['flashinfer', 'fa3', 'triton'] attention backends are supported for deterministic inference.

After:

python3 -m sglang.launch_server --model-path /shared/public/elr-models/Qwen/Qwen3-8B/2069b3fae1114555f3c020c81410e51fa0f656f2 --enable-deterministic-inference

WARNING:sglang.srt.server_args:Attention backend not specified. Falling back to 'fa3' for deterministic inference. You can explicitly set --attention-backend to one of ['flashinfer', 'fa3', 'triton'].
[2025-10-17 21:28:32] INFO:     Started server process [14033]
[2025-10-17 21:28:32] INFO:     Waiting for application startup.
[2025-10-17 21:28:32] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-10-17 21:28:32] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-10-17 21:28:32] INFO:     Application startup complete.
[2025-10-17 21:28:32] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-10-17 21:28:33] INFO:     127.0.0.1:54594 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-17 21:28:33] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-17 21:28:35] INFO:     127.0.0.1:54600 - "POST /generate HTTP/1.1" 200 OK
[2025-10-17 21:28:35] The server is fired up and ready to roll!

Modifications

Accuracy Tests

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:24<00:00,  8.13it/s]
Accuracy: 0.955
Invalid: 0.000
Latency: 24.666 s
Output throughput: 964.265 token/s

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

zhyncs · 2025-10-18T05:30:20Z

+                        f"You can explicitly set --attention-backend to one of {DETERMINISTIC_ATTENTION_BACKEND_CHOICES}."
+                    )
+                else:
+                    self.attention_backend = "fa3"


I think that for most use cases, this should be good. but how about sm120 (rtx 50xx)

ah for that, just updated to set flashinfer for sm120 as well since it's also blackwell architecture. Wdyt?

@yzh119 can you help check, thanks

add default attention backend as fa3 for deterministic inference

b851618

zminglei marked this pull request as ready for review October 18, 2025 04:29

Fridge003 approved these changes Oct 18, 2025

View reviewed changes

hebiao064 approved these changes Oct 18, 2025

View reviewed changes

Fridge003 added the run-ci label Oct 18, 2025

hebiao064 added the deterministic Issues on deterministic inference/kernels label Oct 18, 2025

zhyncs suggested changes Oct 18, 2025

View reviewed changes

Comment thread python/sglang/srt/server_args.py Outdated

change default to flashinfer

33a102b

zminglei changed the title ~~add default attention backend as fa3 for deterministic inference~~ add default attention backend as flashinfer for deterministic inference Oct 18, 2025

zminglei changed the title ~~add default attention backend as flashinfer for deterministic inference~~ set default attention backend as flashinfer for deterministic inference Oct 18, 2025

update

58a60a7

zminglei changed the title ~~set default attention backend as flashinfer for deterministic inference~~ set default attention backend for deterministic inference Oct 18, 2025

zhyncs reviewed Oct 18, 2025

View reviewed changes

zminglei added 2 commits October 17, 2025 22:56

fix

e1d033a

lint

a9e01d4

zhyncs approved these changes Oct 18, 2025

View reviewed changes

zhyncs merged commit f4488e9 into sgl-project:main Oct 18, 2025
60 of 70 checks passed

hebiao064 mentioned this pull request Oct 20, 2025

[Feature] Support deterministic inference with Batch Invariant Ops #10278

Closed

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set default attention backend for deterministic inference#11801

set default attention backend for deterministic inference#11801
zhyncs merged 5 commits intosgl-project:mainfrom
zminglei:fix-server-arg

zminglei commented Oct 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

zhyncs Oct 18, 2025

Uh oh!

zminglei Oct 18, 2025

Uh oh!

zhyncs Oct 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zminglei commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Uh oh!

zhyncs Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

zminglei Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

zhyncs Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zminglei commented Oct 18, 2025 •

edited

Loading