Skip to content

set default attention backend for deterministic inference#11801

Merged
zhyncs merged 5 commits intosgl-project:mainfrom
zminglei:fix-server-arg
Oct 18, 2025
Merged

set default attention backend for deterministic inference#11801
zhyncs merged 5 commits intosgl-project:mainfrom
zminglei:fix-server-arg

Conversation

@zminglei
Copy link
Copy Markdown
Collaborator

@zminglei zminglei commented Oct 18, 2025

Motivation

Set default deterministic compatible attention backend when deterministic enabled and no attention backend being set.

Tested on a single H100
Before:

python3 -m sglang.launch_server --model-path /shared/public/elr-models/Qwen/Qwen3-8B/2069b3fae1114555f3c020c81410e51fa0f656f2 --enable-deterministic-inference
/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:sglang.srt.server_args:Sampling backend is set to pytorch for deterministic inference.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/jobuser/zminglei/sglang/python/sglang/launch_server.py", line 11, in <module>
    server_args = prepare_server_args(sys.argv[1:])
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/server_args.py", line 3489, in prepare_server_args
    return ServerArgs.from_cli_args(raw_args)
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/server_args.py", line 3138, in from_cli_args
    return cls(**{attr: getattr(args, attr) for attr in attrs})
  File "<string>", line 258, in __init__
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/server_args.py", line 580, in __post_init__
    self._handle_deterministic_inference()
  File "/home/jobuser/zminglei/sglang/python/sglang/srt/server_args.py", line 1415, in _handle_deterministic_inference
    raise ValueError(
ValueError: Currently only ['flashinfer', 'fa3', 'triton'] attention backends are supported for deterministic inference.

After:

python3 -m sglang.launch_server --model-path /shared/public/elr-models/Qwen/Qwen3-8B/2069b3fae1114555f3c020c81410e51fa0f656f2 --enable-deterministic-inference

WARNING:sglang.srt.server_args:Attention backend not specified. Falling back to 'fa3' for deterministic inference. You can explicitly set --attention-backend to one of ['flashinfer', 'fa3', 'triton'].
[2025-10-17 21:28:32] INFO:     Started server process [14033]
[2025-10-17 21:28:32] INFO:     Waiting for application startup.
[2025-10-17 21:28:32] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-10-17 21:28:32] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-10-17 21:28:32] INFO:     Application startup complete.
[2025-10-17 21:28:32] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-10-17 21:28:33] INFO:     127.0.0.1:54594 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-17 21:28:33] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-17 21:28:35] INFO:     127.0.0.1:54600 - "POST /generate HTTP/1.1" 200 OK
[2025-10-17 21:28:35] The server is fired up and ready to roll!

Modifications

Accuracy Tests

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:24<00:00,  8.13it/s]
Accuracy: 0.955
Invalid: 0.000
Latency: 24.666 s
Output throughput: 964.265 token/s

Benchmarking and Profiling

Checklist

@zminglei zminglei marked this pull request as ready for review October 18, 2025 04:29
@hebiao064 hebiao064 added the deterministic Issues on deterministic inference/kernels label Oct 18, 2025
Comment thread python/sglang/srt/server_args.py Outdated
@zminglei zminglei changed the title add default attention backend as fa3 for deterministic inference add default attention backend as flashinfer for deterministic inference Oct 18, 2025
@zminglei zminglei changed the title add default attention backend as flashinfer for deterministic inference set default attention backend as flashinfer for deterministic inference Oct 18, 2025
@zminglei zminglei changed the title set default attention backend as flashinfer for deterministic inference set default attention backend for deterministic inference Oct 18, 2025
f"You can explicitly set --attention-backend to one of {DETERMINISTIC_ATTENTION_BACKEND_CHOICES}."
)
else:
self.attention_backend = "fa3"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for most use cases, this should be good. but how about sm120 (rtx 50xx)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah for that, just updated to set flashinfer for sm120 as well since it's also blackwell architecture. Wdyt?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yzh119 can you help check, thanks

@zhyncs zhyncs merged commit f4488e9 into sgl-project:main Oct 18, 2025
60 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deterministic Issues on deterministic inference/kernels run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants