Checklist
Describe the bug
I'm running qwen3-vl on an RTX 5090. If I don't explicitly specify the attention_backend parameter, SGLang automatically selects 'trtllm_mha'.
--- python/sglang/srt/server_args.py
@@ def _handle_attention_backend_compatibility(self):
if not use_mla_backend:
# MHA architecture
if (
is_hopper_with_cuda_12_3()
and is_no_spec_infer_or_topk_one(self)
and is_fa3_default_architecture(self.model_config.hf_config)
):
self.attention_backend = "fa3"
elif is_blackwell() and is_no_spec_infer_or_topk_one(self):
self.attention_backend = "trtllm_mha" # !!! auto select trtllm_nha
However, later during the SM version check, a ValueError is raised.
--- python/sglang/srt/server_args.py
@@ def _handle_attention_backend_compatibility(self):
if (
self.attention_backend == "trtllm_mha"
or self.decode_attention_backend == "trtllm_mha"
or self.prefill_attention_backend == "trtllm_mha"
):
if not is_sm100_supported():
raise ValueError(
"TRTLLM MHA backend is only supported on Blackwell GPUs (SM100). Please use a different backend."
)
I manually checked and confirmed the SM version is sm120. Does sm120 not support trtllm_mha? If that's the case, should a version check be added to prevent automatic selection of trtllm_mha in the first place?
Reproduction
python3 -m sglang.launch_server --model-path Qwen/Qwen3-VL-4B-Instruct-FP8 --enable-multimodal --cuda-graph-max-bs 128 --context-length 2560 --page-size 16 --stream-interval 300 --mem-fraction-static 0.7 --port 30260 --base-gpu-id 2 --kv-cache-dtype fp8_e4m3 --fp8-gemm-backend=cutlass
[2025-12-10 17:42:05] WARNING server_args.py:1391: Attention backend not explicitly specified. Use trtllm_mha backend by default.
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/mnt/sufeng/sglang/python/sglang/launch_server.py", line 25, in
server_args = prepare_server_args(sys.argv[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/sufeng/sglang/python/sglang/srt/server_args.py", line 4463, in prepare_server_args
return ServerArgs.from_cli_args(raw_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/sufeng/sglang/python/sglang/srt/server_args.py", line 4012, in from_cli_args
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 293, in init
File "/mnt/sufeng/sglang/python/sglang/srt/server_args.py", line 648, in post_init
self._handle_attention_backend_compatibility()
File "/mnt/sufeng/sglang/python/sglang/srt/server_args.py", line 1456, in _handle_attention_backend_compatibility
raise ValueError(
ValueError: TRTLLM MHA backend is only supported on Blackwell GPUs (SM100). Please use a different backend.
Environment
(root) root@iZbp1egv8tehlc78k9u6y7Z:/mnt/sufeng/sglang# nvidia-smi
Wed Dec 10 17:42:56 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 On | 00000000:08:00.0 Off | N/A |
| 0% 29C P8 7W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 On | 00000000:0C:00.0 Off | N/A |
| 0% 28C P8 21W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 On | 00000000:7E:00.0 Off | N/A |
| 0% 28C P8 17W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 5090 On | 00000000:7F:00.0 Off | N/A |
| 0% 28C P8 11W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA GeForce RTX 5090 On | 00000001:08:00.0 Off | N/A |
| 0% 27C P8 27W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 5090 On | 00000001:0C:00.0 Off | N/A |
| 0% 27C P8 19W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 5090 On | 00000001:81:00.0 Off | N/A |
| 0% 28C P8 13W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA GeForce RTX 5090 On | 00000001:82:00.0 Off | N/A |
| 0% 27C P8 21W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Checklist
Describe the bug
I'm running qwen3-vl on an RTX 5090. If I don't explicitly specify the attention_backend parameter, SGLang automatically selects 'trtllm_mha'.
However, later during the SM version check, a ValueError is raised.
I manually checked and confirmed the SM version is sm120. Does sm120 not support trtllm_mha? If that's the case, should a version check be added to prevent automatic selection of trtllm_mha in the first place?
Reproduction
[2025-12-10 17:42:05] WARNING server_args.py:1391: Attention backend not explicitly specified. Use trtllm_mha backend by default.
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/mnt/sufeng/sglang/python/sglang/launch_server.py", line 25, in
server_args = prepare_server_args(sys.argv[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/sufeng/sglang/python/sglang/srt/server_args.py", line 4463, in prepare_server_args
return ServerArgs.from_cli_args(raw_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/sufeng/sglang/python/sglang/srt/server_args.py", line 4012, in from_cli_args
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 293, in init
File "/mnt/sufeng/sglang/python/sglang/srt/server_args.py", line 648, in post_init
self._handle_attention_backend_compatibility()
File "/mnt/sufeng/sglang/python/sglang/srt/server_args.py", line 1456, in _handle_attention_backend_compatibility
raise ValueError(
ValueError: TRTLLM MHA backend is only supported on Blackwell GPUs (SM100). Please use a different backend.
Environment
(root) root@iZbp1egv8tehlc78k9u6y7Z:/mnt/sufeng/sglang# nvidia-smi
Wed Dec 10 17:42:56 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 On | 00000000:08:00.0 Off | N/A |
| 0% 29C P8 7W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 On | 00000000:0C:00.0 Off | N/A |
| 0% 28C P8 21W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 On | 00000000:7E:00.0 Off | N/A |
| 0% 28C P8 17W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 5090 On | 00000000:7F:00.0 Off | N/A |
| 0% 28C P8 11W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA GeForce RTX 5090 On | 00000001:08:00.0 Off | N/A |
| 0% 27C P8 27W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 5090 On | 00000001:0C:00.0 Off | N/A |
| 0% 27C P8 19W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 5090 On | 00000001:81:00.0 Off | N/A |
| 0% 28C P8 13W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA GeForce RTX 5090 On | 00000001:82:00.0 Off | N/A |
| 0% 27C P8 21W / 575W | 1MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+