Checklist
Describe the bug
[2025-04-23 16:14:34 TP0] Attention backend not set. Use triton backend by default.
[2025-04-23 16:14:34 TP0] Init torch distributed begin.
[W423 16:14:34.122823038 HIPAllocatorConfig.h:29] Warning: expandable_segments not supported on this platform (function operator())
[2025-04-23 16:14:35 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-23 16:14:35 TP0] Load weight begin. avail mem=17.88 GB
[2025-04-23 16:14:35 TP0] sgl-kernel is not available on Non-NV platforms. Fallback to other kernel libraries.
[2025-04-23 16:14:35 TP0] sgl-kernel is not available on Non-NV platforms. Fallback to other kernel libraries.
[2025-04-23 16:14:35 TP0] The following error message 'operation scheduled before its operands' can be ignored.
/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/utils/_device.py:104: UserWarning: expandable_segments not supported on this platform (Triggered internally at /pytorch/c10/hip/HIPAllocatorConfig.h:29.)
return func(*args, **kwargs)
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:04<00:09, 4.56s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:09<00:04, 4.66s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:11<00:00, 3.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:11<00:00, 3.78s/it]
[2025-04-23 16:14:47 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=7.96 GB, mem usage=9.92 GB.
[2025-04-23 16:14:47 TP0] KV Cache is allocated. #tokens: 13200, K size: 1.21 GB, V size: 1.21 GB
[2025-04-23 16:14:47 TP0] Memory pool end. avail mem=4.70 GB
[2025-04-23 16:14:47 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=4.70 GB
Capturing batches (avail_mem=4.70 GB): 0%| | 0/4 [00:00<?, ?it/s]
[2025-04-23 16:14:48 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/sglang/python/sglang/srt/managers/scheduler.py", line 2001, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/usr/local/sglang/python/sglang/srt/managers/scheduler.py", line 261, in init
self.tp_worker = TpWorkerClass(
File "/usr/local/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/usr/local/sglang/python/sglang/srt/managers/tp_worker.py", line 75, in init
self.model_runner = ModelRunner(
File "/usr/local/sglang/python/sglang/srt/model_executor/model_runner.py", line 181, in init
self.initialize(min_per_gpu_memory)
File "/usr/local/sglang/python/sglang/srt/model_executor/model_runner.py", line 219, in initialize
self.init_cuda_graphs()
File "/usr/local/sglang/python/sglang/srt/model_executor/model_runner.py", line 980, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/usr/local/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 276, in init
self.capture()
File "/usr/local/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 360, in capture
) = self.capture_one_batch_size(bs, forward)
File "/usr/local/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 452, in capture_one_batch_size
run_once()
File "/usr/local/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 445, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/sglang/python/sglang/srt/models/qwen2.py", line 383, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/sglang/python/sglang/srt/models/qwen2.py", line 291, in forward
hidden_states, residual = layer(
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/sglang/python/sglang/srt/models/qwen2.py", line 224, in forward
hidden_states = self.self_attn(
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/sglang/python/sglang/srt/models/qwen2.py", line 167, in forward
qkv, _ = self.qkv_proj(hidden_states)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in call_impl
return forward_call(*args, **kwargs)
File "/usr/local/sglang/python/sglang/srt/layers/linear.py", line 445, in forward
output_parallel = self.quant_method.apply(self, input, bias)
File "/usr/local/sglang/python/sglang/srt/layers/quantization/awq.py", line 195, in apply
out = awq_dequantize(qweight, scales, qzeros)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sgl_kernel-0.0.9.post2-py3.10-linux-x86_64.egg/sgl_kernel/gemm.py", line 10, in awq_dequantize
return torch.ops.sgl_kernel.awq_dequantize.default(qweight, scales, qzeros)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/_ops.py", line 1232, in getattr
raise AttributeError(
AttributeError: '_OpNamespace' 'sgl_kernel' object has no attribute 'awq_dequantize'
Reproduction
qwen2.5-instruct 14B int4 AWQ quantization
Environment
Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
ROCM available: True
GPU 0: AMD Radeon RX 7900 XT
GPU 0 Compute Capability: 11.0
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.4.43482-0f2d60242
ROCM Driver Version:
PyTorch: 2.6.0+rocm6.4.0.git2fb0ac2b
sglang: 0.4.5.post3
sgl_kernel: 0.0.9.post2
flashinfer: Module Not Found
triton: 3.2.0
transformers: 4.51.1
torchao: 0.11.0.dev20250418+rocm6.3
numpy: 1.26.4
aiohttp: 3.11.14
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.30.2
interegular: 0.3.3
modelscope: 1.24.0
orjson: 3.10.16
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 1.2.1
zmq: Module Not Found
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.8.5.dev134+gd6da9322c.d20250422.rocm640
xgrammar: 0.1.18
openai: 1.68.2
tiktoken: 0.9.0
anthropic: 0.49.0
litellm: 1.63.14
decord: 0.6.0
AMD Topology:
Hypervisor vendor: Microsoft
ulimit soft: 1024
Checklist
Describe the bug
[2025-04-23 16:14:34 TP0] Attention backend not set. Use triton backend by default.
[2025-04-23 16:14:34 TP0] Init torch distributed begin.
[W423 16:14:34.122823038 HIPAllocatorConfig.h:29] Warning: expandable_segments not supported on this platform (function operator())
[2025-04-23 16:14:35 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-23 16:14:35 TP0] Load weight begin. avail mem=17.88 GB
[2025-04-23 16:14:35 TP0] sgl-kernel is not available on Non-NV platforms. Fallback to other kernel libraries.
[2025-04-23 16:14:35 TP0] sgl-kernel is not available on Non-NV platforms. Fallback to other kernel libraries.
[2025-04-23 16:14:35 TP0] The following error message 'operation scheduled before its operands' can be ignored.
/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/utils/_device.py:104: UserWarning: expandable_segments not supported on this platform (Triggered internally at /pytorch/c10/hip/HIPAllocatorConfig.h:29.)
return func(*args, **kwargs)
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:04<00:09, 4.56s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:09<00:04, 4.66s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:11<00:00, 3.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:11<00:00, 3.78s/it]
[2025-04-23 16:14:47 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=7.96 GB, mem usage=9.92 GB.
[2025-04-23 16:14:47 TP0] KV Cache is allocated. #tokens: 13200, K size: 1.21 GB, V size: 1.21 GB
[2025-04-23 16:14:47 TP0] Memory pool end. avail mem=4.70 GB
[2025-04-23 16:14:47 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=4.70 GB
Capturing batches (avail_mem=4.70 GB): 0%| | 0/4 [00:00<?, ?it/s]
[2025-04-23 16:14:48 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/sglang/python/sglang/srt/managers/scheduler.py", line 2001, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/usr/local/sglang/python/sglang/srt/managers/scheduler.py", line 261, in init
self.tp_worker = TpWorkerClass(
File "/usr/local/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/usr/local/sglang/python/sglang/srt/managers/tp_worker.py", line 75, in init
self.model_runner = ModelRunner(
File "/usr/local/sglang/python/sglang/srt/model_executor/model_runner.py", line 181, in init
self.initialize(min_per_gpu_memory)
File "/usr/local/sglang/python/sglang/srt/model_executor/model_runner.py", line 219, in initialize
self.init_cuda_graphs()
File "/usr/local/sglang/python/sglang/srt/model_executor/model_runner.py", line 980, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/usr/local/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 276, in init
self.capture()
File "/usr/local/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 360, in capture
) = self.capture_one_batch_size(bs, forward)
File "/usr/local/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 452, in capture_one_batch_size
run_once()
File "/usr/local/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 445, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/sglang/python/sglang/srt/models/qwen2.py", line 383, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/sglang/python/sglang/srt/models/qwen2.py", line 291, in forward
hidden_states, residual = layer(
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/sglang/python/sglang/srt/models/qwen2.py", line 224, in forward
hidden_states = self.self_attn(
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/sglang/python/sglang/srt/models/qwen2.py", line 167, in forward
qkv, _ = self.qkv_proj(hidden_states)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in call_impl
return forward_call(*args, **kwargs)
File "/usr/local/sglang/python/sglang/srt/layers/linear.py", line 445, in forward
output_parallel = self.quant_method.apply(self, input, bias)
File "/usr/local/sglang/python/sglang/srt/layers/quantization/awq.py", line 195, in apply
out = awq_dequantize(qweight, scales, qzeros)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sgl_kernel-0.0.9.post2-py3.10-linux-x86_64.egg/sgl_kernel/gemm.py", line 10, in awq_dequantize
return torch.ops.sgl_kernel.awq_dequantize.default(qweight, scales, qzeros)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/_ops.py", line 1232, in getattr
raise AttributeError(
AttributeError: '_OpNamespace' 'sgl_kernel' object has no attribute 'awq_dequantize'
Reproduction
qwen2.5-instruct 14B int4 AWQ quantization
Environment
Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
ROCM available: True
GPU 0: AMD Radeon RX 7900 XT
GPU 0 Compute Capability: 11.0
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.4.43482-0f2d60242
ROCM Driver Version:
PyTorch: 2.6.0+rocm6.4.0.git2fb0ac2b
sglang: 0.4.5.post3
sgl_kernel: 0.0.9.post2
flashinfer: Module Not Found
triton: 3.2.0
transformers: 4.51.1
torchao: 0.11.0.dev20250418+rocm6.3
numpy: 1.26.4
aiohttp: 3.11.14
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.30.2
interegular: 0.3.3
modelscope: 1.24.0
orjson: 3.10.16
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 1.2.1
zmq: Module Not Found
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.8.5.dev134+gd6da9322c.d20250422.rocm640
xgrammar: 0.1.18
openai: 1.68.2
tiktoken: 0.9.0
anthropic: 0.49.0
litellm: 1.63.14
decord: 0.6.0
AMD Topology:
Hypervisor vendor: Microsoft
ulimit soft: 1024