Skip to content

[Bug] Qwen3 lora doesn't work #7271

@logachevpa

Description

@logachevpa

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I trained lora for qwen2.5 and qwen3. The inference by sglang of first works well, the second throws the error on init stage, despite a train script the same, i changed only model. I got next traceback

[2025-06-17 13:14:27] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 295, in __init__
    self.capture()
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 379, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 463, in capture_one_batch_size
    self.model_runner.lora_manager.prepare_lora_batch(forward_batch)
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/lora/lora_manager.py", line 154, in prepare_lora_batch
    self.memory_pool.prepare_lora_batch(cur_uids, self.loras)
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/lora/mem_pool.py", line 153, in prepare_lora_batch
    self.load_lora_weight_to_buffer(
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/lora/mem_pool.py", line 213, in load_lora_weight_to_buffer
    self.A_buffer[name][layer_id][buffer_id][: lora_rank * c, :].copy_(
RuntimeError: The size of tensor a (5120) must match the size of tensor b (8192) at non-singleton dimension 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2255, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 273, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 64, in __init__
    self.worker = TpModelWorker(
                  ^^^^^^^^^^^^^^
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 78, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 241, in initialize
    self.init_cuda_graphs()
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1028, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/logachevpa/miniconda3/envs/py312/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 297, in __init__
    raise Exception(
Exception: Capture CUDA graph failed: The size of tensor a (5120) must match the size of tensor b (8192) at non-singleton dimension 1
Possible solutions

Reproduction

PATH=/nvidia-tools-dir/bin:$PATH CUDA_VISIBLE_DEVICES=3 python3 -m sglang.launch_server --model-path Qwen/Qwen3-32B --lora-paths lora0=./lora_qwen3_original --max-loras-per-batch 1 --lora-backend triton --disable-radix-cache

Environment

(py312) logachevpa@cml-gpu:~/arcadia/junk/logachevpa/train_llm$ python3 -m sglang.check_env
Python: 3.12.11 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:09:17) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 575.57.08
PyTorch: 2.6.0+cu124
sglang: 0.4.6.post4
sgl_kernel: 0.1.2.post1
flashinfer_python: 0.2.5
triton: 3.2.0
transformers: 4.52.4
torchao: 0.11.0
numpy: 2.2.6
aiohttp: 3.12.9
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.4
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: 0.8.5
xgrammar: 0.1.18
openai: 1.84.0
tiktoken: 0.9.0
anthropic: 0.52.2
litellm: 1.72.1
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-25 0 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-25 0 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 26-51 1 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 26-51 1 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 52-77 2 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 52-77 2 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 78-103 3 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 78-103 3 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions