Skip to content

[Bug] encounter a bug when using sglang deploying "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" in a 2-different-gpus environment #3842

@wwk-code

Description

@wwk-code

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

My deploying environment is my personal host computer, which loads two different gpus as "Geforce RTX 3090" and "Geforce RTX 4060TI".When I run a simple SGlang-based script I encounter the bug. Bug report info is below:
#############################################################

(HupuKiller) root@DESKTOP-9RQB5NI:/data/workspace# python /data/workspace/projects/HupuKiller/src/py/sglang/test/sglang_server.py
/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
warnings.warn(
INFO 02-25 16:41:10 init.py:190] Automatically detected platform cuda.
WARNING 02-25 16:41:10 cuda.py:336] Detected different devices in the system:
WARNING 02-25 16:41:10 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-25 16:41:10 cuda.py:336] NVIDIA GeForce RTX 4060 Ti
WARNING 02-25 16:41:10 cuda.py:336] Please make sure to set CUDA_DEVICE_ORDER=PCI_BUS_ID to avoid unexpected behavior.
2025-02-25 16:41:11,589 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-02-25 16:41:13] server_args=ServerArgs(model_path='google-bert/bert-base-chinese', tokenizer_path='google-bert/bert-base-chinese', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=256, device='cuda', served_model_name='google-bert/bert-base-chinese', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30023, mem_fraction_static=0.87, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=2, stream_interval=1, stream_output=False, random_seed=669493103, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=8, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=True, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False)
[2025-02-25 16:41:14] Downcasting torch.float32 to torch.bfloat16.
/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
warnings.warn(
/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
warnings.warn(
/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
warnings.warn(
INFO 02-25 16:41:16 init.py:190] Automatically detected platform cuda.
WARNING 02-25 16:41:16 cuda.py:336] Detected different devices in the system:
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 4060 Ti
WARNING 02-25 16:41:16 cuda.py:336] Please make sure to set CUDA_DEVICE_ORDER=PCI_BUS_ID to avoid unexpected behavior.
INFO 02-25 16:41:16 init.py:190] Automatically detected platform cuda.
INFO 02-25 16:41:16 init.py:190] Automatically detected platform cuda.
WARNING 02-25 16:41:16 cuda.py:336] Detected different devices in the system:
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 4060 Ti
WARNING 02-25 16:41:16 cuda.py:336] Please make sure to set CUDA_DEVICE_ORDER=PCI_BUS_ID to avoid unexpected behavior.
WARNING 02-25 16:41:16 cuda.py:336] Detected different devices in the system:
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 4060 Ti
WARNING 02-25 16:41:16 cuda.py:336] Please make sure to set CUDA_DEVICE_ORDER=PCI_BUS_ID to avoid unexpected behavior.
2025-02-25 16:41:17,454 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-25 16:41:17,488 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-25 16:41:17,489 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-02-25 16:41:20 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-02-25 16:41:20 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-02-25 16:41:20 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-02-25 16:41:20 TP1] Init torch distributed begin.
[2025-02-25 16:41:21 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-02-25 16:41:21 TP0] Init torch distributed begin.
[2025-02-25 16:41:21 TP1] sglang is using nccl==2.21.5
[2025-02-25 16:41:21 TP0] sglang is using nccl==2.21.5
[2025-02-25 16:41:21 TP1] reading GPU P2P access cache from /root/.cache/sglang/gpu_p2p_access_cache_for_0,1.json
[2025-02-25 16:41:21 TP0] reading GPU P2P access cache from /root/.cache/sglang/gpu_p2p_access_cache_for_0,1.json
[2025-02-25 16:41:21 TP0] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-25 16:41:21 TP1] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-25 16:41:21 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 240, in init
self.tp_worker = TpWorkerClass(
File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init
min_per_gpu_memory = self.init_torch_distributed()
File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed
raise ValueError(
ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

[2025-02-25 16:41:21] Received sigquit from a child proces. It usually means the child failed.
[2025-02-25 16:41:21 TP1] Load weight begin. avail mem=14.76 GB

#############################################################

Reproduction

from sglang.test.test_utils import is_in_ci

if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

server_process, port = launch_server_cmd(
# "python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --port 30000 --enable-p2p-check --dtype bfloat16 --tp 2 --context-length 4096"
"python -m sglang.launch_server --model-path google-bert/bert-base-chinese --host 0.0.0.0 --port 30000 --enable-p2p-check --dtype bfloat16 --tp 2 --context-length 256"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

Environment

(HupuKiller) root@DESKTOP-9RQB5NI:/data/workspace# python3 -m sglang.check_env
2025-02-25 16:47:37,120 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
INFO 02-25 16:47:41 init.py:190] Automatically detected platform cuda.
WARNING 02-25 16:47:41 cuda.py:336] Detected different devices in the system:
WARNING 02-25 16:47:41 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-25 16:47:41 cuda.py:336] NVIDIA GeForce RTX 4060 Ti
WARNING 02-25 16:47:41 cuda.py:336] Please make sure to set CUDA_DEVICE_ORDER=PCI_BUS_ID to avoid unexpected behavior.
Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 4060 Ti
GPU 0 Compute Capability: 8.6
GPU 1 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda-12.6/
NVCC: Cuda compilation tools, release 12.6, V12.6.85
CUDA Driver Version: 566.36
PyTorch: 2.5.1+cu124
sglang: 0.4.3.post2
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post2
triton: 3.1.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.27.0
interegular: 0.3.3
modelscope: 1.23.1
orjson: 3.10.13
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.4
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.63.2
tiktoken: 0.8.0
anthropic: 0.46.0
decord: 0.6.0
Hypervisor vendor: Microsoft
ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions