[Bug] encounter a bug when using sglang deploying "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" in a 2-different-gpus environment

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

My deploying environment is my personal host computer, which loads two different gpus as "Geforce RTX 3090" and  "Geforce RTX  4060TI".When I run a simple SGlang-based script I encounter the bug. Bug report info is below:
#############################################################  


(HupuKiller) root@DESKTOP-9RQB5NI:/data/workspace# python /data/workspace/projects/HupuKiller/src/py/sglang/test/sglang_server.py
/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
INFO 02-25 16:41:10 __init__.py:190] Automatically detected platform cuda.
WARNING 02-25 16:41:10 cuda.py:336] Detected different devices in the system: 
WARNING 02-25 16:41:10 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-25 16:41:10 cuda.py:336] NVIDIA GeForce RTX 4060 Ti
WARNING 02-25 16:41:10 cuda.py:336] Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
2025-02-25 16:41:11,589 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-02-25 16:41:13] server_args=ServerArgs(model_path='google-bert/bert-base-chinese', tokenizer_path='google-bert/bert-base-chinese', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=256, device='cuda', served_model_name='google-bert/bert-base-chinese', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30023, mem_fraction_static=0.87, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=2, stream_interval=1, stream_output=False, random_seed=669493103, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=8, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=True, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False)
[2025-02-25 16:41:14] Downcasting torch.float32 to torch.bfloat16.
/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
INFO 02-25 16:41:16 __init__.py:190] Automatically detected platform cuda.
WARNING 02-25 16:41:16 cuda.py:336] Detected different devices in the system: 
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 4060 Ti
WARNING 02-25 16:41:16 cuda.py:336] Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 02-25 16:41:16 __init__.py:190] Automatically detected platform cuda.
INFO 02-25 16:41:16 __init__.py:190] Automatically detected platform cuda.
WARNING 02-25 16:41:16 cuda.py:336] Detected different devices in the system: 
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 4060 Ti
WARNING 02-25 16:41:16 cuda.py:336] Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
WARNING 02-25 16:41:16 cuda.py:336] Detected different devices in the system: 
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-25 16:41:16 cuda.py:336] NVIDIA GeForce RTX 4060 Ti
WARNING 02-25 16:41:16 cuda.py:336] Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
2025-02-25 16:41:17,454 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-25 16:41:17,488 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-25 16:41:17,489 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-02-25 16:41:20 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-02-25 16:41:20 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-02-25 16:41:20 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-02-25 16:41:20 TP1] Init torch distributed begin.
[2025-02-25 16:41:21 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-02-25 16:41:21 TP0] Init torch distributed begin.
[2025-02-25 16:41:21 TP1] sglang is using nccl==2.21.5
[2025-02-25 16:41:21 TP0] sglang is using nccl==2.21.5
[2025-02-25 16:41:21 TP1] reading GPU P2P access cache from /root/.cache/sglang/gpu_p2p_access_cache_for_0,1.json
[2025-02-25 16:41:21 TP0] reading GPU P2P access cache from /root/.cache/sglang/gpu_p2p_access_cache_for_0,1.json
[2025-02-25 16:41:21 TP0] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-25 16:41:21 TP1] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-25 16:41:21 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 240, in __init__
    self.tp_worker = TpWorkerClass(
  File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
  File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in __init__
    min_per_gpu_memory = self.init_torch_distributed()
  File "/data/workspace/anaconda/envs/HupuKiller/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed
    raise ValueError(
ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

[2025-02-25 16:41:21] Received sigquit from a child proces. It usually means the child failed.
[2025-02-25 16:41:21 TP1] Load weight begin. avail mem=14.76 GB


#############################################################

### Reproduction

from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process


server_process, port = launch_server_cmd(
    # "python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --port 30000 --enable-p2p-check --dtype bfloat16 --tp 2 --context-length 4096"
    "python -m sglang.launch_server --model-path google-bert/bert-base-chinese --host 0.0.0.0 --port 30000 --enable-p2p-check --dtype bfloat16 --tp 2 --context-length 256"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

### Environment

(HupuKiller) root@DESKTOP-9RQB5NI:/data/workspace# python3 -m sglang.check_env
2025-02-25 16:47:37,120 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
INFO 02-25 16:47:41 __init__.py:190] Automatically detected platform cuda.
WARNING 02-25 16:47:41 cuda.py:336] Detected different devices in the system: 
WARNING 02-25 16:47:41 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-25 16:47:41 cuda.py:336] NVIDIA GeForce RTX 4060 Ti
WARNING 02-25 16:47:41 cuda.py:336] Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 4060 Ti
GPU 0 Compute Capability: 8.6
GPU 1 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda-12.6/
NVCC: Cuda compilation tools, release 12.6, V12.6.85
CUDA Driver Version: 566.36
PyTorch: 2.5.1+cu124
sglang: 0.4.3.post2
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post2
triton: 3.1.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.27.0
interegular: 0.3.3
modelscope: 1.23.1
orjson: 3.10.13
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.4
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.63.2
tiktoken: 0.8.0
anthropic: 0.46.0
decord: 0.6.0
Hypervisor vendor: Microsoft
ulimit soft: 1048576

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] encounter a bug when using sglang deploying "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" in a 2-different-gpus environment #3842

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] encounter a bug when using sglang deploying "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" in a 2-different-gpus environment #3842

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions