Checklist
Describe the bug
This performance issue is identified in this PR: #8048
sglang phi4-mm behaves similarly with vllm phi4-mm no matter whether lora is enabled but the behavior is different when comparing sglang phi4-mm without lora and hugging-face without lora. sglang phi4-mm without lora is worse than hugging-face without lora.
Reproduction
Setup
huggingface-cli download microsoft/Phi-4-multimodal-instruct
Server
python3 -m sglang.launch_server --trust-remote-code --disable-radix-cache --model-path /home/jobuser/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/33e62acdd07cd7d6635badd529aa0a3467bb9c6a/
Client:
curl_command = f"""
curl -s http://localhost:{30000}/v1/chat/completions \\
-H "Content-Type: application/json" \\
-d '{{
"model": "default",
"messages": [
{{
"role": "user",
"content": [
{{
"type": "audio_url",
"audio_url": {{
"url": "/home/jobuser/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/33e62acdd07cd7d6635badd529aa0a3467bb9c6a/examples/what_is_the_traffic_sign_in_the_image.wav"
}}
}},
{{
"type": "text",
"text": "Based on the attached audio, generate a comprehensive text transcription of the spoken content."
}}
]
}}
],
"temperature": 0,
"max_tokens": 1000
}}'
"""
response = subprocess.check_output(curl_command, shell=True).decode()
print(response)
# It gives `I'm sorry, but I can't view or transcribe audio content. However, if you provide the text or a description of the audio, I'd be happy to help you with a transcription or any other information you need` but it should've been some meaningful contents like `What is the traffic sign in the image?`
Environment
(venv) jobuser [ ~/sglang ]$ python3 -m sglang.check_env
Python: 3.10.14 (main, Jul 14 2024, 22:24:12) [GCC 11.2.0]
CUDA available: True
GPU 0,1: NVIDIA H100 80GB HBM3
GPU 0,1 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.77
CUDA Driver Version: 550.163.01
PyTorch: 2.7.1+cu126
sglang: 0.4.9.post2
sgl_kernel: 0.2.5
flashinfer_python: 0.2.7.post1
triton: 3.3.1
transformers: 4.53.0
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.13
fastapi: 0.116.0
hf_transfer: 0.1.9
huggingface_hub: 0.33.2
interegular: 0.3.3
modelscope: 1.27.1
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.0
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.21
openai: 1.93.1
tiktoken: 0.9.0
anthropic: 0.57.1
litellm: 1.74.0.post1
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 SYS SYS SYS SYS NODE NODE 64-127,192-255 1 N/A
GPU1 NV18 X SYS SYS SYS SYS NODE NODE 64-127,192-255 1 N/A
NIC0 SYS SYS X NODE NODE NODE SYS SYS
NIC1 SYS SYS NODE X PIX NODE SYS SYS
NIC2 SYS SYS NODE PIX X NODE SYS SYS
NIC3 SYS SYS NODE NODE NODE X SYS SYS
NIC4 NODE NODE SYS SYS SYS SYS X NODE
NIC5 NODE NODE SYS SYS SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
ulimit soft: 10000000
Checklist
Describe the bug
This performance issue is identified in this PR: #8048
sglang phi4-mm behaves similarly with vllm phi4-mm no matter whether lora is enabled but the behavior is different when comparing sglang phi4-mm without lora and hugging-face without lora. sglang phi4-mm without lora is worse than hugging-face without lora.
Reproduction
Setup
Server
Client:
Environment