Hi there, Amazing work on the RadixAttention and json contained decoding. I am running into some unexcited performance issue comparing sglang and vllm. I use latest pip of vllm, and use git-clone-ed sglang as of today.
here is my code to launch sglang
python -m sglang.launch_server --model-path NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --port 30000 --tp 8
Here is my code to launch v-llm
python -m vllm.entrypoints.openai.api_server --model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --tensor-parallel-size 8
Both running with the same Conda with CUDA 12.1 environment, 8x a10g on aws.
Here is the openai-compatible curl request
'curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."}
]
}
'
The SG-lang one is giving me 10 second of latency, while the vllm is giving 0.45 second. The number are reported after the first run to avoid any cold-start issue.
Hi there, Amazing work on the RadixAttention and json contained decoding. I am running into some unexcited performance issue comparing sglang and vllm. I use latest pip of vllm, and use git-clone-ed sglang as of today.
here is my code to launch sglang
python -m sglang.launch_server --model-path NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --port 30000 --tp 8Here is my code to launch v-llm
python -m vllm.entrypoints.openai.api_server --model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --tensor-parallel-size 8
Both running with the same Conda with CUDA 12.1 environment, 8x a10g on aws.
Here is the openai-compatible curl request
'curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."}
]
}
'
The SG-lang one is giving me 10 second of latency, while the vllm is giving 0.45 second. The number are reported after the first run to avoid any cold-start issue.