Performance issue comparing sglang to vllm. 

Hi there, Amazing work on the RadixAttention and json contained decoding. I am running into some unexcited performance issue comparing sglang and vllm. I use latest pip of vllm, and use git-clone-ed sglang as of today. 


here is my code to launch sglang 
`python -m sglang.launch_server --model-path NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --port 30000 --tp 8`

Here is my code to launch v-llm 

python -m vllm.entrypoints.openai.api_server     --model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --tensor-parallel-size 8

Both running with the same Conda with  CUDA 12.1 environment, 8x a10g on aws. 
Here is the openai-compatible curl request

'curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
        "messages": [
            {"role": "system", "content": "You are a helpful AI assistant"},
            {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."}
        ]
    }
'
The SG-lang one is giving me 10 second of latency, while the vllm is giving 0.45 second. The number are reported after the first run to avoid any cold-start issue. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue comparing sglang to vllm. #169

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance issue comparing sglang to vllm. #169

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions