Description
When running QWQ-32B model on H20 GPU, the decode performance using fa3 attention backend is significantly lower than flashinfer.
Performance metrics with input_len = 12000, batch_size = 64
- fa3 bf16 attention: 908.34 tok/s
- flashinfer bf16 attention: 2014.20 tok/s
Environment
- GPU: NVIDIA H20
- SGLang commit: 4418f59
- CUDA version: 12.4
- Model: QwQ-32B
Reproduction Steps
- Start the service using the following launch script:
python3 -m sglang.launch_server \
--host ${SERVER_IP} \
--port ${SERVER_PORT} \
--model-path ${MODEL_PATH} \
--tp 4 \
--attention-backend fa3 \
--chunked-prefill-size -1 \
--mem-fraction-static 0.9 \
--max-prefill-tokens 40960 \
--context-length 40960 \
--max-running-requests 128 \
--enable-metrics \
--trust-remote-code
- Run performance benchmark using the following script:
python3 -m sglang.bench_serving \
--host 127.0.0.1 \
--port 30066 \
--backend sglang-oai \
--dataset-name random \
--request-rate 64 \
--random-input 12000 \
--random-output 2000 \
--random-range-ratio 1 \
--num-prompts 128 \
--max-concurrency 64 \
--flush-cache
Description
When running QWQ-32B model on H20 GPU, the decode performance using fa3 attention backend is significantly lower than flashinfer.
Performance metrics with input_len = 12000, batch_size = 64
Environment
Reproduction Steps
python3 -m sglang.launch_server \ --host ${SERVER_IP} \ --port ${SERVER_PORT} \ --model-path ${MODEL_PATH} \ --tp 4 \ --attention-backend fa3 \ --chunked-prefill-size -1 \ --mem-fraction-static 0.9 \ --max-prefill-tokens 40960 \ --context-length 40960 \ --max-running-requests 128 \ --enable-metrics \ --trust-remote-codepython3 -m sglang.bench_serving \ --host 127.0.0.1 \ --port 30066 \ --backend sglang-oai \ --dataset-name random \ --request-rate 64 \ --random-input 12000 \ --random-output 2000 \ --random-range-ratio 1 \ --num-prompts 128 \ --max-concurrency 64 \ --flush-cache