Skip to content

fa3 decode performance significantly lower than flashinfer on H20 #5630

@cscyuge

Description

@cscyuge

Description

When running QWQ-32B model on H20 GPU, the decode performance using fa3 attention backend is significantly lower than flashinfer.

Performance metrics with input_len = 12000, batch_size = 64

  • fa3 bf16 attention: 908.34 tok/s
  • flashinfer bf16 attention: 2014.20 tok/s

Environment

  • GPU: NVIDIA H20
  • SGLang commit: 4418f59
  • CUDA version: 12.4
  • Model: QwQ-32B

Reproduction Steps

  1. Start the service using the following launch script:
python3 -m sglang.launch_server \
        --host ${SERVER_IP} \
        --port ${SERVER_PORT} \
        --model-path ${MODEL_PATH} \
        --tp 4 \
        --attention-backend fa3 \
        --chunked-prefill-size -1 \
        --mem-fraction-static 0.9 \
        --max-prefill-tokens 40960 \
        --context-length 40960 \
        --max-running-requests 128 \
        --enable-metrics \
        --trust-remote-code
  1. Run performance benchmark using the following script:
python3 -m sglang.bench_serving \
    --host 127.0.0.1 \
    --port 30066 \
    --backend sglang-oai \
    --dataset-name random \
    --request-rate 64 \
    --random-input 12000 \
    --random-output 2000 \
    --random-range-ratio 1 \
    --num-prompts 128 \
    --max-concurrency 64 \
    --flush-cache

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions