fa3 decode performance significantly lower than flashinfer on H20

## Description
When running QWQ-32B model on H20 GPU, the decode performance using fa3 attention backend is significantly lower than flashinfer. 

Performance metrics with input_len = 12000, batch_size = 64
- fa3 bf16 attention: 908.34 tok/s
- flashinfer bf16 attention: 2014.20 tok/s

## Environment
- GPU: NVIDIA H20
- SGLang commit: 4418f599a
- CUDA version: 12.4
- Model: QwQ-32B

## Reproduction Steps
1. Start the service using the following launch script:
```bash
python3 -m sglang.launch_server \
        --host ${SERVER_IP} \
        --port ${SERVER_PORT} \
        --model-path ${MODEL_PATH} \
        --tp 4 \
        --attention-backend fa3 \
        --chunked-prefill-size -1 \
        --mem-fraction-static 0.9 \
        --max-prefill-tokens 40960 \
        --context-length 40960 \
        --max-running-requests 128 \
        --enable-metrics \
        --trust-remote-code
```

2. Run performance benchmark using the following script:
```bash
python3 -m sglang.bench_serving \
    --host 127.0.0.1 \
    --port 30066 \
    --backend sglang-oai \
    --dataset-name random \
    --request-rate 64 \
    --random-input 12000 \
    --random-output 2000 \
    --random-range-ratio 1 \
    --num-prompts 128 \
    --max-concurrency 64 \
    --flush-cache
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fa3 decode performance significantly lower than flashinfer on H20 #5630

Description

Environment

Reproduction Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fa3 decode performance significantly lower than flashinfer on H20 #5630

Description

Description

Environment

Reproduction Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions