Skip to content

Fix regression caused by fa3 block_table#15009

Merged
Fridge003 merged 1 commit intosgl-project:mainfrom
wenscarl:fix_fa3_metadata_regression
Dec 13, 2025
Merged

Fix regression caused by fa3 block_table#15009
Fridge003 merged 1 commit intosgl-project:mainfrom
wenscarl:fix_fa3_metadata_regression

Conversation

@wenscarl
Copy link
Copy Markdown
Collaborator

@wenscarl wenscarl commented Dec 12, 2025

Motivation

Fixes: #14665 [Bug] Llama4 on DGXH100 has regressed by 14.23%

Modifications

Accuracy Tests

Benchmarking and Profiling

Benchmark script:

python3 -m sglang.bench_one_batch \
    --model-path meta-llama/Llama-4-Scout-17B-16E \
    --disable-radix-cache \
    --context-length 2048 \
    --attention-backend fa3 \
    --load-format dummy \
    --batch-size 128 \
    --tp-size 8 \
    --input-len 1000 \
    --output-len 1000

Before:

Prefill. latency: 2.12386 s, throughput:  60267.49 token/s
Decode 0. Batch size: 128, latency: 0.17131 s, throughput:    747.20 token/s
Decode 1. Batch size: 128, latency: 0.11283 s, throughput:   1134.45 token/s
Decode 2. Batch size: 128, latency: 0.11220 s, throughput:   1140.83 token/s
Decode 3. Batch size: 128, latency: 0.11324 s, throughput:   1130.34 token/s
Decode 4. Batch size: 128, latency: 0.11100 s, throughput:   1153.15 token/s
Decode.  median latency: 0.11281 s, median throughput:   1134.62 token/s
Total. latency:  5.681 s, throughput:  23253.82 token/s
Benchmark ...
Prefill. latency: 1.81066 s, throughput:  70692.62 token/s
Decode 0. Batch size: 128, latency: 0.11238 s, throughput:   1139.00 token/s
Decode 1. Batch size: 128, latency: 0.11217 s, throughput:   1141.11 token/s
Decode 2. Batch size: 128, latency: 0.11157 s, throughput:   1147.31 token/s
Decode 3. Batch size: 128, latency: 0.11164 s, throughput:   1146.54 token/s
Decode 4. Batch size: 128, latency: 0.11063 s, throughput:   1157.03 token/s
Decode.  median latency: 0.15820 s, median throughput:    809.09 token/s
Total. latency: 160.106 s, throughput:   1598.95 token/s

After

Prefill. latency: 2.03622 s, throughput:  62861.66 token/s
Decode 0. Batch size: 128, latency: 0.08541 s, throughput:   1498.60 token/s
Decode 1. Batch size: 128, latency: 0.01811 s, throughput:   7067.50 token/s
Decode 2. Batch size: 128, latency: 0.01751 s, throughput:   7309.17 token/s
Decode 3. Batch size: 128, latency: 0.01738 s, throughput:   7365.30 token/s
Decode 4. Batch size: 128, latency: 0.01736 s, throughput:   7371.69 token/s
Decode.  median latency: 0.01744 s, median throughput:   7338.32 token/s
Total. latency:  2.646 s, throughput:  49923.53 token/s
Benchmark ...
Prefill. latency: 1.72161 s, throughput:  74349.17 token/s
Decode 0. Batch size: 128, latency: 0.01989 s, throughput:   6435.87 token/s
Decode 1. Batch size: 128, latency: 0.01811 s, throughput:   7067.99 token/s
Decode 2. Batch size: 128, latency: 0.01759 s, throughput:   7276.41 token/s
Decode 3. Batch size: 128, latency: 0.01768 s, throughput:   7241.46 token/s
Decode 4. Batch size: 128, latency: 0.01753 s, throughput:   7303.33 token/s
Decode.  median latency: 0.01828 s, median throughput:   7003.46 token/s
Total. latency: 20.002 s, throughput:  12798.53 token/s

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

@Fridge003
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@Fridge003
Copy link
Copy Markdown
Collaborator

failed b200 tests are unrelated to fa3 backend

@Fridge003 Fridge003 merged commit 665cb02 into sgl-project:main Dec 13, 2025
210 of 233 checks passed
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 13, 2025
…n_eagle3_npu

* 'main' of https://github.com/sgl-project/sglang: (25 commits)
  [NPU] perf update with kvcache nz & w4a8 quant (sgl-project#14423)
  [PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks (sgl-project#15027)
  Fix GLM-4.6 tool calls don't support streaming output for arguments i… (sgl-project#13989)
  feature: adding nightly wheel workflow and indexer (sgl-project#14924)
  [diffusion] feat: Improve LoRA compatibility by adding unified format detection and diffusers-based normalization (sgl-project#14659)
  [Fix] Disable trtllm moe backend for draft model for a qucik fix (sgl-project#15002)
  [diffusion] fix: use NDRotaryEmbedding in flux_2   (sgl-project#15034)
  Mistral Large 3 NVFP4 support (sgl-project#14485)
  call check_quantized_moe_compatibility after initialize (sgl-project#13876)
  Add sgl_router_attempt_http_responses_total for single attempt information (sgl-project#15037)
  Add error code in prometheus metrics and add X-SMG-Error-Code header (sgl-project#15036)
  Provide more fine grained error reason for reqwest error (sgl-project#15032)
  Tiny change http router response format to unify (sgl-project#15031)
  Tiny unify grpc existing error responses into new format (sgl-project#15030)
  Add `code` field and unify error responses for router (sgl-project#15028)
  Super tiny remove unused log_request (sgl-project#15035)
  Fix decode OOM caused by retraction (sgl-project#14939)
  [CI]Add gb200 runner back (sgl-project#15024)
  Add a special label for b200 CI runner that can run kernel tests (sgl-project#15033)
  Fix regression caused by fa3 block_table (sgl-project#15009)
  ...

# Conflicts:
#	python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Llama4 on DGXH100 has regressed by 14.23%

5 participants