Skip to content

add the fa4 mm backend and varlen func#13539

Merged
Fridge003 merged 6 commits intosgl-project:mainfrom
bzhng-development:vz/fa4-multimodal-backend
Jan 23, 2026
Merged

add the fa4 mm backend and varlen func#13539
Fridge003 merged 6 commits intosgl-project:mainfrom
bzhng-development:vz/fa4-multimodal-backend

Conversation

@vincentzed
Copy link
Copy Markdown
Contributor

@vincentzed vincentzed commented Nov 18, 2025

It's the same acc, but there is not a major speedup, MMMU benchmark is not reliable re: time it take to finish.

Before

python3 bench_sglang.py --concurrency 32
Preparing samples...
Loading datasets for 30 subjects...
Loading datasets:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:14<00:00,  2.10it/s]
Saving images to: /root/.cache/mmmu/images
Processing samples...
Processing samples:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 143684.29it/s]
Skipping 0 samples with large images, 0.0% of dataset Samples have been prepared

75%|█████████████████████████████████████████████████████████████████████████████████████████████████ | 672/900 [01:34<00:25,
9.02it/s]100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [02:18<00:00,  6.48it/s]
Benchmark time: 138.87946422677487
answers saved to: ./answer_sglang.json
Evaluating...
answers saved to: ./answer_sglang.json
{'Accounting': {'acc': 0.333, 'num': 30},
 'Agriculture': {'acc': 0.533, 'num': 30},
 'Architecture_and_Engineering': {'acc': 0.267, 'num': 30},
 'Art': {'acc': 0.667, 'num': 30},
 'Art_Theory': {'acc': 0.867, 'num': 30},
 'Basic_Medical_Science': {'acc': 0.667, 'num': 30},
 'Biology': {'acc': 0.533, 'num': 30},
 'Chemistry': {'acc': 0.233, 'num': 30},
 'Clinical_Medicine': {'acc': 0.6, 'num': 30},
 'Computer_Science': {'acc': 0.433, 'num': 30},
 'Design': {'acc': 0.767, 'num': 30},
 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.367, 'num': 30},
 'Economics': {'acc': 0.567, 'num': 30},
 'Electronics': {'acc': 0.167, 'num': 30},
 'Energy_and_Power': {'acc': 0.233, 'num': 30},
 'Finance': {'acc': 0.233, 'num': 30},
 'Geography': {'acc': 0.333, 'num': 30},
 'History': {'acc': 0.6, 'num': 30},
 'Literature': {'acc': 0.833, 'num': 30},
 'Manage': {'acc': 0.4, 'num': 30},
 'Marketing': {'acc': 0.333, 'num': 30},
 'Materials': {'acc': 0.3, 'num': 30},
 'Math': {'acc': 0.433, 'num': 30},
 'Mechanical_Engineering': {'acc': 0.267, 'num': 30},
 'Music': {'acc': 0.367, 'num': 30},
 'Overall': {'acc': 0.461, 'num': 900},
 'Overall-Art and Design': {'acc': 0.667, 'num': 120},
 'Overall-Business': {'acc': 0.373, 'num': 150},
 'Overall-Health and Medicine': {'acc': 0.507, 'num': 150},
 'Overall-Humanities and Social Science': {'acc': 0.683, 'num': 120},
 'Overall-Science': {'acc': 0.367, 'num': 150},
 'Overall-Tech and Engineering': {'acc': 0.314, 'num': 210},
 'Pharmacy': {'acc': 0.5, 'num': 30},
 'Physics': {'acc': 0.3, 'num': 30},
 'Psychology': {'acc': 0.667, 'num': 30},
 'Public_Health': {'acc': 0.4, 'num': 30},
 'Sociology': {'acc': 0.633, 'num': 30}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.461

After

python3 bench_sglang.py --concurrency 32
Preparing samples...
Loading datasets for 30 subjects...
Loading datasets:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:14<00:00,  2.07it/s]
Saving images to: /root/.cache/mmmu/images
Processing samples...
Processing samples:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 1966.71it/s]
Skipping 0 samples with large images, 0.0% of dataset Samples have been prepared
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [01:32<00:00,  9.74it/s]
Benchmark time: 92.43894441286102
answers saved to: ./answer_sglang.json
Evaluating...
answers saved to: ./answer_sglang.json
{'Accounting': {'acc': 0.333, 'num': 30},
 'Agriculture': {'acc': 0.533, 'num': 30},
 'Architecture_and_Engineering': {'acc': 0.333, 'num': 30},
 'Art': {'acc': 0.667, 'num': 30},
 'Art_Theory': {'acc': 0.867, 'num': 30},
 'Basic_Medical_Science': {'acc': 0.667, 'num': 30},
 'Biology': {'acc': 0.567, 'num': 30},
 'Chemistry': {'acc': 0.3, 'num': 30},
 'Clinical_Medicine': {'acc': 0.6, 'num': 30},
 'Computer_Science': {'acc': 0.467, 'num': 30},
 'Design': {'acc': 0.767, 'num': 30},
 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.367, 'num': 30},
 'Economics': {'acc': 0.633, 'num': 30},
 'Electronics': {'acc': 0.133, 'num': 30},
 'Energy_and_Power': {'acc': 0.267, 'num': 30},
 'Finance': {'acc': 0.4, 'num': 30},
 'Geography': {'acc': 0.333, 'num': 30},
 'History': {'acc': 0.6, 'num': 30},
 'Literature': {'acc': 0.833, 'num': 30},
 'Manage': {'acc': 0.433, 'num': 30},
 'Marketing': {'acc': 0.367, 'num': 30},
 'Materials': {'acc': 0.167, 'num': 30},
 'Math': {'acc': 0.333, 'num': 30},
 'Mechanical_Engineering': {'acc': 0.267, 'num': 30},
 'Music': {'acc': 0.367, 'num': 30},
 'Overall': {'acc': 0.474, 'num': 900},
 'Overall-Art and Design': {'acc': 0.667, 'num': 120},
 'Overall-Business': {'acc': 0.433, 'num': 150},
 'Overall-Health and Medicine': {'acc': 0.533, 'num': 150},
 'Overall-Humanities and Social Science': {'acc': 0.667, 'num': 120},
 'Overall-Science': {'acc': 0.38, 'num': 150},
 'Overall-Tech and Engineering': {'acc': 0.31, 'num': 210},
 'Pharmacy': {'acc': 0.567, 'num': 30},
 'Physics': {'acc': 0.367, 'num': 30},
 'Psychology': {'acc': 0.633, 'num': 30},
 'Public_Health': {'acc': 0.467, 'num': 30},
 'Sociology': {'acc': 0.6, 'num': 30}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.474

Benchmarks

Launch LLM:
python -m sglang.launch_server --model-path Qwen/Qwen3-VL-8B-Instruct --port 30000 --mm-attention-backend fa4 --trust-remote-code

Before

python3 -m sglang.bench_serving --backend sglang --dataset-name image --num-prompts 250 --random-input 1024 --random-output 256 --image-count 2 --image-resolution 720p --image-format jpeg --image-content random --disable-stream --apply-chat-template --warmup-requests 100 benchmark_args=Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=None, dataset_name='image', dataset_path='', model=None, served_model_name=None, tokenizer=None, num_prompts=250, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=2, image_resolution='720p', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, disable_tqdm=False, disable_stream=True, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=True, profile=False, profile_activities=['CPU', 'GPU'], lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=100, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None) Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='image', dataset_path='',
model='Qwen/Qwen3-VL-8B-Instruct', served_model_name=None, tokenizer=None, num_prompts=250, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=2, image_resolution='720p', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, disable_tqdm=False, disable_stream=True, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=True, profile=False, profile_activities=['CPU', 'GPU'], lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=100, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None)

#Input tokens: 578485
#Output tokens: 33035

Created 250 random jpeg images with average 1855859 bytes per request Starting warmup with 100 sequences...
Warmup completed with 100 sequences. Starting main benchmark run... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [00:09<00:00, 26.16it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     250
Benchmark duration (s):                  9.56
Total input tokens:                      578485
Total input text tokens:                 137485
Total input vision tokens:               441000
Total generated tokens:                  33035
Total generated tokens (retokenized):    26537
Request throughput (req/s):              26.14
Input token throughput (tok/s):          60490.66
Output token throughput (tok/s):         3454.38
Total token throughput (tok/s):          63945.05
Concurrency:                             200.88
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   7684.24
Median E2E Latency (ms):                 7688.81
---------------Time to First Token----------------
Mean TTFT (ms):                          7658.62
Median TTFT (ms):                        7688.85
P99 TTFT (ms):                           9153.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          -0.00
Median TPOT (ms):                        -0.00
P99 TPOT (ms):                           -0.00
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

After

python3 -m sglang.bench_serving --backend sglang --dataset-name image --num-prompts 250 --random-input 1024 --random-output 256 --image-count 2 --image-resolution 720p --image-format jpeg --image-content random --disable-stream --apply-chat-template --warmup-requests 100 benchmark_args=Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=None, dataset_name='image', dataset_path='', model=None, served_model_name=None, tokenizer=None, num_prompts=250, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=2, image_resolution='720p', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, disable_tqdm=False, disable_stream=True, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=True, profile=False, profile_activities=['CPU', 'GPU'], lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=100, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None) Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='image', dataset_path='',
model='Qwen/Qwen3-VL-8B-Instruct', served_model_name=None, tokenizer=None, num_prompts=250, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=2, image_resolution='720p', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, disable_tqdm=False, disable_stream=True, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=True, profile=False, profile_activities=['CPU', 'GPU'], lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=100, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None)

#Input tokens: 578367
#Output tokens: 33035

Created 250 random jpeg images with average 1855859 bytes per request Starting warmup with 100 sequences...
Warmup completed with 100 sequences. Starting main benchmark run... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [00:09<00:00, 26.74it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     250
Benchmark duration (s):                  9.36
Total input tokens:                      578367
Total input text tokens:                 137367
Total input vision tokens:               441000
Total generated tokens:                  33035
Total generated tokens (retokenized):    27435
Request throughput (req/s):              26.72
Input token throughput (tok/s):          61815.98
Output token throughput (tok/s):         3530.79
Total token throughput (tok/s):          65346.77
Concurrency:                             201.59
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   7544.50
Median E2E Latency (ms):                 7626.54
---------------Time to First Token----------------
Mean TTFT (ms):                          7544.54
Median TTFT (ms):                        7626.59
P99 TTFT (ms):                           9019.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          -0.00
Median TPOT (ms):                        -0.00
P99 TPOT (ms):                           -0.00
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@github-actions github-actions Bot added documentation Improvements or additions to documentation Multi-modal multi-modal language model labels Nov 18, 2025
@vincentzed vincentzed marked this pull request as ready for review November 19, 2025 01:44
@b8zhong b8zhong added run-ci format Auto Format Code labels Nov 19, 2025
@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Dec 1, 2025

Please format the comments in description. I'll follow up.

@b8zhong b8zhong force-pushed the vz/fa4-multimodal-backend branch from f646aee to efa95f3 Compare December 2, 2025 06:57
@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented Dec 2, 2025

On B200:

cd test/nightly
NIGHTLY_VLM_MODELS="Qwen/Qwen2.5-VL-7B-Instruct" python3 -m unittest test_vlms_perf.TestNightlyVLMModelsPerformance.test_bench_one_batch
image

@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented Dec 2, 2025

/tag-and-rerun-ci again

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@b8zhong b8zhong enabled auto-merge (squash) January 14, 2026 21:58
@Fridge003
Copy link
Copy Markdown
Collaborator

Hi @vincentzed Please resolve the conflicts

@Fridge003 Fridge003 disabled auto-merge January 17, 2026 01:13
@Fridge003
Copy link
Copy Markdown
Collaborator

Fridge003 commented Jan 18, 2026

@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented Jan 18, 2026

@Fridge003 Thanks, yes, it's from SM120 device not supporting FA4 yet. should be fixed in c9591ae

@vincentzed
Copy link
Copy Markdown
Contributor Author

FA4

#Input tokens: 578402
#Output tokens: 32935
#Total images: 500
#Images per request: 2 (fixed)

Created 250 random jpeg images with average 1855818 bytes per request
Starting warmup with 100 sequences...
Warmup completed with 100 sequences. Starting main benchmark run...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [00:05<00:00, 46.20it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     248       
Benchmark duration (s):                  5.42      
Total input tokens:                      573021    
Total input text tokens:                 135549    
Total input vision tokens:               437472    
Total generated tokens:                  32677     
Total generated tokens (retokenized):    26713     
Request throughput (req/s):              45.74     
Input token throughput (tok/s):          105675.61 
Output token throughput (tok/s):         6026.24   
Peak output token throughput (tok/s):    142.00    
Peak concurrent requests:                248       
Total token throughput (tok/s):          111701.85 
Concurrency:                             199.81    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4368.69   
Median E2E Latency (ms):                 4417.58   
P90 E2E Latency (ms):                    5020.73   
P99 E2E Latency (ms):                    5191.82   
---------------Time to First Token----------------
Mean TTFT (ms):                          4354.54   
Median TTFT (ms):                        4417.59   
P99 TTFT (ms):                           5191.83   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          -0.00     
Median TPOT (ms):                        -0.00     
P99 TPOT (ms):                           -0.00     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Triton

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     249       
Benchmark duration (s):                  5.90      
Total input tokens:                      575996    
Total input text tokens:                 136760    
Total input vision tokens:               439236    
Total generated tokens:                  32799     
Total generated tokens (retokenized):    27011     
Request throughput (req/s):              42.23     
Input token throughput (tok/s):          97682.89  
Output token throughput (tok/s):         5562.37   
Peak output token throughput (tok/s):    123.00    
Peak concurrent requests:                249       
Total token throughput (tok/s):          103245.26 
Concurrency:                             195.36    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4626.31   
Median E2E Latency (ms):                 4621.84   
P90 E2E Latency (ms):                    5401.82   
P99 E2E Latency (ms):                    5634.61   
---------------Time to First Token----------------
Mean TTFT (ms):                          4626.32   
Median TTFT (ms):                        4621.85   
P99 TTFT (ms):                           5634.62   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          -0.00     
Median TPOT (ms):                        -0.00     
P99 TPOT (ms):                           -0.00     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

They are similar at low bs.

Comment thread python/sglang/srt/layers/attention/vision.py Outdated
Comment thread python/sglang/srt/layers/attention/vision.py Outdated
@Fridge003 Fridge003 merged commit 08fcda2 into sgl-project:main Jan 23, 2026
141 of 157 checks passed
@b8zhong b8zhong deleted the vz/fa4-multimodal-backend branch February 4, 2026 00:05
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation format Auto Format Code high priority Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants