Skip to content

benchmark: enhance configurable multimodal benchmarking in bench_serving#9812

Merged
zhyncs merged 33 commits intosgl-project:mainfrom
AlienKevin:vlm-bench-serving
Oct 8, 2025
Merged

benchmark: enhance configurable multimodal benchmarking in bench_serving#9812
zhyncs merged 33 commits intosgl-project:mainfrom
AlienKevin:vlm-bench-serving

Conversation

@AlienKevin
Copy link
Copy Markdown
Contributor

@AlienKevin AlienKevin commented Aug 30, 2025

Motivation

#9583 introduced configurable multimodal benchmarking to bench_serving. This PR refines it in three ways, positioning bench_serving as a reliable way to benchmark and compare VLM performance within and across inference frameworks.

  • Count vision tokens in throughput metrics with a clear per-modality breakdown: Vision-prefill benchmarks often have ~0 input text tokens, ~1,000 image tokens, and ~0 output text tokens. If we only count text tokens, key metrics misleadingly read as zero (e.g., total input tokens, input token throughput, total token throughput). Following discussions with @mickqian, all metrics now include vision tokens, and we report separate counts for text vs. vision tokens for users who need the details.

  • Broaden image options: PNG support and a blank image mode: We add PNG alongside JPEG and introduce a blank image option to simulate various preprocessing overheads. Random PNGs are much larger in size than JPEGs, stress-testing preprocessing; blank PNGs are tiny and ideal for isolating prefill/decode performance with minimal I/O overhead. Blank images are also useful for verifying that multimodal cache is functioning properly.

  • Replace deprecated max_tokens with max_completion_tokens: Using max_tokens can be confusing, especially when random-output-len is set below prompt_len, since users often assume output length excludes inputs. Switching to max_completion_tokens removes that ambiguity.

Tests

We tested Qwen2.5-VL-7B-Instruct with this PR on SGLang and vLLM. Since this PR also touches the MMMU dataset in bench_serving, we evaluate both MMMU and a fixed ISL/OSL setup. All benchmarks undergo three consecutive runs where the server is kept alive between the runs for consistency. The results match previous benchmarks and show that SGL is currently slower than vLLM (~13% slower on MMMU and 46% slower on ISL1000/OSL1). Much of the slow down come from the scheduler/preprocessing as the GPU is idle for much longer in SGL.

SGL

Benchmark Total tokens/s (per run) Average total tokens/s / Std
MMMU (Math) 9,154 • 9,396 • 10,242 9,597 / 571
ISL1000/OSL1 9,259 • 9,781 • 9,838 9,626 / 319

vLLM

Benchmark Total tokens/s (per run) Average total tokens/s / Std
MMMU (Math) 10,960 • 11,095 • 10,992 11,016 / 71
ISL1000/OSL1 17,934 • 17,799 • 17,697 17,810 / 119

Details for SGL

SGL server command follows common benchmarking practice to set chunked prefill size to 8196 and disable multimodal and prefix caching.

SGLANG_VLM_CACHE_SIZE_MB=0 python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-VL-7B-Instruct \
    --mem-fraction-static 0.8 \
    --chat-template 'qwen2-vl' \
    --tp 1 \
    --disable-radix-cache \
    --cuda-graph-bs 256 \
    --cuda-graph-max-bs 256 \
    --chunked-prefill-size 8192 \
    --max-prefill-tokens 8192 \
    --max-running-requests 256 \
    --enable-multimodal

MMMU:

python3 -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --dataset-name mmmu \
    --num-prompts 1000 \
    --apply-chat-template

Run 1:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     505       
Benchmark duration (s):                  79.91     
Total input tokens:                      214424    
Total input text tokens:                 44483     
Total input vision tokens:               169941    
Total generated tokens:                  517120    
Total generated tokens (retokenized):    348374    
Request throughput (req/s):              6.32      
Input token throughput (tok/s):          2683.25   
Output token throughput (tok/s):         6471.11   
Total token throughput (tok/s):          9154.36   
Concurrency:                             379.43    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   60041.25  
Median E2E Latency (ms):                 41262.11  
---------------Time to First Token----------------
Mean TTFT (ms):                          29677.97  
Median TTFT (ms):                        17249.72  
P99 TTFT (ms):                           50016.25  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           38.01     
Median ITL (ms):                         27.78     
P95 ITL (ms):                            63.60     
P99 ITL (ms):                            66.07     
Max ITL (ms):                            8671.55   
==================================================
Run 2 and 3

Run 2:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     505       
Benchmark duration (s):                  77.86     
Total input tokens:                      214424    
Total input text tokens:                 44483     
Total input vision tokens:               169941    
Total generated tokens:                  517120    
Total generated tokens (retokenized):    348422    
Request throughput (req/s):              6.49      
Input token throughput (tok/s):          2753.97   
Output token throughput (tok/s):         6641.66   
Total token throughput (tok/s):          9395.63   
Concurrency:                             388.16    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   59846.31  
Median E2E Latency (ms):                 42907.69  
---------------Time to First Token----------------
Mean TTFT (ms):                          31200.22  
Median TTFT (ms):                        16632.26  
P99 TTFT (ms):                           53494.78  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           35.86     
Median ITL (ms):                         24.72     
P95 ITL (ms):                            56.25     
P99 ITL (ms):                            64.21     
Max ITL (ms):                            10912.03  
==================================================

Run 3:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     505       
Benchmark duration (s):                  71.43     
Total input tokens:                      214424    
Total input text tokens:                 44483     
Total input vision tokens:               169941    
Total generated tokens:                  517120    
Total generated tokens (retokenized):    349392    
Request throughput (req/s):              7.07      
Input token throughput (tok/s):          3002.07   
Output token throughput (tok/s):         7240.00   
Total token throughput (tok/s):          10242.06  
Concurrency:                             384.92    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   54442.27  
Median E2E Latency (ms):                 38471.66  
---------------Time to First Token----------------
Mean TTFT (ms):                          27549.69  
Median TTFT (ms):                        15780.27  
P99 TTFT (ms):                           46930.68  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.63     
Median ITL (ms):                         23.31     
P95 ITL (ms):                            49.35     
P99 ITL (ms):                            51.27     
Max ITL (ms):                            8770.60   
==================================================

ISL1000/OSL1

We want to fix input tokens to be roughly 1000 to test out the standard ISL1000 setting for vision prefill. We choose an input image size of 1120x700 since Qwen breaks them down into 14x14 patches followed by a 2x2 pixel-unshuffle, resulting in 1000 visual tokens as the output of the vision encoder. The average input tokens is 22 tokens more than 1000 because of two additional image start and end tokens and boilerplates from Qwen's chat template.

python3 -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --dataset-name image \
    --num-prompts 768 \
    --apply-chat-template \
    --random-output-len 1 \
    --random-input-len 1 \
    --image-resolution 1120x700 \
    --image-format jpeg \
    --image-count 1 \
    --image-content random \
    --random-range-ratio 1

Run 1:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     768       
Benchmark duration (s):                  84.86     
Total input tokens:                      784915    
Total input text tokens:                 15379     
Total input vision tokens:               769536    
Total generated tokens:                  768       
Total generated tokens (retokenized):    768       
Request throughput (req/s):              9.05      
Input token throughput (tok/s):          9249.90   
Output token throughput (tok/s):         9.05      
Total token throughput (tok/s):          9258.95   
Concurrency:                             524.26    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   57925.98  
Median E2E Latency (ms):                 58398.53  
---------------Time to First Token----------------
Mean TTFT (ms):                          57925.95  
Median TTFT (ms):                        58398.49  
P99 TTFT (ms):                           83143.86  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================
Run 2 and 3

Run 2:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     768       
Benchmark duration (s):                  80.33     
Total input tokens:                      784909    
Total input text tokens:                 15373     
Total input vision tokens:               769536    
Total generated tokens:                  768       
Total generated tokens (retokenized):    768       
Request throughput (req/s):              9.56      
Input token throughput (tok/s):          9771.56   
Output token throughput (tok/s):         9.56      
Total token throughput (tok/s):          9781.12   
Concurrency:                             507.40    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   53069.90  
Median E2E Latency (ms):                 53704.35  
---------------Time to First Token----------------
Mean TTFT (ms):                          53069.87  
Median TTFT (ms):                        53704.31  
P99 TTFT (ms):                           78548.58  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Run 3:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     768       
Benchmark duration (s):                  79.86     
Total input tokens:                      784912    
Total input text tokens:                 15376     
Total input vision tokens:               769536    
Total generated tokens:                  768       
Total generated tokens (retokenized):    768       
Request throughput (req/s):              9.62      
Input token throughput (tok/s):          9828.56   
Output token throughput (tok/s):         9.62      
Total token throughput (tok/s):          9838.18   
Concurrency:                             506.75    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   52694.52  
Median E2E Latency (ms):                 52875.04  
---------------Time to First Token----------------
Mean TTFT (ms):                          52694.50  
Median TTFT (ms):                        52875.02  
P99 TTFT (ms):                           78287.03  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================
python3 -m sglang.check_env
Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.93
CUDA Driver Version: 575.57.08
PyTorch: 2.8.0+cu128
sglang: 0.5.1.post3
sgl_kernel: 0.3.7
flashinfer_python: 0.2.14.post1
triton: 3.4.0
transformers: 4.55.2
torchao: 0.9.0
numpy: 2.3.2
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.4
interegular: 0.3.3
modelscope: 1.29.0
orjson: 3.11.2
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.2
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.23
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.64.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5  NIC6     NIC7    NIC8    NIC9    NIC10   NIC11   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PXB     NODE    NODE    NODE    NODE    NODE  SYS      SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    PXB     NODE    NODE  SYS      SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    PXB     NODE  SYS      SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    NODE    PXB   SYS      SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS   PXB      NODE    NODE    NODE    NODE    NODE    56-111,168-223  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS   NODE     NODE    NODE    PXB     NODE    NODE    56-111,168-223  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS   NODE     NODE    NODE    NODE    PXB     NODE    56-111,168-223  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS   NODE     NODE    NODE    NODE    NODE    PXB     56-111,168-223  1               N/A
NIC0    PXB     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    NODE  SYS      SYS     SYS     SYS     SYS     SYS
NIC1    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      PIX     NODE    NODE    NODE  SYS      SYS     SYS     SYS     SYS     SYS
NIC2    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX      X      NODE    NODE    NODE  SYS      SYS     SYS     SYS     SYS     SYS
NIC3    NODE    PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      NODE    NODE  SYS      SYS     SYS     SYS     SYS     SYS
NIC4    NODE    NODE    PXB     NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      NODE  SYS      SYS     SYS     SYS     SYS     SYS
NIC5    NODE    NODE    NODE    PXB     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE     X    SYS      SYS     SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     PXB     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS    X       NODE    NODE    NODE    NODE    NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS   NODE      X      PIX     NODE    NODE    NODE
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS   NODE     PIX      X      NODE    NODE    NODE
NIC9    SYS     SYS     SYS     SYS     NODE    PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS   NODE     NODE    NODE     X      NODE    NODE
NIC10   SYS     SYS     SYS     SYS     NODE    NODE    PXB     NODE    SYS     SYS     SYS     SYS     SYS     SYS   NODE     NODE    NODE    NODE     X      NODE
NIC11   SYS     SYS     SYS     SYS     NODE    NODE    NODE    PXB     SYS     SYS     SYS     SYS     SYS     SYS   NODE     NODE    NODE    NODE    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11


ulimit soft: 1048576

Details for vLLM

vLLM server command sets chunked prefill size to 8196 and disables multimodal and prefix caching, closely matching the SGLang server configs:

vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
    --gpu-memory-utilization 0.8 \
    --tensor-parallel-size 1 \
    --no-enable-prefix-caching \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 256 \
    -O0 \
    --disable-mm-preprocessor-cache \
    --port 8000

MMMU

python3 -m sglang.bench_serving \
    --backend vllm-chat \
    --dataset-name mmmu \
    --num-prompts 1000 \
    --apply-chat-template \
    --port 8000

Run 1:

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     505       
Benchmark duration (s):                  66.75     
Total input tokens:                      214424    
Total input text tokens:                 44483     
Total input vision tokens:               169941    
Total generated tokens:                  517120    
Total generated tokens (retokenized):    342271    
Request throughput (req/s):              7.57      
Input token throughput (tok/s):          3212.52   
Output token throughput (tok/s):         7747.53   
Total token throughput (tok/s):          10960.05  
Concurrency:                             384.78    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   50856.14  
Median E2E Latency (ms):                 37700.28  
---------------Time to First Token----------------
Mean TTFT (ms):                          19935.86  
Median TTFT (ms):                        5733.54   
P99 TTFT (ms):                           38785.94  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           38.72     
Median ITL (ms):                         27.20     
P95 ITL (ms):                            58.61     
P99 ITL (ms):                            314.43    
Max ITL (ms):                            738.38    
==================================================
Run 2 and 3

Run 2:

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     505       
Benchmark duration (s):                  65.93     
Total input tokens:                      214424    
Total input text tokens:                 44483     
Total input vision tokens:               169941    
Total generated tokens:                  517120    
Total generated tokens (retokenized):    343723    
Request throughput (req/s):              7.66      
Input token throughput (tok/s):          3252.12   
Output token throughput (tok/s):         7843.04   
Total token throughput (tok/s):          11095.15  
Concurrency:                             383.48    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   50068.19  
Median E2E Latency (ms):                 36977.60  
---------------Time to First Token----------------
Mean TTFT (ms):                          19500.35  
Median TTFT (ms):                        5722.59   
P99 TTFT (ms):                           38048.93  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           38.27     
Median ITL (ms):                         26.89     
P95 ITL (ms):                            58.13     
P99 ITL (ms):                            261.80    
Max ITL (ms):                            836.76    
==================================================

Run 3:

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     505       
Benchmark duration (s):                  66.55     
Total input tokens:                      214424    
Total input text tokens:                 44483     
Total input vision tokens:               169941    
Total generated tokens:                  517120    
Total generated tokens (retokenized):    343419    
Request throughput (req/s):              7.59      
Input token throughput (tok/s):          3221.77   
Output token throughput (tok/s):         7769.85   
Total token throughput (tok/s):          10991.62  
Concurrency:                             383.69    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   50567.22  
Median E2E Latency (ms):                 37317.17  
---------------Time to First Token----------------
Mean TTFT (ms):                          19684.25  
Median TTFT (ms):                        5715.40   
P99 TTFT (ms):                           38408.50  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           38.66     
Median ITL (ms):                         27.21     
P95 ITL (ms):                            59.18     
P99 ITL (ms):                            246.85    
Max ITL (ms):                            863.41    
==================================================

ISL1000/OSL1

python3 -m sglang.bench_serving \
    --backend vllm-chat \
    --dataset-name image \
    --num-prompts 768 \
    --apply-chat-template \
    --random-output-len 1 \
    --random-input-len 1 \
    --image-resolution 1120x700 \
    --image-format jpeg \
    --image-count 1 \
    --image-content random \
    --random-range-ratio 1 \
    --port 8000

Run 1:

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     768       
Benchmark duration (s):                  43.81     
Total input tokens:                      784912    
Total input text tokens:                 15376     
Total input vision tokens:               769536    
Total generated tokens:                  768       
Total generated tokens (retokenized):    768       
Request throughput (req/s):              17.53     
Input token throughput (tok/s):          17916.56  
Output token throughput (tok/s):         17.53     
Total token throughput (tok/s):          17934.09  
Concurrency:                             451.36    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   25747.35  
Median E2E Latency (ms):                 25224.20  
---------------Time to First Token----------------
Mean TTFT (ms):                          25747.35  
Median TTFT (ms):                        25224.20  
P99 TTFT (ms):                           42164.54  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================
Run 2 and 3

Run 2:

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     768       
Benchmark duration (s):                  44.14     
Total input tokens:                      784906    
Total input text tokens:                 15370     
Total input vision tokens:               769536    
Total generated tokens:                  768       
Total generated tokens (retokenized):    768       
Request throughput (req/s):              17.40     
Input token throughput (tok/s):          17781.32  
Output token throughput (tok/s):         17.40     
Total token throughput (tok/s):          17798.72  
Concurrency:                             454.81    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   26141.01  
Median E2E Latency (ms):                 26727.64  
---------------Time to First Token----------------
Mean TTFT (ms):                          26141.01  
Median TTFT (ms):                        26727.64  
P99 TTFT (ms):                           42443.92  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Run 3:

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     768       
Benchmark duration (s):                  44.40     
Total input tokens:                      784908    
Total input text tokens:                 15372     
Total input vision tokens:               769536    
Total generated tokens:                  768       
Total generated tokens (retokenized):    768       
Request throughput (req/s):              17.30     
Input token throughput (tok/s):          17679.89  
Output token throughput (tok/s):         17.30     
Total token throughput (tok/s):          17697.19  
Concurrency:                             455.47    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   26329.09  
Median E2E Latency (ms):                 26054.65  
---------------Time to First Token----------------
Mean TTFT (ms):                          26329.09  
Median TTFT (ms):                        26054.64  
P99 TTFT (ms):                           42496.25  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================
vllm collect-env
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 4.1.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.1+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-1083-nvidia-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.93
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200

Nvidia driver version        : 575.57.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  224
On-line CPU(s) list:                     0-223
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8480CL
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      56
Socket(s):                               2
Stepping:                                7
CPU max MHz:                             3800.0000
CPU min MHz:                             800.0000
BogoMIPS:                                4000.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
L1d cache:                               5.3 MiB (112 instances)
L1i cache:                               3.5 MiB (112 instances)
L2 cache:                                224 MiB (112 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-55,112-167
NUMA node1 CPU(s):                       56-111,168-223
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Vulnerable; BHI: Vulnerable
Vulnerability Srbds:                     Not affected
Vulnerability Tsx async abort:           Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-cufile-cu12==1.13.0.11
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-ml-py==13.580.65
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pyzmq==27.0.1
[pip3] torch==2.7.1+cu128
[pip3] torchaudio==2.7.1+cu128
[pip3] torchvision==0.22.1+cu128
[pip3] transformers==4.55.2
[pip3] triton==3.3.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.1.dev8339+g66422d382 (git sha: 66422d382)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4 NIC5     NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PXB     NODE    NODE    NODE    NODE NODE     SYS     SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    PXB     NODE NODE     SYS     SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    PXB  NODE     SYS     SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    NODE PXB      SYS     SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS  SYS      PXB     NODE    NODE    NODE    NODE    NODE    56-111,168-223  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS  SYS      NODE    NODE    NODE    PXB     NODE    NODE    56-111,168-223  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS  SYS      NODE    NODE    NODE    NODE    PXB     NODE    56-111,168-223  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS  SYS      NODE    NODE    NODE    NODE    NODE    PXB     56-111,168-223  1               N/A
NIC0    PXB     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE NODE     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      PIX     NODE    NODE NODE     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX      X      NODE    NODE NODE     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    NODE    PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      NODE NODE     SYS     SYS     SYS     SYS     SYS     SYS
NIC4    NODE    NODE    PXB     NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X   NODE     SYS     SYS     SYS     SYS     SYS     SYS
NIC5    NODE    NODE    NODE    PXB     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE  X       SYS     SYS     SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     PXB     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS  SYS       X      NODE    NODE    NODE    NODE    NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS  SYS      NODE     X      PIX     NODE    NODE    NODE
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS  SYS      NODE    PIX      X      NODE    NODE    NODE
NIC9    SYS     SYS     SYS     SYS     NODE    PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS  SYS      NODE    NODE    NODE     X      NODE    NODE
NIC10   SYS     SYS     SYS     SYS     NODE    NODE    PXB     NODE    SYS     SYS     SYS     SYS     SYS  SYS      NODE    NODE    NODE    NODE     X      NODE
NIC11   SYS     SYS     SYS     SYS     NODE    NODE    NODE    PXB     SYS     SYS     SYS     SYS     SYS  SYS      NODE    NODE    NODE    NODE    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566
NCCL_VERSION=2.25.1-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.8.1
LD_LIBRARY_PATH=/usr/local/cuda/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_user
VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_MODULE_LOADING=LAZY

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @AlienKevin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the bench_serving tool by improving its multimodal benchmarking capabilities. It introduces more accurate token counting for vision models, expands image format support, and clarifies token length definitions, making it a more robust tool for evaluating VLM performance.

Highlights

  • Enhanced Multimodal Token Counting: Throughput metrics now accurately include vision tokens, providing a comprehensive breakdown of text vs. vision token counts to prevent misleading zero values in vision-heavy benchmarks.
  • Expanded Image Options: Added support for PNG images alongside JPEG, and introduced a "blank image" mode. This allows for more diverse stress testing of preprocessing overheads and isolation of prefill/decode performance.
  • Clarified Token Length Definition: Replaced the ambiguous "max_tokens" parameter with "max_completion_tokens" to ensure clarity regarding the generated output length, excluding input tokens.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the multimodal benchmarking capabilities in bench_serving.py. The key improvements include counting vision tokens separately for more accurate throughput metrics, adding support for PNG and blank images to test different preprocessing overheads, and replacing the deprecated max_tokens with max_completion_tokens for clarity. The code is well-refactored, especially with the introduction of the create_mm_data_row helper function. I've identified a couple of minor issues: a leftover debug print statement and a hardcoded image format in the data URI generation, which could affect functionality for non-JPEG images.

Comment thread python/sglang/bench_serving.py Outdated
Comment thread python/sglang/bench_serving.py Outdated
AlienKevin and others added 2 commits August 30, 2025 01:00
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@JustinTong0323
Copy link
Copy Markdown
Collaborator

Cool!! Thanks Kevin!

@AlienKevin
Copy link
Copy Markdown
Contributor Author

@zhyncs Saw you as one of the code owners, would you mind taking a look at this PR?

Comment thread python/sglang/bench_serving.py Outdated
Comment thread python/sglang/bench_serving.py Outdated
Comment thread python/sglang/bench_serving.py
Copy link
Copy Markdown
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix the comments

@AlienKevin
Copy link
Copy Markdown
Contributor Author

@zhyncs Thanks for the swift review! I've fixed the max_tokens for non-chat endpoints. As for ignore_eos, I think it's useful to keep this option for controllable ISL/OSL benchmarking.

@AlienKevin
Copy link
Copy Markdown
Contributor Author

@zhyncs Hi just following up, let me know if your comments are addressed.

Comment thread python/sglang/bench_serving.py Outdated
@JustinTong0323 JustinTong0323 added the ready-to-merge The PR is ready to merge after the CI is green. label Sep 19, 2025
@mickqian mickqian changed the title Enhance configurable multimodal benchmarking in bench_serving benchmark: enhance configurable multimodal benchmarking in bench_serving Oct 1, 2025
Added optional model_id parameter to get_dataset function and included assertions to ensure it is not None when required.
Removed assertions for tokenize_prompt and apply_chat_template for image and mmu datasets. Added a check to set apply_chat_template to True for image and mmu datasets.
@JustinTong0323 JustinTong0323 requested a review from zhyncs October 8, 2025 03:14
@zhyncs zhyncs merged commit e3bb7f5 into sgl-project:main Oct 8, 2025
96 of 98 checks passed
ch-tiger1 pushed a commit to ch-tiger1/sglang that referenced this pull request Oct 9, 2025
…ing (sgl-project#9812)

Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
@ZailiWang ZailiWang mentioned this pull request Oct 17, 2025
4 tasks
BraveY pushed a commit to openanolis/sglang that referenced this pull request Oct 22, 2025
Merge branch sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main
https://code.alipay.com/Theta/SGLang/pull_requests/342?tab=diff

Reviewed-by: 苏墨 <xuyongfei.xyf@antgroup.com>


* [router] minor code clean up in server startup (sgl-project#10470)
* [bugfix] fix typo (sgl-project#10471)
* [PD metrics] Add latency Histogram metrics of each stage for generate requests (sgl-project#8710)
* [CI] Fix runner for sgl-kernel (sgl-project#9887)
* fix(internvl): fix accuracy issue of normalization (sgl-project#10375)
* fix: gpt-oss streaming dropping normal content when tools are provided but not used (sgl-project#9657)
* model: support solar (sgl-project#8189)
* fix: resolve sgl-kernel ut (sgl-project#10476)
* [1/2] Speed up trtllm_mla attention backend (>10% e2e) (sgl-project#10473)
* Fix `--dataset-path` in `bench_one_batch_server` (sgl-project#10475)
* [Env] minimal version for organizing envs (sgl-project#10479)
* chore: bump v0.3.10 sgl-kernel (sgl-project#10478)
* [router] multi model registration fix (sgl-project#10481)
* [2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (sgl-project#10286)
* [Auto Sync] Update registry.py (20250915) (sgl-project#10484)
* [router] fix worker registration in multi model mode (sgl-project#10486)
* fix crash of DeepSeek-V3 update_weights_from_disk (sgl-project#8863)
* Temporay work-around for rocm 7.0.0 alpha with enabling data-parallel issue (sgl-project#10434)
* [Hicache] Evaluate Per-Round Metrics in Multiturn Bench (sgl-project#10203)
* [ModelOpt] Respect `kv_cache_quant_algo` in ModelOpt checkpoints (sgl-project#10336)
* Add Logprobs unit test with a loose threshold (sgl-project#10230)
* [router] add router db connector for responses api (sgl-project#10487)
* Remove wrong imports `from sglang.python` (sgl-project#10493)
* [router] fix router manager and router init in server (sgl-project#10499)
* Cache the result of `is_blackwell` platform check (sgl-project#10498)
* feat: update support for qwen3next model (sgl-project#10466)
* Minor fix lint introduced by sgl-project#10466 (sgl-project#10507)
* chore: upgrade sgl-kernel 0.3.10 (sgl-project#10500)
* Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. (sgl-project#10491)
* Fix CI when sgl-kernel is changed but srt is not changed (sgl-project#10515)
* Support sgl-router parallel_batch in bench_one_batch_server (sgl-project#10506)
* [CPU] fix CPU backend sel. issue for Llama4 (sgl-project#10511)
* adjust import setuptools_rust (sgl-project#10524)
* Fix formatting in long code blocks (sgl-project#10528)
* skip vision_model for lora (sgl-project#10530)
* [2/2] Speed up trtllm_mla attention backend (sgl-project#10474)
* support using fa4 on deepseek on blackwell (sgl-project#9928)
* [Auto Sync] Update scheduler_profiler_mixin.py, rpd_utils.p... (20250916) (sgl-project#10494)
* [Auto Sync] Update activation.py, chunk_cache.py, utils.py (20250917) (sgl-project#10538)
* feat: add priority based scheduling with priority based request acceptance and preemption (sgl-project#8746)
* Fix decord dependency for aarch64 docker build (sgl-project#10529)
* enable prefix cache with dp (sgl-project#10459)
* [bugfix]hicache bench_long_context.py run failed (sgl-project#10523)
* Remove duplicated code (sgl-project#10545)
* CUDA Arch Independent (sgl-project#8813)
* [bench] Fix random seed in `bench_one_batch_server` (sgl-project#10548)
* [HiCache] Add tests for hicache storage mooncake backend (sgl-project#10171)
* [BugFix] Fix incorrect hidden_states_tensor in pd disaggregation + eagle (sgl-project#9976)
* fix: update dsv3 fp4 ut (sgl-project#10584)
* vlm: remove redundant d2h movement of mm feature tensors (sgl-project#9987)
* Enable trtllm mla prefix extend (sgl-project#10526)
* [ROCm] Fix fp8 quantization accuracy issue. (sgl-project#10558)
* [HICache] introduce evict policy (sgl-project#10190)
* PullRequest: 303 Revert "PullRequest: 291 for fa3 kvcache: revert github "convert mla kvcache to bfloat16""
* aiter v0.1.5.post2 (sgl-project#10563)
* [PD] Improve disaggregation common backend and refactor mooncake backend (sgl-project#10273)
* chore: upgrade mooncake 0.3.6 (sgl-project#10596)
* [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525)
* Scale kkt after reduction (sgl-project#10604)
* fix deepep assert when PD disaggregation == null (sgl-project#8274)
* [RL] Add destroy process group api (sgl-project#9979)
* Feat/add heartbeat mechanism for nixl conn (sgl-project#10222)
* update deepep version for qwen3-next deepep moe (sgl-project#10624)
* support qwen3-next-fp8 deepep (sgl-project#10622)
* Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610)
* [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595)
* Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579)
* feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947)
* Garbage collector regression in the online server (sgl-project#10621)
* [router] refactor worker to builder pattern 1/n (sgl-project#10628)
* refactor: use registry for _get_attention_backend_from_str (sgl-project#10629)
* [Feature] Speculative decoding support lookahead (sgl-project#9873)
* [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553)
* [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586)
* model support: Sarashina2VisionForCausalLM (sgl-project#10632)
* feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631)
* chore: bump sgl-kernel 0.3.11 (sgl-project#10630)
* Hicache L3 backend mooncake optimization configuration reading method (sgl-project#10319)
* [router] refactor worker to builder pattern 2/n (sgl-project#10633)
* [Feature]feat(get_ip): unify get_ip_xxx (sgl-project#10081)
* [router] refactor worker to builder pattern 3/n (sgl-project#10647)
* [sgl-kernel] Support moe_sum_reduce cuda kernel (sgl-project#10321)
* [router] refactor worker to builder pattern 4/n (sgl-project#10650)
* Fix fast decode plan for flashinfer v0.4.0rc1 and upgrade sgl-kernel 0.3.11 (sgl-project#10634)
* [router] refactor worker to builder pattern 5/n (sgl-project#10653)
* [HiCacheStorage]support page_first_direct layout for generic set&get (sgl-project#10522)
* [router] preserve order of json params using preserve_order feature (sgl-project#10661)
* [router] refactor router and worker management 1/n (sgl-project#10664)
* fix: resolve sync issue (sgl-project#10668)
* [Auto Sync] Update .clang-format (20250919) (sgl-project#10670)
* [router] refactor router and worker management 2/n (sgl-project#10666)
* router-spec: Reorder `ChatCompletionRequest` and fix validation logic (sgl-project#10675)
* chore: cleanup docker image (sgl-project#10671)
* limit sgl-kernel causal conv1d to cuda only (sgl-project#10648)
* [Auto Sync] Update model_runner.py (20250920) (sgl-project#10679)
* [router] refactor router and worker management 2.5/n (sgl-project#10677)
* [1/2] Support deterministic inference with flashinfer attention backend (sgl-project#10645)
* [Auto Sync] Update deepseek_v2.py (20250920) (sgl-project#10683)
* chore: upgrade mooncake 0.3.6.post1 to fix gb200 dockerfile (sgl-project#10681)
* [Performance] Qwen3-Next: optimize causal_conv1d_fn triton kernel - up to 9% faster (sgl-project#10680)
* Replace os.environ in layernorm.py (sgl-project#10684)
* fix(disagg): fix sending KV cache in case of MLA for NIXL backend (sgl-project#10673)
* fix: update run_suite (sgl-project#10685)
* fix: remove awq_dequantize deps (sgl-project#10686)
* [Auto Sync] Update modelopt_quant.py (20250920) (sgl-project#10688)
* [Feature] Support deterministic inference with FA3 backend (sgl-project#10651)
* feat: update server args  (sgl-project#10696)
* Super tiny fix extra logs (sgl-project#10697)
* [3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization  (sgl-project#10592)
* Update release-docs.yml (sgl-project#10706)
* Refactors radix cache for extra key support (sgl-project#10317)
* [Router]fix: fix get_load missing api_key (sgl-project#10385)
* fix: disable gpt-oss b200 ut (sgl-project#10716)
* Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU (sgl-project#10714)
* [Auto Sync] Update deepseek_v2.py (20250922) (sgl-project#10717)
* Support deterministic inference with triton backend (sgl-project#10694)
* [deterministic inference] Move batch invariant pkg to sglang (sgl-project#10695)
* [2/2] Support deterministic inference for temperature > 0 (sgl-project#10678)
* [Ascend] codeowner updates for ascend related files (sgl-project#10699)
* [theta] 支持qwen-vl的多模自定义采样
* revert e61d08c [theta] 支持qwen-vl的多模...
* PullRequest: 306 [theta] 支持qwen-vl的多模自定义采样
* [4/4] Introduce CachedKernel to reduce CSGMV kernel launch overheads by 60% (sgl-project#10709)
* Convert FLASHINFER_WORKSPACE_SIZE to integer (sgl-project#10731)
* EPLB: prefer to use physical experts in the same node (sgl-project#9849)
* fix capture_bs when speculative decoding enabled (sgl-project#10730)
* Fix flaky logprobs test (sgl-project#10728)
* Fix CI TestChunkedSGMV (sgl-project#10737)
* [Docs, minor] Fix LLM doc matrix (sgl-project#10753)
* Add warnings and remove dependency for deterministic inference (sgl-project#10724)
* bugfix: Fix `get_worker_urls_for_model` in http/router.rs (sgl-project#10754)
* [router] refactor router and worker management 3/n (sgl-project#10727)
* [router] update ci so only execute benchmarks when labels are added (sgl-project#10757)
* Fix MTP MoE weight loading with NVFP4 target model. (sgl-project#10758)
* chore: bump sgl-kernel v0.3.12 (sgl-project#10732)
* [Generative Score API] Added test_scores_api.py to github CICD to run per commit (sgl-project#10755)
* refactor zero copy (sgl-project#10300)
* Fix multimodal registry and code sync scripts (sgl-project#10759)
* Enables TRT-LLM backend to be used for target_verify (sgl-project#10281)
* fix: kv events with tp > 1 (sgl-project#10541)
* [Auto Sync] Update flashattention_backend.py (20250922) (sgl-project#10762)
* [Feature] Add MLAProcess for DeepSeek MLA on NPU (sgl-project#10130)
* [Ascend] optimize Qwen-vl on Ascend (sgl-project#10556)
* [Ascend]optimize Qwen3 on Ascend (sgl-project#10574)
* [Auto Sync] Update configurer.py (20250923) (sgl-project#10765)
* [router] refactor router and worker management 4/n (sgl-project#10756)
* PullRequest: 310 新增 BailingMoEV3 模型及其 MLA 支持
* [router] remove pd router draining channel (sgl-project#10767)
* [router] fix logger type mismatch (sgl-project#10774)
* Use simulate acc len from `sglang.environ` (sgl-project#10771)
* Fix trtllm_mla slow concat kernel in MTP (sgl-project#10777)
* Move cached kernel to srt.utils (sgl-project#10776)
* feat: unify dockerfiles (sgl-project#10705)
* Introduce `FutureMap` (sgl-project#10715)
* chore: upgrade sgl-kernel 0.3.12 (sgl-project#10782)
* followup: clean up dockerfiles and release yamls  (sgl-project#10783)
* Clean up server args (sgl-project#10770)
* move `environ` into `sglang.srt` to avoid break SRT auto sync. (sgl-project#10791)
* Fix hicache mooncake backend CI (sgl-project#10792)
* [router] fix cache aware routing strategy and lock contention (sgl-project#10773)
* [router] responses api POST and GET with local storage (sgl-project#10581)
* model: support qwen3-vl series (sgl-project#10323)
* [fix][pd-disag]no need set next batch sampling info done in prefill (sgl-project#10259)
* [ROCm] Update aiter to v0.1.5.post3 (sgl-project#10812)
* [router] use dashmap for radix tree instead of hash for multi model (sgl-project#10814)
* router(grpc): Implement route for chat_cmpl endpoint (sgl-project#10761)
* fix ceval (sgl-project#10504)
* Remove duplicate code in qwen2 model (sgl-project#10540)
* [router] fix axum default body limit (sgl-project#10818)
* Fix latest main ci (sgl-project#10799)
* add tunning files for QWEN-3-NEXT (sgl-project#10794)
* [Auto Sync] Update protocol.py (20250923) (sgl-project#10820)
* fix: draft model IMA by overide max_positional_embeddings (sgl-project#10787)
* [Auto Sync] Update elementwise.py (20250923) (sgl-project#10823)
* [Auto Sync] Update simple_eval_common.py (20250923) (sgl-project#10824)
* [router] Support streaming for Openai Router Response api  (sgl-project#10822)
* [router] add auth middleware for api key auth (sgl-project#10826)
* [Auto Sync] Update load_config.py, model_config.py, configu... (20250923) (sgl-project#10825)
* Revert "[fix][pd-disag]no need set next batch sampling info done in prefill" (sgl-project#10828)
* Add CI timeout guidelines (sgl-project#10829)
* [theta] fix serving_tokenization.py
* feat: add cache_salt support to request (sgl-project#10718)
* fix bailing_moe with enable_dp_attention (sgl-project#10860)
* ci: free space on workers for build (sgl-project#10786)
* router-grpc: Support jinja chat template content format detection (sgl-project#10832)
* [router] select first healthy worker on proxied get requests (sgl-project#10827)
* chore: Initial support for input config files (sgl-project#10534)
* router-grpc: Add tools processing and other paramters for apply_chat_template (sgl-project#10877)
* [router] consolidate health endpoints and flush cache (sgl-project#10876)
* Restruct sgl-kernel benchmark (sgl-project#10861)
* [Bug] Fix Issue#10215 (sgl-project#10572)
* [router] consolidate worker get loads (sgl-project#10880)
* [router] Support Oracle DB(ATP) Data Connector (sgl-project#10845)
* [router] simplify tokenizer dev doc (sgl-project#10895)
* [Auto Sync] Update model_config.py (20250925) (sgl-project#10885)
* [ci feature] add ci monitor (sgl-project#10872)
* [HiCache] Cleaning the deprecated host memory state (sgl-project#10778)
* integrate AIBrix KVcache (sgl-project#10376)
* Add fuse_moe per-channel tune (sgl-project#10915)
* [router] consolidate worker load monitoring (sgl-project#10894)
* router: Fix constraint proto and `build_constraint` in grpc router (sgl-project#10881)
* Refactor kv_cache_scheme handling for quantization (sgl-project#10132)
* refactor: Move `grpc/client.rs` to `grpc_client/sglang_scheduler.rs` (sgl-project#10924)
* fix env flashinfer (sgl-project#10910)
* [minor] Remove deprecated function `get_ip` (sgl-project#10883)
* Rename customer label -> custom label (sgl-project#10899)
* [router] change log level to warning (sgl-project#10926)
* [router][refactor] Clean up protobuf fields (sgl-project#10923)
* Replace the Kimi-K2 generated tool call idx with history tool call count (sgl-project#10612)
* [ci] add ci-monitor workflow (sgl-project#10898)
* Remove pull_request trigger from CI monitor workflow (sgl-project#10932)
* router: Support parallel sampling num > 1 in grpc_server and non-stream handling (sgl-project#10929)
* Revert "Refactor kv_cache_scheme handling for quantization (sgl-project#10132)" (sgl-project#10935)
* Update CODEOWNERS to include JustinTong0323 in FC (sgl-project#10939)
* [PD-HiCache]: Support Async Offloading KVCache In Decode Side (sgl-project#10192)
* CI: Fix docker manifest build (sgl-project#10936)
* [router] update owners for router components (sgl-project#10927)
* Fuse write kv buffer into rope for qwen3 moe & bailing moe (sgl-project#10749)
* [router] add grpc client get and set (sgl-project#10955)
* [router]fix code owner syntax error (sgl-project#10956)
* [router] move grpc client from router to worker and builder (sgl-project#10958)
* [router] add move grpc worker management from router to worker manager (sgl-project#10960)
* [router] grpc router regular mode import cleanup (sgl-project#10963)
* [router] remove old/oudated/useless comments (sgl-project#10967)
* [router] remove old/oudated/useless comments across code base (sgl-project#10968)
* ci: fix rate-limit of huggingface with hf auth login (sgl-project#10947)
* Update label field comment to indicate deprecation (sgl-project#10970)
* Restruct gpu_memory_settings in a unify function and relax max_cuda_graph_bs (sgl-project#10372)
* ci: refactor nightly test (sgl-project#10495)
* refactor loading weights from remote instance coding format (sgl-project#10941)
* [router][grpc] Add helpfer functions for decoder in router.rs and fix specs (sgl-project#10971)
* Add simple docker file for B300 (sgl-project#10944)
* Ci monitor support performance (sgl-project#10965)
* [HiCache]: Support dynamic loading backends for hicache (sgl-project#10551)
* [Bugfix][Minor][Benchmark] Fix some bugs due to PR sgl-project#10495 (sgl-project#10982)
* [router][grpc] Support E2E non-stream chat completions (sgl-project#10980)
* fix: fp8 quantization failure of qwen 2.5 VL 7B model (sgl-project#10112)
* [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (sgl-project#10981)
* fix: make inference deterministic for large TP (sgl-project#10930)
* Add auth to get server info (sgl-project#10751)
* PullRequest: 315 bailingMoE: Fix deepep_mode keyerror
* Add support for topk metadata transferring for PD (sgl-project#10616)
* [PD] Extract the PP transfer layer calculate logic from Mooncake to Common backend (sgl-project#10565)
* Use jsonschema to constrain required or specific tool choice (sgl-project#10550)
* Fix profiler (sgl-project#10997)
* [router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) (sgl-project#10995)
* [router] basic mcp support for openai router response api (sgl-project#10978)
* [router] fix chat template loading and tokenizer path (sgl-project#10999)
* Fix CI failure of TypeError: RotaryEmbedding.forward_cpu() got an unexpected keyword argument 'fused_set_kv_buffer_arg' (sgl-project#11009)
* [bugfix]Add empty_context import to two_batch_overlap.py (sgl-project#10964)
* prepare for sglang+verl (sgl-project#10555)
* [sgl-kernel] Optimize concat_mla_k kernel (sgl-project#10543)
* [HiCache] bug: fix mooncake store batch set v1 (sgl-project#11013)
* Fix FusedSetKVBufferArg  in RotaryEmbedding (sgl-project#11003)
* Update GLM-4.5 Model Doc (sgl-project#11017)
* [router] migrate to rust python module for pythonic parser (sgl-project#11033)
* fix: show failed models in nightly ci (sgl-project#10986)
* [router][tool call] Support normal content extraction before tool call (streaming) (sgl-project#11038)
* [router] add harmony tool parser base structure and interface (sgl-project#11036)
* Unify SGL Kernel Releases (sgl-project#10701)
* [1/2] Support FA4 for MHA Prefill in sgl-kernel (sgl-project#10940)
* fix: check if weights are already local before downloading (sgl-project#11015)
* [HiCacheStorage] mooncake store support page_first_direct layout (sgl-project#10591)
* [speculative decoding] rename lookahead to ngram (sgl-project#11010)
* Fix gemma 3 launch with `transformers:` the error: `AttributeError: 'TransformersForCausalLM' object has no attribute 'tp_size'` (sgl-project#9614)
* Fix sgl-kernel benchmark dead code  (sgl-project#11022)
* [router][tool call] Improve normal content extraction and error handling (non-stream) (sgl-project#11050)
* chore: upgrade cutedsl 4.2.1 (sgl-project#11054)
* [Ci Monitor] Auto uploaded performance data to sglang_ci_data repo (sgl-project#10976)
* chore: upgrade sgl-kernel 0.3.13 (sgl-project#11056)
* [router] add n to generate sampling params (sgl-project#11069)
* Use more general heuristics to set the default value of --mem-fraction-static (sgl-project#10975)
* [router][tool call] Separate `JsonParser` and `LlamaParser` (sgl-project#11073)
* Fix mem fraction static for nightly tests (sgl-project#11076)
* fix: fp8 mllama4 without vision modules being quantized (sgl-project#10611)
* [router] Use `get_pooled` in `process_single_choice` (sgl-project#11079)
* [router][grpc] Add logprobs support to router (sgl-project#11082)
* feat(reasoning): improve enable thinking from request (sgl-project#10875)
* [Profile] dump memory trace when cuda graph profile is enabled (sgl-project#11083)
* Remove hybrid_linear_attn attention backend and refactor attention registry (sgl-project#10816)
* [model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… (sgl-project#9642)
* Enable optional FP32 compute for LM Head (sgl-project#10729)
* Update CODEOWNERS for attention/ascend_backend.py (sgl-project#11092)
* [router] grpc router generate endpoint support (sgl-project#11070)
* [router][tool call] Full support for ToolChoice (sgl-project#11085)
* Fix spec filter batch when target extend  (sgl-project#10991)
* [Fix] Resolve performance drop in speculative decoding aiter backend (sgl-project#11087)
* [Auto Sync] Update fused_moe_triton_config.py (20250930) (sgl-project#11099)
* chore: bump sgl-kernel v0.3.14 (sgl-project#11067)
* [router][grpc-server] Fix gRPC server shutdown (sgl-project#11094)
* Fix eagle radix cache (sgl-project#10846)
* [Eval] Add `--repeat` in `run_eval`  (sgl-project#11101)
* [CPU] Adding Memory Capacity Acquisition Functionality (sgl-project#11102)
* Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization (sgl-project#11081)
* Support Dots.ocr model (sgl-project#11071)
* [router][bugfix] Fix input_logprobs handling with None value and `logprob_start_len = -1` (sgl-project#11113)
* Feature/make PEFT adapter module format compatibile (sgl-project#11080)
* fix: KimiK2Detector Improve tool call ID parsing with regex (sgl-project#10972)
* [router] add mcp list and mcp call in output array (sgl-project#11112)
* Organize spec-related data structures (sgl-project#10735)
* [AMD] Add Tilelang and Fast Hadamard Transform builds to Dockerfile.rocm (sgl-project#11114)
* [Auto Sync] Update base_grammar_backend.py, xgrammar_backen... (20250930) (sgl-project#11115)
* [Doc] Update multimodal language models documentation (sgl-project#11111)
* Quick Fix: fix Qwen3-VL launch failure caused by MRotaryEmbedding arg (sgl-project#10985)
* docker: x86 dev builds for hopper and blackwell (sgl-project#11075)
* Refactor AMD CI. (sgl-project#11128)
* feat: add fast_decode_plan from flashinfer, flashinfer to 0.4.0rc3 (sgl-project#10760)
* [HiCache]bug fix: fixed blank item in host_mem_release_queue (sgl-project#11005)
* [Feature] Add EIC as sglang HiCache Storage backend (sgl-project#10271)
* [HiCache] Configurable and Dynamic Prefetch Timeout (sgl-project#10512)
* [router] add pd service in grpc router for pd (sgl-project#11120)
* [router] Add multi-turn tool calling loop support for MCP integration (sgl-project#11143)
* Fix metrics and request tracing (TimeStats) (sgl-project#11123)
* Remove debug print statement from scheduler output (sgl-project#11145)
* Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch (sgl-project#10720)
* Fix ngram spec with page size > 1 (sgl-project#11135)
* [ROCm] To reduce the compiling time when using torch compile. (sgl-project#10559)
* Fix DeepSeek chunked prefill memory issue (sgl-project#11149)
* Clean up parallel_state.py (sgl-project#11148)
* Tiny improve dumper (sgl-project#11132)
* Tiny fix missing alt stream in nextn layer (sgl-project#10768)
* Fuse quantize and rope in trtllm_mla MTP (sgl-project#10779)
* Tiny detect slow ranks (sgl-project#10508)
* Remove unused pack `.item()` in paged allocator. (sgl-project#11156)
* Support dispatch low latency (sgl-project#10263)
* Support single batch overlap (sgl-project#10422)
* [router][grpc] Support tool call parser in streaming (sgl-project#11160)
* [model] Add mamba2 and Falcon-H1 support. (sgl-project#10988)
* Clean up ascend allocator (sgl-project#11152)
* fix cpp JIT compilation issue of ngram speculative decoding (sgl-project#10837)
* Tiny cleanup deepseek_v2.py (sgl-project#11163)
* Tiny fix ep_gather behavior different in CI (sgl-project#11130)
* Tiny remove duplicated code (sgl-project#11164)
* [proto] Add script to compile python protos (sgl-project#11171)
* Unify forward output datastructure (sgl-project#11124)
* [grpc] style fix for grpc compilation. (sgl-project#11175)
* Remove dp balance metadata and minimul token balance. (sgl-project#11170)
* Minor fixes for server_args, parallel_state, and test_deterministic.py (sgl-project#11159)
* fix: shoudn't include CUDA_ARCH 100 and 120 for cuda12.6.1 (sgl-project#11176)
* [router][grpc] Support streaming for v1/chat/completions (sgl-project#11179)
* Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (sgl-project#11138)
* Introduce naming convention in `io_struct` and base sglang io classes. (sgl-project#10133)
* [Generative Scores API] add performance tests to CICD  (sgl-project#10830)
* [1/n] Enable DCA CUDA graph capture (sgl-project#9537)
* [Fix] Update to v0.1.5.post4 and refine HIP attention backend selection (sgl-project#11161)
* [CI]] Tee server logs to both file and stdout/stderr using PIPE (sgl-project#11185)
* fix: radix cache memory accounting (sgl-project#10637)
* Tiny add PD disaggregation + DP attention test (sgl-project#11167)
* [router] Steaming support for MCP Tool Calls in OpenAI Router (sgl-project#11173)
* [Feature] Option to save model weights to CPU when memory saver mode is enabled (sgl-project#10873)
* Add --thinking-mode to run_eval (sgl-project#11189)
* [hot-fix] Fix CI break which caused by adding `thinking_mode` in eval (sgl-project#11192)
* Tiny move files to utils folder (sgl-project#11166)
* Fix CUDA illegal memory access issues in speculative decoding (sgl-project#10892)
* Fix [test]: Env:SGLANG_TORCH_PROFILER_DIR for pytest. (sgl-project#10780)
* Optimize debug log position of PD abort request (sgl-project#11090)
* fix 3fs indices (sgl-project#10855)
* model: support starcoder2 (sgl-project#10609)
* [Test] Initialize mem_fraction_static in setUpClass to fix pytest VLM test crashes. (sgl-project#10859)
* fix xeon ci check (sgl-project#10838)
* fix qwen2 eagle3 runtime error (sgl-project#10517)
* [minor] fix the lint (sgl-project#11198)
* [Fix] Fix the bug of the calculation of base_gpu_id (dp offset) in data_parallel_controller.py (sgl-project#10741)
* [fix]missing prefix_lens_cpu init when p/d disaggregation (sgl-project#11196)
* fix self.enable_kv_cache_events (sgl-project#11178)
* [HICache]: Refactor HiCache CI (sgl-project#11011)
* fix sampling_seed handling when deterministic is enabled (sgl-project#11096)
* [fix]enable flashmla when using draft model P/D attention select (sgl-project#11012)
* [router] fix get load response parsing (sgl-project#11213)
* [router] add grpc router pd mode for chat and generate (sgl-project#11140)
* EAGLE cache fix for HiCache (sgl-project#11215)
* Add --max-new-tokens CLI flag for MMMU evaluation (sgl-project#11217)
* Add DeepSeek-V3.2 Tool Call Template (sgl-project#11063)
* Tiny `skip_sample` adjust (sgl-project#11225)
* [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 (sgl-project#11194)
* Update `v1/responses` to be more OpenAI-compatible. (sgl-project#9624)
* chore: bump sgl-kernel v0.3.14.post1 (sgl-project#11137)
* Update DeepGEMM repository tag to specific commit (sgl-project#11229)
* [Feat] Support Torch Symm Mem AllReduce (sgl-project#10571)
* Refactor and optimize mooncake CI (sgl-project#11162)
* [Fix AMD CI] VRAM cleanup  (sgl-project#11174)
* Update transformers package version to 4.57.0 (sgl-project#11222)
* Remove gdrcopy check in ci_install_deepep.sh (sgl-project#11237)
* Rename runner labels (sgl-project#11228)
* [Auto Sync] Update io_struct.py (20251004) (sgl-project#11206)
* Create two new GH workflows to automatically bump SGLang and Kernel version (sgl-project#10996)
* Fix spec_utils.py (sgl-project#11247)
* ci: make find_local_hf_snapshot_dir more robust (sgl-project#11248)
* [quantization] Fix scale remapping for mllama4 (sgl-project#10042)
* [quantization] Enable aiter mxfp4 fused_moe for Quark (sgl-project#10048)
* Use cu128 for torch audio to fix some CI tests (sgl-project#11251)
* Bump torch_memory_saver 0.0.9rc2 (sgl-project#11252)
* update sgl kernel version to 0.3.14.post1 (sgl-project#11242)
* Update condition for sgl-kernel-benchmark-test (sgl-project#11254)
* feat: add shortcut detection for multimodal templates in Jinja format (sgl-project#11209)
* Improve bot release workflow (sgl-project#11240)
* Add flashmla and fast hadamard transform to Dockerfile (sgl-project#11235)
* Support DeepSeek V3.2 Exp (sgl-project#11061)
* chore: bump SGLang version to 0.5.3rc2 (sgl-project#11259)
* chore: bump SGLang version to 0.5.3 (sgl-project#11263)
* [theta] fix bailing v3
* [router] add ipv6 support across all components (sgl-project#11219)
* Remove env var warnings for release (sgl-project#11262)
* Enable native ModelOpt quantization support (1/3)  (sgl-project#7149)
* [router][tool call] Clean up redundant `detect_format` and `has_tool_markers` (sgl-project#11270)
* disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 (sgl-project#11274)
* docker: add manifest to versioned docker releases (sgl-project#11268)
* [Bug] Fix incorrect assertion in FA4 and add UT. (sgl-project#11182)
* [router][grpc] Refine streaming processes (sgl-project#11277)
* Fix code sync scripts (sgl-project#11276)
* [Auto Sync] Update test_utils.py (20251006) (sgl-project#11280)
* Rename max_micro_batch_size -> pp_max_micro_batch_size (sgl-project#11279)
* reverse the amd ci test back to 1200s and split the 8-gpu deepseek job into two. (sgl-project#11238)
* Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components (sgl-project#11261)
* fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration (sgl-project#11282)
* docs: update sgl-kernel README (sgl-project#11286)
* chore: bump sgl-kernel version to 0.3.15 (sgl-project#11281)
* [router][grpc] Fix proto3 default value mismatches and cleanup unused fields (sgl-project#11283)
* convert test_deterministic into unit tests (sgl-project#11095)
* Feature/longbench v2 evaluation utils (sgl-project#10949)
* [ci] fix pp test (sgl-project#11294)
* EAGLE cache fix for SWARadixCache (sgl-project#11231)
* Remove overlap thread (sgl-project#11210)
* [router] add reasoning and tool parser argument in router (sgl-project#11290)
* Remove sampling info events and overlap thread file (sgl-project#11300)
* Introduce future indices (sgl-project#11301)
* [sgl-kernel] Support float64 moe_sum_reduce cuda kernel (sgl-project#11068)
* [Docs] [Router] Update Observability and Common Issues Section (sgl-project#11302)
* [router] add get server info and get model info in grpc server (sgl-project#11303)
* [router][grpc] Refactor chat template content format detection (sgl-project#11288)
* [Doc] HiCache Design Documents (sgl-project#11027)
* [Doc]: Best Practice for HICache (sgl-project#11001)
* [router] fix grpc connection conversion and add optimization (sgl-project#11305)
* [router][grpc] Fix sampling_params.stop_strs is None (sgl-project#11306)
* Update tool parser and related documentation (sgl-project#11223)
* [router][grpc] Fix error message format in grpc chat handler (sgl-project#11307)
* [quantization] Properly ignore quantization for layers excluded in quant_config (sgl-project#11205)
* [router] support Openai router conversation API CRUD (sgl-project#11297)
* [router][grpc] Fix request_id extraction when n > 1 (sgl-project#11311)
* [router] cleanup worker health check to return early (sgl-project#11310)
* [oai serving chat] Add argument `--sampling-defaults` and fix `ChatCompletionRequest` defaults (sgl-project#11304)
* Clean match_prefix and prepare_for_extend for mem cache V2 (sgl-project#11200)
* ci: unify the model launch method of nightly ci (sgl-project#11230)
* [Chore] Update xgrammar 0.1.24 -> 0.1.25 (sgl-project#10710)
* update sampling_params documentation with defaults (sgl-project#11315)
* Optimize copy_kv_cache for spec decoding (sgl-project#11126)
* Rename `ngram_utils` -> `ngram_info` (sgl-project#11316)
* [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (sgl-project#11314)
* [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (sgl-project#9545)
* [8/N] MoE Refactor: deprecate `EPMoE` (sgl-project#11211)
* Skip weight loading in deepgemm compilation (sgl-project#11312)
* [2/2] Support MHA prefill with FlashAttention 4. (sgl-project#10937)
* [Doc] Update mooncake nvlink transport doc for PD disaggregation (sgl-project#11321)
* fix(decode): adjust ServerArgs import to explicit module path (sgl-project#11007)
* Support LoRA in bench_serving oai interface (sgl-project#11318)
* benchmark: enhance configurable multimodal benchmarking in bench_serving (sgl-project#9812)
* [CI] improve disaggregation CI. (sgl-project#11264)
* [theta] fix tokenization
* model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (sgl-project#10909)
* [router] refactor generate to use new pipeline arch (sgl-project#11323)
* [router] improve reasoning parser lock and reduce req cloning (sgl-project#11336)
* [router][grpc] Cleanup debug logs in grpc_server and grpc_router (sgl-project#11340)
* [router] Fix all unused_qualifications (sgl-project#11341)
* [router] Support history management using conversation (sgl-project#11339)
* [router][grpc] Add dependencies in Cargo.toml to support chat template rendering (sgl-project#11342)
* fix: fix revision for sgl-flash-attn in sgl-kernel (sgl-project#11327)
* [Auto Sync] Update scheduler.py (20251009) (sgl-project#11350)
* [Generative Score API] Multi-Item scoring with custom attention mask. (sgl-project#10979)
* [router][grpc] disable health check generation and increase timeout (sgl-project#11353)
* [router] Refactor OpenAI router: split monolithic file and move location (sgl-project#11359)
* [router][lint] Add unused_qualifications to cargo lint warnings (sgl-project#11366)
* [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (sgl-project#11309)
* PullRequest: 323 [theta] 错误码规范化:1)chat和completions请求的前处理统一为400;2)多模态load data请求返回为标准的http错误码
* [router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (sgl-project#11373)
* add code pp support for nixl (sgl-project#11375)
* fix bench_serving mishandling of internal states (sgl-project#11376)
* PullRequest: 322 支持MTP并使用DeepseekV2AttentionMLA子类化BailingMoEV3AttentionMLA
* [router][grpc] Replace fake health check with correct ones (sgl-project#11387)
* [router] change grpc client from mutable to clone (sgl-project#11394)
* chore: upgrade flashinfer 0.4.0 (sgl-project#11364)
* [router] conversation item API: create, retrieve and delete (sgl-project#11369)
* chore: bump SGLang version to 0.5.3.post1 (sgl-project#11324)
* move more files under srt/utils (sgl-project#11285)
* [grammar] Avoid server crash when grammar backend is None (sgl-project#11401)
* fix: fix gpu-proc affinity set incorrectly when pp_size > 1 (sgl-project#11389)
* [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded (sgl-project#11365)
* [CI] Refactor PD disaggregation test suite (sgl-project#11363)
* Replace pad with cat for better performance (sgl-project#11388)
* fix: reinstall torch in deps install (sgl-project#11414)
* feat(hicache): Support passing prefix keys for l3 store. (sgl-project#9045)
* fix file and object naming scheme in HiCacheNixl to avoid data corruption (sgl-project#10969)
* Dedicated toml files for CPU/XPU (sgl-project#10734)
* Add metrics for speculative decoding (acceptance rate, average acceptance length) (sgl-project#11144)
* chore: update pyproject (sgl-project#11420)
* PullRequest: 330 [theta] qwen-vl支持视频base64传入图像帧,如:data:video/jpeg;base64,frame1_base64,frame2_base64,...,frameN_base64
* fix: fix video input for qwen3-vl (sgl-project#11361)
* perf: optimize qwen-vl with symm mem allreduce (sgl-project#11381)
* [HiCache] feat: add multi tenant with prefix tag (sgl-project#9256)
* [CI] Merge build-dev into workflow matrix (sgl-project#11345)
* Revert "perf: optimize qwen-vl with symm mem allreduce" (sgl-project#11436)
* Revert "fix: fix video input for qwen3-vl" (sgl-project#11437)
* Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (sgl-project#11433)
* [router] Fix ci nvcc not found error (sgl-project#11411)
* feat(mooncake): support GB suffix for global_segment_size  (sgl-project#10745)
* Separate allocation logic from scheduler (sgl-project#11313)
* [router] disable rate limiter by default (sgl-project#11435)
* [router] leverage RAII to actively cancel request during client disconnect (sgl-project#11399)
* [router][grpc] Consolidate parser checks for chat completions (sgl-project#11439)
* Reorder PD disagg CI tests (#11438)
* fix: Change dsv32 hack temporary path to use system temp directory (#11445)
* Fix batch invariant ops (#11368)
* [BugFix] test_mla_fp8.py fails on Cublas 12.9 (#11360)
* [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton (#11450)
* Remove tilelang dependency in Dockerfile (#11455)
* Enable native ModelOpt quantization support (2/3) (#9991)
* Reland [1/2] Optimizations and refactors about quant kernel (#10312)
* Super tiny delete unused openai router in sgl-router (#11448)
* Adjust logits metada init for target verify (#11467)
* [Documentation][Configuration] Server args and documentation of PD-Multiplexing. (#11427)
* Fix enable_v2 in int8 quant (#11470)
* [Fix] Fix split prefill with fa3. (#11428)
* fix stop when stream  (#11462)
* Add option to disable `any_whitespace` for `xgrammar` and `llguidance` backends. (#8919)
* PullRequest: 334 [theta] 修复qwen3-vl的各种bug
* [7/n] decouple quantization impl from vllm dependency - gguf kernel (#11019)
* fix Xeon CI (#11454)
* [CI] Add nightly builds to dockerhub (#9804)
* [Feature] support regex strings as a stopping condition (#10635)
* Beta spec-overlap for EAGLE (#11398)
* Piecewise CUDA Graph Support & Torch Compile Backend (#10062)
* [Router]: Small Typo in a comment within tree.rs (#11489)
* chore: bump sgl-kernel version to 0.3.16 (#11476)
* [smol] [perf] Qwen3-VL in place op. (#11481)
* [chore][1/N] Avoid using default mutable parameters (#11478)
* [bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends (#10172)
* [ perf ] Replace json-> orjson in hot path (#11221)
* [chore][2/N] Avoid using default mutable parameters (#11479)
* Fix the GPT function calling regex to allow dash in the name (#10577)
* bailingMoE: Fix Key error of deepep_mode (#11465)
* Fix CI break by express-laned PRs. (#11499)
* Move args from `global_config` to `environ` (#11332)
* move fla env check position (#11500)
* Temporarily remove b200 tests (#11501)
* Fix port conflicts in CI (#11497)
* temporarily remove b200 tests (#11502)
* Fix unit tests (#11503)
* Bugfix: Fix Type consistency for KV indices in SWARadixCache (#11452)
* doc: add doc for adding new models into nightly-ci (#11443)
* [CI] fix lint (#11509)
* Deprecate `global_server_args_dict` (#11331)
* chore: remove flashinfer cleanup cache (#11514)
* fix: revert temporarily remove b200 tests (#11515)
* [Fix] Improve longbench prompt and other logics (#11474)
* Sync changes on io_struct.py and deterministic ops (#11498)
* [lint] Fix the lint issue (#11516)
* Revert "Deprecate `global_server_args_dict`" (#11520)
* Improve dp attention port assignment scheme (#5889)
* [theta] rebase public/main 1013-2
* [router] openai router: support grok model (#11511)
* docs(router): add token-bucket rate limiting to the docs (#11485)
* [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432)
* Update DeepSeek-R1-FP4 default config on blackwell (#11512)
* [Fix]: add missing device attribute to ChunkCache (#11493)
* [Feature] Support mamba radix cache v0 (#11214)
* ci: improve nightly-ci (#11385)
* [CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering (#11505)
* [HICache]: Support 3FS-Store with page_first_direct layout (#11460)
* Tiny fix test run estimated time (#11544)
* [Reland] perf: optimize qwen-vl with symm mem allreduce (#11457)
* [theta] rebase public/main 1013-5
* Depreate `global_server_args_dict` (#11528)
* [theta] rebase public/main 1013-6
* [Fix] Add per_channel_quant parameter to MoE config functions (#11201)
* [router][ci] Add Nightly Release Workflow for SGLang Router (#11527)
* [router] allow tokenizer path to be dir (#11530)
* Remove `tp_worker.worker` (#11548)
* fix: fix video input for qwen3-vl (#11442)
* [NVIDIA] BUMP FA3 (#11444)
* [router][Fix] Include grpc reflection runtime dependency (#11419)
* Adjust overlap event loop (#11507)
* Move deep gemm related arguments to `sglang.srt.environ` (#11547)
* [router][grpc] Further delegate non-stream processing to `processing.rs`  (#11553)
* [router] allow user to specify chat template path (#11549)
* Minor: improve sampler & remove unused fields from model_config.py (#11531)
* [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483)
* Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11441)
* Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) (#11557)
* [CI] Add Basic Test for DeepSeek V3.2 (#11308)
* [router][grpc] Add error handling to `generate_tool_constraints` (#11562)
* [NVIDIA] update pyproject.toml to support cu130 option (#11521)
* [CI Monitor] Ci monitor only deal with main branch in default (#11538)
* Tiny cleanup fp4 gemm calls (#11537)
* [router][grpc] Add `serve_grpc` to `launch_server` and log id for HealthCheck (#11564)
* [router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds (#11571)
* [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM (#11534)
* chore: bump sgl-kernel version to 0.3.16.post1 (#11573)
* Fix accept rate in speculative decoding metrics (#11572)
* Compilation Folder Reset (#11539)
* [FEATURE] Add Profile Trace Merger for Distributed Traces (#11413)
* [DSv32] Use torch.compile for _get_logits_head_gate (#11565)
* Make DeepEP combine recv do not overlap (#11535)
* bench_serving support PD Disaggregation (#11542)
* Implement LRU eviction policy for LoRA adapters (#11041)
* PullRequest: 337 支持completions协议传入多模态请求
* Revert "[NVIDIA] BUMP FA3 (#11444)" (#11582)
* chore: bump sgl-kernel version to 0.3.16.post2 (#11583)
* [Auto Sync] Update model_config.py (20251014) (#11580)
* Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json (#11587)
* [router][protocols] Add Axum validate extractor and use it for `/v1/chat/completions` endpoint (#11588)
* [router] update generate spec to align with sgl io struct (#11591)
* [router] change worker api to async instead of sync (#11566)
* Update news section in README.md (#11598)
* [router] delete useless table content comment in spec (#11597)
* [router] allow router launch server to use grpc mode (#11600)
* [Docs] [Router]: Update sg-router doc on circuit breaker (#11449)
* [router] when given both local tokenizer and chat template, log all (#11601)
* [AMD CI] Add image and weights caching. (#11593)
* Update release-docker-dev.yml (#11603)
* Optimize Triton Draft Backend (#11556)
* Refactor spec decoding metrics calculation into separate `TokenizerManager` utility function (#11586)
* make radix cache deterministic (#10721)
* move eagle draft post process to cuda graph (#11434)
* Reduce one step decode for draft model. (#11561)
* [router] add py binding and readme for openai router and history backend (#11453)
* [theta] print load mm cost
* [theta] 百灵4头支持tp8
* [router] cleanup app context and move to startup (#11617)
* [router] add chang and keyang to sgl router author (#11620)
* use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. (#11605)
* [router] update router readme to latest features (#11619)
* Fix log for chunked prefix cache (#11624)
* [Auto Sync] Update scheduler.py, server_args.py (20251014) (#11623)
* [Auto Sync] Update collector.py (20251014) (#11625)
* [Minor] Update xgrammar dependency (#11622)
* Update install.md (#11631)
* fix: Update SGL_KERNEL_VERSION to 0.3.15 (#11633)
* [router][grpc] add warm up to grpc server (#11627)
* Refactor kv cache free (#11351)
* [router] update router doc to latest features (#11639)
* fix: upgrade transformers to 4.57.1 (#11628)
* [router] add worker self discovery for metadata (#11638)
* [router] upgrade to 0.2.0 (#11642)
* [theta] qwen vl耗时打印
* [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423)
* [theta] qwen vl耗时打印
* [1/N]Support  DeepSeek-R1 w4a8 normal deepep (#8247)
* [Fix] Fix accuracy bug in CSGMV kernel caching key. (#11579)
* feat: add add_chunked_prefix_cache_attention_backend (#11636)
* Super tiny improve FA3 import error message (#11590)
* [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl  (#11458)
* [Doc] Update support matrix for attn and hybrid attn (#11293)
* Clean up some Qwen3-Next and deterministic code (#11585)
* docs: update sglang installation guide (#11659)
* [theta] 更新aci镜像和依赖
* Tiny cleanup some eagle unused codes (#11660)
* Fix 1-step draft model forward (#11653)
* [tool call] Fix prev_tool_call_arr management in base_format_detector.py (#11367)
* [router] Fix response api related spec (#11621)
* Fix missing json imports in serving_responses.py (#11681)
* [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674)
* [sgl-kernel] Optimize gguf test (#11667)
* [router][grpc] Simplify model_id determination (#11684)
* [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding (#11676)
* chore: bump SGLang version to 0.5.3.post2 (#11680)
* [CI][XPU]enable sglang CI on Intel XPU (#9493)
* enable rmsnorm on XPU (#10248)
* Sync code and test CI; rename some env vars (#11686)
* docs: Add Contributor Covenant Code of Conduct (#11689)
* [theta] dockerfile增加deepgemm编译缓存(需要定期更新😂)
* [Mamba] Increase default mamba_full_memory_ratio to 0.9 (#11679)
* [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912)
* [sgl-kernel] support hadamard (#11663)
* Fix missing a2a backend init of GLM4.5 MoE Block (#11692)
* Split test_intel_amx_attention_backend.py to pass CI of timeout (#11370)
lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025
…ing (sgl-project#9812)

Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants