Skip to content

[diffusion] Add dynamic batching v0#18764

Merged
mickqian merged 33 commits intosgl-project:mainfrom
qimcis:dynamic-batching
May 3, 2026
Merged

[diffusion] Add dynamic batching v0#18764
mickqian merged 33 commits intosgl-project:mainfrom
qimcis:dynamic-batching

Conversation

@qimcis
Copy link
Copy Markdown
Contributor

@qimcis qimcis commented Feb 13, 2026

Motivation

#18594

Modifications

Added dynamic batching (with max batch size + delay) to the diffusion scheduler. Across the tested text-to-image models, dynamic batching gave up to +29.6% higher throughput, -22.4% lower mean latency, and -31.8% lower P99 latency.

Currently does not support i2i/edits/i2v, just prompt-only diffusion requests (t2i, t2v).

Benchmarking and Profiling

1x NVIDIA H100 80GB
Results (64 prompts):

Model Throughput Baseline (req/s) Throughput Dynamic (req/s) Δ (%) Mean Lat Baseline (s) Mean Lat Dynamic (s) Δ (%) P99 Baseline (s) P99 Dynamic (s) Δ (%) Duration Baseline (s) Duration Dynamic (s) Δ (%) Peak Mem Baseline (MB) Peak Mem Dynamic (MB) Δ (%)
Tongyi-MAI/Z-Image-Turbo 0.377 0.395 +4.7% 10.351 9.960 -3.8% 13.267 10.358 -21.9% 169.581 161.984 -4.5% 20388.42 22215.95 +9.0%
black-forest-labs/FLUX.1-dev 0.496 0.538 +8.4% 7.877 7.437 -5.6% 10.087 9.244 -8.4% 129.055 119.062 -7.7% 32839.30 34676.05 +5.6%
Qwen/Qwen-Image-2512 0.488 0.633 +29.6% 8.002 6.212 -22.4% 10.190 6.954 -31.8% 131.120 101.147 -22.9% 57282.23 59880.09 +4.5%

For example, on Qwen Image:

Dynamic Batching:

sglang serve --model-path Qwen/Qwen-Image-2512 --backend diffusers --num-gpus 1 --host 127.0.0.1 --port 30200 --log-level info --dynamic-batch-max-size 4 --dynamic-batch-delay-ms 5 --dit-layerwise-offload true
python -m sglang.multimodal_gen.benchmarks.bench_serving --base-url http://127.0.0.1:30200 --model Qwen/Qwen-Image-2512 --dataset vbench --dataset-path /path/to/prompts --task text-to-image --width 512 --height 512 --num-prompts 64 --max-concurrency 4 --request-rate inf --disable-tqdm

Baseline:

sglang serve --model-path Qwen/Qwen-Image-2512 --backend diffusers --num-gpus 1 --host 127.0.0.1 --port 30300 --log-level info --dit-layerwise-offload true
python -m sglang.multimodal_gen.benchmarks.bench_serving --base-url http://127.0.0.1:30300 --model Qwen/Qwen-Image-2512 --dataset vbench --dataset-path /path/to/prompts --task text-to-image --width 512 --height 512 --num-prompts 64 --max-concurrency 4 --request-rate inf --disable-tqdm

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@JoeyTPChou
Copy link
Copy Markdown

Thanks for the initiatives!! However the throughput improvement looks pretty small, wonder if you have some thought about that?

@qimcis
Copy link
Copy Markdown
Contributor Author

qimcis commented Feb 13, 2026

Thanks for the initiatives!! However the throughput improvement looks pretty small, wonder if you have some thought about that?

I updated the scheduler to drain multiple pending messages per poll, which helps, though there's probably more improvements that can be made.

Model Throughput (req/s) Δ Throughput Mean (s) Δ Mean P99 (s) Δ P99 Peak Mem (MB) Δ Mem
Qwen/Qwen-Image-2512 (baseline) 0.35 11.08 11.64 57,301
Qwen/Qwen-Image-2512 (best) 0.68 +94.3% 5.89 +46.8% 6.54 +43.8% 59,880 +4.5%
Tongyi-MAI/Z-Image-Turbo (baseline) 0.36 10.72 11.18 20,399
Tongyi-MAI/Z-Image-Turbo (best) 0.41 +13.9% 9.70 +9.6% 9.99 +10.7% 22,216 +8.9%
black-forest-labs/FLUX.1-dev (baseline) 0.49 7.90 8.30 32,839
black-forest-labs/FLUX.1-dev (best) 0.54 +10.2% 7.32 +7.3% 9.42 -13.5% 34,676 +5.6%

Best results (see table) were with --dynamic-batch-max-size 8 --dynamic-batch-delay-ms 5 after I ran a sweep through

@qimcis
Copy link
Copy Markdown
Contributor Author

qimcis commented Feb 22, 2026

baseline improved from the last time i benchmarked, however improvements still stand. gains are very model dependent, some models benefit clearly (qwen image) while for others the throughput-latency tradeoff is less favourable, but still an improvement

Model Throughput Baseline (req/s) Throughput Dynamic (req/s) Δ (%) Mean Lat Baseline (s) Mean Lat Dynamic (s) Δ (%) P99 Baseline (s) P99 Dynamic (s) Δ (%) Duration Baseline (s) Duration Dynamic (s) Δ (%) Peak Mem Baseline (MB) Peak Mem Dynamic (MB) Δ (%)
Qwen/Qwen-Image-2512 0.482904 0.725401 +50.22% 22.668942 15.457689 -31.81% 25.080151 21.314026 -15.02% 132.531532 88.227048 -33.43% 58002 65904 +13.62%
Tongyi-MAI/Z-Image-Turbo 0.371991 0.421981 +13.44% 29.496175 26.649730 -9.65% 32.404535 35.313092 +8.98% 172.047351 151.665530 -11.85% 21572 33890 +57.10%
black-forest-labs/FLUX.1-dev 0.497358 0.554445 +11.48% 22.059737 20.369420 -7.66% 24.228342 25.658341 +5.90% 128.679945 115.430799 -10.30% 33362 41988 +25.86%

specific amount of batch delay and batch max size were chosen here as the best performing settings for each model after i ran a sweep through all options

Qwen/Qwen-Image-2512 -- Dynamic

# Server
sglang serve --model-path Qwen/Qwen-Image-2512 --backend diffusers --num-gpus 1 \
  --host 127.0.0.1 --port 31300 --log-level info --dit-layerwise-offload true \
  --dynamic-batch-max-size 8 --dynamic-batch-delay-ms 8

# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
  --base-url http://127.0.0.1:31300 --model Qwen/Qwen-Image-2512 \
  --dataset vbench --task text-to-image --width 512 --height 512 \
  --num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdm

Qwen/Qwen-Image-2512 -- Baseline

# Server
sglang serve --model-path Qwen/Qwen-Image-2512 --backend diffusers --num-gpus 1 \
  --host 127.0.0.1 --port 31301 --log-level info --dit-layerwise-offload true

# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
  --base-url http://127.0.0.1:31301 --model Qwen/Qwen-Image-2512 \
  --dataset vbench --task text-to-image --width 512 --height 512 \
  --num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdm

Tongyi-MAI/Z-Image-Turbo -- Dynamic

# Server
sglang serve --model-path Tongyi-MAI/Z-Image-Turbo --backend diffusers --num-gpus 1 \
  --host 127.0.0.1 --port 31302 --log-level info --dit-layerwise-offload true \
  --dynamic-batch-max-size 8 --dynamic-batch-delay-ms 10

# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
  --base-url http://127.0.0.1:31302 --model Tongyi-MAI/Z-Image-Turbo \
  --dataset vbench --task text-to-image --width 512 --height 512 \
  --num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdm

Tongyi-MAI/Z-Image-Turbo -- Baseline

# Server
sglang serve --model-path Tongyi-MAI/Z-Image-Turbo --backend diffusers --num-gpus 1 \
  --host 127.0.0.1 --port 31303 --log-level info --dit-layerwise-offload true

# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
  --base-url http://127.0.0.1:31303 --model Tongyi-MAI/Z-Image-Turbo \
  --dataset vbench --task text-to-image --width 512 --height 512 \
  --num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdm

black-forest-labs/FLUX.1-dev -- Dynamic

# Server
sglang serve --model-path black-forest-labs/FLUX.1-dev --backend diffusers --num-gpus 1 \
  --host 127.0.0.1 --port 31304 --log-level info --dit-layerwise-offload true \
  --dynamic-batch-max-size 8 --dynamic-batch-delay-ms 5

# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
  --base-url http://127.0.0.1:31304 --model black-forest-labs/FLUX.1-dev \
  --dataset vbench --task text-to-image --width 512 --height 512 \
  --num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdm

black-forest-labs/FLUX.1-dev -- Baseline

# Server
sglang serve --model-path black-forest-labs/FLUX.1-dev --backend diffusers --num-gpus 1 \
  --host 127.0.0.1 --port 31305 --log-level info --dit-layerwise-offload true

# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
  --base-url http://127.0.0.1:31305 --model black-forest-labs/FLUX.1-dev \
  --dataset vbench --task text-to-image --width 512 --height 512 \
  --num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdm

i'm opening the PR up for review, lmk your thoughts @mickqian

@qimcis qimcis marked this pull request as ready for review February 22, 2026 22:53

MINIMUM_PICTURE_BASE64_FOR_WARMUP = "data:image/jpg;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAbUlEQVRYhe3VsQ2AMAxE0Y/lIgNQULD/OqyCMgCihCKSG4yRuKuiNH6JLsoEbMACOGBcua9HOR7Y6w6swBwMy0qLTpkeI77qdEBpBFAHBBDAGH8WrwJKI4AAegUCfAKgEgpQDvh3CR3oQCuav58qlAw73kKCSgAAAABJRU5ErkJggg=="

_DYNAMIC_BATCH_SIGNATURE_EXCLUDED_FIELDS = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make it an attribute to the fields of Req, instead of maintaining a list of hard-coded names here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this would be better i agree

f"({self.ring_degree} * {self.ulysses_degree} = {self.ring_degree * self.ulysses_degree})"
)

if self.dynamic_batch_max_size < 1:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't fall in the category of parallelism, isn't it

self.waiting_queue: deque[tuple[bytes | None, Any, float]] = deque()
self._dynamic_batch_max_size = max(1, server_args.dynamic_batch_max_size)
self._dynamic_batch_delay_s = max(
0.0, server_args.dynamic_batch_delay_ms / 1000.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic_batch_delay_ms is validated and > 0 for sure, so max is needless

self.waiting_queue: deque[tuple[bytes, Req]] = deque()
# FIFO queue entries: (identity, request, enqueue_ts_s)
self.waiting_queue: deque[tuple[bytes | None, Any, float]] = deque()
self._dynamic_batch_max_size = max(1, server_args.dynamic_batch_max_size)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

# 2: execute, make sure a reply is always sent
items = self.get_next_batch_to_run()
if not items:
if self.waiting_queue and self._dynamic_batch_delay_s > 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condition could be simplified

@qimcis
Copy link
Copy Markdown
Contributor Author

qimcis commented Feb 23, 2026

addressed suggestions @mickqian

@SYChen123
Copy link
Copy Markdown
Contributor

Thanks for your great work! Is there any reason or concerns on not supporting i2i/edits requests?
Since recently I need to deploy Qwen-Image-Edit on my production environment, I'd be happy to collaborate and contribute on the batch inference of img edit requests.

@qimcis
Copy link
Copy Markdown
Contributor Author

qimcis commented Mar 5, 2026

Thanks for your great work! Is there any reason or concerns on not supporting i2i/edits requests? Since recently I need to deploy Qwen-Image-Edit on my production environment, I'd be happy to collaborate and contribute on the batch inference of img edit requests.

Valid question! I initially opened this pr as a first implementation of inference batching for sgl diffusion, so I wanted minimal complexity (hence why I focused only on the most common case, t2i, as requests only differ by prompt and seed). I figured that image requests would be more complex to batch. I'd be happy to try to implement this tomorrow though!

The main challenge I could think of off the top of my head is that for i2i/edit batching is that different models handle image conditioning differently (some concatenate spatial conditioning (image/mask latents) with the noisy latent along the channel dimension, others inject conditioning through cross-attention, etc etc) so the batching logic would likely need to be model-specific rather than one general solution. My intuition could be wrong though, lmk what you think @SYChen123

@SYChen123
Copy link
Copy Markdown
Contributor

Valid question! I initially opened this pr as a first implementation of inference batching for sgl diffusion, so I wanted minimal complexity (hence why I focused only on the most common case, t2i, as requests only differ by prompt and seed). I figured that image requests would be more complex to batch. I'd be happy to try to implement this tomorrow though!

The main challenge I could think of off the top of my head is that for i2i/edit batching is that different models handle image conditioning differently (some concatenate spatial conditioning (image/mask latents) with the noisy latent along the channel dimension, others inject conditioning through cross-attention, etc etc) so the batching logic would likely need to be model-specific rather than one general solution. My intuition could be wrong though, lmk what you think @SYChen123

@qimcis Hi qimcis. I think from the perspective of design, how the models handle image conditioning should not affect the scheduling logic. For any types of model, we just batch the requests based on perhaps the number of image and text tokens and pass the batch to executor. However, I am new to diffusion so I am not sure if there're any details that I missed. If so feel free to correct me!

Another doubts I have is that will a model simultaneously execute different types of reqs? For example, a model is able to execute t2i and i2i requests simultaneously. Not sure will such case happens. In such case, shall we batch t2i and i2i requests separately? Since i2i request usually takes longer time to generate and will slow down t2i requests if in 1 batch.

@yyy1000
Copy link
Copy Markdown
Contributor

yyy1000 commented Mar 12, 2026

Hi @yhyang201 and @mickqian,

We refined the implementation and ran benchmarks with two models. We found that the performance improvement is significant and closely related to batch size—using a batch size of 8 can reduce latency by up to 8×. Could you take a look when you're available?

  1. Qwen/Qwen-Image-2512   --dataset vbench --task text-to-image --width 512 --height 512   --num-prompts 8 --max-concurrency 8 --request-rate inf --disable-tqdm
    Batch:
================= Serving Benchmark Result =================
Task:                                         text-to-image                 
Model:                                        Qwen/Qwen-Image-2512          
Dataset:                                      vbench                        
--------------------------------------------------
Benchmark duration (s):                       81.94                         
Request rate:                                 inf                           
Max request concurrency:                      8                             
Successful requests:                          8/8                           
--------------------------------------------------
Request throughput (req/s):                   0.10                          
Latency Mean (s):                             81.94                         
Latency Median (s):                           81.94                         
Latency P99 (s):                              81.94                         
--------------------------------------------------
Peak Memory Max (MB):                         22926.00                      
Peak Memory Mean (MB):                        22926.00                      
Peak Memory Median (MB):                      22926.00                      

Non-batch:

================= Serving Benchmark Result =================
Task:                                         text-to-image                 
Model:                                        Qwen/Qwen-Image-2512          
Dataset:                                      vbench                        
--------------------------------------------------
Benchmark duration (s):                       586.73                        
Request rate:                                 inf                           
Request timeout (s):                          disabled                      
Max request concurrency:                      8                             
Successful requests:                          8/8                           
--------------------------------------------------
Request throughput (req/s):                   0.01                          
Latency Mean (s):                             330.95                        
Latency Median (s):                           330.95                        
Latency P99 (s):                              581.62                        
--------------------------------------------------
Peak Memory Max (MB):                         6646.00                       
Peak Memory Mean (MB):                        6646.00                       
Peak Memory Median (MB):                      6646.00                       
------------------------------------------------------------

black-forest-labs/FLUX.1-dev
  --dataset vbench --task text-to-image --width 512 --height 512
  --num-prompts 8 --max-concurrency 8

Batch:

================= Serving Benchmark Result =================
Task:                                         text-to-image                 
Model:                                        black-forest-labs/FLUX.1-dev  
Dataset:                                      vbench                        
--------------------------------------------------
Benchmark duration (s):                       34.49                         
Request rate:                                 inf                           
Request timeout (s):                          300.00                        
Max request concurrency:                      8                             
Successful requests:                          8/8                           
--------------------------------------------------
Request throughput (req/s):                   0.23                          
Latency Mean (s):                             34.49                         
Latency Median (s):                           34.49                         
Latency P99 (s):                              34.49                         
--------------------------------------------------
Peak Memory Max (MB):                         22124.00                      
Peak Memory Mean (MB):                        22124.00                      
Peak Memory Median (MB):                      22124.00                   

Non-batch:

================= Serving Benchmark Result =================
Task:                                         text-to-image                 
Model:                                        black-forest-labs/FLUX.1-dev  
Dataset:                                      vbench                        
--------------------------------------------------
Benchmark duration (s):                       167.57                        
Request rate:                                 inf                           
Request timeout (s):                          disabled                      
Max request concurrency:                      8                             
Successful requests:                          8/8                           
--------------------------------------------------
Request throughput (req/s):                   0.05                          
Latency Mean (s):                             94.61                         
Latency Median (s):                           94.62                         
Latency P99 (s):                              166.10                        
--------------------------------------------------
Peak Memory Max (MB):                         5342.00                       
Peak Memory Mean (MB):                        5341.75                       
Peak Memory Median (MB):                      5342.00                       
------------------------------------------------------------

For Tongyi-MAI/Z-Image-Turbo , it doesn't support batch prompts in SGlang yet and I plan to add support in a separate PR.

@qimcis
Copy link
Copy Markdown
Contributor Author

qimcis commented Mar 12, 2026

@mickqian I will create a roadmap over this weekend to keep track of everything that will need to be done following this PR for inference batching support; and @SYChen123 , I plan on opening a follow-up PR after this one to address i2i/edit batching, just for separation of concerns! expect separate PR up shortly following the merge of this one

@JoeyTPChou
Copy link
Copy Markdown

@qimcis thanks for the great work! Can you help me understand how the latency get calculated? Cause my intuition is if the batch size gets larger for single inference request, the latency should increase due to higher compute and memory footprint. However I see improvement in both latency and throughput

@qimcis
Copy link
Copy Markdown
Contributor Author

qimcis commented Apr 7, 2026

@qimcis thanks for the great work! Can you help me understand how the latency get calculated? Cause my intuition is if the batch size gets larger for single inference request, the latency should increase due to higher compute and memory footprint. However I see improvement in both latency and throughput

My implementation implements dynamic batching across multiple concurrent requests, not larger static batching for a single isolated request, so the benchmark latency is end-to-end per-request latency under concurrent load, it includes queueing time. In this setting, i believe batching can reduce queueing enough that both throughput and latency improve. For a single isolated request like you mentioned, I’d agree that batching usually would not make it faster; i could be misunderstanding though

@qimcis qimcis force-pushed the dynamic-batching branch from e3a9641 to 0bb5de2 Compare April 7, 2026 03:08
@JoeyTPChou
Copy link
Copy Markdown

@qimcis thanks for the great work! Can you help me understand how the latency get calculated? Cause my intuition is if the batch size gets larger for single inference request, the latency should increase due to higher compute and memory footprint. However I see improvement in both latency and throughput

My implementation implements dynamic batching across multiple concurrent requests, not larger static batching for a single isolated request, so the benchmark latency is end-to-end per-request latency under concurrent load, it includes queueing time. In this setting, i believe batching can reduce queueing enough that both throughput and latency improve. For a single isolated request like you mentioned, I’d agree that batching usually would not make it faster; i could be misunderstanding though

Hi @qimcis thanks for the clarification, and I'm now understand the latency here. Just to clarify, the latency I was curious about is on the load testing side where we check how many concurrent request a server can handle in different parallelisms, num of GPUs, dynamic batch sizes ... etc where we check the throughput and single request latencies (because single request latencies are the latency a user can see). So there will be a tipping point where a server is overloaded by too many concurrent requests and we see a significantly latency increase.

@niehen6174
Copy link
Copy Markdown
Contributor

Thanks for the great work on adding dynamic batching support for diffusion models.

I noticed that the current evaluation is primarily based on 512×512 resolution. At this scale, many diffusion workloads tend to be more memory-bound, where batching optimizations can show clearer benefits.

I’m wondering if it might also be helpful to include benchmarks at higher resolutions, such as 1024×1024, which are more commonly used in practice. At these larger resolutions, workloads often become more compute-bound, and the impact of dynamic batching may differ.

From some quick experiments on my side, I observed that for Qwen-Image at 1024×1024, the workload still appears to be largely memory-bound, and the benefit from batching is relatively limited. I’ve shared some test results here for reference:
#22183 (comment)

It might be useful to run similar tests to better understand how the optimization behaves under this setting.

@qimcis qimcis force-pushed the dynamic-batching branch from 6943550 to 03b3b33 Compare May 3, 2026 14:26
@mickqian mickqian merged commit 62265ca into sgl-project:main May 3, 2026
71 of 78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation npu run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants