[diffusion] Add dynamic batching v0#18764
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Thanks for the initiatives!! However the throughput improvement looks pretty small, wonder if you have some thought about that? |
I updated the scheduler to drain multiple pending messages per poll, which helps, though there's probably more improvements that can be made.
Best results (see table) were with |
|
baseline improved from the last time i benchmarked, however improvements still stand. gains are very model dependent, some models benefit clearly (qwen image) while for others the throughput-latency tradeoff is less favourable, but still an improvement
specific amount of batch delay and batch max size were chosen here as the best performing settings for each model after i ran a sweep through all options Qwen/Qwen-Image-2512 -- Dynamic # Server
sglang serve --model-path Qwen/Qwen-Image-2512 --backend diffusers --num-gpus 1 \
--host 127.0.0.1 --port 31300 --log-level info --dit-layerwise-offload true \
--dynamic-batch-max-size 8 --dynamic-batch-delay-ms 8
# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
--base-url http://127.0.0.1:31300 --model Qwen/Qwen-Image-2512 \
--dataset vbench --task text-to-image --width 512 --height 512 \
--num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdmQwen/Qwen-Image-2512 -- Baseline # Server
sglang serve --model-path Qwen/Qwen-Image-2512 --backend diffusers --num-gpus 1 \
--host 127.0.0.1 --port 31301 --log-level info --dit-layerwise-offload true
# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
--base-url http://127.0.0.1:31301 --model Qwen/Qwen-Image-2512 \
--dataset vbench --task text-to-image --width 512 --height 512 \
--num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdmTongyi-MAI/Z-Image-Turbo -- Dynamic # Server
sglang serve --model-path Tongyi-MAI/Z-Image-Turbo --backend diffusers --num-gpus 1 \
--host 127.0.0.1 --port 31302 --log-level info --dit-layerwise-offload true \
--dynamic-batch-max-size 8 --dynamic-batch-delay-ms 10
# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
--base-url http://127.0.0.1:31302 --model Tongyi-MAI/Z-Image-Turbo \
--dataset vbench --task text-to-image --width 512 --height 512 \
--num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdmTongyi-MAI/Z-Image-Turbo -- Baseline # Server
sglang serve --model-path Tongyi-MAI/Z-Image-Turbo --backend diffusers --num-gpus 1 \
--host 127.0.0.1 --port 31303 --log-level info --dit-layerwise-offload true
# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
--base-url http://127.0.0.1:31303 --model Tongyi-MAI/Z-Image-Turbo \
--dataset vbench --task text-to-image --width 512 --height 512 \
--num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdmblack-forest-labs/FLUX.1-dev -- Dynamic # Server
sglang serve --model-path black-forest-labs/FLUX.1-dev --backend diffusers --num-gpus 1 \
--host 127.0.0.1 --port 31304 --log-level info --dit-layerwise-offload true \
--dynamic-batch-max-size 8 --dynamic-batch-delay-ms 5
# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
--base-url http://127.0.0.1:31304 --model black-forest-labs/FLUX.1-dev \
--dataset vbench --task text-to-image --width 512 --height 512 \
--num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdmblack-forest-labs/FLUX.1-dev -- Baseline # Server
sglang serve --model-path black-forest-labs/FLUX.1-dev --backend diffusers --num-gpus 1 \
--host 127.0.0.1 --port 31305 --log-level info --dit-layerwise-offload true
# Benchmark
python -m sglang.multimodal_gen.benchmarks.bench_serving \
--base-url http://127.0.0.1:31305 --model black-forest-labs/FLUX.1-dev \
--dataset vbench --task text-to-image --width 512 --height 512 \
--num-prompts 64 --max-concurrency 12 --request-rate inf --disable-tqdmi'm opening the PR up for review, lmk your thoughts @mickqian |
|
|
||
| MINIMUM_PICTURE_BASE64_FOR_WARMUP = "data:image/jpg;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAbUlEQVRYhe3VsQ2AMAxE0Y/lIgNQULD/OqyCMgCihCKSG4yRuKuiNH6JLsoEbMACOGBcua9HOR7Y6w6swBwMy0qLTpkeI77qdEBpBFAHBBDAGH8WrwJKI4AAegUCfAKgEgpQDvh3CR3oQCuav58qlAw73kKCSgAAAABJRU5ErkJggg==" | ||
|
|
||
| _DYNAMIC_BATCH_SIGNATURE_EXCLUDED_FIELDS = { |
There was a problem hiding this comment.
should we make it an attribute to the fields of Req, instead of maintaining a list of hard-coded names here?
There was a problem hiding this comment.
yes this would be better i agree
| f"({self.ring_degree} * {self.ulysses_degree} = {self.ring_degree * self.ulysses_degree})" | ||
| ) | ||
|
|
||
| if self.dynamic_batch_max_size < 1: |
There was a problem hiding this comment.
this doesn't fall in the category of parallelism, isn't it
| self.waiting_queue: deque[tuple[bytes | None, Any, float]] = deque() | ||
| self._dynamic_batch_max_size = max(1, server_args.dynamic_batch_max_size) | ||
| self._dynamic_batch_delay_s = max( | ||
| 0.0, server_args.dynamic_batch_delay_ms / 1000.0 |
There was a problem hiding this comment.
dynamic_batch_delay_ms is validated and > 0 for sure, so max is needless
| self.waiting_queue: deque[tuple[bytes, Req]] = deque() | ||
| # FIFO queue entries: (identity, request, enqueue_ts_s) | ||
| self.waiting_queue: deque[tuple[bytes | None, Any, float]] = deque() | ||
| self._dynamic_batch_max_size = max(1, server_args.dynamic_batch_max_size) |
| # 2: execute, make sure a reply is always sent | ||
| items = self.get_next_batch_to_run() | ||
| if not items: | ||
| if self.waiting_queue and self._dynamic_batch_delay_s > 0: |
There was a problem hiding this comment.
condition could be simplified
|
addressed suggestions @mickqian |
|
Thanks for your great work! Is there any reason or concerns on not supporting i2i/edits requests? |
Valid question! I initially opened this pr as a first implementation of inference batching for sgl diffusion, so I wanted minimal complexity (hence why I focused only on the most common case, t2i, as requests only differ by prompt and seed). I figured that image requests would be more complex to batch. I'd be happy to try to implement this tomorrow though! The main challenge I could think of off the top of my head is that for i2i/edit batching is that different models handle image conditioning differently (some concatenate spatial conditioning (image/mask latents) with the noisy latent along the channel dimension, others inject conditioning through cross-attention, etc etc) so the batching logic would likely need to be model-specific rather than one general solution. My intuition could be wrong though, lmk what you think @SYChen123 |
@qimcis Hi qimcis. I think from the perspective of design, how the models handle image conditioning should not affect the scheduling logic. For any types of model, we just batch the requests based on perhaps the number of image and text tokens and pass the batch to executor. However, I am new to diffusion so I am not sure if there're any details that I missed. If so feel free to correct me! Another doubts I have is that will a model simultaneously execute different types of reqs? For example, a model is able to execute t2i and i2i requests simultaneously. Not sure will such case happens. In such case, shall we batch t2i and i2i requests separately? Since i2i request usually takes longer time to generate and will slow down t2i requests if in 1 batch. |
|
Hi @yhyang201 and @mickqian, We refined the implementation and ran benchmarks with two models. We found that the performance improvement is significant and closely related to batch size—using a batch size of 8 can reduce latency by up to 8×. Could you take a look when you're available?
Non-batch: black-forest-labs/FLUX.1-dev Batch: Non-batch: For Tongyi-MAI/Z-Image-Turbo , it doesn't support batch prompts in SGlang yet and I plan to add support in a separate PR. |
|
@mickqian I will create a roadmap over this weekend to keep track of everything that will need to be done following this PR for inference batching support; and @SYChen123 , I plan on opening a follow-up PR after this one to address i2i/edit batching, just for separation of concerns! expect separate PR up shortly following the merge of this one |
|
@qimcis thanks for the great work! Can you help me understand how the latency get calculated? Cause my intuition is if the batch size gets larger for single inference request, the latency should increase due to higher compute and memory footprint. However I see improvement in both latency and throughput |
My implementation implements dynamic batching across multiple concurrent requests, not larger static batching for a single isolated request, so the benchmark latency is end-to-end per-request latency under concurrent load, it includes queueing time. In this setting, i believe batching can reduce queueing enough that both throughput and latency improve. For a single isolated request like you mentioned, I’d agree that batching usually would not make it faster; i could be misunderstanding though |
Hi @qimcis thanks for the clarification, and I'm now understand the latency here. Just to clarify, the latency I was curious about is on the load testing side where we check how many concurrent request a server can handle in different parallelisms, num of GPUs, dynamic batch sizes ... etc where we check the throughput and single request latencies (because single request latencies are the latency a user can see). So there will be a tipping point where a server is overloaded by too many concurrent requests and we see a significantly latency increase. |
|
Thanks for the great work on adding dynamic batching support for diffusion models. I noticed that the current evaluation is primarily based on 512×512 resolution. At this scale, many diffusion workloads tend to be more memory-bound, where batching optimizations can show clearer benefits. I’m wondering if it might also be helpful to include benchmarks at higher resolutions, such as 1024×1024, which are more commonly used in practice. At these larger resolutions, workloads often become more compute-bound, and the impact of dynamic batching may differ. From some quick experiments on my side, I observed that for Qwen-Image at 1024×1024, the workload still appears to be largely memory-bound, and the benefit from batching is relatively limited. I’ve shared some test results here for reference: It might be useful to run similar tests to better understand how the optimization behaves under this setting. |
Motivation
#18594
Modifications
Added dynamic batching (with max batch size + delay) to the diffusion scheduler. Across the tested text-to-image models, dynamic batching gave up to +29.6% higher throughput, -22.4% lower mean latency, and -31.8% lower P99 latency.
Currently does not support i2i/edits/i2v, just prompt-only diffusion requests (t2i, t2v).
Benchmarking and Profiling
1x NVIDIA H100 80GB
Results (64 prompts):
For example, on Qwen Image:
Dynamic Batching:
Baseline:
Checklist