[Performance] Dynamic Batch Tokenizer by sundar24295s · Pull Request #9382 · sgl-project/sglang

sundar24295s · 2025-08-20T06:21:21Z

Motivation

This PR introduces an AsyncDynamicBatchTokenizer that enables batching of tokenization requests to improve throughput and reduce latency for SGLang's tokenizer manager.

Performance Impact:

For Qwen3-Embedding-0.6B with 500 input tokens and 1 prompt per request at RPS 500, P99 latency improved from 4583 ms to 464 ms (~10× faster) with this PR.
In production systems, P99 latency is a critical metric as it represents the worst-case experience for users. This improvement allows systems to extract maximum throughput while maintaining acceptable P99 latency thresholds for SLA compliance.

Context

This PR builds upon the tokenization batching infrastructure introduced in PR #5141, which added enable_tokenizer_batch_encode for batching multiple texts within a single request.

Tokenization Batching Options

Feature	Use Case	When to Use
`enable_tokenizer_batch_encode` (PR #5141)	Client sends batched inputs in single request	When your client batches multiple texts into one API call
`enable_dynamic_batch_tokenizer` (This PR)	Client sends single prompts across multiple requests	When your client application sends individual requests

Example scenarios:

Use enable_tokenizer_batch_encode: Client sends {"input": ["text1", "text2", "text3"]} in one request
Use enable_dynamic_batch_tokenizer: Client sends multiple concurrent requests: {"input": "text1"}, {"input": "text2"}, {"input": "text3"}

Modifications

🚀 Dynamic Batching
- Automatically batches multiple concurrent tokenization requests for efficiency
- Processes single requests immediately when no other requests are pending
- Collects additional requests up to max_batch_size or batch_wait_timeout_s when queue has pending items
⚙️ Server Args
- max_batch_size (default: 32): Maximum number of requests to batch together
- batch_wait_timeout_s (default: 0.002s): Maximum time to wait for additional requests
- enable_dynamic_batch_tokenizer: Feature flag to enable/disable the functionality
- Usage
```
--enable-dynamic-batch-tokenizer \
--dynamic-batch-tokenizer-batch-size 32 \
--dynamic-batch-tokenizer-batch-timeout 0.002
```
🔄 Async Processing
- Non-blocking tokenization using asyncio and ThreadPoolExecutor
- Maintains event loop responsiveness while handling blocking tokenizer calls
- Scales efficiently with concurrent requests

Benchmarking and Profiling

Model Qwen3-Embedding-0.6B
Input Token Length = 500
GPU Type = H100
Traffic distribution = Poisson

Baseline Results

(li-sglang) jobuser [ /shared/user/repos/li-sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-Embedding-0.6B --port 30000 --host 0.0.0.0 --enable-metrics --disable-radix-cache --disable-cuda-graph  --is-embedding


| test_duration_secs | minute_interval | target_rps | item_count | server_type | distribution | unique_requests | total_requests | successful_requests | failed_requests | send_duration_secs | total_duration_secs | avg_response_time_ms | p50_response_time_ms | p90_response_time_ms | p99_response_time_ms |
|--------------------|-----------------|------------|------------|-------------|--------------|-----------------|----------------|---------------------|-----------------|--------------------|---------------------|----------------------|----------------------|----------------------|----------------------|
| 60                 | 1               | 300        | 1          | HTTP        | POISSON      | 100             | 15145          | 15145               | 0               | 71.19              | 71.44               | 29.73                | 28.18                | 35.59                | 52.94                |
| 60                 | 1               | 400        | 1          | HTTP        | POISSON      | 100             | 19019          | 19019               | 0               | 75.75              | 76.04               | 53.17                | 34.21                | 58.68                | 436.09               |
| 60                 | 1               | 500        | 1          | HTTP        | POISSON      | 100             | 20350          | 20350               | 0               | 81.05              | 185.33              | 2306.36              | 2345.45              | 4142.11              | 4583.08              |
| 60                 | 1               | 600        | 1          | HTTP        | POISSON      | 100             | 20216          | 20216               | 0               | 86.23              | 107.08              | 5939.38              | 5780.45              | 10926.77             | 11914.25             |

Batch Tokenizer Results

(li-sglang) jobuser [ /shared/user/repos/li-sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-Embedding-0.6B --port 30000 --host 0.0.0.0 --enable-metrics --disable-radix-cache --disable-cuda-graph  --is-embedding --enable-dynamic-batch-tokenizer



| test_duration_secs | minute_interval | target_rps | item_count | server_type | distribution | unique_requests | total_requests | successful_requests | failed_requests | send_duration_secs | total_duration_secs | avg_response_time_ms | p50_response_time_ms | p90_response_time_ms | p99_response_time_ms |
|--------------------|-----------------|------------|------------|-------------|--------------|-----------------|----------------|---------------------|-----------------|--------------------|---------------------|----------------------|----------------------|----------------------|----------------------|
| 60                 | 1               | 300        | 1          | HTTP        | POISSON      | 100             | 15079          | 15079               | 0               | 71.68              | 71.88               | 31.40                | 28.92                | 44.66                | 68.06                |
| 60                 | 1               | 400        | 1          | HTTP        | POISSON      | 100             | 18965          | 18965               | 0               | 75.95              | 76.20               | 70.45                | 52.30                | 98.21                | 424.97               |
| 60                 | 1               | 500        | 1          | HTTP        | POISSON      | 100             | 21972          | 21972               | 0               | 81.29              | 81.75               | 125.62               | 89.51                | 287.98               | 464.68               |
| 60                 | 1               | 600        | 1          | HTTP        | POISSON      | 100             | 24053          | 24053               | 0               | 88.15              | 118.13              | 560.86               | 560.43               | 751.65               | 908.02               |

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

hebiao064 · 2025-08-21T23:03:38Z

trigger ci, will review it by today or tmr

sundar24295s · 2025-08-27T22:05:41Z

@zhyncs / @hnyls2002 The unit test failures seems unrelated to this PR.

hnyls2002 · 2025-09-09T19:26:36Z

Could you please split your benchmark scripts first as a separate PR? Also, do not use api or score, use more specific names.

sundar24295s · 2025-09-10T05:00:42Z

@hnyls2002 Moved the benchmark scripts to another PR and rebased

merrymercy · 2025-10-08T23:45:05Z

+            - "cross_encoder_pairs": Cross-encoder pairs like [["query", "document"]]
+        """
+        if isinstance(texts, str):
+            return "single_string"


please use ENUM instead of string

sundar24295s marked this pull request as ready for review August 20, 2025 07:44

sundar24295s requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners August 20, 2025 07:44

zhyncs assigned fzyzcjy, hebiao064 and hnyls2002 Aug 22, 2025

zhyncs added the high priority label Aug 22, 2025

zminglei approved these changes Aug 25, 2025

View reviewed changes

sundar24295s mentioned this pull request Sep 9, 2025

[Benchmark] Prefil-only benchmark scripts #10240

Merged

4 tasks

sundar24295s requested review from BBuf, ByronHsu, Edwardf0t1, FlamingoPg, HaiShaw, JustinTong0323, ShangmingCai, ch-wan, ispobock, kushanam, mickqian, yizhang2077 and zhyncs as code owners September 10, 2025 03:17

sundar24295s added 3 commits September 10, 2025 03:20

Dynamic Batch Tokenizer

22672f9

Dynamic Batch Tokenizer

141bf5a

updates

41ef215

sundar24295s force-pushed the suramach/batchtokenizer branch from ec40e5f to 41ef215 Compare September 10, 2025 03:20

sundar24295s removed request for BBuf, ByronHsu, Edwardf0t1, FlamingoPg, HaiShaw, JustinTong0323, ShangmingCai, ch-wan, ispobock, kushanam and yizhang2077 September 10, 2025 04:10

hnyls2002 approved these changes Sep 12, 2025

View reviewed changes

sundar24295s added 4 commits September 13, 2025 03:59

Handle Cross Encoder Requests.

7f25cab

Added unit tests

7c90c90

Merge branch 'main' into suramach/batchtokenizer

ef79966

pre-commit changes

0631a41

hnyls2002 merged commit 94d0f65 into sgl-project:main Sep 13, 2025
70 of 77 checks passed

merrymercy reviewed Oct 8, 2025

View reviewed changes

sundar24295s mentioned this pull request Dec 17, 2025

[Roadmap] SGLang Prefill-Only 2026 CY26H1 Roadmap #15344

Open

23 tasks

This was referenced Apr 13, 2026

google/multi item scoring v1 [conflicts-resolved] alexshires/sglang-jax#2

Open

Google/multi item scoring v1 alexshires/sglang-jax#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Dynamic Batch Tokenizer#9382

[Performance] Dynamic Batch Tokenizer#9382
hnyls2002 merged 7 commits intosgl-project:mainfrom
sundar24295s:suramach/batchtokenizer

sundar24295s commented Aug 20, 2025 •

edited

Loading

Uh oh!

hebiao064 commented Aug 21, 2025

Uh oh!

sundar24295s commented Aug 27, 2025

Uh oh!

hnyls2002 commented Sep 9, 2025

Uh oh!

sundar24295s commented Sep 10, 2025

Uh oh!

Uh oh!

merrymercy Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

sundar24295s commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Context

Tokenization Batching Options

Modifications

Benchmarking and Profiling

Baseline Results

Batch Tokenizer Results

Checklist

Uh oh!

hebiao064 commented Aug 21, 2025

Uh oh!

sundar24295s commented Aug 27, 2025

Uh oh!

hnyls2002 commented Sep 9, 2025

Uh oh!

sundar24295s commented Sep 10, 2025

Uh oh!

Uh oh!

merrymercy Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sundar24295s commented Aug 20, 2025 •

edited

Loading