Skip to content

[Performance] Dynamic Batch Tokenizer#9382

Merged
hnyls2002 merged 7 commits intosgl-project:mainfrom
sundar24295s:suramach/batchtokenizer
Sep 13, 2025
Merged

[Performance] Dynamic Batch Tokenizer#9382
hnyls2002 merged 7 commits intosgl-project:mainfrom
sundar24295s:suramach/batchtokenizer

Conversation

@sundar24295s
Copy link
Copy Markdown
Collaborator

@sundar24295s sundar24295s commented Aug 20, 2025

Motivation

  • This PR introduces an AsyncDynamicBatchTokenizer that enables batching of tokenization requests to improve throughput and reduce latency for SGLang's tokenizer manager.

Performance Impact:

  • For Qwen3-Embedding-0.6B with 500 input tokens and 1 prompt per request at RPS 500, P99 latency improved from 4583 ms to 464 ms (~10× faster) with this PR.
  • In production systems, P99 latency is a critical metric as it represents the worst-case experience for users. This improvement allows systems to extract maximum throughput while maintaining acceptable P99 latency thresholds for SLA compliance.
image

Context

This PR builds upon the tokenization batching infrastructure introduced in PR #5141, which added enable_tokenizer_batch_encode for batching multiple texts within a single request.

Tokenization Batching Options

Feature Use Case When to Use
enable_tokenizer_batch_encode (PR #5141) Client sends batched inputs in single request When your client batches multiple texts into one API call
enable_dynamic_batch_tokenizer (This PR) Client sends single prompts across multiple requests When your client application sends individual requests

Example scenarios:

  • Use enable_tokenizer_batch_encode: Client sends {"input": ["text1", "text2", "text3"]} in one request
  • Use enable_dynamic_batch_tokenizer: Client sends multiple concurrent requests: {"input": "text1"}, {"input": "text2"}, {"input": "text3"}

Modifications

  • 🚀 Dynamic Batching

    • Automatically batches multiple concurrent tokenization requests for efficiency
    • Processes single requests immediately when no other requests are pending
    • Collects additional requests up to max_batch_size or batch_wait_timeout_s when queue has pending items
  • ⚙️ Server Args

    • max_batch_size (default: 32): Maximum number of requests to batch together
    • batch_wait_timeout_s (default: 0.002s): Maximum time to wait for additional requests
    • enable_dynamic_batch_tokenizer: Feature flag to enable/disable the functionality
    • Usage
      --enable-dynamic-batch-tokenizer \
      --dynamic-batch-tokenizer-batch-size 32 \
      --dynamic-batch-tokenizer-batch-timeout 0.002
      
  • 🔄 Async Processing

    • Non-blocking tokenization using asyncio and ThreadPoolExecutor
    • Maintains event loop responsiveness while handling blocking tokenizer calls
    • Scales efficiently with concurrent requests

Benchmarking and Profiling

  • Model Qwen3-Embedding-0.6B
  • Input Token Length = 500
  • GPU Type = H100
  • Traffic distribution = Poisson

Baseline Results

(li-sglang) jobuser [ /shared/user/repos/li-sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-Embedding-0.6B --port 30000 --host 0.0.0.0 --enable-metrics --disable-radix-cache --disable-cuda-graph  --is-embedding

| test_duration_secs | minute_interval | target_rps | item_count | server_type | distribution | unique_requests | total_requests | successful_requests | failed_requests | send_duration_secs | total_duration_secs | avg_response_time_ms | p50_response_time_ms | p90_response_time_ms | p99_response_time_ms |
|--------------------|-----------------|------------|------------|-------------|--------------|-----------------|----------------|---------------------|-----------------|--------------------|---------------------|----------------------|----------------------|----------------------|----------------------|
| 60                 | 1               | 300        | 1          | HTTP        | POISSON      | 100             | 15145          | 15145               | 0               | 71.19              | 71.44               | 29.73                | 28.18                | 35.59                | 52.94                |
| 60                 | 1               | 400        | 1          | HTTP        | POISSON      | 100             | 19019          | 19019               | 0               | 75.75              | 76.04               | 53.17                | 34.21                | 58.68                | 436.09               |
| 60                 | 1               | 500        | 1          | HTTP        | POISSON      | 100             | 20350          | 20350               | 0               | 81.05              | 185.33              | 2306.36              | 2345.45              | 4142.11              | 4583.08              |
| 60                 | 1               | 600        | 1          | HTTP        | POISSON      | 100             | 20216          | 20216               | 0               | 86.23              | 107.08              | 5939.38              | 5780.45              | 10926.77             | 11914.25             |

Batch Tokenizer Results

(li-sglang) jobuser [ /shared/user/repos/li-sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-Embedding-0.6B --port 30000 --host 0.0.0.0 --enable-metrics --disable-radix-cache --disable-cuda-graph  --is-embedding --enable-dynamic-batch-tokenizer


| test_duration_secs | minute_interval | target_rps | item_count | server_type | distribution | unique_requests | total_requests | successful_requests | failed_requests | send_duration_secs | total_duration_secs | avg_response_time_ms | p50_response_time_ms | p90_response_time_ms | p99_response_time_ms |
|--------------------|-----------------|------------|------------|-------------|--------------|-----------------|----------------|---------------------|-----------------|--------------------|---------------------|----------------------|----------------------|----------------------|----------------------|
| 60                 | 1               | 300        | 1          | HTTP        | POISSON      | 100             | 15079          | 15079               | 0               | 71.68              | 71.88               | 31.40                | 28.92                | 44.66                | 68.06                |
| 60                 | 1               | 400        | 1          | HTTP        | POISSON      | 100             | 18965          | 18965               | 0               | 75.95              | 76.20               | 70.45                | 52.30                | 98.21                | 424.97               |
| 60                 | 1               | 500        | 1          | HTTP        | POISSON      | 100             | 21972          | 21972               | 0               | 81.29              | 81.75               | 125.62               | 89.51                | 287.98               | 464.68               |
| 60                 | 1               | 600        | 1          | HTTP        | POISSON      | 100             | 24053          | 24053               | 0               | 88.15              | 118.13              | 560.86               | 560.43               | 751.65               | 908.02               |

Checklist

@sundar24295s sundar24295s marked this pull request as ready for review August 20, 2025 07:44
@hebiao064
Copy link
Copy Markdown
Collaborator

trigger ci, will review it by today or tmr

@sundar24295s
Copy link
Copy Markdown
Collaborator Author

@zhyncs / @hnyls2002 The unit test failures seems unrelated to this PR.

@hnyls2002
Copy link
Copy Markdown
Collaborator

Could you please split your benchmark scripts first as a separate PR? Also, do not use api or score, use more specific names.

@sundar24295s
Copy link
Copy Markdown
Collaborator Author

@hnyls2002 Moved the benchmark scripts to another PR and rebased

@hnyls2002 hnyls2002 merged commit 94d0f65 into sgl-project:main Sep 13, 2025
70 of 77 checks passed
- "cross_encoder_pairs": Cross-encoder pairs like [["query", "document"]]
"""
if isinstance(texts, str):
return "single_string"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use ENUM instead of string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants