Skip to content

Support Multi Detokenizer based on Multi Tokenizer#9970

Open
LLLL114 wants to merge 50 commits intosgl-project:mainfrom
LLLL114:multi_detokenizer_manager
Open

Support Multi Detokenizer based on Multi Tokenizer#9970
LLLL114 wants to merge 50 commits intosgl-project:mainfrom
LLLL114:multi_detokenizer_manager

Conversation

@LLLL114
Copy link
Copy Markdown
Contributor

@LLLL114 LLLL114 commented Sep 3, 2025

Motivation

Enable multi detokenizer based on Multi Tokenizer.

Modifications

  1. Add detokenizer-worker-num to set nums of detokenizer-worker-num
  2. To minimize the code change and save socket usage, reuse most structure of MultiTokenizerMixin, and the detokenizer nums must be divisible by tokenizer nums.
  3. Add MultiDetokenizerRouter to route request from scheduler to multi detokenizer.

Benchmarking and Profiling

set detokenizer-worker-num = 1,4,8
cmd:

SGLANG_USE_MODELSCOPE=true \
python -m sglang.launch_server \
    --model-path /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-0.5B --disaggregation-mode null \
    --port $PORT --base-gpu-id $BASEGPUID \
    --trust-remote-code --tp-size 1 --dp-size 8 --tokenizer-worker-num 8 --detokenizer-worker-num 1 \
    --disable-radix-cache
detokenizer-worker-num = 1
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1500      
Successful requests:                     10000     
Benchmark duration (s):                  133.01    
Total input tokens:                      20707676  
Total generated tokens:                  5122821   
Total generated tokens (retokenized):    5120982   
Request throughput (req/s):              75.18     
Input token throughput (tok/s):          155689.50 
Output token throughput (tok/s):         38515.64  
Total token throughput (tok/s):          194205.15 
Concurrency:                             1412.65   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18789.19  
Median E2E Latency (ms):                 17832.40  
---------------Time to First Token----------------
Mean TTFT (ms):                          4460.66   
Median TTFT (ms):                        4030.85   
P99 TTFT (ms):                           13525.59  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           28.02     
Median ITL (ms):                         0.01      
P95 ITL (ms):                            102.88    
P99 ITL (ms):                            181.98    
Max ITL (ms):                            7381.73   
==================================================
detokenizer-worker-num = 4
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1500      
Successful requests:                     10000     
Benchmark duration (s):                  129.37    
Total input tokens:                      20707676  
Total generated tokens:                  5122821   
Total generated tokens (retokenized):    5121155   
Request throughput (req/s):              77.30     
Input token throughput (tok/s):          160069.65 
Output token throughput (tok/s):         39599.24  
Total token throughput (tok/s):          199668.89 
Concurrency:                             1326.15   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   17155.94  
Median E2E Latency (ms):                 15163.17  
---------------Time to First Token----------------
Mean TTFT (ms):                          2551.11   
Median TTFT (ms):                        1784.67   
P99 TTFT (ms):                           17076.68  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           28.57     
Median ITL (ms):                         0.01      
P95 ITL (ms):                            146.90    
P99 ITL (ms):                            481.66    
Max ITL (ms):                            6877.82   
==================================================
detokenizer-worker-num = 8
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1500      
Successful requests:                     10000     
Benchmark duration (s):                  128.83    
Total input tokens:                      20707676  
Total generated tokens:                  5122821   
Total generated tokens (retokenized):    5121141   
Request throughput (req/s):              77.62     
Input token throughput (tok/s):          160741.49 
Output token throughput (tok/s):         39765.44  
Total token throughput (tok/s):          200506.93 
Concurrency:                             1337.05   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   17224.71  
Median E2E Latency (ms):                 15310.77  
---------------Time to First Token----------------
Mean TTFT (ms):                          2456.57   
Median TTFT (ms):                        1451.91   
P99 TTFT (ms):                           24556.52  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           28.88     
Median ITL (ms):                         0.01      
P95 ITL (ms):                            167.74    
P99 ITL (ms):                            456.69    
Max ITL (ms):                            6719.15   
==================================================

Checklist

Summary by CodeRabbit

  • New Features
    • Supports multiple detokenizer workers coordinated via a router for parallel detokenization.
    • Adds CLI flags to configure tokenizer and detokenizer worker counts, with validation for compatible ratios.
  • Performance
    • Increased throughput and scalability for tokenization/detokenization in multi-worker configurations.
  • Tests
    • Test suite updated to run with multiple detokenizer workers to reflect new parallel setup.

Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
@whybeyoung
Copy link
Copy Markdown
Collaborator

Thanks for you contribution~ LGTM

Comment thread python/sglang/srt/entrypoints/engine.py
LLLL114 and others added 4 commits September 5, 2025 10:31
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
@miter6
Copy link
Copy Markdown
Contributor

miter6 commented Sep 7, 2025

@LLLL114 Hi,
Dose Multi tokenizer and Detokenizer support for PD??

@LLLL114
Copy link
Copy Markdown
Contributor Author

LLLL114 commented Sep 7, 2025

@LLLL114 Hi,

Dose Multi tokenizer and Detokenizer support for PD??

Sure,you can try with tokenizer-worker-num and detokenizer-worker-num

Comment thread python/sglang/srt/managers/multi_tokenizer_mixin.py Outdated
Comment thread python/sglang/srt/server_args.py
LLLL114 and others added 3 commits September 9, 2025 16:24
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
@whybeyoung
Copy link
Copy Markdown
Collaborator

@hnyls2002 can you review it

Comment thread python/sglang/srt/managers/multi_tokenizer_mixin.py Outdated
Comment thread python/sglang/srt/managers/multi_tokenizer_mixin.py Outdated
whybeyoung and others added 3 commits September 28, 2025 18:30
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: ybyang <ybyang7@iflytek.com>
@whybeyoung whybeyoung force-pushed the multi_detokenizer_manager branch from 6f774ce to 6023602 Compare September 28, 2025 12:40
coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

@sgl-project sgl-project deleted a comment from coderabbitai Bot Oct 24, 2025
raise ValueError(f"Unknown req type: {type(req)}")


class MultiDetokenizerRouter:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The router might become a bottleneck.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for us, we don't encounter the boottleneck now , what's you case? (maybe our hardware can't reach thos hign cocurrency.

@merrymercy merrymercy requested a review from zhyncs as a code owner November 29, 2025 07:06
whybeyoung added a commit to whybeyoung/sglang that referenced this pull request Apr 28, 2026
Backport sgl-project#9970 with adaptations for the V4 PD branch.

Add a new --detokenizer-worker-num CLI flag that scales the detokenizer
out of the single-process bottleneck. When > 1, N DetokenizerManager
processes each listen on a private IPC socket and a new
MultiDetokenizerRouter process owns the public detokenizer IPC and
fans out scheduler outputs by hashing http_worker_ipc (zlib.crc32, so
routing is deterministic across runs). Stream-order is preserved
because all outputs of the same HTTP/tokenizer worker pin to the same
detokenizer.

* server_args: new field + CLI arg + divisibility check
  (tokenizer_worker_num must be a multiple of detokenizer_worker_num);
  skip_tokenizer_init forces it back to 1.
* multi_tokenizer_mixin:
  - SocketMapping.send_output gains an optional is_tokenizer flag (only
    affects log labelling).
  - multi_http_worker_event_loop now also handles BaseReq, since the
    detok router may forward single requests downstream.
  - new MultiDetokenizerRouter class + run_multi_detokenizer_router_process.
* detokenizer_manager: skip the unused send_to_tokenizer socket in
  multi-tokenizer mode (results go through SocketMapping instead).
* engine: factor detok launch into _launch_detokenizer_subprocesses,
  spawning N detok workers + 1 router when detokenizer_worker_num > 1.
* test_multi_tokenizer: exercise --detokenizer-worker-num 4.
whybeyoung added a commit to whybeyoung/sglang that referenced this pull request Apr 29, 2026
Backport sgl-project#9970 with adaptations for the V4 PD branch.

Add a new --detokenizer-worker-num CLI flag that scales the detokenizer
out of the single-process bottleneck. When > 1, N DetokenizerManager
processes each listen on a private IPC socket and a new
MultiDetokenizerRouter process owns the public detokenizer IPC and
fans out scheduler outputs by hashing http_worker_ipc (zlib.crc32, so
routing is deterministic across runs). Stream-order is preserved
because all outputs of the same HTTP/tokenizer worker pin to the same
detokenizer.

* server_args: new field + CLI arg + divisibility check
  (tokenizer_worker_num must be a multiple of detokenizer_worker_num);
  skip_tokenizer_init forces it back to 1.
* multi_tokenizer_mixin:
  - SocketMapping.send_output gains an optional is_tokenizer flag (only
    affects log labelling).
  - multi_http_worker_event_loop now also handles BaseReq, since the
    detok router may forward single requests downstream.
  - new MultiDetokenizerRouter class + run_multi_detokenizer_router_process.
* detokenizer_manager: skip the unused send_to_tokenizer socket in
  multi-tokenizer mode (results go through SocketMapping instead).
* engine: factor detok launch into _launch_detokenizer_subprocesses,
  spawning N detok workers + 1 router when detokenizer_worker_num > 1.
* test_multi_tokenizer: exercise --detokenizer-worker-num 4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants