You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When handle_loop processes multiple scheduler batches before _wait_one_response is scheduled (i.e. several ZMQ messages are buffered), intermediate streaming outputs accumulate in out_list. Only the last one is kept and the rest are silently discarded.
The race in _wait_one_response exists unconditionally, but its severity depends on which output the consumer reads.
Cumulative text consumers: each output dict carries cumulative decoded text (output_str[sent_offset:]). If intermediate chunks are dropped, out_list[-1] still contains all text up to that point. The client sees a larger than usual chunk but no data loss.
Token ID consumers:output_ids in each output dict are disjoint deltas - they contain only the new tokens since the last chunk. Dropping a chunk permanently loses those token IDs with no way to recover them. This is silent, unrecoverable data loss, regardless of whether --skip-tokenizer-init is set.
In other words, _wait_one_response was never designed to handle disjoint streaming outputs safely. The cumulative text path happened to mask the race, but any consumer reading output_ids from the stream has no safety.
Additionally, --skip-tokenizer-init removes the detokenizer process, which normally acts as a natural throttle (tokenizer.batch_decode() is CPU-bound) and makes ZMQ buffer accumulation rare. Without it, messages arrive at wire speed and accumulation under load becomes routine - making the race significantly easier to trigger.
Symptoms
Example of corrupted streaming output from DeepSeek-V3.2 (tool-calling response):
The file prerequests.sh exists in the repository root. Now let's **creatory (create a directory for)** modules.
**<|DSML|function_c>**
<|DSML|invoke name="add_entry">
...
</|DSML|invoke>
</|DSML|function_calls>
completion_tokens reported by SGLang is higher than the number of tokens the client actually received
P99 inter-token latency spikes on affected chunks
Why Dynamo's default configuration amplifies the problem
Dynamo forces --skip-tokenizer-init by default, which has a compounding effect on this bug.
Beyond amplifying this race, --skip-tokenizer-init has additional costs in Dynamo:
No structured output enforcement. SGLang's grammar-guided decoding and JSON mode require the tokenizer to be initialized. With --skip-tokenizer-init, the schema is never forwarded to SGLang — constrained decoding is silently disabled. (cc @ishandhanani there is PR from my co-worker in dynamo: feat: sglang guided decoding support ai-dynamo/dynamo#6620)
Single tokenizer worker bottleneck. --skip-tokenizer-init also prevents using --tokenizer-worker-num > 1. In our production data, the single tokenizer worker relaying requests over ZMQ becomes a throughput bottleneck under high concurrency. Re-enabling tokenizer init would unlock multi-worker scaling.
A possible middle ground: a new SGLang flag (e.g. --skip-detokenization or something) that initializes the tokenizer (enabling structured output and multi-worker) but skips the detokenization pass for output chunks, leaving that to the external consumer.
Reproduction
This race is sensitive to system-level timing. In our experience, it reproduces reliably in a k8s production environment (+Dynamo) under sustained load, but we were unable to trigger it in a local setup with the same SGLang + Dynamo frontend configuration in a single container - likely because differences in network stack and container overhead affect the exact back-pressure dynamics that cause ZMQ message accumulation.
Observed in production (Dynamo + SGLang, detokenizer enabled (for SO compatability)):
466 affected requests in 20 minutes under max_concurrency=5 (DeepSeek-V3.2)
Backlogs of up to 14 queued chunks per wakeup
A few percent (~1-3%) of requests showed a token count mismatch
Describe the bug
_wait_one_responseintokenizer_manager.pyconsumes only the last element ofstate.out_liston each wakeup:sglang/python/sglang/srt/managers/tokenizer_manager.py
Lines 1149 to 1151 in 203cd8e
When
handle_loopprocesses multiple scheduler batches before_wait_one_responseis scheduled (i.e. several ZMQ messages are buffered), intermediate streaming outputs accumulate inout_list. Only the last one is kept and the rest are silently discarded.The race in
_wait_one_responseexists unconditionally, but its severity depends on which output the consumer reads.output_str[sent_offset:]). If intermediate chunks are dropped,out_list[-1]still contains all text up to that point. The client sees a larger than usual chunk but no data loss.output_idsin each output dict are disjoint deltas - they contain only the new tokens since the last chunk. Dropping a chunk permanently loses those token IDs with no way to recover them. This is silent, unrecoverable data loss, regardless of whether --skip-tokenizer-init is set.In other words,
_wait_one_responsewas never designed to handle disjoint streaming outputs safely. The cumulative text path happened to mask the race, but any consumer readingoutput_idsfrom the stream has no safety.Additionally,
--skip-tokenizer-initremoves the detokenizer process, which normally acts as a natural throttle (tokenizer.batch_decode()is CPU-bound) and makes ZMQ buffer accumulation rare. Without it, messages arrive at wire speed and accumulation under load becomes routine - making the race significantly easier to trigger.Symptoms
completion_tokensreported by SGLang is higher than the number of tokens the client actually receivedWhy Dynamo's default configuration amplifies the problem
Dynamo forces
--skip-tokenizer-initby default, which has a compounding effect on this bug.Beyond amplifying this race,
--skip-tokenizer-inithas additional costs in Dynamo:--skip-tokenizer-init, the schema is never forwarded to SGLang — constrained decoding is silently disabled. (cc @ishandhanani there is PR from my co-worker in dynamo: feat: sglang guided decoding support ai-dynamo/dynamo#6620)--skip-tokenizer-initalso prevents using--tokenizer-worker-num > 1. In our production data, the single tokenizer worker relaying requests over ZMQ becomes a throughput bottleneck under high concurrency. Re-enabling tokenizer init would unlock multi-worker scaling.A possible middle ground: a new SGLang flag (e.g.
--skip-detokenizationor something) that initializes the tokenizer (enabling structured output and multi-worker) but skips the detokenization pass for output chunks, leaving that to the external consumer.Reproduction
This race is sensitive to system-level timing. In our experience, it reproduces reliably in a k8s production environment (+Dynamo) under sustained load, but we were unable to trigger it in a local setup with the same SGLang + Dynamo frontend configuration in a single container - likely because differences in network stack and container overhead affect the exact back-pressure dynamics that cause ZMQ message accumulation.
Observed in production (Dynamo + SGLang, detokenizer enabled (for SO compatability)):
466 affected requests in 20 minutes under max_concurrency=5 (DeepSeek-V3.2)
cc @ishandhanani
Environment
Dynamo + SGLang