[Bug] Streaming token ids data loss under load (affects Nvidia Dynamo)

### Describe the bug

`_wait_one_response` in `tokenizer_manager.py` consumes only the last element of `state.out_list` on each wakeup:

https://github.com/sgl-project/sglang/blob/203cd8eb02a424bb39145cf46f8f577a468de5b5/python/sglang/srt/managers/tokenizer_manager.py#L1149-L1151

When `handle_loop` processes multiple scheduler batches before `_wait_one_response` is scheduled (i.e. several ZMQ messages are buffered), intermediate streaming outputs accumulate in `out_list`. Only the last one is kept and the rest are silently discarded.

The race in `_wait_one_response` exists unconditionally, but its severity depends on which output the consumer reads.

* **Cumulative text consumers:** each output dict carries cumulative decoded text (`output_str[sent_offset:]`). If intermediate chunks are dropped, `out_list[-1]` still contains all text up to that point. The client sees a larger than usual chunk but no data loss.
* **Token ID consumers:** `output_ids` in each output dict are **disjoint deltas** - they contain only the new tokens since the last chunk. Dropping a chunk permanently loses those token IDs with no way to recover them. This is silent, unrecoverable data loss, **regardless of whether --skip-tokenizer-init is set.**
In other words, `_wait_one_response` was never designed to handle disjoint streaming outputs safely. The cumulative text path happened to mask the race, but any consumer reading `output_ids` from the stream has no safety.
Additionally, `--skip-tokenizer-init` removes the detokenizer process, which normally acts as a natural throttle (`tokenizer.batch_decode()` is CPU-bound) and makes ZMQ buffer accumulation rare. Without it, messages arrive at wire speed and accumulation under load becomes routine - making the race significantly easier to trigger.

### Symptoms

* Example of corrupted streaming output from DeepSeek-V3.2 (tool-calling response):

```
The file prerequests.sh exists in the repository root. Now let's **creatory (create a directory for)** modules.

**<｜DSML｜function_c>**
<｜DSML｜invoke name="add_entry">
...
</｜DSML｜invoke>
</｜DSML｜function_calls>
```

* `completion_tokens` reported by SGLang is higher than the number of tokens the client actually received
* P99 inter-token latency spikes on affected chunks

### Why Dynamo's default configuration amplifies the problem

Dynamo forces `--skip-tokenizer-init` by default, which has a compounding effect on this bug. 

Beyond amplifying this race, `--skip-tokenizer-init` has additional costs in Dynamo:
* No structured output enforcement. SGLang's grammar-guided decoding and JSON mode require the tokenizer to be initialized. With `--skip-tokenizer-init`, the schema is never forwarded to SGLang — constrained decoding is silently disabled. (cc @ishandhanani there is PR from my co-worker in dynamo: https://github.com/ai-dynamo/dynamo/pull/6620)
* Single tokenizer worker bottleneck. `--skip-tokenizer-init` also prevents using `--tokenizer-worker-num > 1`. In our production data, the single tokenizer worker relaying requests over ZMQ becomes a throughput bottleneck under high concurrency. Re-enabling tokenizer init would unlock multi-worker scaling.

A possible middle ground: a new SGLang flag (e.g. `--skip-detokenization` or something) that initializes the tokenizer (enabling structured output and multi-worker) but skips the detokenization pass for output chunks, leaving that to the external consumer.

### Reproduction

This race is sensitive to system-level timing. In our experience, it reproduces reliably in a k8s production environment (+Dynamo) under sustained load, but we were unable to trigger it in a local setup with the same SGLang + Dynamo frontend configuration in a single container - likely because differences in network stack and container overhead affect the exact back-pressure dynamics that cause ZMQ message accumulation.

Observed in production (Dynamo + SGLang, **detokenizer enabled (for SO compatability)**):
466 affected requests in 20 minutes under max_concurrency=5 (DeepSeek-V3.2)
* Backlogs of up to 14 queued chunks per wakeup
* A few percent (~1-3%) of requests showed a token count mismatch


cc @ishandhanani 

### Environment

Dynamo + SGLang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Streaming token ids data loss under load (affects Nvidia Dynamo) #19976

Describe the bug

Symptoms

Why Dynamo's default configuration amplifies the problem

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	out = state.out_list[-1]

	state.out_list = []

[Bug] Streaming token ids data loss under load (affects Nvidia Dynamo) #19976

Description

Describe the bug

Symptoms

Why Dynamo's default configuration amplifies the problem

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions