Skip to content

[Bug] Streaming token ids data loss under load (affects Nvidia Dynamo) #19976

@vladnosiv

Description

@vladnosiv

Describe the bug

_wait_one_response in tokenizer_manager.py consumes only the last element of state.out_list on each wakeup:

out = state.out_list[-1]
state.out_list = []

When handle_loop processes multiple scheduler batches before _wait_one_response is scheduled (i.e. several ZMQ messages are buffered), intermediate streaming outputs accumulate in out_list. Only the last one is kept and the rest are silently discarded.

The race in _wait_one_response exists unconditionally, but its severity depends on which output the consumer reads.

  • Cumulative text consumers: each output dict carries cumulative decoded text (output_str[sent_offset:]). If intermediate chunks are dropped, out_list[-1] still contains all text up to that point. The client sees a larger than usual chunk but no data loss.
  • Token ID consumers: output_ids in each output dict are disjoint deltas - they contain only the new tokens since the last chunk. Dropping a chunk permanently loses those token IDs with no way to recover them. This is silent, unrecoverable data loss, regardless of whether --skip-tokenizer-init is set.
    In other words, _wait_one_response was never designed to handle disjoint streaming outputs safely. The cumulative text path happened to mask the race, but any consumer reading output_ids from the stream has no safety.
    Additionally, --skip-tokenizer-init removes the detokenizer process, which normally acts as a natural throttle (tokenizer.batch_decode() is CPU-bound) and makes ZMQ buffer accumulation rare. Without it, messages arrive at wire speed and accumulation under load becomes routine - making the race significantly easier to trigger.

Symptoms

  • Example of corrupted streaming output from DeepSeek-V3.2 (tool-calling response):
The file prerequests.sh exists in the repository root. Now let's **creatory (create a directory for)** modules.

**<|DSML|function_c>**
<|DSML|invoke name="add_entry">
...
</|DSML|invoke>
</|DSML|function_calls>
  • completion_tokens reported by SGLang is higher than the number of tokens the client actually received
  • P99 inter-token latency spikes on affected chunks

Why Dynamo's default configuration amplifies the problem

Dynamo forces --skip-tokenizer-init by default, which has a compounding effect on this bug.

Beyond amplifying this race, --skip-tokenizer-init has additional costs in Dynamo:

  • No structured output enforcement. SGLang's grammar-guided decoding and JSON mode require the tokenizer to be initialized. With --skip-tokenizer-init, the schema is never forwarded to SGLang — constrained decoding is silently disabled. (cc @ishandhanani there is PR from my co-worker in dynamo: feat: sglang guided decoding support ai-dynamo/dynamo#6620)
  • Single tokenizer worker bottleneck. --skip-tokenizer-init also prevents using --tokenizer-worker-num > 1. In our production data, the single tokenizer worker relaying requests over ZMQ becomes a throughput bottleneck under high concurrency. Re-enabling tokenizer init would unlock multi-worker scaling.

A possible middle ground: a new SGLang flag (e.g. --skip-detokenization or something) that initializes the tokenizer (enabling structured output and multi-worker) but skips the detokenization pass for output chunks, leaving that to the external consumer.

Reproduction

This race is sensitive to system-level timing. In our experience, it reproduces reliably in a k8s production environment (+Dynamo) under sustained load, but we were unable to trigger it in a local setup with the same SGLang + Dynamo frontend configuration in a single container - likely because differences in network stack and container overhead affect the exact back-pressure dynamics that cause ZMQ message accumulation.

Observed in production (Dynamo + SGLang, detokenizer enabled (for SO compatability)):
466 affected requests in 20 minutes under max_concurrency=5 (DeepSeek-V3.2)

  • Backlogs of up to 14 queued chunks per wakeup
  • A few percent (~1-3%) of requests showed a token count mismatch

cc @ishandhanani

Environment

Dynamo + SGLang

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions