fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens#26245
Conversation
Greptile SummaryThis PR fixes a GIL-blocking performance bug in Confidence Score: 5/5Safe to merge — targeted performance fix with no behavioral change in the common case and intact fallback for providers that omit usage. Logic is correct and validated against No files require special attention.
|
| Filename | Overview |
|---|---|
| litellm/litellm_core_utils/streaming_chunk_builder_utils.py | Adds chunks_have_reasoning_tokens() helper that inspects streaming chunks (both ModelResponseStream objects and plain dicts) for a provider-supplied completion_tokens_details.reasoning_tokens value; logic is correct and follows existing chunk-parsing patterns in _calculate_usage_per_chunk. |
| litellm/main.py | Call-site change gates count_reasoning_tokens behind the new helper; semantics unchanged since calculate_usage already discards the recomputed value when provider supplied it. |
| tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py | Four new mock-only tests covering the happy path, fallback path, dict-chunk path, and end-to-end skip verification; no real network calls, all assertions meaningful. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[stream_chunk_builder called with chunks] --> B[processor.chunks_have_reasoning_tokens chunks]
B --> C{Any chunk has\ncompletion_tokens_details\n.reasoning_tokens?}
C -- Yes --> D[reasoning_tokens = None\nskip tiktoken.encode GIL block]
C -- No --> E[reasoning_tokens = processor.count_reasoning_tokens response\ntiktoken.encode fallback path]
D --> F[processor.calculate_usage chunks model ...]
E --> F
F --> G{reasoning_tokens arg\nis not None?}
G -- No --> H[Use provider-supplied\ncompletion_tokens_details.reasoning_tokens]
G -- Yes --> I{returned_usage\n.completion_tokens_details\n.reasoning_tokens is None?}
I -- Yes --> J[Apply locally computed\nreasoning_tokens]
I -- No --> H
H --> K[Return final Usage object]
J --> K
Reviews (2): Last reviewed commit: "fix(proxy): skip redundant tiktoken reco..." | Re-trigger Greptile
…rriAI#25855) Bedrock enforces non-increasing TTL ordering across cache_control blocks (tools → system → messages). The tool cache_control TTL was being unconditionally dropped to the default 5m, while system blocks preserved the user-specified TTL for Claude 4.5+ models. This mismatch caused "a ttl='1h' block must not come after a ttl='5m' block" errors when users set ttl='1h' on both tools and system. Converse path: add_cache_point_tool_block() now accepts a model param and preserves TTL for Claude 4.5+, matching _get_cache_point_block(). Invoke path: _remove_ttl_from_cache_control() now also processes tools (was only processing system and messages). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…asoning_tokens (BerriAI#26193) stream_chunk_builder unconditionally called ChunkProcessor.count_reasoning_tokens, which recomputes reasoning_tokens via tiktoken.encode() — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking). The computed value is already discarded by calculate_usage when the provider supplied reasoning_tokens in a streaming usage chunk (see lines applying reasoning_tokens only when completion_tokens_details.reasoning_tokens is None). All major reasoning providers do supply this value, so in the common case the expensive recount runs purely to be thrown away — while blocking the asyncio event loop long enough to fail K8s liveness probes and get pods killed. Adds ChunkProcessor.chunks_have_reasoning_tokens(chunks) which inspects the already-parsed streaming chunks for any completion_tokens_details.reasoning_tokens entry. Short-circuits the call site in stream_chunk_builder when one is found. The slow fallback path via count_reasoning_tokens is preserved for providers that emit reasoning_content without reporting usage. Tests cover: - chunks_have_reasoning_tokens returns True when provider reports the value - returns False when provider omits it (fallback path preserved) - handles plain-dict chunks as well as ModelResponseStream objects - end-to-end: stream_chunk_builder does not invoke count_reasoning_tokens when the provider already supplies reasoning_tokens Fixes BerriAI#26193
349cc9a to
4d1c7b8
Compare
Low: No security issuesThis PR adds a performance optimization that skips a redundant tiktoken tokenization when the upstream provider already supplies reasoning_tokens in streaming usage chunks. The new code is read-only inspection of in-memory chunk data with no changes to auth, input validation, data exposure, or external interactions. Status: 0 open Posted by Veria AI · 2026-04-24T08:26:46.159Z |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
cd88dde to
c014bfa
Compare
Relevant issues
Fixes #26193
Type
Bug Fix
Changes
Problem
stream_chunk_builderunconditionally callsChunkProcessor.count_reasoning_tokens, which recomputesreasoning_tokensviatiktoken.encode()— a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking, 100k+ tokens).The computed value is already discarded by
calculate_usagewhen the provider suppliedreasoning_tokensin a streaming usage chunk. Specifically,calculate_usageonly applies the recomputed value whencompletion_tokens_details.reasoning_tokens is None:All major reasoning providers (Anthropic extended thinking, OpenAI o1/o3, Gemini thinking) supply
reasoning_tokensin streaming usage chunks, so in the common case the expensive recount runs purely to be thrown away.The bug report (#26193) includes py-spy thread dumps showing a single
tiktoken.encode()call monopolizing the event loop for the full liveness-probe window, resulting in pods being SIGKILLed by Kubernetes.Fix
Inspect the already-parsed streaming chunks for any
completion_tokens_details.reasoning_tokensentry. If one exists, skip the local tiktoken recount entirely —calculate_usagewill pick up the provider-supplied value as before. If none exists (rare), fall through tocount_reasoning_tokensso providers that emitreasoning_contentwithout reporting usage are unaffected.New helper
ChunkProcessor.chunks_have_reasoning_tokens(chunks)handles bothModelResponseStreamobjects and plain-dict chunks, matching the existing chunk-parsing patterns in_calculate_usage_per_chunk.Call-site change in
litellm/main.py:Tests
Four new tests in
tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py:test_chunks_have_reasoning_tokens_true_when_provider_supplies— regression test for the reported bug.test_chunks_have_reasoning_tokens_false_when_provider_omits— ensures the slow fallback is still exercised when providers omit usage.test_chunks_have_reasoning_tokens_handles_dict_chunks— dict-chunk path (proxy-style chunks) is covered.test_stream_chunk_builder_skips_count_reasoning_tokens_when_usage_present— end-to-end:stream_chunk_builderdoes not invokecount_reasoning_tokenswhen the provider already supplied the value, and the final response carries the provider-supplied number.All four tests fail on
litellm_oss_branchwithout the fix and pass with it.Scope
count_reasoning_tokensitself (it remains the correct fallback).reasoning_contentwithout any usage chunk. That path still blocks the event loop and could be fixed in a follow-up (e.g.asyncio.to_threadat the async caller), but is orthogonal to this issue's reported reproduction.Pre-Submission checklist
tests/test_litellm/test_streaming_chunk_builder_utils.py)