Skip to content

fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens#26245

Closed
dschulmeist wants to merge 2 commits into
BerriAI:litellm_oss_branchfrom
dschulmeist:fix/tiktoken-event-loop-block-26193
Closed

fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens#26245
dschulmeist wants to merge 2 commits into
BerriAI:litellm_oss_branchfrom
dschulmeist:fix/tiktoken-event-loop-block-26193

Conversation

@dschulmeist

Copy link
Copy Markdown

Relevant issues

Fixes #26193

Type

Bug Fix

Changes

Problem

stream_chunk_builder unconditionally calls ChunkProcessor.count_reasoning_tokens, which recomputes reasoning_tokens via tiktoken.encode() — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking, 100k+ tokens).

The computed value is already discarded by calculate_usage when the provider supplied reasoning_tokens in a streaming usage chunk. Specifically, calculate_usage only applies the recomputed value when completion_tokens_details.reasoning_tokens is None:

if reasoning_tokens is not None:
    if returned_usage.completion_tokens_details is None:
        returned_usage.completion_tokens_details = (
            CompletionTokensDetailsWrapper(reasoning_tokens=reasoning_tokens)
        )
    elif (
        returned_usage.completion_tokens_details is not None
        and returned_usage.completion_tokens_details.reasoning_tokens is None
    ):
        returned_usage.completion_tokens_details.reasoning_tokens = (
            reasoning_tokens
        )

All major reasoning providers (Anthropic extended thinking, OpenAI o1/o3, Gemini thinking) supply reasoning_tokens in streaming usage chunks, so in the common case the expensive recount runs purely to be thrown away.

The bug report (#26193) includes py-spy thread dumps showing a single tiktoken.encode() call monopolizing the event loop for the full liveness-probe window, resulting in pods being SIGKILLed by Kubernetes.

Fix

Inspect the already-parsed streaming chunks for any completion_tokens_details.reasoning_tokens entry. If one exists, skip the local tiktoken recount entirely — calculate_usage will pick up the provider-supplied value as before. If none exists (rare), fall through to count_reasoning_tokens so providers that emit reasoning_content without reporting usage are unaffected.

New helper ChunkProcessor.chunks_have_reasoning_tokens(chunks) handles both ModelResponseStream objects and plain-dict chunks, matching the existing chunk-parsing patterns in _calculate_usage_per_chunk.

Call-site change in litellm/main.py:

if processor.chunks_have_reasoning_tokens(chunks):
    reasoning_tokens = None
else:
    reasoning_tokens = processor.count_reasoning_tokens(response)

Tests

Four new tests in tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py:

  1. test_chunks_have_reasoning_tokens_true_when_provider_supplies — regression test for the reported bug.
  2. test_chunks_have_reasoning_tokens_false_when_provider_omits — ensures the slow fallback is still exercised when providers omit usage.
  3. test_chunks_have_reasoning_tokens_handles_dict_chunks — dict-chunk path (proxy-style chunks) is covered.
  4. test_stream_chunk_builder_skips_count_reasoning_tokens_when_usage_present — end-to-end: stream_chunk_builder does not invoke count_reasoning_tokens when the provider already supplied the value, and the final response carries the provider-supplied number.

All four tests fail on litellm_oss_branch without the fix and pass with it.

Scope

  • No API changes: existing public method signatures untouched.
  • No new config flags or defaults.
  • No changes to count_reasoning_tokens itself (it remains the correct fallback).
  • Does not address the separate case where a misbehaving provider emits reasoning_content without any usage chunk. That path still blocks the event loop and could be fixed in a follow-up (e.g. asyncio.to_thread at the async caller), but is orthogonal to this issue's reported reproduction.

Pre-Submission checklist

  • Added testing in tests/test_litellm/
  • PR passes the touched test file (13/13 in test_streaming_chunk_builder_utils.py)
  • Scope isolated to a single bug
  • Greptile review (auto-triggered)

@greptile-apps

greptile-apps Bot commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a GIL-blocking performance bug in stream_chunk_builder where ChunkProcessor.count_reasoning_tokens unconditionally called tiktoken.encode() on large reasoning responses, holding the GIL long enough to kill Kubernetes pods via liveness-probe timeouts. The fix adds chunks_have_reasoning_tokens() to detect when the provider already supplied reasoning_tokens in a streaming usage chunk, allowing the expensive recount to be skipped since calculate_usage would have discarded the locally-computed value anyway.

Confidence Score: 5/5

Safe to merge — targeted performance fix with no behavioral change in the common case and intact fallback for providers that omit usage.

Logic is correct and validated against calculate_usage's existing conditional that discards the locally-computed value when the provider already supplied it. The fallback path is fully preserved. Four new tests cover all branches including the end-to-end skip. No backwards-incompatible changes, no new config flags, and no security implications.

No files require special attention.

Important Files Changed

Filename Overview
litellm/litellm_core_utils/streaming_chunk_builder_utils.py Adds chunks_have_reasoning_tokens() helper that inspects streaming chunks (both ModelResponseStream objects and plain dicts) for a provider-supplied completion_tokens_details.reasoning_tokens value; logic is correct and follows existing chunk-parsing patterns in _calculate_usage_per_chunk.
litellm/main.py Call-site change gates count_reasoning_tokens behind the new helper; semantics unchanged since calculate_usage already discards the recomputed value when provider supplied it.
tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py Four new mock-only tests covering the happy path, fallback path, dict-chunk path, and end-to-end skip verification; no real network calls, all assertions meaningful.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[stream_chunk_builder called with chunks] --> B[processor.chunks_have_reasoning_tokens chunks]
    B --> C{Any chunk has\ncompletion_tokens_details\n.reasoning_tokens?}
    C -- Yes --> D[reasoning_tokens = None\nskip tiktoken.encode GIL block]
    C -- No --> E[reasoning_tokens = processor.count_reasoning_tokens response\ntiktoken.encode fallback path]
    D --> F[processor.calculate_usage chunks model ...]
    E --> F
    F --> G{reasoning_tokens arg\nis not None?}
    G -- No --> H[Use provider-supplied\ncompletion_tokens_details.reasoning_tokens]
    G -- Yes --> I{returned_usage\n.completion_tokens_details\n.reasoning_tokens is None?}
    I -- Yes --> J[Apply locally computed\nreasoning_tokens]
    I -- No --> H
    H --> K[Return final Usage object]
    J --> K
Loading

Reviews (2): Last reviewed commit: "fix(proxy): skip redundant tiktoken reco..." | Re-trigger Greptile

shubham-arora-clear and others added 2 commits April 23, 2026 21:54
…rriAI#25855)

Bedrock enforces non-increasing TTL ordering across cache_control blocks
(tools → system → messages). The tool cache_control TTL was being
unconditionally dropped to the default 5m, while system blocks preserved
the user-specified TTL for Claude 4.5+ models. This mismatch caused
"a ttl='1h' block must not come after a ttl='5m' block" errors when
users set ttl='1h' on both tools and system.

Converse path: add_cache_point_tool_block() now accepts a model param
and preserves TTL for Claude 4.5+, matching _get_cache_point_block().

Invoke path: _remove_ttl_from_cache_control() now also processes tools
(was only processing system and messages).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…asoning_tokens (BerriAI#26193)

stream_chunk_builder unconditionally called ChunkProcessor.count_reasoning_tokens,
which recomputes reasoning_tokens via tiktoken.encode() — a C extension that holds
the GIL for tens of seconds on large reasoning responses (Claude extended thinking,
OpenAI o1/o3, Gemini thinking).

The computed value is already discarded by calculate_usage when the provider
supplied reasoning_tokens in a streaming usage chunk (see lines applying
reasoning_tokens only when completion_tokens_details.reasoning_tokens is None).
All major reasoning providers do supply this value, so in the common case the
expensive recount runs purely to be thrown away — while blocking the asyncio
event loop long enough to fail K8s liveness probes and get pods killed.

Adds ChunkProcessor.chunks_have_reasoning_tokens(chunks) which inspects the
already-parsed streaming chunks for any completion_tokens_details.reasoning_tokens
entry. Short-circuits the call site in stream_chunk_builder when one is found.
The slow fallback path via count_reasoning_tokens is preserved for providers
that emit reasoning_content without reporting usage.

Tests cover:
- chunks_have_reasoning_tokens returns True when provider reports the value
- returns False when provider omits it (fallback path preserved)
- handles plain-dict chunks as well as ModelResponseStream objects
- end-to-end: stream_chunk_builder does not invoke count_reasoning_tokens when
  the provider already supplies reasoning_tokens

Fixes BerriAI#26193
@dschulmeist dschulmeist force-pushed the fix/tiktoken-event-loop-block-26193 branch from 349cc9a to 4d1c7b8 Compare April 24, 2026 08:25
@veria-ai

veria-ai Bot commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Low: No security issues

This PR adds a performance optimization that skips a redundant tiktoken tokenization when the upstream provider already supplies reasoning_tokens in streaming usage chunks. The new code is read-only inspection of in-memory chunk data with no changes to auth, input validation, data exposure, or external interactions.


Status: 0 open
Risk: 1/10

Posted by Veria AI · 2026-04-24T08:26:46.159Z

@codecov

codecov Bot commented Apr 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants