fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens by dschulmeist · Pull Request #26245 · BerriAI/litellm

dschulmeist · 2026-04-22T12:42:55Z

Relevant issues

Type

Bug Fix

Changes

Problem

stream_chunk_builder unconditionally calls ChunkProcessor.count_reasoning_tokens, which recomputes reasoning_tokens via tiktoken.encode() — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking, 100k+ tokens).

The computed value is already discarded by calculate_usage when the provider supplied reasoning_tokens in a streaming usage chunk. Specifically, calculate_usage only applies the recomputed value when completion_tokens_details.reasoning_tokens is None:

if reasoning_tokens is not None:
    if returned_usage.completion_tokens_details is None:
        returned_usage.completion_tokens_details = (
            CompletionTokensDetailsWrapper(reasoning_tokens=reasoning_tokens)
        )
    elif (
        returned_usage.completion_tokens_details is not None
        and returned_usage.completion_tokens_details.reasoning_tokens is None
    ):
        returned_usage.completion_tokens_details.reasoning_tokens = (
            reasoning_tokens
        )

All major reasoning providers (Anthropic extended thinking, OpenAI o1/o3, Gemini thinking) supply reasoning_tokens in streaming usage chunks, so in the common case the expensive recount runs purely to be thrown away.

The bug report (#26193) includes py-spy thread dumps showing a single tiktoken.encode() call monopolizing the event loop for the full liveness-probe window, resulting in pods being SIGKILLed by Kubernetes.

Fix

Inspect the already-parsed streaming chunks for any completion_tokens_details.reasoning_tokens entry. If one exists, skip the local tiktoken recount entirely — calculate_usage will pick up the provider-supplied value as before. If none exists (rare), fall through to count_reasoning_tokens so providers that emit reasoning_content without reporting usage are unaffected.

New helper ChunkProcessor.chunks_have_reasoning_tokens(chunks) handles both ModelResponseStream objects and plain-dict chunks, matching the existing chunk-parsing patterns in _calculate_usage_per_chunk.

Call-site change in litellm/main.py:

if processor.chunks_have_reasoning_tokens(chunks):
    reasoning_tokens = None
else:
    reasoning_tokens = processor.count_reasoning_tokens(response)

Tests

Four new tests in tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py:

test_chunks_have_reasoning_tokens_true_when_provider_supplies — regression test for the reported bug.
test_chunks_have_reasoning_tokens_false_when_provider_omits — ensures the slow fallback is still exercised when providers omit usage.
test_chunks_have_reasoning_tokens_handles_dict_chunks — dict-chunk path (proxy-style chunks) is covered.
test_stream_chunk_builder_skips_count_reasoning_tokens_when_usage_present — end-to-end: stream_chunk_builder does not invoke count_reasoning_tokens when the provider already supplied the value, and the final response carries the provider-supplied number.

All four tests fail on litellm_oss_branch without the fix and pass with it.

Scope

No API changes: existing public method signatures untouched.
No new config flags or defaults.
No changes to count_reasoning_tokens itself (it remains the correct fallback).
Does not address the separate case where a misbehaving provider emits reasoning_content without any usage chunk. That path still blocks the event loop and could be fixed in a follow-up (e.g. asyncio.to_thread at the async caller), but is orthogonal to this issue's reported reproduction.

Pre-Submission checklist

Added testing in tests/test_litellm/
PR passes the touched test file (13/13 in test_streaming_chunk_builder_utils.py)
Scope isolated to a single bug
Greptile review (auto-triggered)

greptile-apps · 2026-04-22T12:45:05Z

Greptile Summary

This PR fixes a GIL-blocking performance bug in stream_chunk_builder where ChunkProcessor.count_reasoning_tokens unconditionally called tiktoken.encode() on large reasoning responses, holding the GIL long enough to kill Kubernetes pods via liveness-probe timeouts. The fix adds chunks_have_reasoning_tokens() to detect when the provider already supplied reasoning_tokens in a streaming usage chunk, allowing the expensive recount to be skipped since calculate_usage would have discarded the locally-computed value anyway.

Confidence Score: 5/5

Safe to merge — targeted performance fix with no behavioral change in the common case and intact fallback for providers that omit usage.

Logic is correct and validated against calculate_usage's existing conditional that discards the locally-computed value when the provider already supplied it. The fallback path is fully preserved. Four new tests cover all branches including the end-to-end skip. No backwards-incompatible changes, no new config flags, and no security implications.

No files require special attention.

Important Files Changed

Filename	Overview
litellm/litellm_core_utils/streaming_chunk_builder_utils.py	Adds `chunks_have_reasoning_tokens()` helper that inspects streaming chunks (both `ModelResponseStream` objects and plain dicts) for a provider-supplied `completion_tokens_details.reasoning_tokens` value; logic is correct and follows existing chunk-parsing patterns in `_calculate_usage_per_chunk`.
litellm/main.py	Call-site change gates `count_reasoning_tokens` behind the new helper; semantics unchanged since `calculate_usage` already discards the recomputed value when provider supplied it.
tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py	Four new mock-only tests covering the happy path, fallback path, dict-chunk path, and end-to-end skip verification; no real network calls, all assertions meaningful.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[stream_chunk_builder called with chunks] --> B[processor.chunks_have_reasoning_tokens chunks]
    B --> C{Any chunk has\ncompletion_tokens_details\n.reasoning_tokens?}
    C -- Yes --> D[reasoning_tokens = None\nskip tiktoken.encode GIL block]
    C -- No --> E[reasoning_tokens = processor.count_reasoning_tokens response\ntiktoken.encode fallback path]
    D --> F[processor.calculate_usage chunks model ...]
    E --> F
    F --> G{reasoning_tokens arg\nis not None?}
    G -- No --> H[Use provider-supplied\ncompletion_tokens_details.reasoning_tokens]
    G -- Yes --> I{returned_usage\n.completion_tokens_details\n.reasoning_tokens is None?}
    I -- Yes --> J[Apply locally computed\nreasoning_tokens]
    I -- No --> H
    H --> K[Return final Usage object]
    J --> K

_{Reviews (2): Last reviewed commit: "fix(proxy): skip redundant tiktoken reco..." | Re-trigger Greptile}

…rriAI#25855) Bedrock enforces non-increasing TTL ordering across cache_control blocks (tools → system → messages). The tool cache_control TTL was being unconditionally dropped to the default 5m, while system blocks preserved the user-specified TTL for Claude 4.5+ models. This mismatch caused "a ttl='1h' block must not come after a ttl='5m' block" errors when users set ttl='1h' on both tools and system. Converse path: add_cache_point_tool_block() now accepts a model param and preserves TTL for Claude 4.5+, matching _get_cache_point_block(). Invoke path: _remove_ttl_from_cache_control() now also processes tools (was only processing system and messages). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…asoning_tokens (BerriAI#26193) stream_chunk_builder unconditionally called ChunkProcessor.count_reasoning_tokens, which recomputes reasoning_tokens via tiktoken.encode() — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking). The computed value is already discarded by calculate_usage when the provider supplied reasoning_tokens in a streaming usage chunk (see lines applying reasoning_tokens only when completion_tokens_details.reasoning_tokens is None). All major reasoning providers do supply this value, so in the common case the expensive recount runs purely to be thrown away — while blocking the asyncio event loop long enough to fail K8s liveness probes and get pods killed. Adds ChunkProcessor.chunks_have_reasoning_tokens(chunks) which inspects the already-parsed streaming chunks for any completion_tokens_details.reasoning_tokens entry. Short-circuits the call site in stream_chunk_builder when one is found. The slow fallback path via count_reasoning_tokens is preserved for providers that emit reasoning_content without reporting usage. Tests cover: - chunks_have_reasoning_tokens returns True when provider reports the value - returns False when provider omits it (fallback path preserved) - handles plain-dict chunks as well as ModelResponseStream objects - end-to-end: stream_chunk_builder does not invoke count_reasoning_tokens when the provider already supplies reasoning_tokens Fixes BerriAI#26193

veria-ai · 2026-04-24T08:26:46Z

Low: No security issues

This PR adds a performance optimization that skips a redundant tiktoken tokenization when the upstream provider already supplies reasoning_tokens in streaming usage chunks. The new code is read-only inspection of in-memory chunk data with no changes to auth, input validation, data exposure, or external interactions.

Status: 0 open
Risk: 1/10

Posted by Veria AI · 2026-04-24T08:26:46.159Z

codecov · 2026-04-24T08:29:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

shubham-arora-clear and others added 2 commits April 23, 2026 21:54

dschulmeist force-pushed the fix/tiktoken-event-loop-block-26193 branch from 349cc9a to 4d1c7b8 Compare April 24, 2026 08:25

Sameerlite force-pushed the litellm_oss_branch branch from cd88dde to c014bfa Compare April 27, 2026 03:29

Sameerlite deleted the branch BerriAI:litellm_oss_branch April 27, 2026 04:56

Sameerlite closed this Apr 27, 2026

dschulmeist mentioned this pull request Apr 27, 2026

fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens #26577

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens#26245

fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens#26245
dschulmeist wants to merge 2 commits into
BerriAI:litellm_oss_branchfrom
dschulmeist:fix/tiktoken-event-loop-block-26193

dschulmeist commented Apr 22, 2026

Uh oh!

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading

Important Files Changed

Uh oh!

veria-ai Bot commented Apr 24, 2026

Uh oh!

codecov Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

dschulmeist commented Apr 22, 2026

Relevant issues

Type

Changes

Problem

Fix

Tests

Scope

Pre-Submission checklist

Uh oh!

greptile-apps Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

veria-ai Bot commented Apr 24, 2026

Low: No security issues

Uh oh!

codecov Bot commented Apr 24, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading