fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens by dschulmeist · Pull Request #26577 · BerriAI/litellm

dschulmeist · 2026-04-27T07:39:15Z

Re-opening previously closed PR #26245 against the new OSS staging branch (the original target `litellm_oss_branch` was deleted on 2026-04-27 as part of the maintainer branch rotation).

Relevant issues

Fixes #26193

Type

Bug Fix

Problem

`stream_chunk_builder` unconditionally calls `ChunkProcessor.count_reasoning_tokens`, which recomputes `reasoning_tokens` via `tiktoken.encode()` — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking, 100k+ tokens).

The computed value is already discarded by `calculate_usage` when the provider supplied `reasoning_tokens` in a streaming usage chunk. All major reasoning providers do supply this — so in the common case the expensive recount runs purely to be thrown away while blocking the event loop long enough for K8s to SIGKILL the pod.

Fix

Inspect chunks for `completion_tokens_details.reasoning_tokens` first. If present, skip the local tiktoken recount entirely. Slow-path fallback preserved for providers that emit `reasoning_content` without reporting usage.

New helper `ChunkProcessor.chunks_have_reasoning_tokens(chunks)` matches the existing chunk-parsing patterns in `_calculate_usage_per_chunk`. Call-site:
```python
if processor.chunks_have_reasoning_tokens(chunks):
reasoning_tokens = None
else:
reasoning_tokens = processor.count_reasoning_tokens(response)
```

Tests

Four new tests in `tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py`:

`test_chunks_have_reasoning_tokens_true_when_provider_supplies` — regression test
`test_chunks_have_reasoning_tokens_false_when_provider_omits` — fallback path preserved
`test_chunks_have_reasoning_tokens_handles_dict_chunks` — dict-chunk path covered
`test_stream_chunk_builder_skips_count_reasoning_tokens_when_usage_present` — end-to-end

Scope notes

No API changes. No new config flags. Doesn't address misbehaving providers that emit `reasoning_content` without any usage chunk — that path still hits the slow tiktoken call and could be addressed in a follow-up via `asyncio.to_thread` at the async caller.

Pre-Submission checklist

Added testing in `tests/test_litellm/`
PR passes touched-file tests (13/13)
Scope isolated to a single bug

…rriAI#25855) Bedrock enforces non-increasing TTL ordering across cache_control blocks (tools → system → messages). The tool cache_control TTL was being unconditionally dropped to the default 5m, while system blocks preserved the user-specified TTL for Claude 4.5+ models. This mismatch caused "a ttl='1h' block must not come after a ttl='5m' block" errors when users set ttl='1h' on both tools and system. Converse path: add_cache_point_tool_block() now accepts a model param and preserves TTL for Claude 4.5+, matching _get_cache_point_block(). Invoke path: _remove_ttl_from_cache_control() now also processes tools (was only processing system and messages). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…asoning_tokens (BerriAI#26193) stream_chunk_builder unconditionally called ChunkProcessor.count_reasoning_tokens, which recomputes reasoning_tokens via tiktoken.encode() — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking). The computed value is already discarded by calculate_usage when the provider supplied reasoning_tokens in a streaming usage chunk (see lines applying reasoning_tokens only when completion_tokens_details.reasoning_tokens is None). All major reasoning providers do supply this value, so in the common case the expensive recount runs purely to be thrown away — while blocking the asyncio event loop long enough to fail K8s liveness probes and get pods killed. Adds ChunkProcessor.chunks_have_reasoning_tokens(chunks) which inspects the already-parsed streaming chunks for any completion_tokens_details.reasoning_tokens entry. Short-circuits the call site in stream_chunk_builder when one is found. The slow fallback path via count_reasoning_tokens is preserved for providers that emit reasoning_content without reporting usage. Tests cover: - chunks_have_reasoning_tokens returns True when provider reports the value - returns False when provider omits it (fallback path preserved) - handles plain-dict chunks as well as ModelResponseStream objects - end-to-end: stream_chunk_builder does not invoke count_reasoning_tokens when the provider already supplies reasoning_tokens Fixes BerriAI#26193

greptile-apps · 2026-04-27T07:42:51Z

Greptile Summary

This PR optimizes stream_chunk_builder by skipping the GIL-holding tiktoken recount when a provider already supplies reasoning_tokens in streaming usage chunks, and also bundles Bedrock-specific fixes that propagate the model parameter through _bedrock_tools_pt/add_cache_point_tool_block for Claude 4.5+ TTL support and extend _remove_ttl_from_cache_control to cover tools entries. The core optimization is correct and well-tested; the calculate_usage code confirms that provider-supplied values take precedence over the local recount, so the short-circuit changes no observable output.

Confidence Score: 4/5

Safe to merge; all findings are P2 style concerns with no correctness impact on the tiktoken optimization.

The short-circuit logic is correct — calculate_usage confirms provider values take precedence. Bedrock TTL changes are backwards-compatible. Only P2 findings remain.

litellm/litellm_core_utils/prompt_templates/factory.py — new Bedrock import outside the llms/ directory.

Important Files Changed

Filename	Overview
litellm/litellm_core_utils/streaming_chunk_builder_utils.py	Adds chunks_have_reasoning_tokens helper that inspects chunks for provider-supplied reasoning_tokens to skip the expensive tiktoken recount; logic is correct and consistent with _calculate_usage_per_chunk patterns
litellm/main.py	Call-site updated to short-circuit count_reasoning_tokens via the new helper; fallback preserved for providers that don't emit usage chunks
litellm/litellm_core_utils/prompt_templates/factory.py	Propagates model through add_cache_point_tool_block and _bedrock_tools_pt for Claude 4.5+ TTL support; introduces a Bedrock-specific import (is_claude_4_5_on_bedrock) outside the llms/ directory, violating the team's provider isolation rule
litellm/llms/bedrock/messages/invoke_transformations/anthropic_claude3_transformation.py	Extends _remove_ttl_from_cache_control to also sanitize tools entries, fixing a TTL ordering gap where system/messages were sanitized but tools were not
tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py	Four well-structured mock-only tests covering true/false paths, dict chunks, and end-to-end short-circuit via patch.object; no real network calls
litellm/llms/bedrock/chat/converse_transformation.py	Passes model parameter to _bedrock_tools_pt in both tool-processing branches so TTL is propagated correctly
tests/test_litellm/litellm_core_utils/prompt_templates/test_litellm_core_utils_prompt_templates_factory.py	New tests for add_cache_point_tool_block and _bedrock_tools_pt with TTL covering Claude 4.5, older models, missing model, and no-TTL cases
tests/test_litellm/llms/bedrock/messages/invoke_transformations/test_anthropic_claude3_transformation.py	Two new tests verifying tools cache_control TTL is stripped for older models and preserved for Claude 4.5+

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[stream_chunk_builder called] --> B[ChunkProcessor.chunks_have_reasoning_tokens]
    B -->|True: provider supplied reasoning_tokens| C[reasoning_tokens = None]
    B -->|False: no usage chunk with reasoning_tokens| D[count_reasoning_tokens via tiktoken]
    C --> E[calculate_usage]
    D --> E
    E --> F{completion_tokens_details from chunks?}
    F -->|Yes, reasoning_tokens not None| G[Use provider value directly]
    F -->|No details, reasoning_tokens param not None| H[Apply local recount fallback]
    G --> I[Final Usage object]
    H --> I

_{Reviews (1): Last reviewed commit: "fix(proxy): skip redundant tiktoken reco..." | Re-trigger Greptile}

greptile-apps · 2026-04-27T07:42:55Z

+def add_cache_point_tool_block(
+    tool: dict, model: Optional[str] = None
+) -> Optional[BedrockToolBlock]:
+    from litellm.llms.bedrock.common_utils import is_claude_4_5_on_bedrock


Provider-specific import outside llms/ directory

is_claude_4_5_on_bedrock is a Bedrock-specific utility being imported inside a function in litellm_core_utils/prompt_templates/factory.py. The team's rule is to keep provider-specific logic inside litellm/llms/. The function add_cache_point_tool_block already lives at the edge of this boundary (it returns a BedrockToolBlock), but adding a named Bedrock model check makes it deeper into provider territory. Consider passing a boolean supports_extended_ttl: bool = False from the caller (which already has the model string) instead of importing the Bedrock helper here.

Rule Used: What: Avoid writing provider-specific code outside... (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

- BerriAI/litellm#26577 fix(proxy): skip redundant tiktoken recount (merge-after-nits) - BerriAI/litellm#26576 fix(proxy): load REDIS_* env vars when cache_params has non-connection keys (merge-as-is) - QwenLM/qwen-code#3667 fix(cli): refresh static header on model switch (merge-after-nits) - google-gemini/gemini-cli#26005 Fix infinite dialog loop and add ESC support in /skills link (merge-after-nits)

shubham-arora-clear and others added 2 commits April 27, 2026 09:35

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens#26577

fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens#26577
dschulmeist wants to merge 2 commits into
BerriAI:litellm-oss-staging-04-25-2026from
dschulmeist:fix/tiktoken-event-loop-block-26193

dschulmeist commented Apr 27, 2026

Uh oh!

greptile-apps Bot commented Apr 27, 2026

Important Files Changed

Uh oh!

greptile-apps Bot Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dschulmeist commented Apr 27, 2026

Relevant issues

Type

Problem

Fix

Tests

Scope notes

Pre-Submission checklist

Uh oh!

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants