Skip to content

fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens#26577

Open
dschulmeist wants to merge 2 commits into
BerriAI:litellm-oss-staging-04-25-2026from
dschulmeist:fix/tiktoken-event-loop-block-26193
Open

fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens#26577
dschulmeist wants to merge 2 commits into
BerriAI:litellm-oss-staging-04-25-2026from
dschulmeist:fix/tiktoken-event-loop-block-26193

Conversation

@dschulmeist

Copy link
Copy Markdown

Re-opening previously closed PR #26245 against the new OSS staging branch (the original target `litellm_oss_branch` was deleted on 2026-04-27 as part of the maintainer branch rotation).

Relevant issues

Fixes #26193

Type

Bug Fix

Problem

`stream_chunk_builder` unconditionally calls `ChunkProcessor.count_reasoning_tokens`, which recomputes `reasoning_tokens` via `tiktoken.encode()` — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking, 100k+ tokens).

The computed value is already discarded by `calculate_usage` when the provider supplied `reasoning_tokens` in a streaming usage chunk. All major reasoning providers do supply this — so in the common case the expensive recount runs purely to be thrown away while blocking the event loop long enough for K8s to SIGKILL the pod.

Fix

Inspect chunks for `completion_tokens_details.reasoning_tokens` first. If present, skip the local tiktoken recount entirely. Slow-path fallback preserved for providers that emit `reasoning_content` without reporting usage.

New helper `ChunkProcessor.chunks_have_reasoning_tokens(chunks)` matches the existing chunk-parsing patterns in `_calculate_usage_per_chunk`. Call-site:
```python
if processor.chunks_have_reasoning_tokens(chunks):
reasoning_tokens = None
else:
reasoning_tokens = processor.count_reasoning_tokens(response)
```

Tests

Four new tests in `tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py`:

  1. `test_chunks_have_reasoning_tokens_true_when_provider_supplies` — regression test
  2. `test_chunks_have_reasoning_tokens_false_when_provider_omits` — fallback path preserved
  3. `test_chunks_have_reasoning_tokens_handles_dict_chunks` — dict-chunk path covered
  4. `test_stream_chunk_builder_skips_count_reasoning_tokens_when_usage_present` — end-to-end

Scope notes

No API changes. No new config flags. Doesn't address misbehaving providers that emit `reasoning_content` without any usage chunk — that path still hits the slow tiktoken call and could be addressed in a follow-up via `asyncio.to_thread` at the async caller.

Pre-Submission checklist

  • Added testing in `tests/test_litellm/`
  • PR passes touched-file tests (13/13)
  • Scope isolated to a single bug

shubham-arora-clear and others added 2 commits April 27, 2026 09:35
…rriAI#25855)

Bedrock enforces non-increasing TTL ordering across cache_control blocks
(tools → system → messages). The tool cache_control TTL was being
unconditionally dropped to the default 5m, while system blocks preserved
the user-specified TTL for Claude 4.5+ models. This mismatch caused
"a ttl='1h' block must not come after a ttl='5m' block" errors when
users set ttl='1h' on both tools and system.

Converse path: add_cache_point_tool_block() now accepts a model param
and preserves TTL for Claude 4.5+, matching _get_cache_point_block().

Invoke path: _remove_ttl_from_cache_control() now also processes tools
(was only processing system and messages).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…asoning_tokens (BerriAI#26193)

stream_chunk_builder unconditionally called ChunkProcessor.count_reasoning_tokens,
which recomputes reasoning_tokens via tiktoken.encode() — a C extension that holds
the GIL for tens of seconds on large reasoning responses (Claude extended thinking,
OpenAI o1/o3, Gemini thinking).

The computed value is already discarded by calculate_usage when the provider
supplied reasoning_tokens in a streaming usage chunk (see lines applying
reasoning_tokens only when completion_tokens_details.reasoning_tokens is None).
All major reasoning providers do supply this value, so in the common case the
expensive recount runs purely to be thrown away — while blocking the asyncio
event loop long enough to fail K8s liveness probes and get pods killed.

Adds ChunkProcessor.chunks_have_reasoning_tokens(chunks) which inspects the
already-parsed streaming chunks for any completion_tokens_details.reasoning_tokens
entry. Short-circuits the call site in stream_chunk_builder when one is found.
The slow fallback path via count_reasoning_tokens is preserved for providers
that emit reasoning_content without reporting usage.

Tests cover:
- chunks_have_reasoning_tokens returns True when provider reports the value
- returns False when provider omits it (fallback path preserved)
- handles plain-dict chunks as well as ModelResponseStream objects
- end-to-end: stream_chunk_builder does not invoke count_reasoning_tokens when
  the provider already supplies reasoning_tokens

Fixes BerriAI#26193
@greptile-apps

greptile-apps Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR optimizes stream_chunk_builder by skipping the GIL-holding tiktoken recount when a provider already supplies reasoning_tokens in streaming usage chunks, and also bundles Bedrock-specific fixes that propagate the model parameter through _bedrock_tools_pt/add_cache_point_tool_block for Claude 4.5+ TTL support and extend _remove_ttl_from_cache_control to cover tools entries. The core optimization is correct and well-tested; the calculate_usage code confirms that provider-supplied values take precedence over the local recount, so the short-circuit changes no observable output.

Confidence Score: 4/5

Safe to merge; all findings are P2 style concerns with no correctness impact on the tiktoken optimization.

The short-circuit logic is correct — calculate_usage confirms provider values take precedence. Bedrock TTL changes are backwards-compatible. Only P2 findings remain.

litellm/litellm_core_utils/prompt_templates/factory.py — new Bedrock import outside the llms/ directory.

Important Files Changed

Filename Overview
litellm/litellm_core_utils/streaming_chunk_builder_utils.py Adds chunks_have_reasoning_tokens helper that inspects chunks for provider-supplied reasoning_tokens to skip the expensive tiktoken recount; logic is correct and consistent with _calculate_usage_per_chunk patterns
litellm/main.py Call-site updated to short-circuit count_reasoning_tokens via the new helper; fallback preserved for providers that don't emit usage chunks
litellm/litellm_core_utils/prompt_templates/factory.py Propagates model through add_cache_point_tool_block and _bedrock_tools_pt for Claude 4.5+ TTL support; introduces a Bedrock-specific import (is_claude_4_5_on_bedrock) outside the llms/ directory, violating the team's provider isolation rule
litellm/llms/bedrock/messages/invoke_transformations/anthropic_claude3_transformation.py Extends _remove_ttl_from_cache_control to also sanitize tools entries, fixing a TTL ordering gap where system/messages were sanitized but tools were not
tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py Four well-structured mock-only tests covering true/false paths, dict chunks, and end-to-end short-circuit via patch.object; no real network calls
litellm/llms/bedrock/chat/converse_transformation.py Passes model parameter to _bedrock_tools_pt in both tool-processing branches so TTL is propagated correctly
tests/test_litellm/litellm_core_utils/prompt_templates/test_litellm_core_utils_prompt_templates_factory.py New tests for add_cache_point_tool_block and _bedrock_tools_pt with TTL covering Claude 4.5, older models, missing model, and no-TTL cases
tests/test_litellm/llms/bedrock/messages/invoke_transformations/test_anthropic_claude3_transformation.py Two new tests verifying tools cache_control TTL is stripped for older models and preserved for Claude 4.5+

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[stream_chunk_builder called] --> B[ChunkProcessor.chunks_have_reasoning_tokens]
    B -->|True: provider supplied reasoning_tokens| C[reasoning_tokens = None]
    B -->|False: no usage chunk with reasoning_tokens| D[count_reasoning_tokens via tiktoken]
    C --> E[calculate_usage]
    D --> E
    E --> F{completion_tokens_details from chunks?}
    F -->|Yes, reasoning_tokens not None| G[Use provider value directly]
    F -->|No details, reasoning_tokens param not None| H[Apply local recount fallback]
    G --> I[Final Usage object]
    H --> I
Loading

Reviews (1): Last reviewed commit: "fix(proxy): skip redundant tiktoken reco..." | Re-trigger Greptile

def add_cache_point_tool_block(
tool: dict, model: Optional[str] = None
) -> Optional[BedrockToolBlock]:
from litellm.llms.bedrock.common_utils import is_claude_4_5_on_bedrock

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Provider-specific import outside llms/ directory

is_claude_4_5_on_bedrock is a Bedrock-specific utility being imported inside a function in litellm_core_utils/prompt_templates/factory.py. The team's rule is to keep provider-specific logic inside litellm/llms/. The function add_cache_point_tool_block already lives at the edge of this boundary (it returns a BedrockToolBlock), but adding a named Bedrock model check makes it deeper into provider territory. Consider passing a boolean supports_extended_ttl: bool = False from the caller (which already has the model string) instead of importing the Bedrock helper here.

Rule Used: What: Avoid writing provider-specific code outside... (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Bojun-Vvibe added a commit to Bojun-Vvibe/oss-contributions that referenced this pull request Apr 27, 2026
- BerriAI/litellm#26577 fix(proxy): skip redundant tiktoken recount (merge-after-nits)
- BerriAI/litellm#26576 fix(proxy): load REDIS_* env vars when cache_params has non-connection keys (merge-as-is)
- QwenLM/qwen-code#3667 fix(cli): refresh static header on model switch (merge-after-nits)
- google-gemini/gemini-cli#26005 Fix infinite dialog loop and add ESC support in /skills link (merge-after-nits)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants