fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens#26577
Conversation
…rriAI#25855) Bedrock enforces non-increasing TTL ordering across cache_control blocks (tools → system → messages). The tool cache_control TTL was being unconditionally dropped to the default 5m, while system blocks preserved the user-specified TTL for Claude 4.5+ models. This mismatch caused "a ttl='1h' block must not come after a ttl='5m' block" errors when users set ttl='1h' on both tools and system. Converse path: add_cache_point_tool_block() now accepts a model param and preserves TTL for Claude 4.5+, matching _get_cache_point_block(). Invoke path: _remove_ttl_from_cache_control() now also processes tools (was only processing system and messages). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…asoning_tokens (BerriAI#26193) stream_chunk_builder unconditionally called ChunkProcessor.count_reasoning_tokens, which recomputes reasoning_tokens via tiktoken.encode() — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking). The computed value is already discarded by calculate_usage when the provider supplied reasoning_tokens in a streaming usage chunk (see lines applying reasoning_tokens only when completion_tokens_details.reasoning_tokens is None). All major reasoning providers do supply this value, so in the common case the expensive recount runs purely to be thrown away — while blocking the asyncio event loop long enough to fail K8s liveness probes and get pods killed. Adds ChunkProcessor.chunks_have_reasoning_tokens(chunks) which inspects the already-parsed streaming chunks for any completion_tokens_details.reasoning_tokens entry. Short-circuits the call site in stream_chunk_builder when one is found. The slow fallback path via count_reasoning_tokens is preserved for providers that emit reasoning_content without reporting usage. Tests cover: - chunks_have_reasoning_tokens returns True when provider reports the value - returns False when provider omits it (fallback path preserved) - handles plain-dict chunks as well as ModelResponseStream objects - end-to-end: stream_chunk_builder does not invoke count_reasoning_tokens when the provider already supplies reasoning_tokens Fixes BerriAI#26193
Greptile SummaryThis PR optimizes Confidence Score: 4/5Safe to merge; all findings are P2 style concerns with no correctness impact on the tiktoken optimization. The short-circuit logic is correct — calculate_usage confirms provider values take precedence. Bedrock TTL changes are backwards-compatible. Only P2 findings remain. litellm/litellm_core_utils/prompt_templates/factory.py — new Bedrock import outside the llms/ directory.
|
| Filename | Overview |
|---|---|
| litellm/litellm_core_utils/streaming_chunk_builder_utils.py | Adds chunks_have_reasoning_tokens helper that inspects chunks for provider-supplied reasoning_tokens to skip the expensive tiktoken recount; logic is correct and consistent with _calculate_usage_per_chunk patterns |
| litellm/main.py | Call-site updated to short-circuit count_reasoning_tokens via the new helper; fallback preserved for providers that don't emit usage chunks |
| litellm/litellm_core_utils/prompt_templates/factory.py | Propagates model through add_cache_point_tool_block and _bedrock_tools_pt for Claude 4.5+ TTL support; introduces a Bedrock-specific import (is_claude_4_5_on_bedrock) outside the llms/ directory, violating the team's provider isolation rule |
| litellm/llms/bedrock/messages/invoke_transformations/anthropic_claude3_transformation.py | Extends _remove_ttl_from_cache_control to also sanitize tools entries, fixing a TTL ordering gap where system/messages were sanitized but tools were not |
| tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py | Four well-structured mock-only tests covering true/false paths, dict chunks, and end-to-end short-circuit via patch.object; no real network calls |
| litellm/llms/bedrock/chat/converse_transformation.py | Passes model parameter to _bedrock_tools_pt in both tool-processing branches so TTL is propagated correctly |
| tests/test_litellm/litellm_core_utils/prompt_templates/test_litellm_core_utils_prompt_templates_factory.py | New tests for add_cache_point_tool_block and _bedrock_tools_pt with TTL covering Claude 4.5, older models, missing model, and no-TTL cases |
| tests/test_litellm/llms/bedrock/messages/invoke_transformations/test_anthropic_claude3_transformation.py | Two new tests verifying tools cache_control TTL is stripped for older models and preserved for Claude 4.5+ |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[stream_chunk_builder called] --> B[ChunkProcessor.chunks_have_reasoning_tokens]
B -->|True: provider supplied reasoning_tokens| C[reasoning_tokens = None]
B -->|False: no usage chunk with reasoning_tokens| D[count_reasoning_tokens via tiktoken]
C --> E[calculate_usage]
D --> E
E --> F{completion_tokens_details from chunks?}
F -->|Yes, reasoning_tokens not None| G[Use provider value directly]
F -->|No details, reasoning_tokens param not None| H[Apply local recount fallback]
G --> I[Final Usage object]
H --> I
Reviews (1): Last reviewed commit: "fix(proxy): skip redundant tiktoken reco..." | Re-trigger Greptile
| def add_cache_point_tool_block( | ||
| tool: dict, model: Optional[str] = None | ||
| ) -> Optional[BedrockToolBlock]: | ||
| from litellm.llms.bedrock.common_utils import is_claude_4_5_on_bedrock |
There was a problem hiding this comment.
Provider-specific import outside
llms/ directory
is_claude_4_5_on_bedrock is a Bedrock-specific utility being imported inside a function in litellm_core_utils/prompt_templates/factory.py. The team's rule is to keep provider-specific logic inside litellm/llms/. The function add_cache_point_tool_block already lives at the edge of this boundary (it returns a BedrockToolBlock), but adding a named Bedrock model check makes it deeper into provider territory. Consider passing a boolean supports_extended_ttl: bool = False from the caller (which already has the model string) instead of importing the Bedrock helper here.
Rule Used: What: Avoid writing provider-specific code outside... (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
- BerriAI/litellm#26577 fix(proxy): skip redundant tiktoken recount (merge-after-nits) - BerriAI/litellm#26576 fix(proxy): load REDIS_* env vars when cache_params has non-connection keys (merge-as-is) - QwenLM/qwen-code#3667 fix(cli): refresh static header on model switch (merge-after-nits) - google-gemini/gemini-cli#26005 Fix infinite dialog loop and add ESC support in /skills link (merge-after-nits)
Re-opening previously closed PR #26245 against the new OSS staging branch (the original target `litellm_oss_branch` was deleted on 2026-04-27 as part of the maintainer branch rotation).
Relevant issues
Fixes #26193
Type
Bug Fix
Problem
`stream_chunk_builder` unconditionally calls `ChunkProcessor.count_reasoning_tokens`, which recomputes `reasoning_tokens` via `tiktoken.encode()` — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking, 100k+ tokens).
The computed value is already discarded by `calculate_usage` when the provider supplied `reasoning_tokens` in a streaming usage chunk. All major reasoning providers do supply this — so in the common case the expensive recount runs purely to be thrown away while blocking the event loop long enough for K8s to SIGKILL the pod.
Fix
Inspect chunks for `completion_tokens_details.reasoning_tokens` first. If present, skip the local tiktoken recount entirely. Slow-path fallback preserved for providers that emit `reasoning_content` without reporting usage.
New helper `ChunkProcessor.chunks_have_reasoning_tokens(chunks)` matches the existing chunk-parsing patterns in `_calculate_usage_per_chunk`. Call-site:
```python
if processor.chunks_have_reasoning_tokens(chunks):
reasoning_tokens = None
else:
reasoning_tokens = processor.count_reasoning_tokens(response)
```
Tests
Four new tests in `tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py`:
Scope notes
No API changes. No new config flags. Doesn't address misbehaving providers that emit `reasoning_content` without any usage chunk — that path still hits the slow tiktoken call and could be addressed in a follow-up via `asyncio.to_thread` at the async caller.
Pre-Submission checklist