Studio: surface prompt-cache token counts in /v1/chat/completions usage chunk#5670
Conversation
…ge chunk
Studio's Anthropic and OpenAI Responses proxies already capture
cache_creation_input_tokens, cache_read_input_tokens (Anthropic) and
input_tokens_details.cached_tokens (OpenAI), but they were only written
to the structlog stream. Browser and SDK clients had no way to compute
"how many tokens hit the prompt cache" without scraping the server log,
so the chat UI could not show users how much money the cache was
saving on each turn.
This change emits one extra OpenAI include_usage-style chunk
(choices: [] with a populated usage block) just before the existing
[DONE] for Anthropic and after the final finish_reason chunk for
OpenAI Responses (both response.completed and response.incomplete).
The chunk shape:
usage.prompt_tokens_details.cached_tokens
normalised cache-read count, present for both providers.
usage.cache_creation_input_tokens
Anthropic-only; tokens billed at the cache-write premium.
usage.cache_read_input_tokens
Anthropic-only; same value as cached_tokens, kept for callers
that already key off the native Anthropic name.
Smoke verified end to end against a live Studio (claude-haiku-4-5
and gpt-4o-mini) plus 7 new unit tests on the helper and the two
streaming paths.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 03500d34be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "prompt_tokens": prompt_tokens, | ||
| "completion_tokens": completion_tokens, | ||
| "total_tokens": prompt_tokens + completion_tokens, |
There was a problem hiding this comment.
Include Anthropic cache tokens in prompt/total usage counts
prompt_tokens is populated from usage.input_tokens alone, and total_tokens is derived from that value plus output_tokens; for Anthropic this undercounts cached turns because cache_creation_input_tokens and cache_read_input_tokens are separate input buckets that still belong to total prompt usage. In cache-heavy conversations this will report much smaller prompt/total numbers than actually used, so downstream context/cost displays fed by this chunk become inaccurate. Compute Anthropic prompt/total with all three input components (while still exposing the native cache fields).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to surface prompt-cache accounting to clients by emitting OpenAI-style usage SSE chunks during streaming for both Anthropic and OpenAI providers. A new helper function, _build_usage_chunk, was implemented and integrated into the streaming response paths, supported by a comprehensive suite of unit and integration tests. Feedback was provided regarding logic duplication within the _build_usage_chunk function, suggesting a refactor to extract common token fields and improve maintainability.
| Returns ``None`` when there are no usage numbers to report (e.g. an | ||
| upstream error before ``message_start`` / ``response.completed``). | ||
| """ | ||
| if not isinstance(last_usage, dict): | ||
| return None | ||
|
|
||
| if provider == "anthropic": | ||
| prompt_tokens = last_usage.get("input_tokens") or 0 | ||
| completion_tokens = last_usage.get("output_tokens") or 0 | ||
| cache_creation = last_usage.get("cache_creation_input_tokens") or 0 | ||
| cache_read = last_usage.get("cache_read_input_tokens") or 0 | ||
| if not (prompt_tokens or completion_tokens or cache_creation or cache_read): | ||
| return None | ||
| usage_block: dict[str, Any] = { | ||
| "prompt_tokens": prompt_tokens, | ||
| "completion_tokens": completion_tokens, | ||
| "total_tokens": prompt_tokens + completion_tokens, | ||
| "prompt_tokens_details": {"cached_tokens": cache_read}, | ||
| "cache_creation_input_tokens": cache_creation, | ||
| "cache_read_input_tokens": cache_read, | ||
| } | ||
| else: | ||
| prompt_tokens = last_usage.get("input_tokens") or 0 | ||
| completion_tokens = last_usage.get("output_tokens") or 0 | ||
| cached = 0 | ||
| details = last_usage.get("input_tokens_details") | ||
| if isinstance(details, dict): | ||
| cached = details.get("cached_tokens") or 0 | ||
| if not (prompt_tokens or completion_tokens or cached): | ||
| return None |
There was a problem hiding this comment.
The _build_usage_chunk function contains significant logic duplication between the anthropic and openai branches. Specifically, the extraction of prompt_tokens and completion_tokens, as well as the calculation of total_tokens, are identical. Refactoring this to extract common fields first would improve maintainability and reduce the risk of future inconsistencies.
prompt_tokens = last_usage.get("input_tokens") or 0
completion_tokens = last_usage.get("output_tokens") or 0
usage_block: dict[str, Any] = {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
}
if provider == "anthropic":
cache_creation = last_usage.get("cache_creation_input_tokens") or 0
cache_read = last_usage.get("cache_read_input_tokens") or 0
if not (prompt_tokens or completion_tokens or cache_creation or cache_read):
return None
usage_block.update({
"prompt_tokens_details": {"cached_tokens": cache_read},
"cache_creation_input_tokens": cache_creation,
"cache_read_input_tokens": cache_read,
})
else:
cached = 0
details = last_usage.get("input_tokens_details")
if isinstance(details, dict):
cached = details.get("cached_tokens") or 0
if not (prompt_tokens or completion_tokens or cached):
return None
usage_block["prompt_tokens_details"] = {"cached_tokens": cached}References
- When a condition or calculated value is used across multiple conditional branches, compute it once and reuse the result to ensure consistency and improve maintainability.
Anthropic's `input_tokens` field excludes the cache buckets -- the real prompt size is `input_tokens + cache_creation_input_tokens + cache_read_input_tokens`. Previously the new usage chunk reported only `input_tokens` as `prompt_tokens`, which heavily undercounted cache-hit turns (e.g. an 18.9k-token cache_read turn looked like an 8-token prompt) and broke any downstream context / cost display fed by `prompt_tokens` or `total_tokens`. Fix `_build_usage_chunk` to sum all three input buckets for the Anthropic provider while keeping the OpenAI Responses path unchanged (OpenAI already folds cached tokens into `input_tokens`). The native `cache_creation_input_tokens` / `cache_read_input_tokens` keys and `prompt_tokens_details.cached_tokens` mirror are still emitted, so clients keep full visibility of the cache split. Tests updated to assert the summed shape.
…unslothai#5690) * Studio: per-session cost calculator + /api/providers/pricing endpoint Neither the Anthropic Messages API nor the OpenAI Responses API reports a `cost` field on the response. Both expose detailed token counts (input, output, cache hits, server-tool invocations); pricing multipliers live in the provider docs. The frontend's "cost so far" display was impossible without scraping the server log. Land the math + a snapshot endpoint so the cost calculator can run client-side from the existing usage chunk plumbing. The actual UI hookup belongs in a frontend follow-up (and is gated on PR unslothai#5670's usage-chunk emission landing so the frontend sees the usage block in the first place). Changes: - New `core/inference/pricing.py` with: - Per-MTok base pricing tables for every active Anthropic and gpt-5.x family member. Dated snapshots inherit the canonical-id price via prefix match so future snapshots cost the same as the canonical id until pricing changes. - Shared multipliers for Anthropic cache writes (5m: 1.25x, 1h: 2x) and reads (0.1x); OpenAI cache reads (0.1x); Anthropic server tool surcharges ($10 / 1k web_search, $0.05 / hour code_exec beyond the 50-hour daily free tier). - `calculate_cost(provider, model, usage)` returns a per-turn USD breakdown plus billable token counts, with priced=False for unknown models so the UI can still render token counts. - `pricing_snapshot()` returns the whole table for the frontend so it doesn't re-implement the multipliers. - New `GET /api/providers/pricing` returning the snapshot, scoped behind the existing auth dependency. - New `backend/tests/test_pricing.py` with 12 cases pinning the math against documented values: base input/output multiplication, 5m / 1h / read multipliers, default-to-5m fallback when the breakdown is absent, web_search per-1k pricing, code_execution per-hour pricing, dated-snapshot fallback, OpenAI cache-read discount accounting (cached tokens subtracted from full-price bucket and re-billed at 0.1x), unknown model graceful-degrade, and the snapshot endpoint shape. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: verified OpenAI pricing + fix billable input double-count Address the cost-calculator review: - OpenAI prices were 2-6x under the actual published rates. Cross-checked the live developers.openai.com/api/docs/pricing page and replaced every entry. gpt-5.5 is 5/30, gpt-5.5-pro is 30/180, gpt-5.4 is 2.5/15, gpt-5.4-mini 0.75/4.5, gpt-5.4-nano 0.20/1.25, gpt-5.3-codex 1.75/14. Added chat-latest alias to the canonical chat-snapshot rate. Dropped o3 / o4 / gpt-4.5 rows that are no longer listed on the page; calculator returns priced=False instead of silently billing at zero. - billable_input_tokens was double-counting cached tokens for OpenAI. Anthropic excludes cache_* buckets from input_tokens so we add them; OpenAI folds cache_read_input_tokens into input_tokens already, so the tooltip read 1.8M for a 1.0M bill. Branched the math by provider and added a regression test. Sourcing notes in the module docstring updated. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review: canonical 4.5 ids, long-context tier, OpenAI tool fees Three Codex P1 follow-ups on the cost calculator: 1. Canonical Anthropic 4.5 ids missing from ANTHROPIC_PRICING. claude-opus-4-5 / claude-sonnet-4-5 / claude-haiku-4-5 (no date suffix) are the ids used by backend defaults (PROVIDER_REGISTRY['anthropic'].default_models), but the table only had the dated forms. _lookup's prefix fallback doesn't help because the canonical id is SHORTER than the dated key, so str.startswith goes the wrong way and the calculator returned priced=False + zero cost. Added the canonical aliases for opus-4-5, sonnet-4-5, haiku-4-5, and opus-4-1. 2. OpenAI long-context tier. gpt-5.5 and gpt-5.4 cross over at 272k input tokens to a 2x input / 1.5x output rate (gpt-5.5: $5/$30 -> $10/$45; gpt-5.4: $2.50/$15 -> $5/$22.50). Turns past the threshold were systematically undercounted at headline rates. Added long_context_threshold / long_context_input_per_mtok / long_context_output_per_mtok columns and a tier-selection step in calculate_cost; model_priced gains a "(long-context >272000)" suffix when the higher tier applies so the tooltip can show which rate was used. gpt-5.5-pro / gpt-5.4-pro / mini / nano / codex have no published long-context tier today, so they keep a single rate. 3. OpenAI server-tool surcharges. web_search is $10/1000 calls and the hosted shell container is $0.03 per 20-minute session on the default 1g tier (~$0.09/hr). server_tools_usd was previously stuck at 0.0 for OpenAI even when web_search and shell tools fired, so sessions with tool use understated cost. Added OPENAI_WEB_SEARCH_USD_PER_1K and OPENAI_CONTAINER_USD_PER_HOUR constants plus a parallel of the Anthropic surcharge block that reads counts from usage["openai_tool_use"]. The SSE translator wires the counts in a follow-up commit; the calculator is now ready for them. pricing_snapshot also exposes both constants so the frontend tooltip can render the per-call rate. Existing tests updated to stay in the short-context tier where they were testing base rates; new tests pin canonical 4.5 lookups, long-context crossover on gpt-5.5/gpt-5.4, the absence of crossover on mini/nano/codex, and OpenAI tool surcharges (web_search, container hours, combined total). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…ge chunk (unslothai#5670) * Studio: surface prompt-cache token counts in /v1/chat/completions usage chunk Studio's Anthropic and OpenAI Responses proxies already capture cache_creation_input_tokens, cache_read_input_tokens (Anthropic) and input_tokens_details.cached_tokens (OpenAI), but they were only written to the structlog stream. Browser and SDK clients had no way to compute "how many tokens hit the prompt cache" without scraping the server log, so the chat UI could not show users how much money the cache was saving on each turn. This change emits one extra OpenAI include_usage-style chunk (choices: [] with a populated usage block) just before the existing [DONE] for Anthropic and after the final finish_reason chunk for OpenAI Responses (both response.completed and response.incomplete). The chunk shape: usage.prompt_tokens_details.cached_tokens normalised cache-read count, present for both providers. usage.cache_creation_input_tokens Anthropic-only; tokens billed at the cache-write premium. usage.cache_read_input_tokens Anthropic-only; same value as cached_tokens, kept for callers that already key off the native Anthropic name. Smoke verified end to end against a live Studio (claude-haiku-4-5 and gpt-4o-mini) plus 7 new unit tests on the helper and the two streaming paths. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Anthropic: include cache buckets in prompt_tokens / total_tokens Anthropic's `input_tokens` field excludes the cache buckets -- the real prompt size is `input_tokens + cache_creation_input_tokens + cache_read_input_tokens`. Previously the new usage chunk reported only `input_tokens` as `prompt_tokens`, which heavily undercounted cache-hit turns (e.g. an 18.9k-token cache_read turn looked like an 8-token prompt) and broke any downstream context / cost display fed by `prompt_tokens` or `total_tokens`. Fix `_build_usage_chunk` to sum all three input buckets for the Anthropic provider while keeping the OpenAI Responses path unchanged (OpenAI already folds cached tokens into `input_tokens`). The native `cache_creation_input_tokens` / `cache_read_input_tokens` keys and `prompt_tokens_details.cached_tokens` mirror are still emitted, so clients keep full visibility of the cache split. Tests updated to assert the summed shape. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…unslothai#5690) * Studio: per-session cost calculator + /api/providers/pricing endpoint Neither the Anthropic Messages API nor the OpenAI Responses API reports a `cost` field on the response. Both expose detailed token counts (input, output, cache hits, server-tool invocations); pricing multipliers live in the provider docs. The frontend's "cost so far" display was impossible without scraping the server log. Land the math + a snapshot endpoint so the cost calculator can run client-side from the existing usage chunk plumbing. The actual UI hookup belongs in a frontend follow-up (and is gated on PR unslothai#5670's usage-chunk emission landing so the frontend sees the usage block in the first place). Changes: - New `core/inference/pricing.py` with: - Per-MTok base pricing tables for every active Anthropic and gpt-5.x family member. Dated snapshots inherit the canonical-id price via prefix match so future snapshots cost the same as the canonical id until pricing changes. - Shared multipliers for Anthropic cache writes (5m: 1.25x, 1h: 2x) and reads (0.1x); OpenAI cache reads (0.1x); Anthropic server tool surcharges ($10 / 1k web_search, $0.05 / hour code_exec beyond the 50-hour daily free tier). - `calculate_cost(provider, model, usage)` returns a per-turn USD breakdown plus billable token counts, with priced=False for unknown models so the UI can still render token counts. - `pricing_snapshot()` returns the whole table for the frontend so it doesn't re-implement the multipliers. - New `GET /api/providers/pricing` returning the snapshot, scoped behind the existing auth dependency. - New `backend/tests/test_pricing.py` with 12 cases pinning the math against documented values: base input/output multiplication, 5m / 1h / read multipliers, default-to-5m fallback when the breakdown is absent, web_search per-1k pricing, code_execution per-hour pricing, dated-snapshot fallback, OpenAI cache-read discount accounting (cached tokens subtracted from full-price bucket and re-billed at 0.1x), unknown model graceful-degrade, and the snapshot endpoint shape. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: verified OpenAI pricing + fix billable input double-count Address the cost-calculator review: - OpenAI prices were 2-6x under the actual published rates. Cross-checked the live developers.openai.com/api/docs/pricing page and replaced every entry. gpt-5.5 is 5/30, gpt-5.5-pro is 30/180, gpt-5.4 is 2.5/15, gpt-5.4-mini 0.75/4.5, gpt-5.4-nano 0.20/1.25, gpt-5.3-codex 1.75/14. Added chat-latest alias to the canonical chat-snapshot rate. Dropped o3 / o4 / gpt-4.5 rows that are no longer listed on the page; calculator returns priced=False instead of silently billing at zero. - billable_input_tokens was double-counting cached tokens for OpenAI. Anthropic excludes cache_* buckets from input_tokens so we add them; OpenAI folds cache_read_input_tokens into input_tokens already, so the tooltip read 1.8M for a 1.0M bill. Branched the math by provider and added a regression test. Sourcing notes in the module docstring updated. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review: canonical 4.5 ids, long-context tier, OpenAI tool fees Three Codex P1 follow-ups on the cost calculator: 1. Canonical Anthropic 4.5 ids missing from ANTHROPIC_PRICING. claude-opus-4-5 / claude-sonnet-4-5 / claude-haiku-4-5 (no date suffix) are the ids used by backend defaults (PROVIDER_REGISTRY['anthropic'].default_models), but the table only had the dated forms. _lookup's prefix fallback doesn't help because the canonical id is SHORTER than the dated key, so str.startswith goes the wrong way and the calculator returned priced=False + zero cost. Added the canonical aliases for opus-4-5, sonnet-4-5, haiku-4-5, and opus-4-1. 2. OpenAI long-context tier. gpt-5.5 and gpt-5.4 cross over at 272k input tokens to a 2x input / 1.5x output rate (gpt-5.5: $5/$30 -> $10/$45; gpt-5.4: $2.50/$15 -> $5/$22.50). Turns past the threshold were systematically undercounted at headline rates. Added long_context_threshold / long_context_input_per_mtok / long_context_output_per_mtok columns and a tier-selection step in calculate_cost; model_priced gains a "(long-context >272000)" suffix when the higher tier applies so the tooltip can show which rate was used. gpt-5.5-pro / gpt-5.4-pro / mini / nano / codex have no published long-context tier today, so they keep a single rate. 3. OpenAI server-tool surcharges. web_search is $10/1000 calls and the hosted shell container is $0.03 per 20-minute session on the default 1g tier (~$0.09/hr). server_tools_usd was previously stuck at 0.0 for OpenAI even when web_search and shell tools fired, so sessions with tool use understated cost. Added OPENAI_WEB_SEARCH_USD_PER_1K and OPENAI_CONTAINER_USD_PER_HOUR constants plus a parallel of the Anthropic surcharge block that reads counts from usage["openai_tool_use"]. The SSE translator wires the counts in a follow-up commit; the calculator is now ready for them. pricing_snapshot also exposes both constants so the frontend tooltip can render the per-call rate. Existing tests updated to stay in the short-context tier where they were testing base rates; new tests pin canonical 4.5 lookups, long-context crossover on gpt-5.5/gpt-5.4, the absence of crossover on mini/nano/codex, and OpenAI tool surcharges (web_search, container hours, combined total). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Summary
_stream_anthropicand_stream_openai_responsesalready capturecache_creation_input_tokens,cache_read_input_tokens(Anthropic) andinput_tokens_details.cached_tokens(OpenAI Responses) onlast_usagebut only write them to the structlog stream. Browser and SDK clients had no way to see how many tokens the prompt cache absorbed, so the chat UI cannot show users their per-turn cache savings without scraping the server log.This change emits one extra OpenAI
include_usage-style chunk (choices: []with a populatedusageblock) just before[DONE]for Anthropic and after the finalfinish_reasonchunk for OpenAI Responses (bothresponse.completedandresponse.incomplete).Chunk shape
prompt_tokens_details.cached_tokensis the normalised cross-provider key so existing OpenAI SDK callers that already readprompt_tokens_detailskeep working. The Anthropic-onlycache_creation_input_tokens/cache_read_input_tokenskeys are kept on the sameusagedict for callers that already key off the native Anthropic names.The helper returns
None(suppresses the chunk) when upstream errored before any usage event arrived, so failed turns do not show a misleading "0 tokens" line.Verification
End-to-end against a live Studio routed to api.anthropic.com and api.openai.com:
A 3-turn Opus 4.7 (xhigh) run with
enabled_tools=["web_search","code_execution"]returnedcache_read_input_tokens=18901on turn 2, matching the value already logged by the existing structlog line ("Anthropic stream complete ... cache_read_input_tokens=18901").Test plan
tests/test_external_provider_usage_chunk.pycovering the helper alone and the two streaming integrations (completed + incomplete for OpenAI).test_anthropic_code_execution.py,test_openai_code_execution.py,test_openai_responses_translation.py,test_anthropic_messages.py,test_anthropic_thinking_translation.py,test_openai_tool_passthrough.py,test_responses_tool_passthrough.py,test_responses_api.pystill pass.