Studio: per-session cost calculator + /api/providers/pricing endpoint by danielhanchen · Pull Request #5690 · unslothai/unsloth

danielhanchen · 2026-05-22T10:05:01Z

Summary

Neither the Anthropic Messages API nor the OpenAI Responses API reports a cost field on the response. Both expose detailed token counts (input, output, cache hits, server-tool invocations); pricing multipliers live in the provider docs. The frontend's "cost so far" display was impossible without scraping the server log.

This PR lands the math + a snapshot endpoint so the cost calculator can run client-side from the existing usage chunk plumbing. The UI hookup belongs in a frontend follow-up (and is gated on PR #5670's usage-chunk emission landing so the frontend actually sees the usage block).

Changes

New core/inference/pricing.py:
- Per-MTok base pricing tables for every active Anthropic (claude-opus-4-7 down to deprecated 4.0 family) and gpt-5.x family member. Dated snapshots inherit the canonical-id price via prefix match so future snapshots cost the same until pricing officially changes.
- Shared multipliers: Anthropic cache writes (5m: 1.25x, 1h: 2x), reads (0.1x); OpenAI cache reads (0.1x); Anthropic server-tool surcharges ($10 / 1k web_search, $0.05 / hour code execution beyond the 50-hour daily free tier).
- calculate_cost(provider, model, usage) returns a per-turn USD breakdown:
```
{
  "input_usd": 0.0042, "output_usd": 0.012,
  "cache_write_usd": 0.0001, "cache_read_usd": 0.0008,
  "server_tools_usd": 0.01, "total_usd": 0.0271,
  "billable_input_tokens": 5023,
  "billable_output_tokens": 480,
  "model_priced": "claude-opus-4-7",
  "priced": true
}
```
  Unknown models -> priced: false and zeros, but token counts still report.
- pricing_snapshot() returns the whole table for the frontend.
New GET /api/providers/pricing returning the snapshot, scoped behind the existing auth dependency.
backend/tests/test_pricing.py with 12 cases pinning the math: base I/O multiplication, 5m / 1h / read multipliers, default-to-5m fallback when no breakdown, web_search per-1k, code_execution per-hour, dated-snapshot prefix fallback, OpenAI cache-read discount accounting (cached tokens get subtracted from the full-price bucket and re-billed at 0.1x), unknown-model graceful-degrade, snapshot endpoint shape.

Verification

All 12 tests pass.
Pricing values verified live against the publicly documented per-MTok rates on the Claude models overview and OpenAI pricing page.

Sequencing

This PR is independent and can land standalone (the endpoint is useful for any client that wants to compute cost from a usage payload).
The frontend "Cost: $0.024 this turn / $1.27 session" display is a follow-up gated on PR Studio: surface prompt-cache token counts in /v1/chat/completions usage chunk #5670 (Studio: surface prompt-cache token counts in /v1/chat/completions usage chunk).

Test plan

CI green on backend test suite.
Manual: curl /api/providers/pricing with a valid auth token returns the structured table.
Once Studio: surface prompt-cache token counts in /v1/chat/completions usage chunk #5670 lands, follow-up frontend PR consumes both the usage chunk and this pricing snapshot to render a cost pill in the chat footer.

Neither the Anthropic Messages API nor the OpenAI Responses API reports a `cost` field on the response. Both expose detailed token counts (input, output, cache hits, server-tool invocations); pricing multipliers live in the provider docs. The frontend's "cost so far" display was impossible without scraping the server log. Land the math + a snapshot endpoint so the cost calculator can run client-side from the existing usage chunk plumbing. The actual UI hookup belongs in a frontend follow-up (and is gated on PR #5670's usage-chunk emission landing so the frontend sees the usage block in the first place). Changes: - New `core/inference/pricing.py` with: - Per-MTok base pricing tables for every active Anthropic and gpt-5.x family member. Dated snapshots inherit the canonical-id price via prefix match so future snapshots cost the same as the canonical id until pricing changes. - Shared multipliers for Anthropic cache writes (5m: 1.25x, 1h: 2x) and reads (0.1x); OpenAI cache reads (0.1x); Anthropic server tool surcharges ($10 / 1k web_search, $0.05 / hour code_exec beyond the 50-hour daily free tier). - `calculate_cost(provider, model, usage)` returns a per-turn USD breakdown plus billable token counts, with priced=False for unknown models so the UI can still render token counts. - `pricing_snapshot()` returns the whole table for the frontend so it doesn't re-implement the multipliers. - New `GET /api/providers/pricing` returning the snapshot, scoped behind the existing auth dependency. - New `backend/tests/test_pricing.py` with 12 cases pinning the math against documented values: base input/output multiplication, 5m / 1h / read multipliers, default-to-5m fallback when the breakdown is absent, web_search per-1k pricing, code_execution per-hour pricing, dated-snapshot fallback, OpenAI cache-read discount accounting (cached tokens subtracted from full-price bucket and re-billed at 0.1x), unknown model graceful-degrade, and the snapshot endpoint shape.

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request introduces a pricing module and an API endpoint to calculate and expose per-token costs for Anthropic and OpenAI models. The implementation includes static pricing tables, a cost calculation helper that handles provider-specific logic such as prompt caching and tool surcharges, and comprehensive unit tests. Review feedback identifies several critical issues: the prefix matching logic for model lookups is sensitive to key order and should prioritize longer matches; the OpenAI usage extraction uses incorrect keys (input_tokens instead of prompt_tokens); and cached tokens are currently double-counted for OpenAI in the billable input total. Additionally, the return type hint for the cost calculation function needs to be updated to reflect the mixed types in the output dictionary.

gemini-code-assist · 2026-05-22T10:07:49Z

+    # Fall back to a prefix match so date-suffixed snapshots
+    # ("gpt-5.5-2026-04-23") inherit the canonical-id prices.
+    for key, val in table.items():


The prefix matching logic is sensitive to the order of keys in the pricing tables. Since gpt-5.4 appears before gpt-5.4-mini in OPENAI_PRICING, a model ID like gpt-5.4-mini-2026 will incorrectly match the gpt-5.4 entry instead of the more specific one. The matching should prioritize the longest prefix to ensure accuracy.

Suggested change

# Fall back to a prefix match so date-suffixed snapshots

# ("gpt-5.5-2026-04-23") inherit the canonical-id prices.

for key, val in table.items():

for key in sorted(table.keys(), key=len, reverse=True):

if model.startswith(key):

return table[key]

gemini-code-assist · 2026-05-22T10:07:49Z

+    }
+
+    input_tokens = int(usage.get("input_tokens") or 0)
+    output_tokens = int(usage.get("output_tokens") or 0)
+    cache_creation = int(usage.get("cache_creation_input_tokens") or 0)
+    cache_read = int(usage.get("cache_read_input_tokens") or 0)
+    # OpenAI Responses reports cached tokens under input_tokens_details:
+    if provider == "openai":
+        details = usage.get("input_tokens_details") or {}


The code uses input_tokens and input_tokens_details for OpenAI usage extraction. However, the standard OpenAI API response uses prompt_tokens and prompt_tokens_details. To ensure compatibility, use a helper like _is_openai_compatible() and route to provider-specific extraction logic, as these keys will otherwise fail to extract the correct token counts for OpenAI models.

References

When proxying chat completions to multiple LLM providers, always check for OpenAI compatibility and route to provider-specific implementations if the standard endpoint is not supported.

Avoid premature abstraction. If similar logic across different providers feeds off different data shapes, keep them separate if centralization would require complex normalization.

gemini-code-assist · 2026-05-22T10:07:49Z

+    if provider == "openai":
+        details = usage.get("input_tokens_details") or {}
+        if isinstance(details, dict):
+            cache_read = max(cache_read, int(details.get("cached_tokens") or 0))


For OpenAI, the input_tokens (prompt tokens) count already includes cached tokens. Adding cache_read again here results in double-counting for the billable_input_tokens field. This should be handled conditionally based on the provider, as Anthropic and OpenAI report these counts differently.

Suggested change

cache_read = max(cache_read, int(details.get("cached_tokens") or 0))

out["billable_input_tokens"] = input_tokens if provider == "openai" else (input_tokens + cache_creation + cache_read)

References

Avoid premature abstraction. If similar logic across different providers feeds off different data shapes, keep them separate if centralization would require complex normalization and reduce clarity.

gemini-code-assist · 2026-05-22T10:07:49Z

+
+def calculate_cost(
+    provider: str,
+    model: str,


The return type hint dict[str, float] is inaccurate because the returned dictionary contains mixed types, including int (billable_input_tokens), str (model_priced), and bool (priced). This will cause type-checking errors in strict environments.

Suggested change

model: str,

) -> dict[str, Any]:

References

When applying type hints to a method, ensure consistency by applying similar hints to other methods that share the same signature pattern or purpose.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6738fc07fb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T10:08:21Z

+    for key, val in table.items():
+        if model.startswith(key):


Match the longest model prefix before pricing

The prefix fallback returns the first matching key, which misprices dated snapshots when one model id is a prefix of another (for example, gpt-5.5-pro-2026-04-23 matches gpt-5.5 before gpt-5.5-pro, and gpt-5.4-mini-... matches gpt-5.4 before gpt-5.4-mini). This produces materially wrong cost estimates for pro/mini/nano variants; the fallback should choose the most specific (longest) matching prefix.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-22T10:08:21Z

+ANTHROPIC_PRICING: dict[str, dict[str, float]] = {
+    "claude-opus-4-7": {"input_per_mtok": 5.0, "output_per_mtok": 25.0},
+    "claude-opus-4-6": {"input_per_mtok": 5.0, "output_per_mtok": 25.0},
+    "claude-opus-4-5-20251101": {"input_per_mtok": 5.0, "output_per_mtok": 25.0},


Add canonical Anthropic 4.5 model ids to pricing table

This table only includes dated 4.5 ids (for example claude-opus-4-5-20251101) and omits canonical ids like claude-opus-4-5 / claude-sonnet-4-5 / claude-haiku-4-5 that are used elsewhere in the backend defaults, so lookups for those common ids resolve as unpriced and the calculator returns zero cost (priced=false). Include canonical aliases (or normalize model ids) to avoid silently dropping cost for supported Anthropic models.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-22T10:08:21Z

+        if isinstance(details, dict):
+            cache_read = max(cache_read, int(details.get("cached_tokens") or 0))
+
+    out["billable_input_tokens"] = input_tokens + cache_creation + cache_read


Avoid double-counting OpenAI cached tokens in billable total

For OpenAI, input_tokens already includes cached tokens (read from input_tokens_details.cached_tokens), but billable_input_tokens adds cache_read on top of input_tokens, inflating the token total shown to clients whenever cache hits occur. In a typical case like 1,000,000 input tokens with 800,000 cached, this reports 1,800,000 billable input tokens even though only 1,000,000 were actually processed in total.

Useful? React with 👍 / 👎.

Address the cost-calculator review: - OpenAI prices were 2-6x under the actual published rates. Cross-checked the live developers.openai.com/api/docs/pricing page and replaced every entry. gpt-5.5 is 5/30, gpt-5.5-pro is 30/180, gpt-5.4 is 2.5/15, gpt-5.4-mini 0.75/4.5, gpt-5.4-nano 0.20/1.25, gpt-5.3-codex 1.75/14. Added chat-latest alias to the canonical chat-snapshot rate. Dropped o3 / o4 / gpt-4.5 rows that are no longer listed on the page; calculator returns priced=False instead of silently billing at zero. - billable_input_tokens was double-counting cached tokens for OpenAI. Anthropic excludes cache_* buckets from input_tokens so we add them; OpenAI folds cache_read_input_tokens into input_tokens already, so the tooltip read 1.8M for a 1.0M bill. Branched the math by provider and added a regression test. Sourcing notes in the module docstring updated.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a9f8d6633

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T10:40:00Z

+    base = prices["input_per_mtok"]
+    out_per = prices["output_per_mtok"]
+
+    out["input_usd"] = (input_tokens / 1_000_000.0) * base
+    out["output_usd"] = (output_tokens / 1_000_000.0) * out_per


Price OpenAI long-context requests at long-context rates

This calculation always uses a single input_per_mtok / output_per_mtok pair, but OpenAI now publishes separate short-context and long-context prices for the same model family. For requests that cross into long-context billing (for example on gpt-5.5/gpt-5.4), total_usd is systematically undercounted because the higher long-context rates are never selected.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-22T10:40:00Z

+    else:
+        # OpenAI: cache writes share the base input price (no premium).
+        # Only cache reads get the 0.1x multiplier; subtract those from
+        # the input_usd we already counted so we don't double-bill.
+        # Anthropic excludes cache buckets from input_tokens, but


Include OpenAI built-in tool fees in total_usd

The OpenAI branch only adjusts token charges for cached input and never adds any built-in tool surcharges, so turns that use tools (e.g., web search or hosted code execution/containers) report a materially low total_usd even though those calls are billed separately. Because server_tools_usd stays at 0.0 for OpenAI, sessions with tool use will consistently understate cost.

Useful? React with 👍 / 👎.

Three Codex P1 follow-ups on the cost calculator: 1. Canonical Anthropic 4.5 ids missing from ANTHROPIC_PRICING. claude-opus-4-5 / claude-sonnet-4-5 / claude-haiku-4-5 (no date suffix) are the ids used by backend defaults (PROVIDER_REGISTRY['anthropic'].default_models), but the table only had the dated forms. _lookup's prefix fallback doesn't help because the canonical id is SHORTER than the dated key, so str.startswith goes the wrong way and the calculator returned priced=False + zero cost. Added the canonical aliases for opus-4-5, sonnet-4-5, haiku-4-5, and opus-4-1. 2. OpenAI long-context tier. gpt-5.5 and gpt-5.4 cross over at 272k input tokens to a 2x input / 1.5x output rate (gpt-5.5: $5/$30 -> $10/$45; gpt-5.4: $2.50/$15 -> $5/$22.50). Turns past the threshold were systematically undercounted at headline rates. Added long_context_threshold / long_context_input_per_mtok / long_context_output_per_mtok columns and a tier-selection step in calculate_cost; model_priced gains a "(long-context >272000)" suffix when the higher tier applies so the tooltip can show which rate was used. gpt-5.5-pro / gpt-5.4-pro / mini / nano / codex have no published long-context tier today, so they keep a single rate. 3. OpenAI server-tool surcharges. web_search is $10/1000 calls and the hosted shell container is $0.03 per 20-minute session on the default 1g tier (~$0.09/hr). server_tools_usd was previously stuck at 0.0 for OpenAI even when web_search and shell tools fired, so sessions with tool use understated cost. Added OPENAI_WEB_SEARCH_USD_PER_1K and OPENAI_CONTAINER_USD_PER_HOUR constants plus a parallel of the Anthropic surcharge block that reads counts from usage["openai_tool_use"]. The SSE translator wires the counts in a follow-up commit; the calculator is now ready for them. pricing_snapshot also exposes both constants so the frontend tooltip can render the per-call rate. Existing tests updated to stay in the short-context tier where they were testing base rates; new tests pin canonical 4.5 lookups, long-context crossover on gpt-5.5/gpt-5.4, the absence of crossover on mini/nano/codex, and OpenAI tool surcharges (web_search, container hours, combined total).

for more information, see https://pre-commit.ci

…hat-style usage keys) (unslothai#5722) * Studio: longest-prefix pricing match + accept chat-style usage keys Two P1 / High follow-ups from PR 5690 review feedback: 1. Pricing prefix lookup returned the first key it iterated, so dated snapshots like ``gpt-5.4-mini-2026-04-23`` collided with the shorter ``gpt-5.4`` entry and overbilled by 3x+. Sort the table keys longest-first so the most specific entry wins. 2. ``calculate_cost`` only read ``input_tokens`` / ``output_tokens``, but Studio's OpenAI-Chat-style usage envelope re-emits ``prompt_tokens`` / ``completion_tokens`` (the OpenAI Chat Completions vocabulary). Callers handing in the chat-style shape silently got a zeroed bill. Accept either pair so the calculator works against both raw upstream usage and the Studio-translated envelope. Tests (4 new in test_pricing.py): dated mini/pro snapshots inherit the right rate; chat-style usage keys price correctly; raw key wins when both shapes are present. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: dedupe cache buckets when costing chat-style Anthropic usage When the caller hands in Studio's chat-style envelope (``prompt_tokens`` emitted by ``_build_usage_chunk``) for Anthropic, that value already folds ``cache_creation_input_tokens`` + ``cache_read_input_tokens`` into the total. The previous follow-up accepted the chat-style key but then re-added both cache buckets in ``billable_input_tokens`` and ``input_usd``, double-counting cache tokens on every Anthropic chat-style call. Detect which envelope landed (``input_tokens`` present = raw upstream; absent + ``prompt_tokens`` present = Studio chat-style) and peel the cache buckets off for Anthropic before the downstream math so both envelopes produce identical costs. OpenAI: ``input_tokens`` and Studio's ``prompt_tokens`` both already include ``cache_read`` and exclude any notional ``cache_creation``, so the OpenAI path stays a straight passthrough. Tests (2 new): both envelopes match for Anthropic on a triple (uncached + cache_creation + cache_read); OpenAI envelopes match on a cached-tokens fixture. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: prefer raw output_tokens over chat-style completion_tokens Codex flagged that the previous fallback chain 'usage.get("output_tokens") or usage.get("completion_tokens")' treats an explicit 0 as missing -- a mixed-envelope payload where 'output_tokens' is 0 but 'completion_tokens' is non-zero (or stale) bills the wrong amount. Mirror the has_input_tokens precedence pattern: when the raw key is present we use it even at 0; otherwise fall back to completion_tokens. * Studio: read OpenAI cached tokens from prompt_tokens_details too Codex flagged that the chat-style OpenAI envelope Studio re-emits via _build_usage_chunk surfaces cached prompt tokens under prompt_tokens_details.cached_tokens, not input_tokens_details. The OpenAI branch only checked input_tokens_details, so a cache-heavy chat-style turn billed every cached token at the full input rate instead of the 0.1x cache_read discount. Walk both keys when discovering the cached count. New regression test pins that the two envelopes price identically for a turn with 80k of 100k tokens cached. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: tighten pricing prefix match + clamp corrupt usage Three follow-ups on the longest-prefix pricing match landed in this PR: - Prefix match now requires a dash boundary or end-of-string. The longest-key sort alone still falsely landed "claude-opus-4-15" on the "claude-opus-4-1" row, and "gpt-5.5-prod" on the "gpt-5.5-pro" row (a 6x overcharge). Demanding the next character be "-" rules out the lookalikes while keeping dated snapshots ("gpt-5.4-mini-2026-04-23", "claude-opus-4-7-20260414") landing on their canonical row. - Clamp every token count to >= 0. A corrupted upstream payload (negative cached count, off-by-one in a fixture) could previously produce a negative bill that masked real spend in the session total tooltip. - Tolerate a non-dict "cache_creation" (e.g. an upstream proxy folded the field down to a single int). The current code raised AttributeError mid-turn; now it falls back to the 5m-default bucket so the rest of the cost calculation still runs. Adds tests/test_pricing_edge.py with 20 adversarial cases covering the boundary check, negative / None / zero token values across both envelopes, cache_read > prompt corruption, the OpenAI long-context threshold crossover on cache-inflated billable input, malformed sub-objects, and unknown-provider degradation. Combined suite is 51 tests, all green. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Surface Anthropic cache-read fallback and forward 1h breakdown Two correctness gaps surfaced on the chat-style usage envelope: 1) Anthropic cache_read fell through to "uncached input" pricing when the envelope arrived without the native ``cache_read_input_tokens`` key (e.g. via a proxy that only emits the mirrored ``prompt_tokens_details.cached_tokens`` block). Studio's canonical ``_build_usage_chunk`` always sets both so production traffic was never affected, but the calculator should accept either as a defense-in-depth measure. Add a fallback to read the mirrored field when the native one is missing or zero; the native key still wins when both are present so the math stays deterministic. 2) ``_build_usage_chunk`` dropped the ``cache_creation`` 5m / 1h breakdown. Downstream ``calculate_cost`` then could not apply the 2x 1h premium and silently fell back to the 5m default, underbilling 1h cache writes by 2x on chat-style traffic. Forward the breakdown verbatim when the upstream usage carries it. Tests grow by 4 (20 -> 24): two for the prompt_tokens_details fallback (with native-precedence pin), one for the chunk shape, one for the end-to-end pricing parity check at 1h. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add Anthropic fast_mode pricing multiplier PR 5715 wires the fast-mode-2026-02-01 beta header + speed:"fast" field through to Anthropic, but the cost calculator never learnt about the matching 6x premium documented at https://platform.claude.com/docs/en/build-with-claude/fast-mode (Opus 4.7 standard $5/$25 per MTok, fast $30/$150). This adds: - ANTHROPIC_FAST_MODE_MULT = 6.0 constant. - calculate_cost(..., fast_mode=True) applies the 6x to base input AND output rates before any cache multipliers (cache mults stack on top of fast per Anthropic docs). - Provider+model gate: silently no-op on every model that is not claude-opus-4-6 / claude-opus-4-7 so a stray fast_mode=True on Sonnet/Haiku can never over-charge. - model_priced label tagged "(fast)" so the cost tooltip can surface which rate fired. - pricing_snapshot now exposes fast_mode_mult so the frontend cost panel doesn't have to hard-code 6. 7 new edge tests pin the math; existing 55 still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Honor explicit zero cache_read_input_tokens on Anthropic envelopes The previous follow-up fell back to ``prompt_tokens_details.cached_tokens`` whenever the native ``cache_read_input_tokens`` was missing OR equal to 0, even though the commit message stated the native key always wins when present. A proxy that forwards a stale ``prompt_tokens_details`` block alongside an authoritative ``cache_read_input_tokens: 0`` would then inflate cache_read past the real native count, posting a false cache_read line and bumping billable_input_tokens. Switch the gate to native-key presence so an explicit zero stays authoritative; the mirror only kicks in when the native key is absent. Add a regression test pinning the explicit-zero precedence. * Move fast_mode pricing back to unslothai#5715 The fast_mode 6x multiplier landed in two places at once -- here (f66df7b) and on unslothai#5715 (4f1afdb) -- since both audits ran in parallel. Drop the duplicate from this branch so the change lives in its natural home (unslothai#5715, which introduces fast_mode itself); this PR stays focused on the cache-read fallback + 1h breakdown. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Shorten pricing comments for PR unslothai#5722 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…unslothai#5690) * Studio: per-session cost calculator + /api/providers/pricing endpoint Neither the Anthropic Messages API nor the OpenAI Responses API reports a `cost` field on the response. Both expose detailed token counts (input, output, cache hits, server-tool invocations); pricing multipliers live in the provider docs. The frontend's "cost so far" display was impossible without scraping the server log. Land the math + a snapshot endpoint so the cost calculator can run client-side from the existing usage chunk plumbing. The actual UI hookup belongs in a frontend follow-up (and is gated on PR unslothai#5670's usage-chunk emission landing so the frontend sees the usage block in the first place). Changes: - New `core/inference/pricing.py` with: - Per-MTok base pricing tables for every active Anthropic and gpt-5.x family member. Dated snapshots inherit the canonical-id price via prefix match so future snapshots cost the same as the canonical id until pricing changes. - Shared multipliers for Anthropic cache writes (5m: 1.25x, 1h: 2x) and reads (0.1x); OpenAI cache reads (0.1x); Anthropic server tool surcharges ($10 / 1k web_search, $0.05 / hour code_exec beyond the 50-hour daily free tier). - `calculate_cost(provider, model, usage)` returns a per-turn USD breakdown plus billable token counts, with priced=False for unknown models so the UI can still render token counts. - `pricing_snapshot()` returns the whole table for the frontend so it doesn't re-implement the multipliers. - New `GET /api/providers/pricing` returning the snapshot, scoped behind the existing auth dependency. - New `backend/tests/test_pricing.py` with 12 cases pinning the math against documented values: base input/output multiplication, 5m / 1h / read multipliers, default-to-5m fallback when the breakdown is absent, web_search per-1k pricing, code_execution per-hour pricing, dated-snapshot fallback, OpenAI cache-read discount accounting (cached tokens subtracted from full-price bucket and re-billed at 0.1x), unknown model graceful-degrade, and the snapshot endpoint shape. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: verified OpenAI pricing + fix billable input double-count Address the cost-calculator review: - OpenAI prices were 2-6x under the actual published rates. Cross-checked the live developers.openai.com/api/docs/pricing page and replaced every entry. gpt-5.5 is 5/30, gpt-5.5-pro is 30/180, gpt-5.4 is 2.5/15, gpt-5.4-mini 0.75/4.5, gpt-5.4-nano 0.20/1.25, gpt-5.3-codex 1.75/14. Added chat-latest alias to the canonical chat-snapshot rate. Dropped o3 / o4 / gpt-4.5 rows that are no longer listed on the page; calculator returns priced=False instead of silently billing at zero. - billable_input_tokens was double-counting cached tokens for OpenAI. Anthropic excludes cache_* buckets from input_tokens so we add them; OpenAI folds cache_read_input_tokens into input_tokens already, so the tooltip read 1.8M for a 1.0M bill. Branched the math by provider and added a regression test. Sourcing notes in the module docstring updated. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review: canonical 4.5 ids, long-context tier, OpenAI tool fees Three Codex P1 follow-ups on the cost calculator: 1. Canonical Anthropic 4.5 ids missing from ANTHROPIC_PRICING. claude-opus-4-5 / claude-sonnet-4-5 / claude-haiku-4-5 (no date suffix) are the ids used by backend defaults (PROVIDER_REGISTRY['anthropic'].default_models), but the table only had the dated forms. _lookup's prefix fallback doesn't help because the canonical id is SHORTER than the dated key, so str.startswith goes the wrong way and the calculator returned priced=False + zero cost. Added the canonical aliases for opus-4-5, sonnet-4-5, haiku-4-5, and opus-4-1. 2. OpenAI long-context tier. gpt-5.5 and gpt-5.4 cross over at 272k input tokens to a 2x input / 1.5x output rate (gpt-5.5: $5/$30 -> $10/$45; gpt-5.4: $2.50/$15 -> $5/$22.50). Turns past the threshold were systematically undercounted at headline rates. Added long_context_threshold / long_context_input_per_mtok / long_context_output_per_mtok columns and a tier-selection step in calculate_cost; model_priced gains a "(long-context >272000)" suffix when the higher tier applies so the tooltip can show which rate was used. gpt-5.5-pro / gpt-5.4-pro / mini / nano / codex have no published long-context tier today, so they keep a single rate. 3. OpenAI server-tool surcharges. web_search is $10/1000 calls and the hosted shell container is $0.03 per 20-minute session on the default 1g tier (~$0.09/hr). server_tools_usd was previously stuck at 0.0 for OpenAI even when web_search and shell tools fired, so sessions with tool use understated cost. Added OPENAI_WEB_SEARCH_USD_PER_1K and OPENAI_CONTAINER_USD_PER_HOUR constants plus a parallel of the Anthropic surcharge block that reads counts from usage["openai_tool_use"]. The SSE translator wires the counts in a follow-up commit; the calculator is now ready for them. pricing_snapshot also exposes both constants so the frontend tooltip can render the per-call rate. Existing tests updated to stay in the short-context tier where they were testing base rates; new tests pin canonical 4.5 lookups, long-context crossover on gpt-5.5/gpt-5.4, the absence of crossover on mini/nano/codex, and OpenAI tool surcharges (web_search, container hours, combined total). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

danielhanchen requested a review from rolandtannous as a code owner May 22, 2026 10:05

[pre-commit.ci] auto fixes from pre-commit.com hooks

6738fc0

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

danielhanchen and others added 2 commits May 22, 2026 10:34

[pre-commit.ci] auto fixes from pre-commit.com hooks

4a9f8d6

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

danielhanchen and others added 2 commits May 22, 2026 11:25

[pre-commit.ci] auto fixes from pre-commit.com hooks

65996c9

for more information, see https://pre-commit.ci

danielhanchen merged commit 2201fd6 into main May 22, 2026
32 checks passed

danielhanchen deleted the feat/per-session-cost branch May 22, 2026 13:03

danielhanchen mentioned this pull request May 22, 2026

Studio: pricing follow-up to #5690 (longest-prefix match + chat-style usage keys) #5722

Merged

2 tasks

	cache_read = max(cache_read, int(details.get("cached_tokens") or 0))
	out["billable_input_tokens"] = input_tokens if provider == "openai" else (input_tokens + cache_creation + cache_read)

Uh oh!

Conversation

danielhanchen commented May 22, 2026

Summary

Changes

Verification

Sequencing

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant