Skip to content

Studio: per-session cost calculator + /api/providers/pricing endpoint#5690

Merged
danielhanchen merged 6 commits into
mainfrom
feat/per-session-cost
May 22, 2026
Merged

Studio: per-session cost calculator + /api/providers/pricing endpoint#5690
danielhanchen merged 6 commits into
mainfrom
feat/per-session-cost

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Neither the Anthropic Messages API nor the OpenAI Responses API reports a cost field on the response. Both expose detailed token counts (input, output, cache hits, server-tool invocations); pricing multipliers live in the provider docs. The frontend's "cost so far" display was impossible without scraping the server log.

This PR lands the math + a snapshot endpoint so the cost calculator can run client-side from the existing usage chunk plumbing. The UI hookup belongs in a frontend follow-up (and is gated on PR #5670's usage-chunk emission landing so the frontend actually sees the usage block).

Changes

  • New core/inference/pricing.py:
    • Per-MTok base pricing tables for every active Anthropic (claude-opus-4-7 down to deprecated 4.0 family) and gpt-5.x family member. Dated snapshots inherit the canonical-id price via prefix match so future snapshots cost the same until pricing officially changes.
    • Shared multipliers: Anthropic cache writes (5m: 1.25x, 1h: 2x), reads (0.1x); OpenAI cache reads (0.1x); Anthropic server-tool surcharges ($10 / 1k web_search, $0.05 / hour code execution beyond the 50-hour daily free tier).
    • calculate_cost(provider, model, usage) returns a per-turn USD breakdown:
      {
        "input_usd": 0.0042, "output_usd": 0.012,
        "cache_write_usd": 0.0001, "cache_read_usd": 0.0008,
        "server_tools_usd": 0.01, "total_usd": 0.0271,
        "billable_input_tokens": 5023,
        "billable_output_tokens": 480,
        "model_priced": "claude-opus-4-7",
        "priced": true
      }
      Unknown models -> priced: false and zeros, but token counts still report.
    • pricing_snapshot() returns the whole table for the frontend.
  • New GET /api/providers/pricing returning the snapshot, scoped behind the existing auth dependency.
  • backend/tests/test_pricing.py with 12 cases pinning the math: base I/O multiplication, 5m / 1h / read multipliers, default-to-5m fallback when no breakdown, web_search per-1k, code_execution per-hour, dated-snapshot prefix fallback, OpenAI cache-read discount accounting (cached tokens get subtracted from the full-price bucket and re-billed at 0.1x), unknown-model graceful-degrade, snapshot endpoint shape.

Verification

  • All 12 tests pass.
  • Pricing values verified live against the publicly documented per-MTok rates on the Claude models overview and OpenAI pricing page.

Sequencing

Test plan

Neither the Anthropic Messages API nor the OpenAI Responses API
reports a `cost` field on the response. Both expose detailed token
counts (input, output, cache hits, server-tool invocations); pricing
multipliers live in the provider docs. The frontend's "cost so far"
display was impossible without scraping the server log.

Land the math + a snapshot endpoint so the cost calculator can run
client-side from the existing usage chunk plumbing. The actual UI
hookup belongs in a frontend follow-up (and is gated on PR #5670's
usage-chunk emission landing so the frontend sees the usage block
in the first place).

Changes:

- New `core/inference/pricing.py` with:
  - Per-MTok base pricing tables for every active Anthropic and
    gpt-5.x family member. Dated snapshots inherit the canonical-id
    price via prefix match so future snapshots cost the same as the
    canonical id until pricing changes.
  - Shared multipliers for Anthropic cache writes (5m: 1.25x, 1h: 2x)
    and reads (0.1x); OpenAI cache reads (0.1x); Anthropic server
    tool surcharges ($10 / 1k web_search, $0.05 / hour code_exec
    beyond the 50-hour daily free tier).
  - `calculate_cost(provider, model, usage)` returns a per-turn USD
    breakdown plus billable token counts, with priced=False for
    unknown models so the UI can still render token counts.
  - `pricing_snapshot()` returns the whole table for the frontend
    so it doesn't re-implement the multipliers.
- New `GET /api/providers/pricing` returning the snapshot, scoped
  behind the existing auth dependency.
- New `backend/tests/test_pricing.py` with 12 cases pinning the
  math against documented values: base input/output multiplication,
  5m / 1h / read multipliers, default-to-5m fallback when the
  breakdown is absent, web_search per-1k pricing, code_execution
  per-hour pricing, dated-snapshot fallback, OpenAI cache-read
  discount accounting (cached tokens subtracted from full-price
  bucket and re-billed at 0.1x), unknown model graceful-degrade,
  and the snapshot endpoint shape.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a pricing module and an API endpoint to calculate and expose per-token costs for Anthropic and OpenAI models. The implementation includes static pricing tables, a cost calculation helper that handles provider-specific logic such as prompt caching and tool surcharges, and comprehensive unit tests. Review feedback identifies several critical issues: the prefix matching logic for model lookups is sensitive to key order and should prioritize longer matches; the OpenAI usage extraction uses incorrect keys (input_tokens instead of prompt_tokens); and cached tokens are currently double-counted for OpenAI in the billable input total. Additionally, the return type hint for the cost calculation function needs to be updated to reflect the mixed types in the output dictionary.

Comment on lines +98 to +100
# Fall back to a prefix match so date-suffixed snapshots
# ("gpt-5.5-2026-04-23") inherit the canonical-id prices.
for key, val in table.items():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The prefix matching logic is sensitive to the order of keys in the pricing tables. Since gpt-5.4 appears before gpt-5.4-mini in OPENAI_PRICING, a model ID like gpt-5.4-mini-2026 will incorrectly match the gpt-5.4 entry instead of the more specific one. The matching should prioritize the longest prefix to ensure accuracy.

Suggested change
# Fall back to a prefix match so date-suffixed snapshots
# ("gpt-5.5-2026-04-23") inherit the canonical-id prices.
for key, val in table.items():
for key in sorted(table.keys(), key=len, reverse=True):
if model.startswith(key):
return table[key]

Comment on lines +146 to +154
}

input_tokens = int(usage.get("input_tokens") or 0)
output_tokens = int(usage.get("output_tokens") or 0)
cache_creation = int(usage.get("cache_creation_input_tokens") or 0)
cache_read = int(usage.get("cache_read_input_tokens") or 0)
# OpenAI Responses reports cached tokens under input_tokens_details:
if provider == "openai":
details = usage.get("input_tokens_details") or {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code uses input_tokens and input_tokens_details for OpenAI usage extraction. However, the standard OpenAI API response uses prompt_tokens and prompt_tokens_details. To ensure compatibility, use a helper like _is_openai_compatible() and route to provider-specific extraction logic, as these keys will otherwise fail to extract the correct token counts for OpenAI models.

References
  1. When proxying chat completions to multiple LLM providers, always check for OpenAI compatibility and route to provider-specific implementations if the standard endpoint is not supported.
  2. Avoid premature abstraction. If similar logic across different providers feeds off different data shapes, keep them separate if centralization would require complex normalization.

if provider == "openai":
details = usage.get("input_tokens_details") or {}
if isinstance(details, dict):
cache_read = max(cache_read, int(details.get("cached_tokens") or 0))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For OpenAI, the input_tokens (prompt tokens) count already includes cached tokens. Adding cache_read again here results in double-counting for the billable_input_tokens field. This should be handled conditionally based on the provider, as Anthropic and OpenAI report these counts differently.

Suggested change
cache_read = max(cache_read, int(details.get("cached_tokens") or 0))
out["billable_input_tokens"] = input_tokens if provider == "openai" else (input_tokens + cache_creation + cache_read)
References
  1. Avoid premature abstraction. If similar logic across different providers feeds off different data shapes, keep them separate if centralization would require complex normalization and reduce clarity.


def calculate_cost(
provider: str,
model: str,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint dict[str, float] is inaccurate because the returned dictionary contains mixed types, including int (billable_input_tokens), str (model_priced), and bool (priced). This will cause type-checking errors in strict environments.

Suggested change
model: str,
) -> dict[str, Any]:
References
  1. When applying type hints to a method, ensure consistency by applying similar hints to other methods that share the same signature pattern or purpose.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6738fc07fb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +100 to +101
for key, val in table.items():
if model.startswith(key):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Match the longest model prefix before pricing

The prefix fallback returns the first matching key, which misprices dated snapshots when one model id is a prefix of another (for example, gpt-5.5-pro-2026-04-23 matches gpt-5.5 before gpt-5.5-pro, and gpt-5.4-mini-... matches gpt-5.4 before gpt-5.4-mini). This produces materially wrong cost estimates for pro/mini/nano variants; the fallback should choose the most specific (longest) matching prefix.

Useful? React with 👍 / 👎.

ANTHROPIC_PRICING: dict[str, dict[str, float]] = {
"claude-opus-4-7": {"input_per_mtok": 5.0, "output_per_mtok": 25.0},
"claude-opus-4-6": {"input_per_mtok": 5.0, "output_per_mtok": 25.0},
"claude-opus-4-5-20251101": {"input_per_mtok": 5.0, "output_per_mtok": 25.0},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add canonical Anthropic 4.5 model ids to pricing table

This table only includes dated 4.5 ids (for example claude-opus-4-5-20251101) and omits canonical ids like claude-opus-4-5 / claude-sonnet-4-5 / claude-haiku-4-5 that are used elsewhere in the backend defaults, so lookups for those common ids resolve as unpriced and the calculator returns zero cost (priced=false). Include canonical aliases (or normalize model ids) to avoid silently dropping cost for supported Anthropic models.

Useful? React with 👍 / 👎.

if isinstance(details, dict):
cache_read = max(cache_read, int(details.get("cached_tokens") or 0))

out["billable_input_tokens"] = input_tokens + cache_creation + cache_read

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid double-counting OpenAI cached tokens in billable total

For OpenAI, input_tokens already includes cached tokens (read from input_tokens_details.cached_tokens), but billable_input_tokens adds cache_read on top of input_tokens, inflating the token total shown to clients whenever cache hits occur. In a typical case like 1,000,000 input tokens with 800,000 cached, this reports 1,800,000 billable input tokens even though only 1,000,000 were actually processed in total.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits May 22, 2026 10:34
Address the cost-calculator review:

- OpenAI prices were 2-6x under the actual published rates.
  Cross-checked the live developers.openai.com/api/docs/pricing page
  and replaced every entry. gpt-5.5 is 5/30, gpt-5.5-pro is 30/180,
  gpt-5.4 is 2.5/15, gpt-5.4-mini 0.75/4.5, gpt-5.4-nano 0.20/1.25,
  gpt-5.3-codex 1.75/14. Added chat-latest alias to the canonical
  chat-snapshot rate. Dropped o3 / o4 / gpt-4.5 rows that are no
  longer listed on the page; calculator returns priced=False instead
  of silently billing at zero.

- billable_input_tokens was double-counting cached tokens for
  OpenAI. Anthropic excludes cache_* buckets from input_tokens so
  we add them; OpenAI folds cache_read_input_tokens into
  input_tokens already, so the tooltip read 1.8M for a 1.0M bill.
  Branched the math by provider and added a regression test.

Sourcing notes in the module docstring updated.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a9f8d6633

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +175 to +179
base = prices["input_per_mtok"]
out_per = prices["output_per_mtok"]

out["input_usd"] = (input_tokens / 1_000_000.0) * base
out["output_usd"] = (output_tokens / 1_000_000.0) * out_per

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Price OpenAI long-context requests at long-context rates

This calculation always uses a single input_per_mtok / output_per_mtok pair, but OpenAI now publishes separate short-context and long-context prices for the same model family. For requests that cross into long-context billing (for example on gpt-5.5/gpt-5.4), total_usd is systematically undercounted because the higher long-context rates are never selected.

Useful? React with 👍 / 👎.

Comment on lines +207 to +211
else:
# OpenAI: cache writes share the base input price (no premium).
# Only cache reads get the 0.1x multiplier; subtract those from
# the input_usd we already counted so we don't double-bill.
# Anthropic excludes cache buckets from input_tokens, but

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include OpenAI built-in tool fees in total_usd

The OpenAI branch only adjusts token charges for cached input and never adds any built-in tool surcharges, so turns that use tools (e.g., web search or hosted code execution/containers) report a materially low total_usd even though those calls are billed separately. Because server_tools_usd stays at 0.0 for OpenAI, sessions with tool use will consistently understate cost.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits May 22, 2026 11:25
Three Codex P1 follow-ups on the cost calculator:

1. Canonical Anthropic 4.5 ids missing from ANTHROPIC_PRICING.
   claude-opus-4-5 / claude-sonnet-4-5 / claude-haiku-4-5 (no date
   suffix) are the ids used by backend defaults
   (PROVIDER_REGISTRY['anthropic'].default_models), but the table
   only had the dated forms. _lookup's prefix fallback doesn't help
   because the canonical id is SHORTER than the dated key, so
   str.startswith goes the wrong way and the calculator returned
   priced=False + zero cost. Added the canonical aliases for
   opus-4-5, sonnet-4-5, haiku-4-5, and opus-4-1.

2. OpenAI long-context tier. gpt-5.5 and gpt-5.4 cross over at
   272k input tokens to a 2x input / 1.5x output rate (gpt-5.5:
   $5/$30 -> $10/$45; gpt-5.4: $2.50/$15 -> $5/$22.50). Turns past
   the threshold were systematically undercounted at headline
   rates. Added long_context_threshold / long_context_input_per_mtok /
   long_context_output_per_mtok columns and a tier-selection step
   in calculate_cost; model_priced gains a "(long-context >272000)"
   suffix when the higher tier applies so the tooltip can show
   which rate was used. gpt-5.5-pro / gpt-5.4-pro / mini / nano /
   codex have no published long-context tier today, so they keep a
   single rate.

3. OpenAI server-tool surcharges. web_search is $10/1000 calls and
   the hosted shell container is $0.03 per 20-minute session on the
   default 1g tier (~$0.09/hr). server_tools_usd was previously
   stuck at 0.0 for OpenAI even when web_search and shell tools
   fired, so sessions with tool use understated cost. Added
   OPENAI_WEB_SEARCH_USD_PER_1K and OPENAI_CONTAINER_USD_PER_HOUR
   constants plus a parallel of the Anthropic surcharge block that
   reads counts from usage["openai_tool_use"]. The SSE translator
   wires the counts in a follow-up commit; the calculator is now
   ready for them. pricing_snapshot also exposes both constants so
   the frontend tooltip can render the per-call rate.

Existing tests updated to stay in the short-context tier where they
were testing base rates; new tests pin canonical 4.5 lookups,
long-context crossover on gpt-5.5/gpt-5.4, the absence of crossover
on mini/nano/codex, and OpenAI tool surcharges (web_search,
container hours, combined total).
@danielhanchen danielhanchen merged commit 2201fd6 into main May 22, 2026
32 checks passed
@danielhanchen danielhanchen deleted the feat/per-session-cost branch May 22, 2026 13:03
stophobia pushed a commit to stophobia/unsloth that referenced this pull request May 26, 2026
…hat-style usage keys) (unslothai#5722)

* Studio: longest-prefix pricing match + accept chat-style usage keys

Two P1 / High follow-ups from PR 5690 review feedback:

1. Pricing prefix lookup returned the first key it iterated, so
   dated snapshots like ``gpt-5.4-mini-2026-04-23`` collided with
   the shorter ``gpt-5.4`` entry and overbilled by 3x+. Sort the
   table keys longest-first so the most specific entry wins.

2. ``calculate_cost`` only read ``input_tokens`` / ``output_tokens``,
   but Studio's OpenAI-Chat-style usage envelope re-emits
   ``prompt_tokens`` / ``completion_tokens`` (the OpenAI Chat
   Completions vocabulary). Callers handing in the chat-style
   shape silently got a zeroed bill. Accept either pair so the
   calculator works against both raw upstream usage and the
   Studio-translated envelope.

Tests (4 new in test_pricing.py): dated mini/pro snapshots inherit
the right rate; chat-style usage keys price correctly; raw key wins
when both shapes are present.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: dedupe cache buckets when costing chat-style Anthropic usage

When the caller hands in Studio's chat-style envelope (``prompt_tokens``
emitted by ``_build_usage_chunk``) for Anthropic, that value already
folds ``cache_creation_input_tokens`` + ``cache_read_input_tokens`` into
the total. The previous follow-up accepted the chat-style key but then
re-added both cache buckets in ``billable_input_tokens`` and ``input_usd``,
double-counting cache tokens on every Anthropic chat-style call.

Detect which envelope landed (``input_tokens`` present = raw upstream;
absent + ``prompt_tokens`` present = Studio chat-style) and peel the
cache buckets off for Anthropic before the downstream math so both
envelopes produce identical costs.

OpenAI: ``input_tokens`` and Studio's ``prompt_tokens`` both already
include ``cache_read`` and exclude any notional ``cache_creation``, so
the OpenAI path stays a straight passthrough.

Tests (2 new): both envelopes match for Anthropic on a triple
(uncached + cache_creation + cache_read); OpenAI envelopes match on a
cached-tokens fixture.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: prefer raw output_tokens over chat-style completion_tokens

Codex flagged that the previous fallback chain
'usage.get("output_tokens") or usage.get("completion_tokens")'
treats an explicit 0 as missing -- a mixed-envelope payload where
'output_tokens' is 0 but 'completion_tokens' is non-zero (or
stale) bills the wrong amount. Mirror the has_input_tokens
precedence pattern: when the raw key is present we use it even at
0; otherwise fall back to completion_tokens.

* Studio: read OpenAI cached tokens from prompt_tokens_details too

Codex flagged that the chat-style OpenAI envelope Studio re-emits
via _build_usage_chunk surfaces cached prompt tokens under
prompt_tokens_details.cached_tokens, not input_tokens_details. The
OpenAI branch only checked input_tokens_details, so a cache-heavy
chat-style turn billed every cached token at the full input rate
instead of the 0.1x cache_read discount.

Walk both keys when discovering the cached count. New regression
test pins that the two envelopes price identically for a turn with
80k of 100k tokens cached.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: tighten pricing prefix match + clamp corrupt usage

Three follow-ups on the longest-prefix pricing match landed in this PR:

- Prefix match now requires a dash boundary or end-of-string. The
  longest-key sort alone still falsely landed "claude-opus-4-15" on
  the "claude-opus-4-1" row, and "gpt-5.5-prod" on the "gpt-5.5-pro"
  row (a 6x overcharge). Demanding the next character be "-" rules
  out the lookalikes while keeping dated snapshots
  ("gpt-5.4-mini-2026-04-23", "claude-opus-4-7-20260414") landing on
  their canonical row.
- Clamp every token count to >= 0. A corrupted upstream payload
  (negative cached count, off-by-one in a fixture) could previously
  produce a negative bill that masked real spend in the session
  total tooltip.
- Tolerate a non-dict "cache_creation" (e.g. an upstream proxy
  folded the field down to a single int). The current code raised
  AttributeError mid-turn; now it falls back to the 5m-default
  bucket so the rest of the cost calculation still runs.

Adds tests/test_pricing_edge.py with 20 adversarial cases covering
the boundary check, negative / None / zero token values across both
envelopes, cache_read > prompt corruption, the OpenAI long-context
threshold crossover on cache-inflated billable input, malformed
sub-objects, and unknown-provider degradation. Combined suite is
51 tests, all green.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Surface Anthropic cache-read fallback and forward 1h breakdown

Two correctness gaps surfaced on the chat-style usage envelope:

1) Anthropic cache_read fell through to "uncached input" pricing when
   the envelope arrived without the native ``cache_read_input_tokens``
   key (e.g. via a proxy that only emits the mirrored
   ``prompt_tokens_details.cached_tokens`` block). Studio's canonical
   ``_build_usage_chunk`` always sets both so production traffic was
   never affected, but the calculator should accept either as a
   defense-in-depth measure. Add a fallback to read the mirrored
   field when the native one is missing or zero; the native key still
   wins when both are present so the math stays deterministic.

2) ``_build_usage_chunk`` dropped the ``cache_creation`` 5m / 1h
   breakdown. Downstream ``calculate_cost`` then could not apply the
   2x 1h premium and silently fell back to the 5m default,
   underbilling 1h cache writes by 2x on chat-style traffic. Forward
   the breakdown verbatim when the upstream usage carries it.

Tests grow by 4 (20 -> 24): two for the prompt_tokens_details
fallback (with native-precedence pin), one for the chunk shape, one
for the end-to-end pricing parity check at 1h.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add Anthropic fast_mode pricing multiplier

PR 5715 wires the fast-mode-2026-02-01 beta header + speed:"fast"
field through to Anthropic, but the cost calculator never learnt
about the matching 6x premium documented at
https://platform.claude.com/docs/en/build-with-claude/fast-mode
(Opus 4.7 standard $5/$25 per MTok, fast $30/$150).

This adds:
- ANTHROPIC_FAST_MODE_MULT = 6.0 constant.
- calculate_cost(..., fast_mode=True) applies the 6x to base input
  AND output rates before any cache multipliers (cache mults stack
  on top of fast per Anthropic docs).
- Provider+model gate: silently no-op on every model that is not
  claude-opus-4-6 / claude-opus-4-7 so a stray fast_mode=True on
  Sonnet/Haiku can never over-charge.
- model_priced label tagged "(fast)" so the cost tooltip can
  surface which rate fired.
- pricing_snapshot now exposes fast_mode_mult so the frontend cost
  panel doesn't have to hard-code 6.

7 new edge tests pin the math; existing 55 still pass.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Honor explicit zero cache_read_input_tokens on Anthropic envelopes

The previous follow-up fell back to ``prompt_tokens_details.cached_tokens``
whenever the native ``cache_read_input_tokens`` was missing OR equal to 0,
even though the commit message stated the native key always wins when
present. A proxy that forwards a stale ``prompt_tokens_details`` block
alongside an authoritative ``cache_read_input_tokens: 0`` would then
inflate cache_read past the real native count, posting a false cache_read
line and bumping billable_input_tokens. Switch the gate to native-key
presence so an explicit zero stays authoritative; the mirror only kicks
in when the native key is absent. Add a regression test pinning the
explicit-zero precedence.

* Move fast_mode pricing back to unslothai#5715

The fast_mode 6x multiplier landed in two places at once -- here
(f66df7b) and on unslothai#5715 (4f1afdb) -- since both audits ran in
parallel. Drop the duplicate from this branch so the change lives
in its natural home (unslothai#5715, which introduces fast_mode itself);
this PR stays focused on the cache-read fallback + 1h breakdown.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Shorten pricing comments for PR unslothai#5722

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…unslothai#5690)

* Studio: per-session cost calculator + /api/providers/pricing endpoint

Neither the Anthropic Messages API nor the OpenAI Responses API
reports a `cost` field on the response. Both expose detailed token
counts (input, output, cache hits, server-tool invocations); pricing
multipliers live in the provider docs. The frontend's "cost so far"
display was impossible without scraping the server log.

Land the math + a snapshot endpoint so the cost calculator can run
client-side from the existing usage chunk plumbing. The actual UI
hookup belongs in a frontend follow-up (and is gated on PR unslothai#5670's
usage-chunk emission landing so the frontend sees the usage block
in the first place).

Changes:

- New `core/inference/pricing.py` with:
  - Per-MTok base pricing tables for every active Anthropic and
    gpt-5.x family member. Dated snapshots inherit the canonical-id
    price via prefix match so future snapshots cost the same as the
    canonical id until pricing changes.
  - Shared multipliers for Anthropic cache writes (5m: 1.25x, 1h: 2x)
    and reads (0.1x); OpenAI cache reads (0.1x); Anthropic server
    tool surcharges ($10 / 1k web_search, $0.05 / hour code_exec
    beyond the 50-hour daily free tier).
  - `calculate_cost(provider, model, usage)` returns a per-turn USD
    breakdown plus billable token counts, with priced=False for
    unknown models so the UI can still render token counts.
  - `pricing_snapshot()` returns the whole table for the frontend
    so it doesn't re-implement the multipliers.
- New `GET /api/providers/pricing` returning the snapshot, scoped
  behind the existing auth dependency.
- New `backend/tests/test_pricing.py` with 12 cases pinning the
  math against documented values: base input/output multiplication,
  5m / 1h / read multipliers, default-to-5m fallback when the
  breakdown is absent, web_search per-1k pricing, code_execution
  per-hour pricing, dated-snapshot fallback, OpenAI cache-read
  discount accounting (cached tokens subtracted from full-price
  bucket and re-billed at 0.1x), unknown model graceful-degrade,
  and the snapshot endpoint shape.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: verified OpenAI pricing + fix billable input double-count

Address the cost-calculator review:

- OpenAI prices were 2-6x under the actual published rates.
  Cross-checked the live developers.openai.com/api/docs/pricing page
  and replaced every entry. gpt-5.5 is 5/30, gpt-5.5-pro is 30/180,
  gpt-5.4 is 2.5/15, gpt-5.4-mini 0.75/4.5, gpt-5.4-nano 0.20/1.25,
  gpt-5.3-codex 1.75/14. Added chat-latest alias to the canonical
  chat-snapshot rate. Dropped o3 / o4 / gpt-4.5 rows that are no
  longer listed on the page; calculator returns priced=False instead
  of silently billing at zero.

- billable_input_tokens was double-counting cached tokens for
  OpenAI. Anthropic excludes cache_* buckets from input_tokens so
  we add them; OpenAI folds cache_read_input_tokens into
  input_tokens already, so the tooltip read 1.8M for a 1.0M bill.
  Branched the math by provider and added a regression test.

Sourcing notes in the module docstring updated.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address review: canonical 4.5 ids, long-context tier, OpenAI tool fees

Three Codex P1 follow-ups on the cost calculator:

1. Canonical Anthropic 4.5 ids missing from ANTHROPIC_PRICING.
   claude-opus-4-5 / claude-sonnet-4-5 / claude-haiku-4-5 (no date
   suffix) are the ids used by backend defaults
   (PROVIDER_REGISTRY['anthropic'].default_models), but the table
   only had the dated forms. _lookup's prefix fallback doesn't help
   because the canonical id is SHORTER than the dated key, so
   str.startswith goes the wrong way and the calculator returned
   priced=False + zero cost. Added the canonical aliases for
   opus-4-5, sonnet-4-5, haiku-4-5, and opus-4-1.

2. OpenAI long-context tier. gpt-5.5 and gpt-5.4 cross over at
   272k input tokens to a 2x input / 1.5x output rate (gpt-5.5:
   $5/$30 -> $10/$45; gpt-5.4: $2.50/$15 -> $5/$22.50). Turns past
   the threshold were systematically undercounted at headline
   rates. Added long_context_threshold / long_context_input_per_mtok /
   long_context_output_per_mtok columns and a tier-selection step
   in calculate_cost; model_priced gains a "(long-context >272000)"
   suffix when the higher tier applies so the tooltip can show
   which rate was used. gpt-5.5-pro / gpt-5.4-pro / mini / nano /
   codex have no published long-context tier today, so they keep a
   single rate.

3. OpenAI server-tool surcharges. web_search is $10/1000 calls and
   the hosted shell container is $0.03 per 20-minute session on the
   default 1g tier (~$0.09/hr). server_tools_usd was previously
   stuck at 0.0 for OpenAI even when web_search and shell tools
   fired, so sessions with tool use understated cost. Added
   OPENAI_WEB_SEARCH_USD_PER_1K and OPENAI_CONTAINER_USD_PER_HOUR
   constants plus a parallel of the Anthropic surcharge block that
   reads counts from usage["openai_tool_use"]. The SSE translator
   wires the counts in a follow-up commit; the calculator is now
   ready for them. pricing_snapshot also exposes both constants so
   the frontend tooltip can render the per-call rate.

Existing tests updated to stay in the short-context tier where they
were testing base rates; new tests pin canonical 4.5 lookups,
long-context crossover on gpt-5.5/gpt-5.4, the absence of crossover
on mini/nano/codex, and OpenAI tool surcharges (web_search,
container hours, combined total).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant