Studio: pricing follow-up to #5690 (longest-prefix match + chat-style usage keys)#5722
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1c2c68f969
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # style usage chunk Studio re-emits (``_build_usage_chunk``) uses | ||
| # ``prompt_tokens`` / ``completion_tokens``. Either is fine here | ||
| # so a caller can hand whichever envelope landed first. | ||
| input_tokens = int(usage.get("input_tokens") or usage.get("prompt_tokens") or 0) |
There was a problem hiding this comment.
Normalize Anthropic prompt_tokens before costing
When usage comes from Studio’s chat-style envelope (_build_usage_chunk), Anthropic prompt_tokens already includes cache_creation_input_tokens and cache_read_input_tokens; assigning it directly to input_tokens here makes those cached tokens billed twice (once in input_usd, again in cache_write_usd/cache_read_usd) and also inflates billable_input_tokens in the Anthropic branch. This regression appears whenever callers pass the chat-style usage shape for Anthropic instead of raw input_tokens/output_tokens.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52ba6229d6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| output_tokens = int( | ||
| usage.get("output_tokens") or usage.get("completion_tokens") or 0 | ||
| ) |
There was a problem hiding this comment.
Prefer output_tokens over completion_tokens when both exist
The fallback usage.get("output_tokens") or usage.get("completion_tokens") treats 0 as missing, so a payload that includes both shapes can bill the wrong output amount whenever output_tokens is explicitly zero and completion_tokens is non-zero (or stale). This commit already enforces key-precedence for input tokens, but output tokens currently do not get the same deterministic behavior, which can inflate costs for mixed-envelope callers.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4bcd8bb9c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| details = usage.get("input_tokens_details") or {} | ||
| if isinstance(details, dict): |
There was a problem hiding this comment.
Read OpenAI chat cached tokens from prompt_tokens_details
When calculate_cost receives Studio’s chat-style OpenAI usage shape (prompt_tokens/completion_tokens), cached-token counts are exposed under prompt_tokens_details.cached_tokens (see _build_usage_chunk), but this branch only checks input_tokens_details. In that case cache_read stays 0, so cached input is billed at full input_usd and cache_read_usd is never discounted, which overcharges cache-heavy turns. The new test uses cache_read_input_tokens for OpenAI, a key _build_usage_chunk does not emit, so this regression is not covered.
Useful? React with 👍 / 👎.
Two P1 / High follow-ups from PR 5690 review feedback: 1. Pricing prefix lookup returned the first key it iterated, so dated snapshots like ``gpt-5.4-mini-2026-04-23`` collided with the shorter ``gpt-5.4`` entry and overbilled by 3x+. Sort the table keys longest-first so the most specific entry wins. 2. ``calculate_cost`` only read ``input_tokens`` / ``output_tokens``, but Studio's OpenAI-Chat-style usage envelope re-emits ``prompt_tokens`` / ``completion_tokens`` (the OpenAI Chat Completions vocabulary). Callers handing in the chat-style shape silently got a zeroed bill. Accept either pair so the calculator works against both raw upstream usage and the Studio-translated envelope. Tests (4 new in test_pricing.py): dated mini/pro snapshots inherit the right rate; chat-style usage keys price correctly; raw key wins when both shapes are present.
for more information, see https://pre-commit.ci
When the caller hands in Studio's chat-style envelope (``prompt_tokens`` emitted by ``_build_usage_chunk``) for Anthropic, that value already folds ``cache_creation_input_tokens`` + ``cache_read_input_tokens`` into the total. The previous follow-up accepted the chat-style key but then re-added both cache buckets in ``billable_input_tokens`` and ``input_usd``, double-counting cache tokens on every Anthropic chat-style call. Detect which envelope landed (``input_tokens`` present = raw upstream; absent + ``prompt_tokens`` present = Studio chat-style) and peel the cache buckets off for Anthropic before the downstream math so both envelopes produce identical costs. OpenAI: ``input_tokens`` and Studio's ``prompt_tokens`` both already include ``cache_read`` and exclude any notional ``cache_creation``, so the OpenAI path stays a straight passthrough. Tests (2 new): both envelopes match for Anthropic on a triple (uncached + cache_creation + cache_read); OpenAI envelopes match on a cached-tokens fixture.
for more information, see https://pre-commit.ci
Codex flagged that the previous fallback chain
'usage.get("output_tokens") or usage.get("completion_tokens")'
treats an explicit 0 as missing -- a mixed-envelope payload where
'output_tokens' is 0 but 'completion_tokens' is non-zero (or
stale) bills the wrong amount. Mirror the has_input_tokens
precedence pattern: when the raw key is present we use it even at
0; otherwise fall back to completion_tokens.
Codex flagged that the chat-style OpenAI envelope Studio re-emits via _build_usage_chunk surfaces cached prompt tokens under prompt_tokens_details.cached_tokens, not input_tokens_details. The OpenAI branch only checked input_tokens_details, so a cache-heavy chat-style turn billed every cached token at the full input rate instead of the 0.1x cache_read discount. Walk both keys when discovering the cached count. New regression test pins that the two envelopes price identically for a turn with 80k of 100k tokens cached.
for more information, see https://pre-commit.ci
cb56e28 to
8d8bb1d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d8bb1dc77
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for key in ("input_tokens_details", "prompt_tokens_details"): | ||
| details = usage.get(key) or {} | ||
| if isinstance(details, dict): | ||
| cache_read = max(cache_read, int(details.get("cached_tokens") or 0)) |
There was a problem hiding this comment.
Preserve details-key precedence for mixed OpenAI envelopes
When both usage shapes are present, this branch takes max() across input_tokens_details.cached_tokens and prompt_tokens_details.cached_tokens, so a stale chat-style value can override the authoritative raw Responses value and over-discount input cost. The same commit already enforces raw-key precedence for input_tokens/output_tokens in mixed envelopes, but cached-token details do not follow that rule, which can silently underbill whenever both detail objects are present with different counts.
Useful? React with 👍 / 👎.
Three follow-ups on the longest-prefix pricing match landed in this PR:
- Prefix match now requires a dash boundary or end-of-string. The
longest-key sort alone still falsely landed "claude-opus-4-15" on
the "claude-opus-4-1" row, and "gpt-5.5-prod" on the "gpt-5.5-pro"
row (a 6x overcharge). Demanding the next character be "-" rules
out the lookalikes while keeping dated snapshots
("gpt-5.4-mini-2026-04-23", "claude-opus-4-7-20260414") landing on
their canonical row.
- Clamp every token count to >= 0. A corrupted upstream payload
(negative cached count, off-by-one in a fixture) could previously
produce a negative bill that masked real spend in the session
total tooltip.
- Tolerate a non-dict "cache_creation" (e.g. an upstream proxy
folded the field down to a single int). The current code raised
AttributeError mid-turn; now it falls back to the 5m-default
bucket so the rest of the cost calculation still runs.
Adds tests/test_pricing_edge.py with 20 adversarial cases covering
the boundary check, negative / None / zero token values across both
envelopes, cache_read > prompt corruption, the OpenAI long-context
threshold crossover on cache-inflated billable input, malformed
sub-objects, and unknown-provider degradation. Combined suite is
51 tests, all green.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Two correctness gaps surfaced on the chat-style usage envelope: 1) Anthropic cache_read fell through to "uncached input" pricing when the envelope arrived without the native ``cache_read_input_tokens`` key (e.g. via a proxy that only emits the mirrored ``prompt_tokens_details.cached_tokens`` block). Studio's canonical ``_build_usage_chunk`` always sets both so production traffic was never affected, but the calculator should accept either as a defense-in-depth measure. Add a fallback to read the mirrored field when the native one is missing or zero; the native key still wins when both are present so the math stays deterministic. 2) ``_build_usage_chunk`` dropped the ``cache_creation`` 5m / 1h breakdown. Downstream ``calculate_cost`` then could not apply the 2x 1h premium and silently fell back to the 5m default, underbilling 1h cache writes by 2x on chat-style traffic. Forward the breakdown verbatim when the upstream usage carries it. Tests grow by 4 (20 -> 24): two for the prompt_tokens_details fallback (with native-precedence pin), one for the chunk shape, one for the end-to-end pricing parity check at 1h.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
PR 5715 wires the fast-mode-2026-02-01 beta header + speed:"fast" field through to Anthropic, but the cost calculator never learnt about the matching 6x premium documented at https://platform.claude.com/docs/en/build-with-claude/fast-mode (Opus 4.7 standard $5/$25 per MTok, fast $30/$150). This adds: - ANTHROPIC_FAST_MODE_MULT = 6.0 constant. - calculate_cost(..., fast_mode=True) applies the 6x to base input AND output rates before any cache multipliers (cache mults stack on top of fast per Anthropic docs). - Provider+model gate: silently no-op on every model that is not claude-opus-4-6 / claude-opus-4-7 so a stray fast_mode=True on Sonnet/Haiku can never over-charge. - model_priced label tagged "(fast)" so the cost tooltip can surface which rate fired. - pricing_snapshot now exposes fast_mode_mult so the frontend cost panel doesn't have to hard-code 6. 7 new edge tests pin the math; existing 55 still pass.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d459691f4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if cache_read == 0: | ||
| details = usage.get("prompt_tokens_details") or {} | ||
| if isinstance(details, dict): | ||
| cache_read = max(cache_read, int(details.get("cached_tokens") or 0)) | ||
| cache_read = max(0, int(details.get("cached_tokens") or 0)) |
There was a problem hiding this comment.
Respect explicit zero cache_read_input_tokens
The new fallback treats cache_read_input_tokens == 0 as “missing” and replaces it from prompt_tokens_details.cached_tokens, which breaks mixed-envelope Anthropic payloads when the mirrored chat-style details are stale. In that case a real zero cache-read gets reinterpreted as non-zero, inflating billable_input_tokens and cache_read_usd (and therefore total_usd) even though the authoritative native key explicitly said no cached reads.
Useful? React with 👍 / 👎.
The previous follow-up fell back to ``prompt_tokens_details.cached_tokens`` whenever the native ``cache_read_input_tokens`` was missing OR equal to 0, even though the commit message stated the native key always wins when present. A proxy that forwards a stale ``prompt_tokens_details`` block alongside an authoritative ``cache_read_input_tokens: 0`` would then inflate cache_read past the real native count, posting a false cache_read line and bumping billable_input_tokens. Switch the gate to native-key presence so an explicit zero stays authoritative; the mirror only kicks in when the native key is absent. Add a regression test pinning the explicit-zero precedence.
|
Audit follow-up: pushed 0f3c302 fixing one real precedence bug introduced by 6a939b4. The mirrored Switched the gate to native-key presence so explicit zero stays authoritative; mirror only kicks in when the native key is absent. Added a regression test pinning that precedence. Other reviewer findings (negative server-tool clamp, OpenAI dual-key explicit-zero asymmetry, Rates cross-checked against platform.claude.com and developers.openai.com on 2026-05-25: Opus 4.7 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5, gpt-5.5 $5/$30 (272k long-context 2x/1.5x), gpt-5.4 $2.5/$15, gpt-5.4-mini $0.75/$4.50, cache 1.25x/2.0x/0.1x. Dated snapshot resolution against |
The fast_mode 6x multiplier landed in two places at once -- here (f66df7b) and on #5715 (4f1afdb) -- since both audits ran in parallel. Drop the duplicate from this branch so the change lives in its natural home (#5715, which introduces fast_mode itself); this PR stays focused on the cache-read fallback + 1h breakdown.
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Round 4 cross-cutting fix: merged origin/main into this branch (no conflicts) to bring in PR #5735 (orphan tool_call XML strip widening + 263-line test_tool_xml_strip.py). All 8 PRs in this audit cohort had been forked off a pre-#5735 main, so a squash-merge of any of them would have silently reverted the widened _TOOL_XML_RE regex and deleted the dedicated test file. Verified: diff against origin/main now shows zero unintended changes to routes/inference.py and test_tool_xml_strip.py outside the actual PR scope. |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…5690 # Conflicts: # studio/backend/core/inference/external_provider.py
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Summary
Two P1 / High follow-ups from review on the merged #5690 pricing module.
gpt-5.4-mini-2026-04-23returned the first matching prefix, which was the shortergpt-5.4entry. That overbilled by 3x+ on the mini family. Sort table keys longest-first so the most specific entry wins.calculate_costonly readinput_tokens/output_tokens, but Studio's OpenAI-Chat-style usage envelope re-emitsprompt_tokens/completion_tokens(the Chat Completions vocabulary). Callers handing in the chat-style shape silently got a zeroed bill. Accept either pair so the calculator works against both raw upstream usage and the Studio-translated envelope.Reported by Codex + Gemini bots on #5690.
Test plan
cd studio && python -m pytest backend/tests/test_pricing.py -v(27 passing, 4 new):test_longest_prefix_match_wins_for_dated_mini_snapshottest_longest_prefix_match_wins_for_dated_pro_snapshottest_openai_chat_style_usage_keys_priced_correctlytest_input_tokens_preferred_when_both_keys_present