Skip to content

Studio: pricing follow-up to #5690 (longest-prefix match + chat-style usage keys)#5722

Merged
danielhanchen merged 19 commits into
mainfrom
fix/pricing-followup-5690
May 26, 2026
Merged

Studio: pricing follow-up to #5690 (longest-prefix match + chat-style usage keys)#5722
danielhanchen merged 19 commits into
mainfrom
fix/pricing-followup-5690

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Two P1 / High follow-ups from review on the merged #5690 pricing module.

  1. Longest-prefix pricing match. Looking up a dated snapshot like gpt-5.4-mini-2026-04-23 returned the first matching prefix, which was the shorter gpt-5.4 entry. That overbilled by 3x+ on the mini family. Sort table keys longest-first so the most specific entry wins.
  2. Accept OpenAI-Chat-style usage keys. calculate_cost only read input_tokens / output_tokens, but Studio's OpenAI-Chat-style usage envelope re-emits prompt_tokens / completion_tokens (the Chat Completions vocabulary). Callers handing in the chat-style shape silently got a zeroed bill. Accept either pair so the calculator works against both raw upstream usage and the Studio-translated envelope.

Reported by Codex + Gemini bots on #5690.

Test plan

  • cd studio && python -m pytest backend/tests/test_pricing.py -v (27 passing, 4 new):
    • test_longest_prefix_match_wins_for_dated_mini_snapshot
    • test_longest_prefix_match_wins_for_dated_pro_snapshot
    • test_openai_chat_style_usage_keys_priced_correctly
    • test_input_tokens_preferred_when_both_keys_present
  • All 23 prior pricing tests still green.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c2c68f969

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# style usage chunk Studio re-emits (``_build_usage_chunk``) uses
# ``prompt_tokens`` / ``completion_tokens``. Either is fine here
# so a caller can hand whichever envelope landed first.
input_tokens = int(usage.get("input_tokens") or usage.get("prompt_tokens") or 0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Normalize Anthropic prompt_tokens before costing

When usage comes from Studio’s chat-style envelope (_build_usage_chunk), Anthropic prompt_tokens already includes cache_creation_input_tokens and cache_read_input_tokens; assigning it directly to input_tokens here makes those cached tokens billed twice (once in input_usd, again in cache_write_usd/cache_read_usd) and also inflates billable_input_tokens in the Anthropic branch. This regression appears whenever callers pass the chat-style usage shape for Anthropic instead of raw input_tokens/output_tokens.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52ba6229d6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +224 to +226
output_tokens = int(
usage.get("output_tokens") or usage.get("completion_tokens") or 0
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prefer output_tokens over completion_tokens when both exist

The fallback usage.get("output_tokens") or usage.get("completion_tokens") treats 0 as missing, so a payload that includes both shapes can bill the wrong output amount whenever output_tokens is explicitly zero and completion_tokens is non-zero (or stale). This commit already enforces key-precedence for input tokens, but output tokens currently do not get the same deterministic behavior, which can inflate costs for mixed-envelope callers.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4bcd8bb9c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 233 to 234
details = usage.get("input_tokens_details") or {}
if isinstance(details, dict):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Read OpenAI chat cached tokens from prompt_tokens_details

When calculate_cost receives Studio’s chat-style OpenAI usage shape (prompt_tokens/completion_tokens), cached-token counts are exposed under prompt_tokens_details.cached_tokens (see _build_usage_chunk), but this branch only checks input_tokens_details. In that case cache_read stays 0, so cached input is billed at full input_usd and cache_read_usd is never discounted, which overcharges cache-heavy turns. The new test uses cache_read_input_tokens for OpenAI, a key _build_usage_chunk does not emit, so this regression is not covered.

Useful? React with 👍 / 👎.

danielhanchen and others added 7 commits May 23, 2026 14:00
Two P1 / High follow-ups from PR 5690 review feedback:

1. Pricing prefix lookup returned the first key it iterated, so
   dated snapshots like ``gpt-5.4-mini-2026-04-23`` collided with
   the shorter ``gpt-5.4`` entry and overbilled by 3x+. Sort the
   table keys longest-first so the most specific entry wins.

2. ``calculate_cost`` only read ``input_tokens`` / ``output_tokens``,
   but Studio's OpenAI-Chat-style usage envelope re-emits
   ``prompt_tokens`` / ``completion_tokens`` (the OpenAI Chat
   Completions vocabulary). Callers handing in the chat-style
   shape silently got a zeroed bill. Accept either pair so the
   calculator works against both raw upstream usage and the
   Studio-translated envelope.

Tests (4 new in test_pricing.py): dated mini/pro snapshots inherit
the right rate; chat-style usage keys price correctly; raw key wins
when both shapes are present.
When the caller hands in Studio's chat-style envelope (``prompt_tokens``
emitted by ``_build_usage_chunk``) for Anthropic, that value already
folds ``cache_creation_input_tokens`` + ``cache_read_input_tokens`` into
the total. The previous follow-up accepted the chat-style key but then
re-added both cache buckets in ``billable_input_tokens`` and ``input_usd``,
double-counting cache tokens on every Anthropic chat-style call.

Detect which envelope landed (``input_tokens`` present = raw upstream;
absent + ``prompt_tokens`` present = Studio chat-style) and peel the
cache buckets off for Anthropic before the downstream math so both
envelopes produce identical costs.

OpenAI: ``input_tokens`` and Studio's ``prompt_tokens`` both already
include ``cache_read`` and exclude any notional ``cache_creation``, so
the OpenAI path stays a straight passthrough.

Tests (2 new): both envelopes match for Anthropic on a triple
(uncached + cache_creation + cache_read); OpenAI envelopes match on a
cached-tokens fixture.
Codex flagged that the previous fallback chain
'usage.get("output_tokens") or usage.get("completion_tokens")'
treats an explicit 0 as missing -- a mixed-envelope payload where
'output_tokens' is 0 but 'completion_tokens' is non-zero (or
stale) bills the wrong amount. Mirror the has_input_tokens
precedence pattern: when the raw key is present we use it even at
0; otherwise fall back to completion_tokens.
Codex flagged that the chat-style OpenAI envelope Studio re-emits
via _build_usage_chunk surfaces cached prompt tokens under
prompt_tokens_details.cached_tokens, not input_tokens_details. The
OpenAI branch only checked input_tokens_details, so a cache-heavy
chat-style turn billed every cached token at the full input rate
instead of the 0.1x cache_read discount.

Walk both keys when discovering the cached count. New regression
test pins that the two envelopes price identically for a turn with
80k of 100k tokens cached.
@danielhanchen danielhanchen force-pushed the fix/pricing-followup-5690 branch from cb56e28 to 8d8bb1d Compare May 23, 2026 14:00

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d8bb1dc77

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +240 to +243
for key in ("input_tokens_details", "prompt_tokens_details"):
details = usage.get(key) or {}
if isinstance(details, dict):
cache_read = max(cache_read, int(details.get("cached_tokens") or 0))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve details-key precedence for mixed OpenAI envelopes

When both usage shapes are present, this branch takes max() across input_tokens_details.cached_tokens and prompt_tokens_details.cached_tokens, so a stale chat-style value can override the authoritative raw Responses value and over-discount input cost. The same commit already enforces raw-key precedence for input_tokens/output_tokens in mixed envelopes, but cached-token details do not follow that rule, which can silently underbill whenever both detail objects are present with different counts.

Useful? React with 👍 / 👎.

Three follow-ups on the longest-prefix pricing match landed in this PR:

- Prefix match now requires a dash boundary or end-of-string. The
  longest-key sort alone still falsely landed "claude-opus-4-15" on
  the "claude-opus-4-1" row, and "gpt-5.5-prod" on the "gpt-5.5-pro"
  row (a 6x overcharge). Demanding the next character be "-" rules
  out the lookalikes while keeping dated snapshots
  ("gpt-5.4-mini-2026-04-23", "claude-opus-4-7-20260414") landing on
  their canonical row.
- Clamp every token count to >= 0. A corrupted upstream payload
  (negative cached count, off-by-one in a fixture) could previously
  produce a negative bill that masked real spend in the session
  total tooltip.
- Tolerate a non-dict "cache_creation" (e.g. an upstream proxy
  folded the field down to a single int). The current code raised
  AttributeError mid-turn; now it falls back to the 5m-default
  bucket so the rest of the cost calculation still runs.

Adds tests/test_pricing_edge.py with 20 adversarial cases covering
the boundary check, negative / None / zero token values across both
envelopes, cache_read > prompt corruption, the OpenAI long-context
threshold crossover on cache-inflated billable input, malformed
sub-objects, and unknown-provider degradation. Combined suite is
51 tests, all green.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Two correctness gaps surfaced on the chat-style usage envelope:

1) Anthropic cache_read fell through to "uncached input" pricing when
   the envelope arrived without the native ``cache_read_input_tokens``
   key (e.g. via a proxy that only emits the mirrored
   ``prompt_tokens_details.cached_tokens`` block). Studio's canonical
   ``_build_usage_chunk`` always sets both so production traffic was
   never affected, but the calculator should accept either as a
   defense-in-depth measure. Add a fallback to read the mirrored
   field when the native one is missing or zero; the native key still
   wins when both are present so the math stays deterministic.

2) ``_build_usage_chunk`` dropped the ``cache_creation`` 5m / 1h
   breakdown. Downstream ``calculate_cost`` then could not apply the
   2x 1h premium and silently fell back to the 5m default,
   underbilling 1h cache writes by 2x on chat-style traffic. Forward
   the breakdown verbatim when the upstream usage carries it.

Tests grow by 4 (20 -> 24): two for the prompt_tokens_details
fallback (with native-precedence pin), one for the chunk shape, one
for the end-to-end pricing parity check at 1h.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

danielhanchen and others added 2 commits May 25, 2026 13:27
PR 5715 wires the fast-mode-2026-02-01 beta header + speed:"fast"
field through to Anthropic, but the cost calculator never learnt
about the matching 6x premium documented at
https://platform.claude.com/docs/en/build-with-claude/fast-mode
(Opus 4.7 standard $5/$25 per MTok, fast $30/$150).

This adds:
- ANTHROPIC_FAST_MODE_MULT = 6.0 constant.
- calculate_cost(..., fast_mode=True) applies the 6x to base input
  AND output rates before any cache multipliers (cache mults stack
  on top of fast per Anthropic docs).
- Provider+model gate: silently no-op on every model that is not
  claude-opus-4-6 / claude-opus-4-7 so a stray fast_mode=True on
  Sonnet/Haiku can never over-charge.
- model_priced label tagged "(fast)" so the cost tooltip can
  surface which rate fired.
- pricing_snapshot now exposes fast_mode_mult so the frontend cost
  panel doesn't have to hard-code 6.

7 new edge tests pin the math; existing 55 still pass.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d459691f4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +242 to +245
if cache_read == 0:
details = usage.get("prompt_tokens_details") or {}
if isinstance(details, dict):
cache_read = max(cache_read, int(details.get("cached_tokens") or 0))
cache_read = max(0, int(details.get("cached_tokens") or 0))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect explicit zero cache_read_input_tokens

The new fallback treats cache_read_input_tokens == 0 as “missing” and replaces it from prompt_tokens_details.cached_tokens, which breaks mixed-envelope Anthropic payloads when the mirrored chat-style details are stale. In that case a real zero cache-read gets reinterpreted as non-zero, inflating billable_input_tokens and cache_read_usd (and therefore total_usd) even though the authoritative native key explicitly said no cached reads.

Useful? React with 👍 / 👎.

The previous follow-up fell back to ``prompt_tokens_details.cached_tokens``
whenever the native ``cache_read_input_tokens`` was missing OR equal to 0,
even though the commit message stated the native key always wins when
present. A proxy that forwards a stale ``prompt_tokens_details`` block
alongside an authoritative ``cache_read_input_tokens: 0`` would then
inflate cache_read past the real native count, posting a false cache_read
line and bumping billable_input_tokens. Switch the gate to native-key
presence so an explicit zero stays authoritative; the mirror only kicks
in when the native key is absent. Add a regression test pinning the
explicit-zero precedence.
@danielhanchen

Copy link
Copy Markdown
Member Author

Audit follow-up: pushed 0f3c302 fixing one real precedence bug introduced by 6a939b4.

The mirrored prompt_tokens_details.cached_tokens fallback was gated on cache_read == 0, which conflated the missing-key case with an explicit native zero. A proxy forwarding a stale details block alongside an authoritative cache_read_input_tokens: 0 therefore inflated cache_read past the native count, posting a false cache-read line and bumping billable_input_tokens. Reproed at $5.00 expected vs $5.50 billed on a 1M-input Anthropic prompt.

Switched the gate to native-key presence so explicit zero stays authoritative; mirror only kicks in when the native key is absent. Added a regression test pinning that precedence. pytest studio/backend/tests/test_pricing*.py -> 62 passed.

Other reviewer findings (negative server-tool clamp, OpenAI dual-key explicit-zero asymmetry, _build_usage_chunk not clamping, gpt-5.x-pro lacking long-context rows, bare int() on malformed strings) are pre-existing in pricing.py and not introduced by this PR's diff; leaving them out of scope here.

Rates cross-checked against platform.claude.com and developers.openai.com on 2026-05-25: Opus 4.7 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5, gpt-5.5 $5/$30 (272k long-context 2x/1.5x), gpt-5.4 $2.5/$15, gpt-5.4-mini $0.75/$4.50, cache 1.25x/2.0x/0.1x. Dated snapshot resolution against claude-opus-4-7-20260201 and gpt-5.5-2026-04-23 verified.

danielhanchen and others added 3 commits May 26, 2026 00:35
The fast_mode 6x multiplier landed in two places at once -- here
(f66df7b) and on #5715 (4f1afdb) -- since both audits ran in
parallel. Drop the duplicate from this branch so the change lives
in its natural home (#5715, which introduces fast_mode itself);
this PR stays focused on the cache-read fallback + 1h breakdown.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

Copy link
Copy Markdown
Member Author

Round 4 cross-cutting fix: merged origin/main into this branch (no conflicts) to bring in PR #5735 (orphan tool_call XML strip widening + 263-line test_tool_xml_strip.py). All 8 PRs in this audit cohort had been forked off a pre-#5735 main, so a squash-merge of any of them would have silently reverted the widened _TOOL_XML_RE regex and deleted the dedicated test file. Verified: diff against origin/main now shows zero unintended changes to routes/inference.py and test_tool_xml_strip.py outside the actual PR scope.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

…5690

# Conflicts:
#	studio/backend/core/inference/external_provider.py
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen danielhanchen merged commit 542d743 into main May 26, 2026
28 of 30 checks passed
@danielhanchen danielhanchen deleted the fix/pricing-followup-5690 branch May 26, 2026 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant