Skip to content

fix(fireworks_ai): account for cache read/creation tokens in cost calculator#24860

Open
GopalGB wants to merge 2 commits intoBerriAI:mainfrom
GopalGB:fix/fireworks-ai-cache-token-pricing
Open

fix(fireworks_ai): account for cache read/creation tokens in cost calculator#24860
GopalGB wants to merge 2 commits intoBerriAI:mainfrom
GopalGB:fix/fireworks-ai-cache-token-pricing

Conversation

@GopalGB
Copy link
Copy Markdown

@GopalGB GopalGB commented Mar 31, 2026

Summary

  • Fixes Fireworks AI cost_per_token() to correctly price cache_read_input_tokens and cache_creation_input_tokens
  • Previously only calculated prompt_tokens * input_cost_per_token, ignoring cache-specific rates already defined in model_info
  • Adds regression test validating that cached token pricing produces lower costs than full-price input tokens

Root Cause

fireworks_ai/cost_calculator.py:cost_per_token used a simple multiplication of prompt_tokens * input_cost_per_token without checking for cache_read_input_tokens or cache_creation_input_tokens in the Usage object. Model info already had cache_read_input_token_cost set correctly (e.g., 1e-07 for kimi-k2p5), but it was never read.

Approach

Cache tokens are already included in prompt_tokens, so the fix applies the cost differential:

prompt_cost += cache_read_tokens * (cache_rate - input_rate)

This correctly subtracts the overcharge and applies the discounted rate.

Test Plan

  • New test test_fireworks_ai_cache_token_pricing added
  • Validates completion cost unchanged by cache tokens
  • Validates prompt cost is lower when cache tokens are present (for models with cache pricing)

Fixes #24774

…culator

The Fireworks AI cost_per_token function only calculated costs using
prompt_tokens * input_cost_per_token, ignoring cache_read_input_tokens
and cache_creation_input_tokens from the Usage object. This caused
incorrect cost reporting when prompt caching was active.

Now adjusts the prompt cost by applying the differential between the
cache-specific rate and the standard input rate for cached tokens.

Fixes BerriAI#24774
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 31, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Mar 31, 2026 4:01pm

Request Review

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq bot commented Mar 31, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing GopalGB:fix/fireworks-ai-cache-token-pricing (17dff85) with main (50a52f6)

Open in CodSpeed

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 31, 2026

Greptile Summary

This PR fixes fireworks_ai/cost_calculator.py to correctly account for cache_read_input_tokens and cache_creation_input_tokens when computing prompt costs, and adds a deterministic regression test using fireworks_ai/kimi-k2p5 (a model that actually has cache_read_input_token_cost in the pricing config).

  • Core fix: The old code charged all prompt_tokens at the flat input_cost_per_token rate. The new code applies a differential adjustment: since cache tokens are already counted in prompt_tokens, it adds cache_tokens * (cache_rate - input_rate) — a negative delta for cheaper cached reads, a positive delta for more-expensive cache creation. The arithmetic is correct.
  • Test: test_fireworks_ai_cache_token_pricing uses the local model cost map (no network calls), constructs mock Usage objects, and asserts prompt_cost_cached < prompt_cost_no_cache — an assertion that is deterministically exercised because kimi-k2p5 has cache_read_input_token_cost: 1e-07 < input_cost_per_token: 6e-07. The previous version of this test (from the prior review thread) had a conditional guard that silently skipped the critical assertion; that guard is gone in this iteration.
  • No security, auth, or backwards-compatibility concerns.

Confidence Score: 5/5

Safe to merge — the fix is a targeted, mathematically correct cost adjustment with a deterministic regression test.

No P0 or P1 issues found. The differential-rate adjustment logic is correct. The test now unconditionally exercises the critical assertion using a model with verified cache pricing in the config. No auth, security, or backwards-compatibility concerns.

No files require special attention.

Important Files Changed

Filename Overview
litellm/llms/fireworks_ai/cost_calculator.py Adds differential cost adjustment for cache read/creation tokens; math is correct (already-charged base rate adjusted to the cached rate); no issues found.
tests/local_testing/test_completion_cost.py Adds a pure unit test using kimi-k2p5 (which has cache_read_input_token_cost in the pricing config), no real network calls, both key assertions are always exercised.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[cost_per_token called] --> B[get_model_info for fireworks_ai model]
    B --> C{Model found?}
    C -- No --> D[Fallback: get_base_model_for_pricing\ne.g. fireworks-ai-above-16b]
    D --> E[get_model_info for base tier]
    C -- Yes --> F[input_cost = prompt_tokens x input_cost_per_token]
    E --> F
    F --> G{cache_read_input_tokens > 0\nAND cache_read_input_token_cost set?}
    G -- Yes --> H[prompt_cost += cache_read_tokens\nx cache_read_cost - input_rate]
    G -- No --> I{cache_creation_input_tokens > 0\nAND cache_creation_input_token_cost set?}
    H --> I
    I -- Yes --> J[prompt_cost += cache_creation_tokens\nx cache_creation_cost - input_rate]
    I -- No --> K[completion_cost = completion_tokens\nx output_cost_per_token]
    J --> K
    K --> L[return prompt_cost, completion_cost]
Loading

Reviews (2): Last reviewed commit: "fix(test): use model with cache pricing ..." | Re-trigger Greptile

Comment on lines +1240 to +1244
if model_info.get("cache_read_input_token_cost") is not None:
assert prompt_cost_cached < prompt_cost_no_cache, (
"Prompt cost with 800 cache-read tokens should be less than "
"full-price for the same total prompt tokens"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Critical assertion never executes — test provides zero coverage for the fix

The guarded assertion if model_info.get("cache_read_input_token_cost") is not None: will always be False for fireworks_ai/llama-v3p3-70b-instruct. That model has no direct entry in model_prices_and_context_window.json (only the accounts/fireworks/models/ long-form path exists, and it has no cache_read_input_token_cost). cost_per_token therefore falls back to the generic fireworks-ai-above-16b tier, which also has no cache pricing.

As a result, the key assertion — prompt_cost_cached < prompt_cost_no_cache — is never reached, meaning the test always passes but never validates that the fix actually works.

The model to use is fireworks_ai/kimi-k2p5 (which does have cache_read_input_token_cost in the pricing config, at a rate lower than input_cost_per_token), making the assertion deterministically exercised. The conditional if guard can then be removed entirely:

    prompt_cost_cached, completion_cost_cached = cost_per_token(
-       model="fireworks_ai/llama-v3p3-70b-instruct", usage=usage_with_cache
+       model="fireworks_ai/kimi-k2p5", usage=usage_with_cache
    )
    prompt_cost_no_cache, completion_cost_no_cache = cost_per_token(
-       model="fireworks_ai/llama-v3p3-70b-instruct", usage=usage_no_cache
+       model="fireworks_ai/kimi-k2p5", usage=usage_no_cache
    )

    assert completion_cost_cached == completion_cost_no_cache

-   model_info = litellm.get_model_info(
-       model="fireworks_ai/llama-v3p3-70b-instruct",
-       custom_llm_provider="fireworks_ai",
-   )
-   if model_info.get("cache_read_input_token_cost") is not None:
-       assert prompt_cost_cached < prompt_cost_no_cache, (
-           "Prompt cost with 800 cache-read tokens should be less than "
-           "full-price for the same total prompt tokens"
-       )
+   # kimi-k2p5 defines cache_read_input_token_cost < input_cost_per_token,
+   # so 800 cache-read tokens must yield a lower prompt cost.
+   assert prompt_cost_cached < prompt_cost_no_cache, (
+       "Prompt cost with 800 cache-read tokens should be less than "
+       "full-price for the same total prompt tokens"
+   )

Without this fix the regression test from the PR description never actually runs, defeating its purpose as a safeguard against future breakage.

Rule Used: What: Flag any modifications to existing tests and... (source)

@GopalGB
Copy link
Copy Markdown
Author

GopalGB commented Mar 31, 2026

I have read the CLA Document and I hereby sign the CLA

Switch test from `fireworks_ai/llama-v3p3-70b-instruct` (no
cache_read_input_token_cost) to `fireworks_ai/kimi-k2p5` (has
cache pricing at 1e-07 vs input 6e-07). Remove the conditional
guard so the assertion always runs.

Addresses Greptile review feedback on BerriAI#24860.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Fireworks AI cost calculator ignores cache token pricing

2 participants