Enable Anthropic prompt caching for system prompt#60
Conversation
SKILL.md (~18k tokens) and llms.txt (~4k tokens) are identical across all 60 problems in a benchmark run. With cache_control: ephemeral on the system prompt block, the first request writes the cache and the remaining 59 read from it at 90% discount. For a full Vera run on Claude this reduces effective input cost from ~1.1M tokens to ~200k tokens (~5x saving). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 40 minutes and 48 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe Anthropic client's Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
vera_bench/models.py (1)
100-106:⚠️ Potential issue | 🟠 MajorBenchmark cost accounting incomplete — cache metrics not captured.
With
cache_control: ephemeralenabled on the system prompt, Anthropic's response object splits token usage into three fields:input_tokens(non-cached only),cache_creation_input_tokens(cache writes, billed at premium), andcache_read_input_tokens(cache reads, billed at discount). The current code captures onlyinput_tokens, leaving cache-related token counts unrecorded.After the first request or on cache hits,
LLMResponse.input_tokenswill drop to just the user message size, even though the model processes the entire cached system prompt each time. Any downstream cost or throughput analysis built on this field will misrepresent actual API usage and won't match what Anthropic charges.For a benchmark tool focused on cost measurement, this is misleading. Fold the cache counters into the reported total or expose them as separate fields so the benchmark accounts for all billed tokens:
Suggested fix
elapsed = time.monotonic() - start text = response.content[0].text if response.content else "" + usage = response.usage + cache_creation = getattr(usage, "cache_creation_input_tokens", 0) or 0 + cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0 return LLMResponse( text=text, - input_tokens=response.usage.input_tokens, + input_tokens=usage.input_tokens + cache_creation + cache_read, output_tokens=response.usage.output_tokens, wall_time_s=round(elapsed, 2), model=response.model, )Alternatively, add
cache_creation_input_tokensandcache_read_input_tokensfields toLLMResponseso cost analysis can apply the correct billing multipliers (1.25× for writes, 0.1× for reads).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vera_bench/models.py` around lines 100 - 106, The LLMResponse construction currently only uses response.usage.input_tokens and ignores Anthropic cache fields; update the LLMResponse handling in vera_bench/models.py (the LLMResponse construction/constructor) to account for response.usage.cache_creation_input_tokens and response.usage.cache_read_input_tokens by either (A) adding and populating two new fields on LLMResponse named cache_creation_input_tokens and cache_read_input_tokens and leave input_tokens as the non-cached input, or (B) folding the cache counters into the reported total input_tokens (i.e., input_tokens += cache_creation_input_tokens + cache_read_input_tokens) so downstream cost calculations see all billed tokens; ensure you reference response.usage.cache_creation_input_tokens and response.usage.cache_read_input_tokens when populating the new fields or computing the total.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@vera_bench/models.py`:
- Around line 100-106: The LLMResponse construction currently only uses
response.usage.input_tokens and ignores Anthropic cache fields; update the
LLMResponse handling in vera_bench/models.py (the LLMResponse
construction/constructor) to account for
response.usage.cache_creation_input_tokens and
response.usage.cache_read_input_tokens by either (A) adding and populating two
new fields on LLMResponse named cache_creation_input_tokens and
cache_read_input_tokens and leave input_tokens as the non-cached input, or (B)
folding the cache counters into the reported total input_tokens (i.e.,
input_tokens += cache_creation_input_tokens + cache_read_input_tokens) so
downstream cost calculations see all billed tokens; ensure you reference
response.usage.cache_creation_input_tokens and
response.usage.cache_read_input_tokens when populating the new fields or
computing the total.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: fb9d1b2a-2baa-4095-bafc-54349f36c3da
📒 Files selected for processing (1)
vera_bench/models.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #60 +/- ##
==========================================
+ Coverage 83.27% 83.30% +0.03%
==========================================
Files 10 10
Lines 1363 1366 +3
==========================================
+ Hits 1135 1138 +3
Misses 228 228
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hey @jasisz — thanks for this! Substantive change looks great; verified all 485 tests pass locally on the branch, and the caching semantics match our access pattern cleanly (the system prompt is Quick heads-up: CodeRabbit flagged one outside-diff finding you may have missed — it's stashed in the review's " The finding: with
The current Since the whole point of this PR is cost reduction, getting the accounting right here feels worth including. Would you mind folding in CodeRabbit's suggested Option B on the same PR? elapsed = time.monotonic() - start
text = response.content[0].text if response.content else ""
usage = response.usage
cache_creation = getattr(usage, "cache_creation_input_tokens", 0) or 0
cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
return LLMResponse(
text=text,
input_tokens=usage.input_tokens + cache_creation + cache_read,
output_tokens=response.usage.output_tokens,
wall_time_s=round(elapsed, 2),
model=response.model,
)One gotcha for the test: mock_resp.usage.cache_creation_input_tokens = 0
mock_resp.usage.cache_read_input_tokens = 0to the mock setup around line 105. Happy to push that as a commit on your branch if you've got "Allow edits from maintainers" enabled and that's easier — or you can roll it in yourself, whichever you prefer. Option A (separate |
After enabling cache_control, Anthropic splits usage into input_tokens (uncached), cache_creation_input_tokens (write), and cache_read_input_tokens (read). Sum all three so JSONL logs report the true total input tokens, not just the uncached portion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Good catch — pushed the accounting fix in cdc6461. Didn't realize Anthropic splits the usage counters when caching is enabled, that's a gotcha worth knowing about. Thanks for the detailed breakdown! |
aallan
left a comment
There was a problem hiding this comment.
The only thing worth mentioning, and I'm doing it just so it's written down somewhere, is that historical JSONL cost comparisons pre- vs post-caching will be slightly apples-to-oranges, because pre-caching input_tokens included the system prompt every call while post-caching it's the total billed amount (which now has the 1.25×/0.1× weighting baked in implicitly). Not a correctness issue, just a subtle analytical caveat for anyone doing long-baseline cost trending.
That said, looks good to me!
|
Follow-up for the other providers filed as #61 — OpenAI has automatic caching already firing on our calls (we're just not instrumenting it, so the savings aren't visible in the JSONLs), and Moonshot has Context Caching but needs a bigger refactor to wire in. Phase 1 there (OpenAI cached-token visibility) is a small parallel of what you did here; Phase 2 (Moonshot) is parked until benchmark cadence justifies it. |
…N_ISSUES The dependency-audit job started failing on PR aallan#62 because actions/setup-python@v6 bakes pip 26.0.1 into its Python 3.12 image, and pip 26.0.1 has CVE-2026-3219 (archive handling). The fix landed in pip 26.1 on 2026-04-26 but won't reach the runner image until GitHub refreshes the toolchain. Workaround mirrors aallan/vera#537: a `pip install --upgrade pip` step before pip-audit runs, pulling pip 26.1 from PyPI to replace the bundled 26.0.1. Inline comment in ci.yml points at the tracking issue (aallan#63) so the workaround doesn't quietly outlive its reason. Also opens KNOWN_ISSUES.md as the catalogue location for active workarounds, dev-env gotchas, and analytical caveats — each with an explicit "removal trigger" so cleanup is straightforward later. Initial entries: - The CI workaround above (aallan#63) - assets/results-graph.png pinned to v0.0.7 content until the v0.0.9 narrative writeup - input_tokens semantic shift across PR aallan#60's prompt-caching merge (analytical caveat for cost trending across that boundary) - /opt/homebrew/bin/vera is not the Vera programming language (dev-env collision with an unrelated Homebrew package) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
cache_control: ephemeralto the Anthropic system prompt blockAnthropicClient.complete(), no behavioral differenceCost impact
For a full Vera benchmark run on Claude:
Aver runs benefit too (~4k system prompt), though the absolute saving is smaller.
Cache TTL is 5 minutes (Anthropic
ephemeral), which comfortably covers a sequential 60-problem run.Test plan
ruff checkclean🤖 Generated with Claude Code
Summary by CodeRabbit