Enable Anthropic prompt caching for system prompt by jasisz · Pull Request #60 · aallan/vera-bench

jasisz · 2026-04-17T16:20:32Z

Summary

Add cache_control: ephemeral to the Anthropic system prompt block
SKILL.md (~18k tokens) and llms.txt (~4k tokens) are identical across all 60 problems — first request writes the cache, remaining 59 read at 90% discount
Single-line change in AnthropicClient.complete(), no behavioral difference

Cost impact

For a full Vera benchmark run on Claude:

Before: ~1.1M input tokens at full price
After: ~18k write + ~1.1M cache read → effective cost ~200k tokens (~5x saving)

Aver runs benefit too (~4k system prompt), though the absolute saving is smaller.

Cache TTL is 5 minutes (Anthropic ephemeral), which comfortably covers a sequential 60-problem run.

Test plan

485 tests pass
ruff check clean
No behavioral change — cache is transparent to the caller

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Updated internal system prompt handling for enhanced compatibility.

SKILL.md (~18k tokens) and llms.txt (~4k tokens) are identical across all 60 problems in a benchmark run. With cache_control: ephemeral on the system prompt block, the first request writes the cache and the remaining 59 read from it at 90% discount. For a full Vera run on Claude this reduces effective input cost from ~1.1M tokens to ~200k tokens (~5x saving). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-17T16:20:49Z

Warning

Rate limit exceeded

@jasisz has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 40 minutes and 48 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 40 minutes and 48 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: aa15c5ff-2bdb-42e8-8b9c-2b80ac20845e

📥 Commits

Reviewing files that changed from the base of the PR and between 51ab348 and cdc6461.

📒 Files selected for processing (2)

tests/test_models.py
vera_bench/models.py

📝 Walkthrough

Walkthrough

The Anthropic client's complete() call in vera_bench/models.py now wraps the system prompt in a structured message list with ephemeral cache control instead of passing it as a plain string.

Changes

Cohort / File(s)	Summary
Anthropic Client Configuration `vera_bench/models.py`	Modified system prompt argument from plain string to structured message list with cache control metadata (type: ephemeral) in Anthropic API call.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested labels

harness

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Enable Anthropic prompt caching for system prompt' is concise, clear, and directly reflects the main change in the PR—adding cache_control to the Anthropic system prompt block for cost optimisation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

vera_bench/models.py (1)
100-106: ⚠️ Potential issue | 🟠 Major

Benchmark cost accounting incomplete — cache metrics not captured.

With cache_control: ephemeral enabled on the system prompt, Anthropic's response object splits token usage into three fields: input_tokens (non-cached only), cache_creation_input_tokens (cache writes, billed at premium), and cache_read_input_tokens (cache reads, billed at discount). The current code captures only input_tokens, leaving cache-related token counts unrecorded.

After the first request or on cache hits, LLMResponse.input_tokens will drop to just the user message size, even though the model processes the entire cached system prompt each time. Any downstream cost or throughput analysis built on this field will misrepresent actual API usage and won't match what Anthropic charges.

For a benchmark tool focused on cost measurement, this is misleading. Fold the cache counters into the reported total or expose them as separate fields so the benchmark accounts for all billed tokens:
Suggested fix
        elapsed = time.monotonic() - start
        text = response.content[0].text if response.content else ""
+       usage = response.usage
+       cache_creation = getattr(usage, "cache_creation_input_tokens", 0) or 0
+       cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
        return LLMResponse(
            text=text,
-           input_tokens=response.usage.input_tokens,
+           input_tokens=usage.input_tokens + cache_creation + cache_read,
            output_tokens=response.usage.output_tokens,
            wall_time_s=round(elapsed, 2),
            model=response.model,
        )
Alternatively, add cache_creation_input_tokens and cache_read_input_tokens fields to LLMResponse so cost analysis can apply the correct billing multipliers (1.25× for writes, 0.1× for reads).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/models.py` around lines 100 - 106, The LLMResponse construction
currently only uses response.usage.input_tokens and ignores Anthropic cache
fields; update the LLMResponse handling in vera_bench/models.py (the LLMResponse
construction/constructor) to account for
response.usage.cache_creation_input_tokens and
response.usage.cache_read_input_tokens by either (A) adding and populating two
new fields on LLMResponse named cache_creation_input_tokens and
cache_read_input_tokens and leave input_tokens as the non-cached input, or (B)
folding the cache counters into the reported total input_tokens (i.e.,
input_tokens += cache_creation_input_tokens + cache_read_input_tokens) so
downstream cost calculations see all billed tokens; ensure you reference
response.usage.cache_creation_input_tokens and
response.usage.cache_read_input_tokens when populating the new fields or
computing the total.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@vera_bench/models.py`:
- Around line 100-106: The LLMResponse construction currently only uses
response.usage.input_tokens and ignores Anthropic cache fields; update the
LLMResponse handling in vera_bench/models.py (the LLMResponse
construction/constructor) to account for
response.usage.cache_creation_input_tokens and
response.usage.cache_read_input_tokens by either (A) adding and populating two
new fields on LLMResponse named cache_creation_input_tokens and
cache_read_input_tokens and leave input_tokens as the non-cached input, or (B)
folding the cache counters into the reported total input_tokens (i.e.,
input_tokens += cache_creation_input_tokens + cache_read_input_tokens) so
downstream cost calculations see all billed tokens; ensure you reference
response.usage.cache_creation_input_tokens and
response.usage.cache_read_input_tokens when populating the new fields or
computing the total.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fb9d1b2a-2baa-4095-bafc-54349f36c3da

📥 Commits

Reviewing files that changed from the base of the PR and between c82d28b and 51ab348.

📒 Files selected for processing (1)

vera_bench/models.py

codecov · 2026-04-17T16:26:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.30%. Comparing base (c82d28b) to head (cdc6461).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #60      +/-   ##
==========================================
+ Coverage   83.27%   83.30%   +0.03%     
==========================================
  Files          10       10              
  Lines        1363     1366       +3     
==========================================
+ Hits         1135     1138       +3     
  Misses        228      228

Flag	Coverage Δ
python	`83.30% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

aallan · 2026-04-17T16:32:55Z

Hey @jasisz — thanks for this! Substantive change looks great; verified all 485 tests pass locally on the branch, and the caching semantics match our access pattern cleanly (the system prompt is f"{SYSTEM_PROMPT}\n\n{skill_md}" — byte-identical across all 60 problems in a run).

Quick heads-up: CodeRabbit flagged one outside-diff finding you may have missed — it's stashed in the review's "⚠️ Outside diff range comments" collapsible on this review rather than posted inline, which makes it easy to skip.

The finding: with cache_control: ephemeral enabled, Anthropic splits response.usage into three input-token counters:

Field	Meaning	Billing rate
`input_tokens`	uncached (user message only, for us)	1×
`cache_creation_input_tokens`	written to cache on miss	1.25×
`cache_read_input_tokens`	read from cache on hit	0.1×

The current LLMResponse only captures input_tokens, which after this PR merges will drop to ~1-2k per call (user message only). Downstream consumers (runner.py → ProblemResult.input_tokens → JSONL logs) would then show misleadingly tiny numbers — anyone doing cost analysis from the JSONLs would be off by ~10×.

Since the whole point of this PR is cost reduction, getting the accounting right here feels worth including. Would you mind folding in CodeRabbit's suggested Option B on the same PR?

elapsed = time.monotonic() - start
text = response.content[0].text if response.content else ""
usage = response.usage
cache_creation = getattr(usage, "cache_creation_input_tokens", 0) or 0
cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
return LLMResponse(
    text=text,
    input_tokens=usage.input_tokens + cache_creation + cache_read,
    output_tokens=response.usage.output_tokens,
    wall_time_s=round(elapsed, 2),
    model=response.model,
)

One gotcha for the test: tests/test_models.py::TestAnthropicClient::test_complete_mock explicitly sets mock_resp.usage.input_tokens = 100 but doesn't set the cache fields. MagicMock auto-vivifies unset attributes as further MagicMocks (truthy, not int), so usage.input_tokens + cache_creation + cache_read would blow up on addition. Easy fix — just add:

mock_resp.usage.cache_creation_input_tokens = 0
mock_resp.usage.cache_read_input_tokens = 0

to the mock setup around line 105.

Happy to push that as a commit on your branch if you've got "Allow edits from maintainers" enabled and that's easier — or you can roll it in yourself, whichever you prefer. Option A (separate cache_creation_input_tokens / cache_read_input_tokens fields on LLMResponse) would be nicer for proper cost analysis later, but it's a bigger surface-area change — I'd treat that as a follow-up if we want it.

After enabling cache_control, Anthropic splits usage into input_tokens (uncached), cache_creation_input_tokens (write), and cache_read_input_tokens (read). Sum all three so JSONL logs report the true total input tokens, not just the uncached portion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jasisz · 2026-04-17T16:40:05Z

Good catch — pushed the accounting fix in cdc6461. Didn't realize Anthropic splits the usage counters when caching is enabled, that's a gotcha worth knowing about. Thanks for the detailed breakdown!

aallan

The only thing worth mentioning, and I'm doing it just so it's written down somewhere, is that historical JSONL cost comparisons pre- vs post-caching will be slightly apples-to-oranges, because pre-caching input_tokens included the system prompt every call while post-caching it's the total billed amount (which now has the 1.25×/0.1× weighting baked in implicitly). Not a correctness issue, just a subtle analytical caveat for anyone doing long-baseline cost trending.

That said, looks good to me!

aallan · 2026-04-18T11:49:57Z

Follow-up for the other providers filed as #61 — OpenAI has automatic caching already firing on our calls (we're just not instrumenting it, so the savings aren't visible in the JSONLs), and Moonshot has Context Caching but needs a bigger refactor to wire in. Phase 1 there (OpenAI cached-token visibility) is a small parallel of what you did here; Phase 2 (Moonshot) is parked until benchmark cadence justifies it.

…N_ISSUES The dependency-audit job started failing on PR aallan#62 because actions/setup-python@v6 bakes pip 26.0.1 into its Python 3.12 image, and pip 26.0.1 has CVE-2026-3219 (archive handling). The fix landed in pip 26.1 on 2026-04-26 but won't reach the runner image until GitHub refreshes the toolchain. Workaround mirrors aallan/vera#537: a `pip install --upgrade pip` step before pip-audit runs, pulling pip 26.1 from PyPI to replace the bundled 26.0.1. Inline comment in ci.yml points at the tracking issue (aallan#63) so the workaround doesn't quietly outlive its reason. Also opens KNOWN_ISSUES.md as the catalogue location for active workarounds, dev-env gotchas, and analytical caveats — each with an explicit "removal trigger" so cleanup is straightforward later. Initial entries: - The CI workaround above (aallan#63) - assets/results-graph.png pinned to v0.0.7 content until the v0.0.9 narrative writeup - input_tokens semantic shift across PR aallan#60's prompt-caching merge (analytical caveat for cost trending across that boundary) - /opt/homebrew/bin/vera is not the Vera programming language (dev-env collision with an unrelated Homebrew package) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jasisz requested a review from aallan as a code owner April 17, 2026 16:20

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

aallan approved these changes Apr 17, 2026

View reviewed changes

aallan merged commit bd9b6d5 into aallan:main Apr 17, 2026
10 checks passed

aallan mentioned this pull request Apr 18, 2026

Prompt caching for other providers (OpenAI instrumentation, Moonshot Context Caching) #61

Open

5 tasks

aallan mentioned this pull request Apr 29, 2026

ci: work around CVE-2026-3219 in setup-python's bundled pip #64

Merged

5 tasks

This was referenced May 5, 2026

Baseline JSONL: populate bench_version field (currently empty) #66

Closed

Populate bench_version on baseline JSONL output (closes #66) #67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Anthropic prompt caching for system prompt#60

Enable Anthropic prompt caching for system prompt#60
aallan merged 2 commits into
aallan:mainfrom
jasisz:prompt-caching

jasisz commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

aallan commented Apr 17, 2026

Uh oh!

jasisz commented Apr 17, 2026

Uh oh!

aallan left a comment

Uh oh!

Uh oh!

aallan commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jasisz commented Apr 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Cost impact

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aallan commented Apr 17, 2026

Uh oh!

jasisz commented Apr 17, 2026

Uh oh!

aallan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aallan commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jasisz commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

codecov Bot commented Apr 17, 2026 •

edited

Loading