Skip to content

Enable Anthropic prompt caching for system prompt#60

Merged
aallan merged 2 commits into
aallan:mainfrom
jasisz:prompt-caching
Apr 17, 2026
Merged

Enable Anthropic prompt caching for system prompt#60
aallan merged 2 commits into
aallan:mainfrom
jasisz:prompt-caching

Conversation

@jasisz

@jasisz jasisz commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add cache_control: ephemeral to the Anthropic system prompt block
  • SKILL.md (~18k tokens) and llms.txt (~4k tokens) are identical across all 60 problems — first request writes the cache, remaining 59 read at 90% discount
  • Single-line change in AnthropicClient.complete(), no behavioral difference

Cost impact

For a full Vera benchmark run on Claude:

  • Before: ~1.1M input tokens at full price
  • After: ~18k write + ~1.1M cache read → effective cost ~200k tokens (~5x saving)

Aver runs benefit too (~4k system prompt), though the absolute saving is smaller.

Cache TTL is 5 minutes (Anthropic ephemeral), which comfortably covers a sequential 60-problem run.

Test plan

  • 485 tests pass
  • ruff check clean
  • No behavioral change — cache is transparent to the caller

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Updated internal system prompt handling for enhanced compatibility.

SKILL.md (~18k tokens) and llms.txt (~4k tokens) are identical across
all 60 problems in a benchmark run. With cache_control: ephemeral on
the system prompt block, the first request writes the cache and the
remaining 59 read from it at 90% discount.

For a full Vera run on Claude this reduces effective input cost from
~1.1M tokens to ~200k tokens (~5x saving).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jasisz jasisz requested a review from aallan as a code owner April 17, 2026 16:20
@coderabbitai

coderabbitai Bot commented Apr 17, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@jasisz has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 40 minutes and 48 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 40 minutes and 48 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: aa15c5ff-2bdb-42e8-8b9c-2b80ac20845e

📥 Commits

Reviewing files that changed from the base of the PR and between 51ab348 and cdc6461.

📒 Files selected for processing (2)
  • tests/test_models.py
  • vera_bench/models.py
📝 Walkthrough

Walkthrough

The Anthropic client's complete() call in vera_bench/models.py now wraps the system prompt in a structured message list with ephemeral cache control instead of passing it as a plain string.

Changes

Cohort / File(s) Summary
Anthropic Client Configuration
vera_bench/models.py
Modified system prompt argument from plain string to structured message list with cache control metadata (type: ephemeral) in Anthropic API call.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested labels

harness

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Enable Anthropic prompt caching for system prompt' is concise, clear, and directly reflects the main change in the PR—adding cache_control to the Anthropic system prompt block for cost optimisation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
vera_bench/models.py (1)

100-106: ⚠️ Potential issue | 🟠 Major

Benchmark cost accounting incomplete — cache metrics not captured.

With cache_control: ephemeral enabled on the system prompt, Anthropic's response object splits token usage into three fields: input_tokens (non-cached only), cache_creation_input_tokens (cache writes, billed at premium), and cache_read_input_tokens (cache reads, billed at discount). The current code captures only input_tokens, leaving cache-related token counts unrecorded.

After the first request or on cache hits, LLMResponse.input_tokens will drop to just the user message size, even though the model processes the entire cached system prompt each time. Any downstream cost or throughput analysis built on this field will misrepresent actual API usage and won't match what Anthropic charges.

For a benchmark tool focused on cost measurement, this is misleading. Fold the cache counters into the reported total or expose them as separate fields so the benchmark accounts for all billed tokens:

Suggested fix
        elapsed = time.monotonic() - start
        text = response.content[0].text if response.content else ""
+       usage = response.usage
+       cache_creation = getattr(usage, "cache_creation_input_tokens", 0) or 0
+       cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
        return LLMResponse(
            text=text,
-           input_tokens=response.usage.input_tokens,
+           input_tokens=usage.input_tokens + cache_creation + cache_read,
            output_tokens=response.usage.output_tokens,
            wall_time_s=round(elapsed, 2),
            model=response.model,
        )

Alternatively, add cache_creation_input_tokens and cache_read_input_tokens fields to LLMResponse so cost analysis can apply the correct billing multipliers (1.25× for writes, 0.1× for reads).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/models.py` around lines 100 - 106, The LLMResponse construction
currently only uses response.usage.input_tokens and ignores Anthropic cache
fields; update the LLMResponse handling in vera_bench/models.py (the LLMResponse
construction/constructor) to account for
response.usage.cache_creation_input_tokens and
response.usage.cache_read_input_tokens by either (A) adding and populating two
new fields on LLMResponse named cache_creation_input_tokens and
cache_read_input_tokens and leave input_tokens as the non-cached input, or (B)
folding the cache counters into the reported total input_tokens (i.e.,
input_tokens += cache_creation_input_tokens + cache_read_input_tokens) so
downstream cost calculations see all billed tokens; ensure you reference
response.usage.cache_creation_input_tokens and
response.usage.cache_read_input_tokens when populating the new fields or
computing the total.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@vera_bench/models.py`:
- Around line 100-106: The LLMResponse construction currently only uses
response.usage.input_tokens and ignores Anthropic cache fields; update the
LLMResponse handling in vera_bench/models.py (the LLMResponse
construction/constructor) to account for
response.usage.cache_creation_input_tokens and
response.usage.cache_read_input_tokens by either (A) adding and populating two
new fields on LLMResponse named cache_creation_input_tokens and
cache_read_input_tokens and leave input_tokens as the non-cached input, or (B)
folding the cache counters into the reported total input_tokens (i.e.,
input_tokens += cache_creation_input_tokens + cache_read_input_tokens) so
downstream cost calculations see all billed tokens; ensure you reference
response.usage.cache_creation_input_tokens and
response.usage.cache_read_input_tokens when populating the new fields or
computing the total.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fb9d1b2a-2baa-4095-bafc-54349f36c3da

📥 Commits

Reviewing files that changed from the base of the PR and between c82d28b and 51ab348.

📒 Files selected for processing (1)
  • vera_bench/models.py

@codecov

codecov Bot commented Apr 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.30%. Comparing base (c82d28b) to head (cdc6461).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #60      +/-   ##
==========================================
+ Coverage   83.27%   83.30%   +0.03%     
==========================================
  Files          10       10              
  Lines        1363     1366       +3     
==========================================
+ Hits         1135     1138       +3     
  Misses        228      228              
Flag Coverage Δ
python 83.30% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@aallan

aallan commented Apr 17, 2026

Copy link
Copy Markdown
Owner

Hey @jasisz — thanks for this! Substantive change looks great; verified all 485 tests pass locally on the branch, and the caching semantics match our access pattern cleanly (the system prompt is f"{SYSTEM_PROMPT}\n\n{skill_md}" — byte-identical across all 60 problems in a run).

Quick heads-up: CodeRabbit flagged one outside-diff finding you may have missed — it's stashed in the review's "⚠️ Outside diff range comments" collapsible on this review rather than posted inline, which makes it easy to skip.

The finding: with cache_control: ephemeral enabled, Anthropic splits response.usage into three input-token counters:

Field Meaning Billing rate
input_tokens uncached (user message only, for us)
cache_creation_input_tokens written to cache on miss 1.25×
cache_read_input_tokens read from cache on hit 0.1×

The current LLMResponse only captures input_tokens, which after this PR merges will drop to ~1-2k per call (user message only). Downstream consumers (runner.pyProblemResult.input_tokens → JSONL logs) would then show misleadingly tiny numbers — anyone doing cost analysis from the JSONLs would be off by ~10×.

Since the whole point of this PR is cost reduction, getting the accounting right here feels worth including. Would you mind folding in CodeRabbit's suggested Option B on the same PR?

elapsed = time.monotonic() - start
text = response.content[0].text if response.content else ""
usage = response.usage
cache_creation = getattr(usage, "cache_creation_input_tokens", 0) or 0
cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
return LLMResponse(
    text=text,
    input_tokens=usage.input_tokens + cache_creation + cache_read,
    output_tokens=response.usage.output_tokens,
    wall_time_s=round(elapsed, 2),
    model=response.model,
)

One gotcha for the test: tests/test_models.py::TestAnthropicClient::test_complete_mock explicitly sets mock_resp.usage.input_tokens = 100 but doesn't set the cache fields. MagicMock auto-vivifies unset attributes as further MagicMocks (truthy, not int), so usage.input_tokens + cache_creation + cache_read would blow up on addition. Easy fix — just add:

mock_resp.usage.cache_creation_input_tokens = 0
mock_resp.usage.cache_read_input_tokens = 0

to the mock setup around line 105.

Happy to push that as a commit on your branch if you've got "Allow edits from maintainers" enabled and that's easier — or you can roll it in yourself, whichever you prefer. Option A (separate cache_creation_input_tokens / cache_read_input_tokens fields on LLMResponse) would be nicer for proper cost analysis later, but it's a bigger surface-area change — I'd treat that as a follow-up if we want it.

After enabling cache_control, Anthropic splits usage into
input_tokens (uncached), cache_creation_input_tokens (write),
and cache_read_input_tokens (read). Sum all three so JSONL logs
report the true total input tokens, not just the uncached portion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jasisz

jasisz commented Apr 17, 2026

Copy link
Copy Markdown
Contributor Author

Good catch — pushed the accounting fix in cdc6461. Didn't realize Anthropic splits the usage counters when caching is enabled, that's a gotcha worth knowing about. Thanks for the detailed breakdown!

@aallan aallan left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing worth mentioning, and I'm doing it just so it's written down somewhere, is that historical JSONL cost comparisons pre- vs post-caching will be slightly apples-to-oranges, because pre-caching input_tokens included the system prompt every call while post-caching it's the total billed amount (which now has the 1.25×/0.1× weighting baked in implicitly). Not a correctness issue, just a subtle analytical caveat for anyone doing long-baseline cost trending.

That said, looks good to me!

@aallan aallan merged commit bd9b6d5 into aallan:main Apr 17, 2026
10 checks passed
@aallan

aallan commented Apr 18, 2026

Copy link
Copy Markdown
Owner

Follow-up for the other providers filed as #61 — OpenAI has automatic caching already firing on our calls (we're just not instrumenting it, so the savings aren't visible in the JSONLs), and Moonshot has Context Caching but needs a bigger refactor to wire in. Phase 1 there (OpenAI cached-token visibility) is a small parallel of what you did here; Phase 2 (Moonshot) is parked until benchmark cadence justifies it.

aallan added a commit to jasisz/vera-bench that referenced this pull request Apr 29, 2026
…N_ISSUES

The dependency-audit job started failing on PR aallan#62 because
actions/setup-python@v6 bakes pip 26.0.1 into its Python 3.12 image,
and pip 26.0.1 has CVE-2026-3219 (archive handling). The fix landed
in pip 26.1 on 2026-04-26 but won't reach the runner image until
GitHub refreshes the toolchain.

Workaround mirrors aallan/vera#537: a `pip install --upgrade pip`
step before pip-audit runs, pulling pip 26.1 from PyPI to replace
the bundled 26.0.1. Inline comment in ci.yml points at the tracking
issue (aallan#63) so the workaround doesn't quietly outlive its reason.

Also opens KNOWN_ISSUES.md as the catalogue location for active
workarounds, dev-env gotchas, and analytical caveats — each with an
explicit "removal trigger" so cleanup is straightforward later.

Initial entries:
- The CI workaround above (aallan#63)
- assets/results-graph.png pinned to v0.0.7 content until the
  v0.0.9 narrative writeup
- input_tokens semantic shift across PR aallan#60's prompt-caching merge
  (analytical caveat for cost trending across that boundary)
- /opt/homebrew/bin/vera is not the Vera programming language
  (dev-env collision with an unrelated Homebrew package)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants