feat(api): multi-turn caching + per-session cost tracker#45
Conversation
- Add cache_control on conversation turns so agentic tool loops cache all prior messages, saving ~90% on repeated input tokens - Add CostTracker that accumulates per-session token usage and computes real USD costs using Sonnet pricing - Include session_cost summary in every SSE done event - Export CacheControlEphemeral for ContentBlock caching
📝 WalkthroughWalkthroughAdds an AI token-cost tracker and pricing constants, extends content blocks with cache-control, integrates per-session cost accumulation and multi-turn caching, exposes session_cost in chat SSE "done" events, and replaces README content with a streamlined project overview. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Server
participant Session
participant CostTracker
Client->>Server: POST /chat (messages)
activate Server
Server->>Session: orchestrator.process(messages)
activate Session
Session->>Session: windowedMessages() -> addTurnCaching()
Note over Session: annotate last turn's ContentBlock.CacheControl
loop AI completions
Session->>CostTracker: Add(TokenUsage)
activate CostTracker
CostTracker->>CostTracker: accumulate tokens & costs
deactivate CostTracker
end
Session-->>Server: return session (with Cost)
Server->>CostTracker: Cost() / Summary()
activate CostTracker
CostTracker-->>Server: cost metrics
deactivate CostTracker
Server-->>Client: SSE stream ... "done" (status: complete, session_cost)
deactivate Server
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/ai/cost.go`:
- Around line 17-23: CostTracker currently uses hardcoded Sonnet rates; update
it to accept the selected model (or a resolved pricing table) so costs are
computed per-model: add a Model string field or PricingRates struct to
CostTracker, update its constructor/initializers where CostTracker is created
(thread s.model from ChatStream/session into the new field), and modify all
cost-calculation methods (the functions around the struct, previously using
Sonnet constants) to look up rates based on that model field instead of using
Sonnet-only constants so Haiku/Opus sessions report correct session_costs.
In `@internal/orchestrator/session.go`:
- Around line 631-655: windowedMessages() currently searches from len(msgs)-2
and bails out when len(msgs) < 4, causing the cache breakpoint to be placed one
user turn behind; change the logic to find and mark the most recent user message
(scan from len(msgs)-1 backwards to set cacheIdx) and relax/remove the overly
strict length check (e.g., only skip when msgs too small to have any user
message), then clone that message (m := msgs[cacheIdx]) and set
last.CacheControl = &ai.CacheControlEphemeral as before so the current user
turn—identified by the last user-role message—is the cached breakpoint.
In `@internal/server/chat.go`:
- Around line 195-198: The code emits two "done" SSEs with session_cost (once
from handleStreamEvent and again on the close path), causing duplicate session
cost in the UI; fix by tracking whether the terminal "done" has already been
sent and suppress the duplicate: introduce a boolean flag (e.g., doneSent or
sessionCostSent) in the streaming handler scope, set it to true when
writeSSE(..., "done", {... "session_cost": session.Cost.Summary()}) is called
(the occurrence in handleStreamEvent), and on the close/cleanup path (the
writeSSE call shown in the diff and the similar block at lines ~281-289) check
that flag and either skip emitting another "done" or emit "done" without the
session_cost field if doneSent is true.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1ce5b089-8ade-43e8-9458-e70d8c8a62bb
📒 Files selected for processing (4)
internal/ai/cost.gointernal/ai/models.gointernal/orchestrator/session.gointernal/server/chat.go
| // CostTracker accumulates token usage and computes costs. | ||
| type CostTracker struct { | ||
| InputTokens int | ||
| OutputTokens int | ||
| CacheCreationInputTokens int | ||
| CacheReadInputTokens int | ||
| } |
There was a problem hiding this comment.
Make cost calculation depend on the selected model.
Sessions already carry s.model into ChatStream in internal/orchestrator/session.go Line 303, but this tracker always bills at Sonnet rates. Any Haiku or Opus session will emit a wrong session_cost summary even though separate model IDs/constants already exist. Thread the selected model, or a resolved pricing table, into CostTracker instead of hardcoding Sonnet here.
Also applies to: 36-50
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/ai/cost.go` around lines 17 - 23, CostTracker currently uses
hardcoded Sonnet rates; update it to accept the selected model (or a resolved
pricing table) so costs are computed per-model: add a Model string field or
PricingRates struct to CostTracker, update its constructor/initializers where
CostTracker is created (thread s.model from ChatStream/session into the new
field), and modify all cost-calculation methods (the functions around the
struct, previously using Sonnet constants) to look up rates based on that model
field instead of using Sonnet-only constants so Haiku/Opus sessions report
correct session_costs.
| if len(msgs) < 4 { | ||
| return msgs // too short to benefit from caching | ||
| } | ||
|
|
||
| // Find the last user message that isn't the very last message | ||
| // (we want to cache everything before the newest exchange). | ||
| cacheIdx := -1 | ||
| for i := len(msgs) - 2; i >= 0; i-- { | ||
| if msgs[i].Role == "user" { | ||
| cacheIdx = i | ||
| break | ||
| } | ||
| } | ||
| if cacheIdx < 0 { | ||
| return msgs | ||
| } | ||
|
|
||
| // Clone the message and add cache_control to its last content block. | ||
| m := msgs[cacheIdx] | ||
| blocks := make([]ai.ContentBlock, len(m.Content)) | ||
| copy(blocks, m.Content) | ||
| if len(blocks) > 0 { | ||
| last := &blocks[len(blocks)-1] | ||
| last.CacheControl = &ai.CacheControlEphemeral | ||
| } |
There was a problem hiding this comment.
Cache the current user turn, not the previous one.
windowedMessages() is only called immediately before ChatStream, after a user/tool_result message has already been appended. Starting the scan at len(msgs)-2 and returning early for len(msgs) < 4 means the first request never writes a cache breakpoint, and later requests keep the breakpoint one user turn behind. In the common user -> tool_use -> tool_result -> final answer flow, the second API call cannot read from cache at all.
🔧 Suggested fix
func addTurnCaching(msgs []ai.Message) []ai.Message {
- if len(msgs) < 4 {
- return msgs // too short to benefit from caching
+ if len(msgs) == 0 {
+ return msgs
}
- // Find the last user message that isn't the very last message
- // (we want to cache everything before the newest exchange).
+ // windowedMessages is only called right before ChatStream, so the
+ // newest user message/tool_result is the stable prefix we want to
+ // write for the next loop iteration.
cacheIdx := -1
- for i := len(msgs) - 2; i >= 0; i-- {
+ for i := len(msgs) - 1; i >= 0; i-- {
if msgs[i].Role == "user" {
cacheIdx = i
break
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if len(msgs) < 4 { | |
| return msgs // too short to benefit from caching | |
| } | |
| // Find the last user message that isn't the very last message | |
| // (we want to cache everything before the newest exchange). | |
| cacheIdx := -1 | |
| for i := len(msgs) - 2; i >= 0; i-- { | |
| if msgs[i].Role == "user" { | |
| cacheIdx = i | |
| break | |
| } | |
| } | |
| if cacheIdx < 0 { | |
| return msgs | |
| } | |
| // Clone the message and add cache_control to its last content block. | |
| m := msgs[cacheIdx] | |
| blocks := make([]ai.ContentBlock, len(m.Content)) | |
| copy(blocks, m.Content) | |
| if len(blocks) > 0 { | |
| last := &blocks[len(blocks)-1] | |
| last.CacheControl = &ai.CacheControlEphemeral | |
| } | |
| if len(msgs) == 0 { | |
| return msgs | |
| } | |
| // windowedMessages is only called right before ChatStream, so the | |
| // newest user message/tool_result is the stable prefix we want to | |
| // write for the next loop iteration. | |
| cacheIdx := -1 | |
| for i := len(msgs) - 1; i >= 0; i-- { | |
| if msgs[i].Role == "user" { | |
| cacheIdx = i | |
| break | |
| } | |
| } | |
| if cacheIdx < 0 { | |
| return msgs | |
| } | |
| // Clone the message and add cache_control to its last content block. | |
| m := msgs[cacheIdx] | |
| blocks := make([]ai.ContentBlock, len(m.Content)) | |
| copy(blocks, m.Content) | |
| if len(blocks) > 0 { | |
| last := &blocks[len(blocks)-1] | |
| last.CacheControl = &ai.CacheControlEphemeral | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/orchestrator/session.go` around lines 631 - 655, windowedMessages()
currently searches from len(msgs)-2 and bails out when len(msgs) < 4, causing
the cache breakpoint to be placed one user turn behind; change the logic to find
and mark the most recent user message (scan from len(msgs)-1 backwards to set
cacheIdx) and relax/remove the overly strict length check (e.g., only skip when
msgs too small to have any user message), then clone that message (m :=
msgs[cacheIdx]) and set last.CacheControl = &ai.CacheControlEphemeral as before
so the current user turn—identified by the last user-role message—is the cached
breakpoint.
| writeSSE(w, flusher, "done", map[string]interface{}{ | ||
| "status": "complete", | ||
| "session_cost": session.Cost.Summary(), | ||
| }) |
There was a problem hiding this comment.
Emit session_cost on only one terminal done event.
handleStreamEvent() already sends a done event with session_cost, and the close path sends another done with the same field. The current SSE bridge forwards every done event, so once the UI starts rendering session_cost this will show the summary twice for a single request. Either suppress the close-path done after a normal completion, or keep session_cost only on the streamed done.
Also applies to: 281-289
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/server/chat.go` around lines 195 - 198, The code emits two "done"
SSEs with session_cost (once from handleStreamEvent and again on the close
path), causing duplicate session cost in the UI; fix by tracking whether the
terminal "done" has already been sent and suppress the duplicate: introduce a
boolean flag (e.g., doneSent or sessionCostSent) in the streaming handler scope,
set it to true when writeSSE(..., "done", {... "session_cost":
session.Cost.Summary()}) is called (the occurrence in handleStreamEvent), and on
the close/cleanup path (the writeSSE call shown in the diff and the similar
block at lines ~281-289) check that flag and either skip emitting another "done"
or emit "done" without the session_cost field if doneSent is true.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Around line 83-92: The fenced code block showing REPL commands lacks a
language specifier; update the opening fence for that REPL commands block (the
triple backticks before the list of commands) to include a suitable language
label such as text or bash (e.g., change ``` to ```text) so documentation
viewers get proper syntax highlighting; ensure only the opening fence is
modified and the content of the commands (the listed /mode, /switch, /memory,
/reflect, /cost, /clear, /quit entries) remains unchanged.
- Around line 237-263: The fenced architecture map code block in README.md lacks
a language specifier; update the opening triple-backtick for that block (the one
showing "cmd/ghost/main.go CLI + daemon bootstrap" and the internal/ tree) to
include a language tag such as "text" (e.g., change ``` to ```text) so markdown
renderers apply monospaced formatting and preserve alignment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fd665f32-7f6a-41a1-bfe2-2b766060102a
⛔ Files ignored due to path filters (1)
assets/ghost.pngis excluded by!**/*.png
📒 Files selected for processing (1)
README.md
| ``` | ||
| /mode <name> Switch mode | ||
| /switch <project> Switch project | ||
| /memory search <q> Search memories | ||
| /memory add <text> Add a manual memory | ||
| /reflect Force memory consolidation | ||
| /context Show project context | ||
| /cost Show token usage and spend | ||
| /clear Clear conversation (keep memories) | ||
| /memory add <text> Manual memory | ||
| /reflect Force consolidation | ||
| /cost Token usage + spend | ||
| /clear Clear conversation | ||
| /quit Exit | ||
| ``` |
There was a problem hiding this comment.
Add language specifier to fenced code block.
The REPL commands code block is missing a language specifier, which reduces readability and syntax highlighting support in documentation viewers.
📝 Proposed fix
-```
+```text
/mode <name> Switch mode
/switch <project> Switch project
/memory search <q> Search memories📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| /mode <name> Switch mode | |
| /switch <project> Switch project | |
| /memory search <q> Search memories | |
| /memory add <text> Add a manual memory | |
| /reflect Force memory consolidation | |
| /context Show project context | |
| /cost Show token usage and spend | |
| /clear Clear conversation (keep memories) | |
| /memory add <text> Manual memory | |
| /reflect Force consolidation | |
| /cost Token usage + spend | |
| /clear Clear conversation | |
| /quit Exit | |
| ``` |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 83-83: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 83 - 92, The fenced code block showing REPL commands
lacks a language specifier; update the opening fence for that REPL commands
block (the triple backticks before the list of commands) to include a suitable
language label such as text or bash (e.g., change ``` to ```text) so
documentation viewers get proper syntax highlighting; ensure only the opening
fence is modified and the content of the commands (the listed /mode, /switch,
/memory, /reflect, /cost, /clear, /quit entries) remains unchanged.
| ``` | ||
| cmd/ghost/main.go CLI + daemon bootstrap | ||
| internal/ | ||
| ai/ Claude API client + streaming + tool_use | ||
| ai/ Claude API client, streaming, tool_use, cost tracking | ||
| memory/ SQLite + FTS5 + vector search + time-decay | ||
| tool/ Tool registry + 10 built-in executors | ||
| orchestrator/ Multi-project session manager | ||
| reflection/ Haiku-based memory consolidation | ||
| orchestrator/ Multi-project sessions, context compression, multi-turn caching | ||
| reflection/ Haiku memory consolidation | ||
| prompt/ 3-block cached system prompt | ||
| mode/ Operating mode definitions | ||
| mode/ Operating modes | ||
| project/ Auto-detection (language, tests, git) | ||
| config/ Layered YAML/env/flag config (koanf) | ||
| tui/ Terminal REPL with streaming | ||
| tui/ Terminal REPL | ||
| server/ HTTP REST API (chi) | ||
| mcpserver/ MCP server (stdio) | ||
| telegram/ Telegram bot + approval forwarding | ||
| google/ Google Calendar + Gmail OAuth2 client | ||
| github/ Notification monitor + P0-P4 priority | ||
| scheduler/ Cron + one-shot reminders (gocron) | ||
| briefing/ Daily briefing aggregator | ||
| telegram/ Bot, approvals, session management | ||
| google/ Calendar + Gmail OAuth2 | ||
| github/ Notification monitor | ||
| scheduler/ Cron + reminders (gocron) | ||
| briefing/ Daily briefing | ||
| embedding/ Ollama async worker | ||
| mdv2/ MarkdownV2 escaping utilities | ||
| voice/ Voice pipeline interfaces (WIP) | ||
| mdv2/ MarkdownV2 escaping | ||
| voice/ Voice pipeline (WIP) | ||
| provider/ Interface contracts | ||
| audit/ Per-action cost + token logging | ||
| migrations/ Embedded SQLite schema | ||
| vscode-ghost/ VSCode extension (TypeScript) | ||
| ``` |
There was a problem hiding this comment.
Add language specifier to fenced code block.
The architecture map code block is missing a language specifier, which reduces readability in documentation viewers.
📝 Proposed fix
-```
+```text
cmd/ghost/main.go CLI + daemon bootstrap
internal/📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| cmd/ghost/main.go CLI + daemon bootstrap | |
| internal/ | |
| ai/ Claude API client + streaming + tool_use | |
| ai/ Claude API client, streaming, tool_use, cost tracking | |
| memory/ SQLite + FTS5 + vector search + time-decay | |
| tool/ Tool registry + 10 built-in executors | |
| orchestrator/ Multi-project session manager | |
| reflection/ Haiku-based memory consolidation | |
| orchestrator/ Multi-project sessions, context compression, multi-turn caching | |
| reflection/ Haiku memory consolidation | |
| prompt/ 3-block cached system prompt | |
| mode/ Operating mode definitions | |
| mode/ Operating modes | |
| project/ Auto-detection (language, tests, git) | |
| config/ Layered YAML/env/flag config (koanf) | |
| tui/ Terminal REPL with streaming | |
| tui/ Terminal REPL | |
| server/ HTTP REST API (chi) | |
| mcpserver/ MCP server (stdio) | |
| telegram/ Telegram bot + approval forwarding | |
| google/ Google Calendar + Gmail OAuth2 client | |
| github/ Notification monitor + P0-P4 priority | |
| scheduler/ Cron + one-shot reminders (gocron) | |
| briefing/ Daily briefing aggregator | |
| telegram/ Bot, approvals, session management | |
| google/ Calendar + Gmail OAuth2 | |
| github/ Notification monitor | |
| scheduler/ Cron + reminders (gocron) | |
| briefing/ Daily briefing | |
| embedding/ Ollama async worker | |
| mdv2/ MarkdownV2 escaping utilities | |
| voice/ Voice pipeline interfaces (WIP) | |
| mdv2/ MarkdownV2 escaping | |
| voice/ Voice pipeline (WIP) | |
| provider/ Interface contracts | |
| audit/ Per-action cost + token logging | |
| migrations/ Embedded SQLite schema | |
| vscode-ghost/ VSCode extension (TypeScript) | |
| ``` |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 237-237: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 237 - 263, The fenced architecture map code block in
README.md lacks a language specifier; update the opening triple-backtick for
that block (the one showing "cmd/ghost/main.go CLI + daemon bootstrap" and the
internal/ tree) to include a language tag such as "text" (e.g., change ``` to
```text) so markdown renderers apply monospaced formatting and preserve
alignment.
* fix(vscode): complete tool output cleanup - remove XML generation and regex filter Completes fixes from commits 21e5fa6 and 4f38c54: - Remove remaining XML tag generation in session.go (lines 422-426) - Remove now-unnecessary regex filter in webview-html.ts (line 524) All tool output now flows cleanly through tool_delta events with ID-based matching. No XML tags generated or filtered. Tool indicators work correctly with concurrent tools and proper timing display. * fix: revert to #45 state — undo #46 through #52 and related fixes Reverts 13 commits (55d2cc0..44e3a17) that introduced regressions in the VSCode extension webview, PDF/token features, and TUI. Restores codebase to the stable multi-turn caching state (#45).
Summary
cache_controlon the last user turnbefore each API call so agentic tool loops cache all prior messages (~90% savings)
$15/M output, $0.30/M cache read)
doneevent for display in VSCode/TUITest plan
go vet ./...cleanSummary by CodeRabbit
New Features
Documentation