Skip to content

feat(api): multi-turn caching + per-session cost tracker#45

Merged
wcatz merged 2 commits intomainfrom
feat/api-optimizations
Mar 15, 2026
Merged

feat(api): multi-turn caching + per-session cost tracker#45
wcatz merged 2 commits intomainfrom
feat/api-optimizations

Conversation

@wcatz
Copy link
Copy Markdown
Owner

@wcatz wcatz commented Mar 15, 2026

Summary

  • Multi-turn conversation caching: adds cache_control on the last user turn
    before each API call so agentic tool loops cache all prior messages (~90% savings)
  • CostTracker: accumulates per-session token usage with real Sonnet pricing ($3/M input,
    $15/M output, $0.30/M cache read)
  • Session cost summary included in every SSE done event for display in VSCode/TUI
  • Shows actual cost, savings from caching, and cache hit rate

Test plan

  • go vet ./... clean
  • Cache control correctly placed on conversation turns
  • Cost tracker accumulates across agentic loop iterations

Summary by CodeRabbit

  • New Features

    • Per-session AI cost tracking with session cost reported in chat completion events.
    • Multi-turn caching support to improve message efficiency and reduce costs.
  • Documentation

    • Major README rewrite: streamlined feature descriptions, setup commands, and configuration/architecture guidance.

- Add cache_control on conversation turns so agentic tool loops cache
  all prior messages, saving ~90% on repeated input tokens
- Add CostTracker that accumulates per-session token usage and computes
  real USD costs using Sonnet pricing
- Include session_cost summary in every SSE done event
- Export CacheControlEphemeral for ContentBlock caching
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 15, 2026

📝 Walkthrough

Walkthrough

Adds an AI token-cost tracker and pricing constants, extends content blocks with cache-control, integrates per-session cost accumulation and multi-turn caching, exposes session_cost in chat SSE "done" events, and replaces README content with a streamlined project overview.

Changes

Cohort / File(s) Summary
AI Cost Tracking
internal/ai/cost.go
New file: introduces pricing constants for Sonnet/Haiku, CostTracker type, and methods Add, Cost, CostWithoutCache, Savings, CacheHitRate, Summary.
Model Cache Control
internal/ai/models.go
Added CacheControlEphemeral and new public field CacheControl *cacheControl on ContentBlock to support multi-turn caching directives.
Session / Orchestration
internal/orchestrator/session.go
Added Cost ai.CostTracker field on Session; added addTurnCaching(msgs []ai.Message) []ai.Message; integrates cost accumulation on assistant completion and applies cache control during windowing.
Chat Server Streaming
internal/server/chat.go
Updated handleStreamEvent signature to accept session *orchestrator.Session; SSE "done" event payload now includes session_cost; stream handler calls updated accordingly.
Documentation
README.md
Large rewrite: condensed architecture/features, updated commands and config descriptions, reorganized sections and phrasing across the README.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Server
    participant Session
    participant CostTracker

    Client->>Server: POST /chat (messages)
    activate Server
    Server->>Session: orchestrator.process(messages)
    activate Session
    Session->>Session: windowedMessages() -> addTurnCaching()
    Note over Session: annotate last turn's ContentBlock.CacheControl
    loop AI completions
        Session->>CostTracker: Add(TokenUsage)
        activate CostTracker
        CostTracker->>CostTracker: accumulate tokens & costs
        deactivate CostTracker
    end
    Session-->>Server: return session (with Cost)
    Server->>CostTracker: Cost() / Summary()
    activate CostTracker
    CostTracker-->>Server: cost metrics
    deactivate CostTracker
    Server-->>Client: SSE stream ... "done" (status: complete, session_cost)
    deactivate Server
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I tally tokens hop by hop,
Sonnet, Haiku—prices on the clock,
Cache tucked snug in every turn,
Sessions hum and numbers churn,
Tiny rabbit counts the clock.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: introduces multi-turn caching and per-session cost tracking across files, and is concise and descriptive.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/api-optimizations
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/ai/cost.go`:
- Around line 17-23: CostTracker currently uses hardcoded Sonnet rates; update
it to accept the selected model (or a resolved pricing table) so costs are
computed per-model: add a Model string field or PricingRates struct to
CostTracker, update its constructor/initializers where CostTracker is created
(thread s.model from ChatStream/session into the new field), and modify all
cost-calculation methods (the functions around the struct, previously using
Sonnet constants) to look up rates based on that model field instead of using
Sonnet-only constants so Haiku/Opus sessions report correct session_costs.

In `@internal/orchestrator/session.go`:
- Around line 631-655: windowedMessages() currently searches from len(msgs)-2
and bails out when len(msgs) < 4, causing the cache breakpoint to be placed one
user turn behind; change the logic to find and mark the most recent user message
(scan from len(msgs)-1 backwards to set cacheIdx) and relax/remove the overly
strict length check (e.g., only skip when msgs too small to have any user
message), then clone that message (m := msgs[cacheIdx]) and set
last.CacheControl = &ai.CacheControlEphemeral as before so the current user
turn—identified by the last user-role message—is the cached breakpoint.

In `@internal/server/chat.go`:
- Around line 195-198: The code emits two "done" SSEs with session_cost (once
from handleStreamEvent and again on the close path), causing duplicate session
cost in the UI; fix by tracking whether the terminal "done" has already been
sent and suppress the duplicate: introduce a boolean flag (e.g., doneSent or
sessionCostSent) in the streaming handler scope, set it to true when
writeSSE(..., "done", {... "session_cost": session.Cost.Summary()}) is called
(the occurrence in handleStreamEvent), and on the close/cleanup path (the
writeSSE call shown in the diff and the similar block at lines ~281-289) check
that flag and either skip emitting another "done" or emit "done" without the
session_cost field if doneSent is true.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1ce5b089-8ade-43e8-9458-e70d8c8a62bb

📥 Commits

Reviewing files that changed from the base of the PR and between c1ee78a and d8d9fe5.

📒 Files selected for processing (4)
  • internal/ai/cost.go
  • internal/ai/models.go
  • internal/orchestrator/session.go
  • internal/server/chat.go

Comment thread internal/ai/cost.go
Comment on lines +17 to +23
// CostTracker accumulates token usage and computes costs.
type CostTracker struct {
InputTokens int
OutputTokens int
CacheCreationInputTokens int
CacheReadInputTokens int
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make cost calculation depend on the selected model.

Sessions already carry s.model into ChatStream in internal/orchestrator/session.go Line 303, but this tracker always bills at Sonnet rates. Any Haiku or Opus session will emit a wrong session_cost summary even though separate model IDs/constants already exist. Thread the selected model, or a resolved pricing table, into CostTracker instead of hardcoding Sonnet here.

Also applies to: 36-50

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/ai/cost.go` around lines 17 - 23, CostTracker currently uses
hardcoded Sonnet rates; update it to accept the selected model (or a resolved
pricing table) so costs are computed per-model: add a Model string field or
PricingRates struct to CostTracker, update its constructor/initializers where
CostTracker is created (thread s.model from ChatStream/session into the new
field), and modify all cost-calculation methods (the functions around the
struct, previously using Sonnet constants) to look up rates based on that model
field instead of using Sonnet-only constants so Haiku/Opus sessions report
correct session_costs.

Comment on lines +631 to +655
if len(msgs) < 4 {
return msgs // too short to benefit from caching
}

// Find the last user message that isn't the very last message
// (we want to cache everything before the newest exchange).
cacheIdx := -1
for i := len(msgs) - 2; i >= 0; i-- {
if msgs[i].Role == "user" {
cacheIdx = i
break
}
}
if cacheIdx < 0 {
return msgs
}

// Clone the message and add cache_control to its last content block.
m := msgs[cacheIdx]
blocks := make([]ai.ContentBlock, len(m.Content))
copy(blocks, m.Content)
if len(blocks) > 0 {
last := &blocks[len(blocks)-1]
last.CacheControl = &ai.CacheControlEphemeral
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cache the current user turn, not the previous one.

windowedMessages() is only called immediately before ChatStream, after a user/tool_result message has already been appended. Starting the scan at len(msgs)-2 and returning early for len(msgs) < 4 means the first request never writes a cache breakpoint, and later requests keep the breakpoint one user turn behind. In the common user -> tool_use -> tool_result -> final answer flow, the second API call cannot read from cache at all.

🔧 Suggested fix
 func addTurnCaching(msgs []ai.Message) []ai.Message {
-	if len(msgs) < 4 {
-		return msgs // too short to benefit from caching
+	if len(msgs) == 0 {
+		return msgs
 	}
 
-	// Find the last user message that isn't the very last message
-	// (we want to cache everything before the newest exchange).
+	// windowedMessages is only called right before ChatStream, so the
+	// newest user message/tool_result is the stable prefix we want to
+	// write for the next loop iteration.
 	cacheIdx := -1
-	for i := len(msgs) - 2; i >= 0; i-- {
+	for i := len(msgs) - 1; i >= 0; i-- {
 		if msgs[i].Role == "user" {
 			cacheIdx = i
 			break
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(msgs) < 4 {
return msgs // too short to benefit from caching
}
// Find the last user message that isn't the very last message
// (we want to cache everything before the newest exchange).
cacheIdx := -1
for i := len(msgs) - 2; i >= 0; i-- {
if msgs[i].Role == "user" {
cacheIdx = i
break
}
}
if cacheIdx < 0 {
return msgs
}
// Clone the message and add cache_control to its last content block.
m := msgs[cacheIdx]
blocks := make([]ai.ContentBlock, len(m.Content))
copy(blocks, m.Content)
if len(blocks) > 0 {
last := &blocks[len(blocks)-1]
last.CacheControl = &ai.CacheControlEphemeral
}
if len(msgs) == 0 {
return msgs
}
// windowedMessages is only called right before ChatStream, so the
// newest user message/tool_result is the stable prefix we want to
// write for the next loop iteration.
cacheIdx := -1
for i := len(msgs) - 1; i >= 0; i-- {
if msgs[i].Role == "user" {
cacheIdx = i
break
}
}
if cacheIdx < 0 {
return msgs
}
// Clone the message and add cache_control to its last content block.
m := msgs[cacheIdx]
blocks := make([]ai.ContentBlock, len(m.Content))
copy(blocks, m.Content)
if len(blocks) > 0 {
last := &blocks[len(blocks)-1]
last.CacheControl = &ai.CacheControlEphemeral
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/orchestrator/session.go` around lines 631 - 655, windowedMessages()
currently searches from len(msgs)-2 and bails out when len(msgs) < 4, causing
the cache breakpoint to be placed one user turn behind; change the logic to find
and mark the most recent user message (scan from len(msgs)-1 backwards to set
cacheIdx) and relax/remove the overly strict length check (e.g., only skip when
msgs too small to have any user message), then clone that message (m :=
msgs[cacheIdx]) and set last.CacheControl = &ai.CacheControlEphemeral as before
so the current user turn—identified by the last user-role message—is the cached
breakpoint.

Comment thread internal/server/chat.go
Comment on lines +195 to +198
writeSSE(w, flusher, "done", map[string]interface{}{
"status": "complete",
"session_cost": session.Cost.Summary(),
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Emit session_cost on only one terminal done event.

handleStreamEvent() already sends a done event with session_cost, and the close path sends another done with the same field. The current SSE bridge forwards every done event, so once the UI starts rendering session_cost this will show the summary twice for a single request. Either suppress the close-path done after a normal completion, or keep session_cost only on the streamed done.

Also applies to: 281-289

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/chat.go` around lines 195 - 198, The code emits two "done"
SSEs with session_cost (once from handleStreamEvent and again on the close
path), causing duplicate session cost in the UI; fix by tracking whether the
terminal "done" has already been sent and suppress the duplicate: introduce a
boolean flag (e.g., doneSent or sessionCostSent) in the streaming handler scope,
set it to true when writeSSE(..., "done", {... "session_cost":
session.Cost.Summary()}) is called (the occurrence in handleStreamEvent), and on
the close/cleanup path (the writeSSE call shown in the diff and the similar
block at lines ~281-289) check that flag and either skip emitting another "done"
or emit "done" without the session_cost field if doneSent is true.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 83-92: The fenced code block showing REPL commands lacks a
language specifier; update the opening fence for that REPL commands block (the
triple backticks before the list of commands) to include a suitable language
label such as text or bash (e.g., change ``` to ```text) so documentation
viewers get proper syntax highlighting; ensure only the opening fence is
modified and the content of the commands (the listed /mode, /switch, /memory,
/reflect, /cost, /clear, /quit entries) remains unchanged.
- Around line 237-263: The fenced architecture map code block in README.md lacks
a language specifier; update the opening triple-backtick for that block (the one
showing "cmd/ghost/main.go CLI + daemon bootstrap" and the internal/ tree) to
include a language tag such as "text" (e.g., change ``` to ```text) so markdown
renderers apply monospaced formatting and preserve alignment.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fd665f32-7f6a-41a1-bfe2-2b766060102a

📥 Commits

Reviewing files that changed from the base of the PR and between d8d9fe5 and d120830.

⛔ Files ignored due to path filters (1)
  • assets/ghost.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • README.md

Comment thread README.md
Comment on lines +83 to 92
```
/mode <name> Switch mode
/switch <project> Switch project
/memory search <q> Search memories
/memory add <text> Add a manual memory
/reflect Force memory consolidation
/context Show project context
/cost Show token usage and spend
/clear Clear conversation (keep memories)
/memory add <text> Manual memory
/reflect Force consolidation
/cost Token usage + spend
/clear Clear conversation
/quit Exit
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifier to fenced code block.

The REPL commands code block is missing a language specifier, which reduces readability and syntax highlighting support in documentation viewers.

📝 Proposed fix
-```
+```text
 /mode <name>       Switch mode
 /switch <project>  Switch project
 /memory search <q> Search memories
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
/mode <name> Switch mode
/switch <project> Switch project
/memory search <q> Search memories
/memory add <text> Add a manual memory
/reflect Force memory consolidation
/context Show project context
/cost Show token usage and spend
/clear Clear conversation (keep memories)
/memory add <text> Manual memory
/reflect Force consolidation
/cost Token usage + spend
/clear Clear conversation
/quit Exit
```
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 83-83: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 83 - 92, The fenced code block showing REPL commands
lacks a language specifier; update the opening fence for that REPL commands
block (the triple backticks before the list of commands) to include a suitable
language label such as text or bash (e.g., change ``` to ```text) so
documentation viewers get proper syntax highlighting; ensure only the opening
fence is modified and the content of the commands (the listed /mode, /switch,
/memory, /reflect, /cost, /clear, /quit entries) remains unchanged.

Comment thread README.md
Comment on lines +237 to 263
```
cmd/ghost/main.go CLI + daemon bootstrap
internal/
ai/ Claude API client + streaming + tool_use
ai/ Claude API client, streaming, tool_use, cost tracking
memory/ SQLite + FTS5 + vector search + time-decay
tool/ Tool registry + 10 built-in executors
orchestrator/ Multi-project session manager
reflection/ Haiku-based memory consolidation
orchestrator/ Multi-project sessions, context compression, multi-turn caching
reflection/ Haiku memory consolidation
prompt/ 3-block cached system prompt
mode/ Operating mode definitions
mode/ Operating modes
project/ Auto-detection (language, tests, git)
config/ Layered YAML/env/flag config (koanf)
tui/ Terminal REPL with streaming
tui/ Terminal REPL
server/ HTTP REST API (chi)
mcpserver/ MCP server (stdio)
telegram/ Telegram bot + approval forwarding
google/ Google Calendar + Gmail OAuth2 client
github/ Notification monitor + P0-P4 priority
scheduler/ Cron + one-shot reminders (gocron)
briefing/ Daily briefing aggregator
telegram/ Bot, approvals, session management
google/ Calendar + Gmail OAuth2
github/ Notification monitor
scheduler/ Cron + reminders (gocron)
briefing/ Daily briefing
embedding/ Ollama async worker
mdv2/ MarkdownV2 escaping utilities
voice/ Voice pipeline interfaces (WIP)
mdv2/ MarkdownV2 escaping
voice/ Voice pipeline (WIP)
provider/ Interface contracts
audit/ Per-action cost + token logging
migrations/ Embedded SQLite schema
vscode-ghost/ VSCode extension (TypeScript)
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifier to fenced code block.

The architecture map code block is missing a language specifier, which reduces readability in documentation viewers.

📝 Proposed fix
-```
+```text
 cmd/ghost/main.go          CLI + daemon bootstrap
 internal/
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
cmd/ghost/main.go CLI + daemon bootstrap
internal/
ai/ Claude API client + streaming + tool_use
ai/ Claude API client, streaming, tool_use, cost tracking
memory/ SQLite + FTS5 + vector search + time-decay
tool/ Tool registry + 10 built-in executors
orchestrator/ Multi-project session manager
reflection/ Haiku-based memory consolidation
orchestrator/ Multi-project sessions, context compression, multi-turn caching
reflection/ Haiku memory consolidation
prompt/ 3-block cached system prompt
mode/ Operating mode definitions
mode/ Operating modes
project/ Auto-detection (language, tests, git)
config/ Layered YAML/env/flag config (koanf)
tui/ Terminal REPL with streaming
tui/ Terminal REPL
server/ HTTP REST API (chi)
mcpserver/ MCP server (stdio)
telegram/ Telegram bot + approval forwarding
google/ Google Calendar + Gmail OAuth2 client
github/ Notification monitor + P0-P4 priority
scheduler/ Cron + one-shot reminders (gocron)
briefing/ Daily briefing aggregator
telegram/ Bot, approvals, session management
google/ Calendar + Gmail OAuth2
github/ Notification monitor
scheduler/ Cron + reminders (gocron)
briefing/ Daily briefing
embedding/ Ollama async worker
mdv2/ MarkdownV2 escaping utilities
voice/ Voice pipeline interfaces (WIP)
mdv2/ MarkdownV2 escaping
voice/ Voice pipeline (WIP)
provider/ Interface contracts
audit/ Per-action cost + token logging
migrations/ Embedded SQLite schema
vscode-ghost/ VSCode extension (TypeScript)
```
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 237-237: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 237 - 263, The fenced architecture map code block in
README.md lacks a language specifier; update the opening triple-backtick for
that block (the one showing "cmd/ghost/main.go CLI + daemon bootstrap" and the
internal/ tree) to include a language tag such as "text" (e.g., change ``` to
```text) so markdown renderers apply monospaced formatting and preserve
alignment.

@wcatz wcatz merged commit 55a46e6 into main Mar 15, 2026
4 checks passed
@wcatz wcatz deleted the feat/api-optimizations branch March 15, 2026 19:23
wcatz added a commit that referenced this pull request Mar 16, 2026
Reverts 13 commits (55d2cc0..44e3a17) that introduced regressions
in the VSCode extension webview, PDF/token features, and TUI.
Restores codebase to the stable multi-turn caching state (#45).
@wcatz wcatz mentioned this pull request Mar 16, 2026
2 tasks
wcatz added a commit that referenced this pull request Mar 16, 2026
* fix(vscode): complete tool output cleanup - remove XML generation and regex filter

Completes fixes from commits 21e5fa6 and 4f38c54:
- Remove remaining XML tag generation in session.go (lines 422-426)
- Remove now-unnecessary regex filter in webview-html.ts (line 524)

All tool output now flows cleanly through tool_delta events with ID-based
matching. No XML tags generated or filtered. Tool indicators work correctly
with concurrent tools and proper timing display.

* fix: revert to #45 state — undo #46 through #52 and related fixes

Reverts 13 commits (55d2cc0..44e3a17) that introduced regressions
in the VSCode extension webview, PDF/token features, and TUI.
Restores codebase to the stable multi-turn caching state (#45).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant