Skip to content

fix(context): align fold summary prefix with main agent for cache reuse#1565

Merged
esengine merged 4 commits into
mainfrom
fix/fold-cache-aligned-summary
May 22, 2026
Merged

fix(context): align fold summary prefix with main agent for cache reuse#1565
esengine merged 4 commits into
mainfrom
fix/fold-cache-aligned-summary

Conversation

@esengine

Copy link
Copy Markdown
Owner

Summary

The fold summary call was being sent with a bespoke "You compress conversation history…" system prompt and no tools, which produces a 0% prefix-cache hit against the main agent's just-cached request. Every fold paid full input price on tens of thousands of tokens.

This PR reshapes the summary request so its prefix mirrors the live agent's last call:

  • Same system prompt (was: bespoke "compress conversation" string)
  • Same tool list (was: omitted)
  • Same head conversation bytes (was: skill bodies stubbed mid-head, breaking the cache from the first skill onward)
  • Only the trailing user "summarize" instruction is new — that's the sole cache-miss boundary

Skill-pin handling was split: collectPinnedSkills is read-only and leaves head bytes intact. The summarize instruction now names the pinned skills so the model doesn't paraphrase their bodies (we still append them verbatim regardless).

Numbers

Measured on a real 60-message session at 48.7K prompt tokens (bench-fold-cache-live.mjs, real DeepSeek API):

shape cache hit input cost
before 0.0% $0.145 per fold
after 99.6% $0.015 per fold

~89.6% saving per fold. Across 6 real session replays the saving holds steady at 88-90%. The remaining ~0.4% miss is just the trailing summarize instruction (~185 tokens).

Test plan

  • 7 new request-shape assertions in tests/context-manager-cache-aligned-fold.test.ts (system / tools / head bytes / trailing instruction / model pin / skill-pin verbatim)
  • All 10 pre-existing fold tests still pass (context-manager-skill-pin, context-manager-thinking-mode, context-manager-fold-timeout)
  • Full npm run verify: build + lint + typecheck + 3561 tests green
  • Live API check via tools/bench-fold-cache-live.mjs confirms 99.6% cache hit on real DeepSeek

reasonix added 4 commits May 22, 2026 08:45
…ult, lifecycle plans

Headline themes:
- TUI: Static-history renderer is the only path; virtual-viewport layers removed (#1529 stages 1-4)
- Chat: queued mid-turn steer handling so input mid-render doesn't drop or fight the live frame (#1501)
- Web search: default switches to Bing; dashboard engine switcher; Mojeek dropped (#1558)
- Plans: lifecycle evidence summaries surface why a plan is ready to accept (#1500)
- Desktop: native OS notifications for approvals + completion (#1519)
- i18n: CLI command output (/mcp /sessions /prune /theme) + approval-prompt labels translated (#1524, #1560)
- Security: SSRF block in web_fetch (#1544), edit-snapshot path containment (#1454), shell redirect sandbox (#1457), Task integrity guardrail (#1516)
- Tools: per-turn dispatch-rate limit (#1356); run_command discourages shell-based edits (#1514)
- Client: DeepSeek 429 → concurrency-limit hint (#1526); timeoutMs honored with AbortSignal (#1535); --no-proxy opt-out for direct route (#1507)
- Files: read/edit/restore preserves source encoding (GB18030 / UTF-8 BOM) (#1518)
- Context: pinned constraints survive folds + full tail capture (#1515, #1552)
- Refactor: lifecycle risk policy extracted into its own module (#1557)

See CHANGELOG for the full list.
The summarizer call was sending a bespoke "You compress conversation
history" system prompt and no tools, guaranteeing a 0% cache hit
against the main agent's just-cached prefix. Reshape the request so
system + tools + head bytes mirror the live agent's last call — the
only novel bytes are the trailing summarize instruction.

Skill-pin handling now collects bodies read-only instead of stubbing
mid-head, so the cache prefix stays unbroken. The summarize
instruction names pinned skills so the model knows not to paraphrase
their bodies (which we append verbatim regardless).

Measured on a real session at 48.7K prompt tokens:
  OLD shape: 0.0% cache hit  → $0.145 per fold
  NEW shape: 99.6% cache hit → $0.015 per fold
  saving: 89.6% per fold
bench-fold-cache-shape.mjs replays real session jsonls, simulates
OLD vs NEW summary-call shapes at the fold point, and reports
byte-level shared-prefix with the main agent's preceding request.
Pure local — no API required.

bench-fold-cache-live.mjs sends one priming + two summary calls to
DeepSeek and reports prompt_cache_hit_tokens / cost for each shape.
Used to confirm the shape change actually translates to API-side
cache hits.
…d-summary

# Conflicts:
#	src/context-manager.ts
#	src/loop.ts
@esengine esengine merged commit 544714b into main May 22, 2026
3 checks passed
@esengine esengine deleted the fix/fold-cache-aligned-summary branch May 22, 2026 16:20
const resp = await fetch("https://api.deepseek.com/chat/completions", {
method: "POST",
headers: { Authorization: `Bearer ${apiKey}`, "Content-Type": "application/json" },
body: JSON.stringify(body),
esengine added a commit that referenced this pull request May 22, 2026
read-before-edit gate (#1563) and cache-aligned fold summary (#1565)
landed after the release commit was written. Document them in the
0.49.0 entry before tagging so the published CHANGELOG matches what
ships.

Co-authored-by: reasonix <reasonix@deepseek.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants