Skip to content

Cache Efficiency Guardrails and Diagnostics / 缓存效率守卫与诊断#2314

Merged
esengine merged 2 commits into
esengine:mainfrom
SivanCola:codex/cache-efficiency-guardrails
May 30, 2026
Merged

Cache Efficiency Guardrails and Diagnostics / 缓存效率守卫与诊断#2314
esengine merged 2 commits into
esengine:mainfrom
SivanCola:codex/cache-efficiency-guardrails

Conversation

@SivanCola

Copy link
Copy Markdown
Collaborator

Summary

This PR adds cache-efficiency guardrails for DeepSeek-style high cache-hit sessions. It focuses on keeping the immutable prefix byte-stable, making cache churn observable, and avoiding compaction decisions that reduce total cache efficiency.

本次改动围绕高缓存命中率做了 6 类增强:稳定 prefix 形状、诊断 cache churn、控制 tool schema 成本、改进 fold 经济性、精简 reasoning 历史、补齐 probe/test/replay/UI 兼容。

Major Changes: Principle and Effect

1. Prefix-shape diagnostics and cache churn attribution

Principle: provider-side prompt cache is sensitive to the byte shape of the reusable request prefix. Even when semantic content is unchanged, system prompt, tool schema ordering, few-shot payloads, or transcript rewrites can turn a warm prefix into a cold one. The new diagnostics hash these prefix components and compare snapshots across turns.

Effect: runtime stats can now explain why a cache miss happened instead of only reporting that it happened. /status, telemetry stats, replay summaries, and cache-related UI surfaces can show miss tokens, schema tokens, prefix changes, and top tool schema contributors. This also complements the existing /cache-miss-report path from upstream by adding local shape-level attribution.

2. Stable tool schema ordering and schema cost governance

Principle: the complete tool list is part of the model request prefix. If MCP reconnects or dynamic registration produce the same logical tool set in a different order, the serialized request changes and cache reuse can be lost. Sorting tool specs by function name makes the schema prefix deterministic. Estimating each schema's token cost makes large tool definitions visible.

Effect: avoidable cold turns caused by MCP reconnect/order churn are reduced. The UI can surface expensive tool schemas in the context breakdown, so cache and token issues caused by large schemas are easier to diagnose. Tests now lock the reconnect/prefix invariants around this behavior.

3. Fold economics for compaction decisions

Principle: summarization/folding is not free. A fold creates a new summary segment that is cold at first and adds immediate request cost. Normal-band folding should only happen when the expected multi-turn savings exceed the summary and post-fold cold tax. Aggressive folding still protects the context window when headroom is genuinely low.

Effect: the context manager avoids cost-negative folds that would lower cache efficiency in medium-length sessions, while still preserving safety near context limits. New tests cover both the conservative normal-band behavior and cache-aligned fold invariants.

4. Reasoning retention and healing for tool-call history

Principle: thinking/reasoning models need reasoning fields to round-trip correctly for assistant messages that contain tool calls. However, stale plain assistant reasoning can bloat future request bodies and make prefix shape less stable. The healing path now strips stale plain reasoning while re-stamping only the tool-call assistant turns that require reasoning continuity.

Effect: tool-call transcripts remain API-safe for reasoning models, while unnecessary reasoning payload is removed from future requests. This reduces request bloat and lowers the chance of cache churn from stale assistant-only reasoning content.

5. Probe and regression guardrails

Principle: cache behavior should be testable without relying only on provider internals. Deterministic shape tests verify local invariants, and live/offline probes measure whether those invariants translate into high cache-hit behavior over realistic loops.

Effect: this PR adds scripts/probe-cache-shape.mts, updates loop and long-session probes, and adds tests for cache shape and fold economics. I also ran the testing tool from PR #2306 against this branch via a temporary overlay, so the change is validated by the new offline cache guard scenarios as well.

6. Documentation, replay, and UI compatibility

Principle: adding cache summary fields is only useful if old transcripts and replay paths remain readable. UI and replay defaults must tolerate sessions that were recorded before these fields existed.

Effect: App, ReplayApp, transcript replay, localized labels, and real-world cache benchmark docs were updated together. Existing sessions stay backwards compatible, and benchmark documentation now explicitly calls out expected cold summary segments after compaction.

Verification

  • git diff --cached --check: passed
  • npm run typecheck: passed
  • npm run lint: passed with one existing non-fatal warning in src/cli/ui/PlanPanel.tsx about a type-only React import
  • npx vitest run tests/cache-shape.test.ts tests/context-manager-fold-economics.test.ts tests/ctx-breakdown.test.ts tests/mcp-reconnect-prefix-invariant.test.ts tests/telemetry.test.ts tests/loop-r1-reasoning.test.ts tests/context-manager-cache-aligned-fold.test.ts: passed, 84 tests
  • npx tsx scripts/probe-cache-shape.mts: passed
  • npm run build:dashboard: passed
  • npx vitest run tests/loop.test.ts tests/dashboard-smoke.test.ts: passed, 85 passed and 1 skipped
  • Full local suite before push: npm test -- --run: passed, 315 files passed and 1 skipped, 4047 tests passed and 12 skipped
  • Pre-push verify hook: npm run build && npm run lint && npm run typecheck && npm run test --silent: passed, 316 files passed, 4050 tests passed and 9 skipped

PR #2306 cache guard run

I fetched the testing tool from #2306 and ran it against a temporary overlay of this final branch.

npm run cache:guard: passed

  • plain-dialogue: 98.4%, PASS
  • tool-roundtrip: 93.5%, PASS
  • multi-tool: 87.3%, PASS
  • reasoning-retention: 98.5%, PASS
  • long-session-resume: 98.3%, PASS
  • mcp-hot-add: 97.5%, PASS, breaks=1
  • pro-one-shot: req=4, min-hit 98.3%, max-miss 124, breaks=2, PASS
  • Overall threshold: 85.0%, PASS

npm run test -- tests/cache-guard.test.ts: passed, 3 tests

Risk Notes

  • Tool specs are now canonicalized by function name. This intentionally changes the diagnostic/order view of hot-added tools; API payload order becomes stable by name instead of registration timing.
  • Fold economics may delay normal-band summaries compared with the previous behavior. Aggressive/headroom-based folding still protects the model context window.
  • This PR intentionally excludes the local .gitignore change for AGENTS.md, which was already present in the working tree and is unrelated to cache efficiency.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 505a49801d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/loop.ts Outdated
@esengine esengine merged commit fccea10 into esengine:main May 30, 2026
4 checks passed
esengine added a commit that referenced this pull request May 30, 2026
…ocale-independent (#2320)

Follow-up polish to the cache-efficiency guardrails (#2314):

- /status "cache detail" line was hard-coded English; route it through i18n
  (statusCacheDetail / statusCacheChurn) so it matches every other status row.
  EN + zh-CN translated; ja/de/ru inherit EN like the rest of their
  observability block.
- sortToolSpecs used localeCompare, which is locale-sensitive and could let
  the host locale reshuffle the serialized tool prefix and reintroduce the
  very cache churn the sort is meant to prevent. Switch to a stable
  codepoint compare. No change for ASCII tool names (all existing tests pass).

Co-authored-by: yhh <yhh@yhhdeMac-mini.local>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants