refactor(core)!: replace tail-preservation compaction with summary + restoration attachments#4599
Conversation
📋 Review SummaryThis PR replaces the tail-preservation auto-compaction model with a claude-code-style "summary + attachments" approach. The implementation is well-structured with comprehensive test coverage (295 tests across 5 files). The new model addresses critical failure modes for single-turn long-running tasks by preserving verbatim user messages, recently-touched files, and recent images with metadata. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
Two findings outside the diff hunks:
-
[Suggestion] PostCompact hook receives raw summary (
chatCompressionService.ts:480):firePostCompactEvent(postCompactTrigger, summary, signal)passes the raw side-query output including the<analysis>scratchpad. MeanwhilecomposePostCompactHistorycallspostProcessSummary(summary)which strips<analysis>. Hook consumers receive the un-stripped version while the history gets the stripped version — inconsistent. PasspostProcessSummary(summary)to the hook. -
[Suggestion] Stale
geminiChat.tscomment (~line 1543-1550): The comment describestrigger: 'auto'as preventing orphan-strip corruption, but the orphan-strip logic was removed entirely in this PR. The comment should be updated —triggernow only affects hook event selection.
… + ergonomics)
Seven follow-ups from wenshao's review of the compaction rewrite.
Critical:
- newTokenCount now includes restoration-block tokens via
estimateContentChars over extraHistory[2..]. Previously the formula
only counted side-query output, so up to 5 × 5K (files) + 3 × image
tokens were missing — letting the inflation guard miss and the
cheap-gate under-estimate the next prompt size (Finding 1).
- composePostCompactHistory now merges every file restoration block
and the image block into a single user Content following the model
ack. The previous output had consecutive user roles, which
geminiChat.test.ts:6289 enforces against and Gemini providers
reject with 400 "consecutive same-role content" (Finding 2).
- Preserve a trailing model+functionCall through compaction so a
pending functionResponse (sitting in sendMessageStream's
pendingUserMessage) has a matching call. Without this, hard-rescue
auto-compaction mid tool-use loop produces a user+functionResponse
with no preceding model+functionCall → API 400. This restores the
protection the split-point in-flight fallback used to provide.
When the funcCall lands without attachments it folds into the
ack's own model Content to avoid model→model adjacency (Finding 3).
- composePostCompactHistory now takes an optional workspaceRoot and
silently skips file paths that resolve outside it.
extractRecentFilePaths picks up paths from model functionCall args
regardless of whether the tool execution succeeded; without a
boundary check, an adversarial model that issued
read_file('/etc/passwd') — denied by the permission system —
would still have its path extracted and re-read into the next
prompt. compress() passes config.getTargetDir() as the boundary
(Finding 4).
Suggestions:
- composePostCompactHistory + buildFileRestorationBlocks +
readFileSizeAdaptive all take optional AbortSignal and short-
circuit / pass it to readFile's { signal } option. Cancelled
compactions stop on the next file read (Finding 5).
- postProcessSummary fallback no longer re-injects the raw
<analysis> block when the strip leaves nothing. The new
stripAnalysisBlock helper runs the closed-tag strip AND an
unclosed-tag strip (handles 'model ran out of output tokens
before closing'). If both leave nothing, postProcessSummary
emits '[Summary unavailable]' rather than leaking scratchpad
(Finding 6).
- firePostCompactEvent now receives stripAnalysisBlock(summary) so
hook consumers see the same text that lands in history. The
resume trailer stays out of the hook payload — that's wrapper
decoration for the next agent turn, not state for consumers
(Finding 8a).
Docs:
- Update the geminiChat.ts comment around `trigger: 'auto'` to
describe what the trigger actually does post-refactor (hook event
categorization) rather than the deleted manual-only orphan-strip
it used to guard against (Finding 8b).
Regression tests cover all six fixable code-path changes
(role alternation, trailing funcCall preservation, workspace
boundary, abort propagation, closed-tag fallback strip, unclosed-tag
fallback strip).
|
@wenshao thanks for the review. Replies to the two review-body findings: Finding 8a (PostCompact hook receives raw summary): Fixed in 1f177d7. Extracted Finding 8b (stale geminiChat.ts comment): Updated in 1f177d7. The comment around |
Review round summary — commit 1f177d7Triage of @wenshao's review (7 inline + 2 review-body findings):
Regression tests cover all six fixed code-path changes (role alternation, trailing funcCall preservation, workspace boundary, abort propagation, closed-tag + unclosed-tag fallback strip). 157/157 tests pass in |
…omposer 4-entry branch wenshao review round 2 on PR #4599. - readFileSizeAdaptive now stats the file first and short-circuits to a reference when its byte size exceeds maxChars*4 (the safe UTF-8 upper bound — a file larger than that cannot fit within maxChars chars). This stops a multi-GB file the agent previously touched from being slurped into a Buffer and exhausting the heap mid-compaction, exactly when we're trying to reduce memory. A large binary file now references rather than reading to binary-detect. - Add a test for composePostCompactHistory's 4-entry branch (attachments + trailing model+functionCall) producing [user(summary), model(ack), user(attachments), model(fc)]. This is the common mid-tool-loop compaction case; a model->model adjacency here is a provider 400. Prior tests only covered the 2-entry fold (no attachments) and 3-entry (no trailing fc) shapes.
…tyle format Replaces the <state_snapshot> XML template with a numbered 9-section structure that mandates verbatim preservation of user messages, including the historical chronological list (section 6). The new format is designed to pair with post-compact file/image restoration (separate work) so the agent can resume long single-turn tasks without losing intent.
…ompt The user-turn trigger injected after the system prompt still said 'generate the <state_snapshot>' from the old XML prompt era. Updated to 'produce the 9-section summary' to match Task 1's new prompt format. Also tightens the prompt test to assert the specific user-message verbatim mandate (not just the word 'verbatim' anywhere) so a future regression that drops the mandate won't silently pass.
extractRecentFilePaths walks history newest-first and returns the top N unique file paths touched by read_file/write_file/edit/replace tool calls. Pure function, no side effects, no state cache — readiness for the next compaction-rewrite tasks.
Three small cleanups from code review: - Map<string, number> -> Set<string> (the index value was never read) - Guard against maxFiles <= 0 explicitly (avoids returning 1 result when caller passes 0 as a 'disable' sentinel) - Document 'replace' as a legacy alias for 'edit' so a future cleanup pass does not delete it as apparent dead code Adds one test covering the maxFiles=0 path.
extractRecentImages walks history newest-first, collects up to N image inlineData parts, and attributes each one to the model+functionCall that preceded it (when one exists). Returns chronological order so callers can render a meaningful 'last visual state ends here' strip.
readFileSizeAdaptive reads a file and returns one of: embed (full content for files ≤ maxTokens × 4 chars), reference (path-only for large files), missing (deleted since last touch), or binary (non-text content). The embed/reference distinction mirrors claude-code's compact_file_reference vs file attachment behavior, but without introducing new message types.
Three corrections from code review: - Import CHARS_PER_TOKEN from tokenEstimation.ts (canonical) instead of redeclaring locally, preventing silent drift between modules. - Compare decoded character length, not raw byte length, against the cap. Otherwise a 10k-char Chinese file would be ~30k bytes and would be mis-classified as 'reference' despite fitting the budget. - Rename FileReadResult -> FileEmbedResult to avoid a name collision with the unrelated FileReadResult interface in fileUtils.ts. Adds a CJK-text test that catches the byte/char regression.
buildFileRestorationBlocks reads each candidate file, classifies it as embed/reference/missing/binary, and emits one consolidated reference block (path-only list) plus one user message per embedded small file. Total embed size is capped at POST_COMPACT_TOKEN_BUDGET; over-budget files downgrade to reference.
The previous version of this test wrote 3 files totalling 9k chars against a 200k char budget. The assertions trivially passed regardless of whether the budget check existed in the implementation. The new version writes 11 files of 20k chars (each at the per-file cap) so the budget is exhausted by the 10th and the 11th must downgrade from embed to reference. Asserts both: file 11 appears in the reference block, and file 11's content does NOT appear in any embed block.
buildImageRestorationBlock emits a single user message whose first part is a metadata header (turn index + source tool name + args per image), followed by the inlineData parts themselves. Handles user-paste images (no source tool) by labeling them as 'user-provided'.
Assembles the full post-compact history in order: summary → model ack → file references → file embeds → image block. Each section is built by the per-concern extractors and builders added in previous tasks. This is the single integration point that chatCompressionService.compress() will call once the wire-up task lands.
Replaces the split-point + tail-preservation model with full-history compression + composePostCompactHistory. The entire curated history is sent to the summary side-query, and the post-compact history is assembled by the new composer (summary + ack + file restores + image restore). BREAKING: the previously-exported findCompressSplitPoint, splitPointRetainingTrailingPairs, COMPRESSION_PRESERVE_THRESHOLD, and TOOL_ROUND_RETAIN_COUNT will be removed in the next commit. Tests that exercise them remain failing temporarily.
Deletes findCompressSplitPoint, splitPointRetainingTrailingPairs, COMPRESSION_PRESERVE_THRESHOLD, MIN_COMPRESSION_FRACTION, and TOOL_ROUND_RETAIN_COUNT, plus the tests that exercised them. The new behavior is covered by composePostCompactHistory and its unit tests. Also cleans up: - Stale orphan-strip comment in compress() that described the deleted manual-trigger orphan-funcCall handling. - TEST_ONLY.COMPRESSION_PRESERVE_THRESHOLD hatch in client.ts. - Docstring references in config.ts and compactionInputSlimming.ts.
…th 9 claude-aligned sections Replaces the 9-section numbered-text prompt with qwen-code's original <state_snapshot> XML envelope, but with the 9 inner section tags content-aligned to claude-code: <primary_request_and_intent> <key_technical_concepts> <files_and_code_sections> <errors_and_fixes> <problem_solving> <all_user_messages> <pending_tasks> <current_work> <next_step> Also: - <scratchpad> -> <analysis>, stripped by postProcessSummary (saves ~600-800 tokens of CoT noise per compaction). - "Resume directly..." trailer moved out of the prompt body and into postProcessSummary (no longer re-generated by the model every compaction; lives once in code with our own wording). - Section 6 verbatim-policed mandate relaxed to "chronological, include short messages like 'ok' / 'continue'" — matches claude-code intent without forcing the model to literally copy long user messages. E2E (qwen3.6-plus, 6 substantial .ts files + thorough analysis): raw history 6508 -> summary 1513 (after strip ~947), 38% history compression. Overall context 24642 -> 20647 reported (-16%), with another ~664 tokens actually saved by the post-strip but not reflected in the conservative token-math heuristic.
Four small follow-ups from review of 641a0ea: - prompts.ts: rewrite getCompressionPrompt's stale JSDoc — it still described the deleted 9-section numbered-text format and the verbatim mandate that was relaxed. - chatCompressionService.ts: clarify the token-math comment so it's obvious the ~1000 token deduction covers the full compression system prompt + kick-off user turn (not any single instruction) and that newTokenCount slightly over-counts because <analysis> gets stripped by postProcessSummary downstream. - postCompactAttachments.ts: add a NOTE comment on the <analysis> strip regex covering its strict-tag-match assumption and multi-block / non-greedy semantics. - postCompactAttachments.test.ts: replace the four lazy `await import('./postCompactAttachments.js')` calls inside the postProcessSummary describe block with one top-level static import — consistent with how every other describe in the file imports.
… + ergonomics)
Seven follow-ups from wenshao's review of the compaction rewrite.
Critical:
- newTokenCount now includes restoration-block tokens via
estimateContentChars over extraHistory[2..]. Previously the formula
only counted side-query output, so up to 5 × 5K (files) + 3 × image
tokens were missing — letting the inflation guard miss and the
cheap-gate under-estimate the next prompt size (Finding 1).
- composePostCompactHistory now merges every file restoration block
and the image block into a single user Content following the model
ack. The previous output had consecutive user roles, which
geminiChat.test.ts:6289 enforces against and Gemini providers
reject with 400 "consecutive same-role content" (Finding 2).
- Preserve a trailing model+functionCall through compaction so a
pending functionResponse (sitting in sendMessageStream's
pendingUserMessage) has a matching call. Without this, hard-rescue
auto-compaction mid tool-use loop produces a user+functionResponse
with no preceding model+functionCall → API 400. This restores the
protection the split-point in-flight fallback used to provide.
When the funcCall lands without attachments it folds into the
ack's own model Content to avoid model→model adjacency (Finding 3).
- composePostCompactHistory now takes an optional workspaceRoot and
silently skips file paths that resolve outside it.
extractRecentFilePaths picks up paths from model functionCall args
regardless of whether the tool execution succeeded; without a
boundary check, an adversarial model that issued
read_file('/etc/passwd') — denied by the permission system —
would still have its path extracted and re-read into the next
prompt. compress() passes config.getTargetDir() as the boundary
(Finding 4).
Suggestions:
- composePostCompactHistory + buildFileRestorationBlocks +
readFileSizeAdaptive all take optional AbortSignal and short-
circuit / pass it to readFile's { signal } option. Cancelled
compactions stop on the next file read (Finding 5).
- postProcessSummary fallback no longer re-injects the raw
<analysis> block when the strip leaves nothing. The new
stripAnalysisBlock helper runs the closed-tag strip AND an
unclosed-tag strip (handles 'model ran out of output tokens
before closing'). If both leave nothing, postProcessSummary
emits '[Summary unavailable]' rather than leaking scratchpad
(Finding 6).
- firePostCompactEvent now receives stripAnalysisBlock(summary) so
hook consumers see the same text that lands in history. The
resume trailer stays out of the hook payload — that's wrapper
decoration for the next agent turn, not state for consumers
(Finding 8a).
Docs:
- Update the geminiChat.ts comment around `trigger: 'auto'` to
describe what the trigger actually does post-refactor (hook event
categorization) rather than the deleted manual-only orphan-strip
it used to guard against (Finding 8b).
Regression tests cover all six fixable code-path changes
(role alternation, trailing funcCall preservation, workspace
boundary, abort propagation, closed-tag fallback strip, unclosed-tag
fallback strip).
The R3.4 end-to-end auto-compression test drives the real ChatCompressionService, which reads config.getTargetDir() for the post-compact file-restoration workspace boundary. The geminiChat mock config lacked getTargetDir, so the test threw "config.getTargetDir is not a function" on CI. Add the mock to unblock the failing Test jobs.
…ot trigger Add four env-overridable chatCompression settings (priority env > settings > default): - maxRecentFilesToRetain (QWEN_COMPACT_MAX_RECENT_FILES, default 5) - maxRecentImagesToRetain (QWEN_COMPACT_MAX_RECENT_IMAGES, default 3) - enableScreenshotTrigger (QWEN_COMPACT_SCREENSHOT_TRIGGER, default true) - screenshotTriggerThreshold(QWEN_COMPACT_SCREENSHOT_THRESHOLD, default 50) The screenshot trigger fires auto-compaction once tool-returned images accumulate to the threshold even when token usage is below the auto tier, so computer-use sessions don't drown the model in stale screenshots. It counts only images nested in functionResponse.parts (tool results), not user pastes, and runs only in the would-be-NOOP path when enabled. Fix a latent bug surfaced while wiring the trigger: extractRecentImages only inspected top-level inlineData parts, but convertToFunctionResponse nests tool media under functionResponse.parts — so post-compact restoration recovered ZERO tool screenshots in real sessions, while unit tests stayed green against a fabricated top-level shape. It now walks both shapes; the image counter and tests use the real nested shape. Remove the now-defunct contextPercentageThreshold deprecation warning (the field was already dropped from ChatCompressionSettings) and its tests, and document the four new settings.
…fix misleading docs Code-review follow-up. The screenshot trigger counts only images nested in functionResponse.parts. Compaction replaces those with the summary and re-embeds survivors as TOP-LEVEL parts in the restoration block, which the counter ignores — so the tool-image count always resets to ~0 and the trigger cannot immediately re-fire, independent of maxRecentImages. The resolveCompactionTuning JSDoc and the settings.md note previously warned of a non-existent "maxRecentImages near threshold => compact every turn" loop. Correct both, and add a regression test asserting countToolResponseImages() is 0 on composePostCompactHistory output.
…omposer 4-entry branch wenshao review round 2 on PR #4599. - readFileSizeAdaptive now stats the file first and short-circuits to a reference when its byte size exceeds maxChars*4 (the safe UTF-8 upper bound — a file larger than that cannot fit within maxChars chars). This stops a multi-GB file the agent previously touched from being slurped into a Buffer and exhausting the heap mid-compaction, exactly when we're trying to reduce memory. A large binary file now references rather than reading to binary-detect. - Add a test for composePostCompactHistory's 4-entry branch (attachments + trailing model+functionCall) producing [user(summary), model(ack), user(attachments), model(fc)]. This is the common mid-tool-loop compaction case; a model->model adjacency here is a provider 400. Prior tests only covered the 2-entry fold (no attachments) and 3-entry (no trailing fc) shapes.
|
Independent verification — unit tests + live TUI E2E Checked this PR out into a worktree and verified it two ways: the unit suite, and a real TUI session driven through Honest deltas in my setup (so you can judge what's actually covered):
Unit tests — 307 passedLive E2E — does it actually fix the stated problem?PR's claim: the old split-point model drops the verbatim user prompt + recent state on compaction, leaving the agent "blind"; the new model keeps every user message verbatim (summary section 6) and restores recently-touched files. I built a long session in a scratch workspace — read 4 small (1) The compaction side-query received 67,007 input tokens and returned a 3,075-token summary — that asymmetry is exactly why it shrinks. (2) "User prompt survives verbatim" — confirmed. The side-query got the FULL history (no split): 11 messages = (3) "Agent isn't blind after compaction" — confirmed (the important one). The follow-up forbade re-reading; the agent still answered with file-specific detail that can only come from the restored embedded bodies:
(4) size-adaptive restore — confirmed. From the post-compact main call, the file block embeds small files in full but lists the large one as a path-only reference: (5) post-compact assembly order matches the PR description exactly — and the Bonus — the inflation guard fires. My first That's exactly the small-session degenerate case the Risk section calls out: slimmed history ≈ 990 tokens, summary alone was 1,645, so One observation (not a blocker)The post-compact tail is a Net: full-history summary, verbatim user-message preservation, size-adaptive file restoration, assembly order, analysis-stripping, and the inflation guard all hold up against real API payloads. The screenshot path is covered by the regression unit test only, not live. 中文说明独立验证了这个 PR,两条线:单测 + 用 如实说明环境差异:
单测: 5 个文件 307 个全过。 端到端验证(对应 PR 要解决的问题): 构造长会话(读 4 个小文件 + 大量分析 +
一个观察(非阻塞): 压缩后历史尾部是 user(文件块),追问也是 user → payload 里出现连续两条 user。deepseek/qwen 的 OpenAI 兼容端点接受;但代码注释自己提到 Gemini 原生协议会拒绝连续同 role。对 OpenAI 兼容端点(本 PR 目标)无害,若将来要支持 Gemini 原生 provider 值得留意。 结论: 全量摘要、用户消息逐字保留、size-adaptive 文件恢复、组装顺序、analysis 剥离、膨胀守卫,都在真实 API 请求上验证通过。截图路径仅靠回归单测覆盖,未做真实 TUI 验证。 |
…nst throws wenshao review round 3 on PR #4599 (two Criticals). - isInsideWorkspace now resolves symlinks via realpathSync (safeRealpath, with a lexical fallback for non-existent paths). A symlink living inside the workspace but pointing outside (e.g. workspace/.env -> ~/.ssh/id_rsa) previously passed the lexical boundary check and had its target read and embedded into the post-compact history sent to the provider. Added a RED-verified security regression test (secret embedded under the old lexical check; rejected under realpath). - Wrap composePostCompactHistory in try/catch inside compress(). The summary side-query has already succeeded at that point, so a restoration-assembly throw (disk I/O / malformed history) previously escaped to sendMessageStream, crashing the active turn AND bypassing the COMPRESSION_FAILED breaker. It now degrades to summary + ack.
d245704 to
cf5da5a
Compare
|
Round 3 triage (commit ✅ Fixed (2 Critical):
✅ Already addressed in ❌ Declined as out of scope for this PR (replace tail-preservation with summary+restoration), round 3:
|
| // image restore. No tail preservation, no continuation bridge. | ||
| extraHistory = await composePostCompactHistory(curatedHistory, summary, { | ||
| workspaceRoot: config.getTargetDir(), | ||
| signal, |
There was a problem hiding this comment.
This reads as a placeholder/test comment ("test line 422") with no actionable content. Line 422 currently sits inside the new try/catch around composePostCompactHistory (added in cf5da5a). If there's a specific concern there, let me know and I'll address it.
| // CLAUDE-CODE-STYLE FULL-HISTORY COMPRESSION: the entire curated | ||
| // history is sent to the summary side-query (no split, no tail | ||
| // preservation), and the post-compact history is assembled by | ||
| // composePostCompactHistory below (summary + model ack + recent |
There was a problem hiding this comment.
This appears to be a placeholder/test comment ("test") with no actionable content. Happy to address a specific concern at this line if you can share the details.
| // suggests. We accept that inaccuracy in favor of avoiding local | ||
| // token estimation. | ||
| if ( | ||
| typeof compressionInputTokenCount === 'number' && |
There was a problem hiding this comment.
Same here — reads as a placeholder/test comment ("test") with no actionable content. Let me know the specific concern and I'll take a look.
wenshao
left a comment
There was a problem hiding this comment.
Additional findings not mappable to diff lines:
(A) extraHistory.slice(2) positional coupling (chatCompressionService.ts:489): restorationChars calculation hardcodes slice(2) assuming the composer always outputs [summary, ack] first. In the fold branch (no attachments, trailing fc), the output is [user(summary), model(ack+fc)] — only 2 entries — so slice(2) is empty and functionCall parts folded into the ack are NOT counted. Fix: return { summary, ack, attachments } from the composer, or compute as totalChars(extraHistory) - summaryTextChars.
(B) Test gaps: (1) restorationChars calculation — no test verifies the inflation guard fires when restoration attachments push post-compact size past original; (2) firePostCompactEvent receiving stripAnalysisBlock(summary) — existing tests use plain text, no test with <analysis> blocks; (3) isInsideWorkspace trailing-separator boundary — no test for sibling paths sharing a prefix.
(C) Restoration token estimation undercounts images: estimateContentChars uses flat imageTokenEstimate (default 1600 tokens) per image, but actual base64 screenshot payloads are 50K-125K tokens. In computer-use sessions, 3 restored screenshots could push post-compact history well past original size without triggering the inflation guard.
— qwen3.7-max via Qwen Code /review
wenshao review round 4 on PR #4599. - isSummaryEmpty now checks the STRIPPED summary: a response that is only an <analysis> block (no <state_snapshot>) strips to empty, so it takes the COMPRESSION_FAILED_EMPTY_SUMMARY path instead of "succeeding" with `[Summary unavailable]` as the agent's only context (silent amnesia). - Manual /compress strips a trailing ORPHANED model+functionCall before composing — it has no pending functionResponse, so preserving it would emit model[fc] then the next user text turn -> API 400. Auto-compaction still keeps it (the pending response pairs with it). - The restoration-failure catch fallback now folds a trailing model+functionCall into the ack turn, so a pending functionResponse (auto mid-tool-loop) keeps its matching call even on the degraded path. - extractRecentFilePaths skips file paths whose tool call FAILED (an error functionResponse), so a denied read_file is never re-read off disk during compaction — closing a permission-bypass side channel. RED-verified regression tests for the empty-summary, orphan-strip, and permission-bypass fixes. Corrected the postProcessSummary comment.
|
Round 4 triage (commit ✅ Fixed (4 Criticals + 1 comment):
All three behavioural fixes have RED-verified regression tests. ✅ Already addressed in ❌ Declined as out of scope for this PR (replace tail-preservation with summary+restoration):
|
…fold text drop wenshao review round 5 on PR #4599. - Regression test for the restoration-failure catch fallback: mock composePostCompactHistory to reject and assert compaction still returns COMPRESSED (no escape to sendMessageStream / breaker bypass) with the trailing functionCall folded into the ack and the trailing text dropped. - Document that the fold branch intentionally keeps only functionCall parts (the trailing turn's text is already captured in the summary); the asymmetry with the with-attachments branch is deliberate.
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ The new regression test properly covers the catch-fallback degradation path that was the R4b Critical gap. The fold-branch documentation clarifies the deliberate asymmetry. — qwen3.7-max via Qwen Code /review
What this PR does
Replaces qwen-code's tail-preservation auto-compaction model — which split history by char-count, summarized the front 70%, and preserved the most recent 30% verbatim — with a claude-code-style model: full-history summary + post-compact restoration attachments. The new flow always sends the entire curated history to the summary side-query, then assembles the post-compact history from a 9-section structured summary, a synthetic model ack, the top 5 recently-touched files (size-adaptive: small files embed full current content read fresh from disk, large files list as path-only references), and the top 3 recently-captured images with per-image metadata (turn index + source tool + args).
Why it's needed
The old
findCompressSplitPointmodel has a critical failure mode for single-turn long-running tasks, which are the dominant pattern for any computer-use-style workflow ("open Safari, click the first result, scroll, take a screenshot"). The split rule requires at least two non-functionResponse user messages in the history — one early, one past the 70% char-count mark — to find a "clean" split. A single-turn task only ever has one such user message, so the scan never succeeds and falls through to fallback branches that, per the code comment at the old lines 193–204, "bias toward more compression rather than less". In the common case where compression fires after a tool result returns (history ends withuser+fr) or after the model gives its final text response, the fallback returnscontents.length— meaning all screenshots and the original user prompt are replaced by an opaque text summary. The agent resumes "blind": no visual context, no verbatim user intent, often looping back to clarifying questions or re-screenshotting from scratch.The new model addresses both gaps. Section 6 of the new summary template ("All user messages, chronological, verbatim") mandates verbatim quoting of every user message, so the user prompt survives compaction even when there's only one. The image restoration block (top 3 most recent images) preserves recent visual state regardless of whether it came from a tool result or a user paste, and the metadata header tells the model which tool call produced each image so it can correlate visual state with the actions that produced it. The file restoration block (top 5 most recently touched files via
read_file/write_file/edit/replace, size-adaptive embed vs. reference) means the model doesn't have to re-read files it was just working with. The pattern is: trust the summary for narrative continuity, trust selective restoration for state continuity.Reviewer Test Plan
How to verify
Run the unit suite for the affected files — 295 tests across 5 files all pass:
cd packages/core npx vitest run \ src/services/chatCompressionService.test.ts \ src/services/postCompactAttachments.test.ts \ src/core/prompts.test.ts \ src/services/tokenEstimation.test.ts \ src/core/client.test.tsFor an end-to-end check, launch the CLI in a workspace with a handful of files and exercise
/compress:After
/compresscompletes, inspect the most recent file in/tmp/e2e-logs/for the post-compact API call. Therequest.messagesarray should follow the order:system(system prompt) →user(Qwen Code context) →model(ack) →user(9-section summary from the compaction side-query) →model(ack"Got it. Thanks for the additional context!") → 5 ×user(file restoration blocks, eachRecently accessed file (full current content embedded)for small files orreference onlyfor large files) → optionaluser(image restoration block) →user(follow-up question). The summary message must contain section 6 ("All user messages (chronological)") with the original user prompts quoted verbatim.Evidence (Before & After)
This was run against
qwen3.6-plusin a workspace with 6 small TypeScript files (config.ts,logger.ts,database.ts,router.ts,auth.ts,cache.ts). Steps: ask the agent to read all 6 files and describe each, then ask for a detailed 5-paragraph analysis of each, then run/compress.Before compaction: 20,763 prompt tokens. After compaction: 14,307 prompt tokens (a 31% reduction). Context usage dropped from 2.1% to 1.4% of the model's 1M-token window. The post-compact history correctly contained: the 9-section summary with both original user prompts quoted verbatim in section 6, 5 file-embed blocks (one per file, full content), and a clean continuation when asked a follow-up ("Which of those 6 files seems most security-sensitive?") — the agent answered coherently based on the restored files without re-reading.
The single-turn computer-use regression test (
packages/core/src/services/chatCompressionService.test.ts) reproduces the motivating scenario programmatically: a 1-user-prompt + 5 (model+fc, user+fr with screenshot) round history. It asserts (a) the user prompt survives verbatim in the summary, (b) the 3 most recent screenshots are restored in the image block (chronological order,s3/s4/s5), (c) the metadata header containscomputer_use__get_app_statewith"app":"Safari". This is the canary for the user-facing UX claim.Tested on
✅ verified locally ·⚠️ CI
Environment (optional)
Local runtime:
node dist/cli.jswith--openai-loggingagainstqwen3.6-pluson macOS arm64. Unit tests via vitest inpackages/core.Risk & Scope
newTokenCount > originalTokenCount→COMPRESSION_FAILED_INFLATED_TOKEN_COUNT) catches the small-session degenerate case where the new history would be larger than the original.microcompaction/microcompact.ts) is not touched by this PR — the idle-trigger path still clears nested media (including computer-use screenshots) under the existing per-kindkeepRecent=5. Improving that requires a whitelist-based approach and is a separate follow-up. Session resume (loading a compacted session from disk) is not specifically validated — runtime state reconstruction is a separate concern. The transcript-path pointer that claude-code includes in its summary (so the model canReadthe source.jsonlfor details) is not plumbed here.findCompressSplitPoint,splitPointRetainingTrailingPairs,COMPRESSION_PRESERVE_THRESHOLD,MIN_COMPRESSION_FRACTION,TOOL_ROUND_RETAIN_COUNT. None of these are re-exported frompackages/core/src/index.ts, so they are inaccessible to@qwen-code/sdk(TypeScript / Python / Java) andpackages/acp-bridgeconsumers — verified by grep. The only in-repo reference outsidechatCompressionService.tswas an internal test hatchTEST_ONLY.COMPRESSION_PRESERVE_THRESHOLDinpackages/core/src/core/client.ts, removed in the same change. Docstring references inconfig.tsandcompactionInputSlimming.tsandtokenEstimation.tswere updated to point at the new flow.Linked Issues
Closes #4592
中文说明
这个 PR 做了什么
把 qwen-code 自动压缩的旧模型——按字符数切分历史,前 70% 压成 summary,最近 30% 逐字保留——替换成 claude-code 风格的"summary + 压缩后恢复 attachment"模型。新流程总是把完整 curated 历史送给 summary side-query,然后按以下顺序组装压缩后的历史:9 段结构化 summary + 模型 ack + 最近 5 个被读写过的文件(自适应:小文件嵌入从磁盘新读的完整内容,大文件只放路径引用)+ 最近 3 张图片(每张图带 metadata header:turn index + 源工具名 + args)。
为什么需要
旧的
findCompressSplitPoint模型对单 turn 长任务有一个致命的失败模式——这正是所有 computer-use 类工作流的主导模式("打开 Safari,点第一个搜索结果,向下滚动,截图")。切分规则要求历史里至少存在两条非 functionResponse 的 user 消息:一条在前,一条落在 70% 字符数线之后,才能找到"干净"的 split。单 turn 任务只有一条这种消息(用户最初的 prompt,在 index 0),扫描永远找不到 split,于是落入 fallback 分支——而原代码注释(旧的 193–204 行)明确写着这些 fallback "bias toward more compression rather than less"。在常见情况下(压缩在 tool result 返回之后触发,历史尾部是user+fr),fallback 返回contents.length,意味着所有截图和用户原始 prompt 都被一段不透明的文本摘要替代了。Agent 恢复时是"瞎子"——没有视觉上下文,没有 verbatim 的用户意图,经常退化成再问澄清问题或者从头重新截图。新模型同时解决了这两个缺口。新 summary 模板的第 6 段("All user messages, chronological, verbatim")强制 verbatim 引用每一条用户消息,所以即使只有一条用户消息,它也能在压缩后存活。图片恢复块(最近 3 张)保留近期视觉状态——无论这些图来自 tool result 还是用户粘贴——并且 metadata header 告诉模型每张图是由哪次 tool call 产生的,便于把视觉状态和产生它的动作对应起来。文件恢复块(最近 5 个被
read_file/write_file/edit/replace触碰过的文件,按大小自适应 embed 或 reference)意味着 model 不需要重新读取刚刚正在使用的文件。整体模式是:用 summary 保证叙事连续性,用选择性恢复保证状态连续性。Reviewer 验证方案
怎么验证
跑这些文件的单测,295 个 test 全过:
cd packages/core npx vitest run \ src/services/chatCompressionService.test.ts \ src/services/postCompactAttachments.test.ts \ src/core/prompts.test.ts \ src/services/tokenEstimation.test.ts \ src/core/client.test.ts要做端到端验证,启动 CLI 进入一个含几个文件的 workspace 然后跑
/compress:/compress完成后,看/tmp/e2e-logs/里最新那个 JSON 的request.messages。顺序应该是:system(系统 prompt)→user(Qwen Code 上下文)→model(ack)→user(压缩 side-query 产生的 9 段 summary)→model(ack"Got it. Thanks for the additional context!")→ 5 个user(文件恢复块,每个要么Recently accessed file (full current content embedded)要么reference only)→ 可选user(图片恢复块)→user(追问)。summary 消息里第 6 段("All user messages (chronological)")必须 verbatim 引用每条用户原话。证据(Before & After)
用
qwen3.6-plus在含 6 个小 TS 文件(config.ts、logger.ts、database.ts、router.ts、auth.ts、cache.ts)的 workspace 里实测过:让 agent 读所有 6 个文件并描述,再让它给每个写 5 段详细分析,然后跑/compress。压缩前 20,763 prompt tokens,压缩后 14,307 prompt tokens(-31%)。Context 占用从 2.1% 降到 1.4%(1M tokens 窗口)。压缩后的历史正确包含:9 段 summary(第 6 段 verbatim 引用了两条原始用户 prompt)、5 个文件 embed 块(每个文件一个,含完整内容)。追问"6 个文件里哪个最敏感"时 agent 直接基于恢复的文件给出连贯答案,没有重读任何文件。
packages/core/src/services/chatCompressionService.test.ts里的"single-turn computer-use regression"测试编程式重现这个场景:1 条用户 prompt + 5 轮 (model+fc, user+fr with screenshot) 历史。assertion:(a) 用户原话 verbatim 在 summary 里;(b) 最近 3 张截图按时序在图片块里(s3/s4/s5);(c) metadata header 含computer_use__get_app_state和"app":"Safari"。这是 user-facing UX 主张的 canary。测试平台
✅ 本地验证过 ·⚠️ 交给 CI
环境(可选)
本地 runtime:macOS arm64 上的
node dist/cli.js,跑qwen3.6-plus并带--openai-logging。单测:packages/core里的 vitest。风险与范围
newTokenCount > originalTokenCount→COMPRESSION_FAILED_INFLATED_TOKEN_COUNT)兜住小会话退化情况。microcompaction/microcompact.ts) 不在本 PR 范围内——idle 触发路径仍按 per-kindkeepRecent=5清理嵌套 media(包括 computer-use 截图)。改进这条线需要白名单制方案,是另一个 follow-up。Session resume 没做专门验证——运行时状态重建是单独的关注点。Claude-code summary 里那条 transcript path 指针(让 model 能Read原.jsonl找细节)本 PR 没接进来。findCompressSplitPoint、splitPointRetainingTrailingPairs、COMPRESSION_PRESERVE_THRESHOLD、MIN_COMPRESSION_FRACTION、TOOL_ROUND_RETAIN_COUNT。这些都没从packages/core/src/index.tsre-export,所以@qwen-code/sdk(TypeScript / Python / Java) 和packages/acp-bridgeconsumer 访问不到——grep 已验证。仓库内唯一在chatCompressionService.ts之外的引用是packages/core/src/core/client.ts里的TEST_ONLY.COMPRESSION_PRESERVE_THRESHOLD(内部测试 hatch),同 commit 一起删了。config.ts、compactionInputSlimming.ts、tokenEstimation.ts里的 docstring 引用也更新到指向新流程。相关 issue
Closes #4592