Skip to content

feat(cli): add /compress-fast command for no-LLM rule-based context compression#4893

Merged
yiliang114 merged 4 commits into
QwenLM:mainfrom
ZijianZhang989:feat/compress-fast
Jun 12, 2026
Merged

feat(cli): add /compress-fast command for no-LLM rule-based context compression#4893
yiliang114 merged 4 commits into
QwenLM:mainfrom
ZijianZhang989:feat/compress-fast

Conversation

@ZijianZhang989

Copy link
Copy Markdown
Collaborator

What this PR does

Adds /compress-fast, a new slash command that compresses conversation context without any LLM side-query. It combines two rule-based steps: (1) force microcompaction to clear old tool results and media parts while keeping the most recent N, and (2) stripping thought parts from all model turns. The result is a significantly smaller history — typically freeing thousands of tokens — at zero API latency.

A chat_compression checkpoint is written to JSONL so --resume works exactly as it does after /compress.

Why it's needed

/compress relies on an LLM side-query (~2-5s, ~30K tokens) to summarise history. For local model deployments and users who just want quick space reclamation, this is too slow. /compress-fast runs entirely rules-based: no API call, no token cost, instant feedback. It complements /compress — use /compress-fast when you need space right now, and /compress when you want semantic summary quality.

Resolves #4264.

Reviewer Test Plan

How to verify

# Unit tests
npx vitest run \
  packages/core/src/services/microcompaction/microcompact.test.ts \
  packages/core/src/core/geminiChat.test.ts \
  packages/cli/src/ui/commands/compressFastCommand.test.ts \
  packages/cli/src/services/BuiltinCommandLoader.test.ts

Manual smoke test in interactive mode:

npm run build && npm run start
# 1. Ask the model to read files and use tools:
> 帮我看看 package.json 和 tsconfig.json
# 2. Run the fast compress:
/compress-fast
#    → COMPRESSION card shows before/after token counts
# 3. Run again immediately:
/compress-fast
#    → "No compression needed" (nothing left to clean)
# 4. Verify model still works:
> 刚才我们读了哪些文件?
#    → Model responds normally, no tool_use_id errors
# 5. Verify context preserved:
> 这句话之前我们聊了什么?
#    → Model remembers conversation structure (dialogue skeleton intact)
# 6. Verify existing /compress still works:
/compress
#    → LLM compression runs as before
# 7. Verify resume:
qwen-code --resume
#    → Session restores and model responds to follow-ups

Evidence (Before & After)

TUI change: a new COMPRESSION history item appears after running /compress-fast, showing the token reduction (e.g. 15,432 → 8,210). This is identical UX to /compress.

Non-UI artifacts: the JSONL transcript gains a chat_compression record with compressionStatus: COMPRESSED and triggerReason: manual, matching the /compress checkpoint format.

Tested on

OS Status
🍏 macOS
🪟 Windows ⚠️
🐧 Linux ⚠️

Environment (optional)

Local: npm run dev on macOS, Node 22.

Risk & Scope

  • Main risk or tradeoff: Stripping thinking blocks discards the model's internal reasoning. For very long tool-use chains where the model refers back to its own earlier reasoning, this could degrade answer quality. In practice text parts and tool results carry enough visible state; the original /compress is available if deeper summarization is needed.
  • Not validated / out of scope: Performance on extremely large histories (>100K tokens) — the token estimation is fast but estimateContentTokens may have edge cases. The command intentionally does NOT rebuild the session via startChat() — deferred tools survive, unlike a /clear. This is both a feature (fast, preserves state) and a limitation (does not reclaim system prompt tokens).
  • Breaking changes / migration notes: None. All changes are additive. microcompactHistory() gains an optional { force: true } parameter that existing callers don't pass. stripThoughtPartsFromContent remains module-private.

Linked Issues

Closes #4264

中文说明

这个 PR 做了什么

新增 /compress-fast 斜杠命令,在不发起任何 LLM 侧边查询的情况下压缩对话上下文。它组合了两个基于规则的步骤:(1) 强制 microcompaction 清理旧的工具结果和媒体内容,保留最近 N 个;(2) 剥离所有模型回复中的 thought 部分。结果是在零 API 延迟下显著缩减 history token 数。

会写入 chat_compression checkpoint 到 JSONL,--resume 的行为与 /compress 完全一致。

为什么需要

/compress 依赖 LLM 侧边查询来生成摘要(约 2-5 秒,消耗约 30K token)。对于本地模型部署或只想快速释放空间的用户来说太慢了。/compress-fast 纯规则驱动:无 API 调用、无 token 开销、即时响应。它与 /compress 互补——需要立即释放空间时用 /compress-fast,需要语义摘要质量时用 /compress

解决 #4264

Reviewer Test Plan

(测试步骤同上,此处省略以保持可读性。)

风险与范围

  • 主要风险与权衡:剥离 thinking 会丢弃模型的内部推理过程。对于非常长的工具调用链,模型可能忘记调用某个工具的原因进而影响回答质量。实践中文本内容和工具结果已携带足够的可见状态;如需更深层的摘要可使用 /compress
  • 未验证/超出范围:极大 history(>100K tokens)下的性能——token 估算速度很快但 estimateContentTokens 可能存在边界情况。命令有意不使用 startChat() 重建 session——deferred tools 会保留,不像 /clear。这既是优点(快、保留状态)也是局限(无法回收 system prompt token)。
  • Breaking changes / 迁移说明:无。所有修改都是增量式的。microcompactHistory() 新增可选的 { force: true } 参数,现有调用方不传此参数。stripThoughtPartsFromContent 保持模块私有。

关联 Issues

Closes #4264

@tanzhenxin tanzhenxin added the type/feature-request New feature or enhancement request label Jun 9, 2026
Comment thread packages/core/src/core/geminiChat.ts Outdated
Comment thread packages/core/src/core/client.ts
Comment thread packages/core/src/core/client.ts Outdated
// Lightweight: setHistory() already called in compressFast().
// Reuse microcompaction's surgical FileReadCache disarm pattern.
const m = microcompactMeta;
const fileReadCache = this.config.getFileReadCache();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] This FileReadCache disarm block is a structural copy of the auto-compression path (lines 1557-1597). Two copies of the same ~20-line branching logic will drift over time. The new copy also drops two debug log messages present in the original (the success log for surgical disarm and the "unresolvable path" explanation before blanket clear), reducing observability in the /compress-fast path.

Consider extracting into a private method on GeminiClient:

private async disarmFileReadCacheAfterEviction(
  meta: MicrocompactMeta,
  logTag: string,
): Promise<void> {
  // shared disarm logic with debug logs
}

Then both call sites reduce to await this.disarmFileReadCacheAfterEviction(m, 'compress-fast') / await this.disarmFileReadCacheAfterEviction(m, 'microcompaction').

— qwen3.7-max via Qwen Code /review

expect(firstModel?.parts).toEqual([{ text: 'response text' }]);
});

it('NOOP when no tool calls and no thinking', () => {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Two compressFast tests use weak assertions that cannot fail:

  1. This test accepts both NOOP and COMPRESSED — tautological since those are the only two possible CompressionStatus values. A bug that always returns either one would go undetected.

  2. The "updates lastPromptTokenCount on COMPRESSED" test gates its core assertion behind if (result.info.compressionStatus === COMPRESSED). If NOOP fires (possible given the small history), the test silently passes.

For test (1), set lastPromptTokenCount to a value that guarantees NOOP, then assert strictly:

expect(result.info.compressionStatus).toBe(CompressionStatus.NOOP);

For test (2), construct a history with substantial thinking parts and set lastPromptTokenCount high enough to guarantee COMPRESSED, then assert unconditionally.

— qwen3.7-max via Qwen Code /review

Comment thread packages/cli/src/ui/commands/compressFastCommand.ts
Comment thread packages/core/src/core/geminiChat.ts Outdated
});
this.setHistory(newHistory);
clearDetailedSpanState();
this.lastPromptTokenCount = afterTokens;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] afterTokens comes from estimateContentTokens() (char/4 heuristic, ~30% underestimate). Overwriting lastPromptTokenCount here replaces the API-authoritative count with a heuristic that persists until the next API call. This degrades the hard-rescue threshold gate and auto-compaction trigger — exactly the safety nets that exist to prevent context-overflow 400 errors.

The existing /compress path avoids this because tryCompress derives newTokenCount from the compression API response (authoritative), not a heuristic.

Suggested change
this.lastPromptTokenCount = afterTokens;
// Don't overwrite API-authoritative count with char/4 heuristic.
// Next sendMessageStream API response will update it.
this.telemetryService?.setLastPromptTokenCount(afterTokens);

— qwen3.7-max via Qwen Code /review

@qwen-code-ci-bot qwen-code-ci-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No high-confidence issues found. All 224 tests pass, no lint/typecheck errors. LGTM! ✅ — qwen3.7-max via Qwen Code /review

@qwen-code-ci-bot qwen-code-ci-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. LGTM! ✅ — qwen3.7-max via Qwen Code /review

@qwen-code-ci-bot qwen-code-ci-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previously raised issues have been addressed in this revision. The implementation is well-structured, well-tested, and follows existing conventions. LGTM! ✅ — qwen3.7-max via Qwen Code /review

@yiliang114

Copy link
Copy Markdown
Collaborator

The main flow looks right to me — the earlier review rounds (FileReadCache disarm dedup, delta-based token adjustment, test assertions) are all properly addressed, and CI is green across the matrix. Two follow-ups worth tightening before merge:

  1. Telemetry: the /compress path emits a compression event via logChatCompression(makeChatCompressionEvent(...)) in chatCompressionService.ts, but compressFast() only records to chatRecordingService and updates lastPromptTokenCount. Once this merges we won't be able to see /compress-fast usage or savings in telemetry — and triggerReason: 'manual' exists exactly to distinguish this case. Could we emit the same event here?

  2. Docs: docs/users/features/commands.md lists /compress but not the new command. Worth adding a /compress-fast row so it's discoverable.

Two non-blocking nits:

  1. In compressFastCommand.ts, the NOOP check relies on originalTokenCount === newTokenCount; compressed.compressionStatus === CompressionStatus.NOOP is already in the return value and doesn't depend on the counts happening to be equal.

  2. 'Compressing context (fast)...' and the Context compressed (...) strings skip t() — same as the existing compressCommand, so fine to leave for a separate pass.


const doCompress = async () => await geminiClient.tryCompressChatFast();

if (executionMode === 'acp') {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] compressFastCommand does not read context.abortSignal or pass it through to tryCompressChatFast. The sibling compressCommand extracts abortSignal (line 34), passes it to tryCompressChat (line 83), and guards both the post-compression path (if (abortSignal?.aborted) { return; } at line 141) and the error path.

While /compress-fast is fast (no LLM call), the post-compression disarmFileReadCacheAfterEviction performs async fsPromises.stat() calls. If the user presses ESC during this window, the UI pending item stays visible and history mutations proceed despite cancellation intent.

Add abortSignal extraction and guards matching the /compress pattern:

Suggested change
if (executionMode === 'acp') {
const { ui } = context;
const abortSignal = context.abortSignal;
const executionMode = context.executionMode ?? 'interactive';

Then add if (abortSignal?.aborted) { return; } after await doCompress() (line 95) and in the catch block.

— qwen3.7-max via Qwen Code /review

@qwen-code-ci-bot qwen-code-ci-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. LGTM! ✅ — qwen3.7-max via Qwen Code /review

@yiliang114

yiliang114 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Ran this end-to-end locally on the PR head (da6f728) before signing off — built the branch and drove the real TUI through a read → compress → resume cycle against a live model, not mocks.

Setup: built dist/cli.js from this branch, drove it in tmux through an OpenAI-compatible DashScope endpoint (qwen3-coder-plus), isolated HOME, workspace with 11 files for the model to read.

Unit tests (the PR's own suites): 397 passed — microcompact (37), client (173), compressFastCommand (7), BuiltinCommandLoader (11), geminiChat (169).

E2E in the real TUI:

  1. Read 3 + 8 files → 11 ReadFile tool results in history.
  2. /compress-fast30320 → 25869 tokens, instant, no side-query.
  3. /compress-fast again → "No compression needed" (NOOP correctly detected).
  4. Post-compress follow-up ("which files did we read?") → model lists all 11 correctly, no tool_use_id errors — the compressed history is still valid for the API.
  5. --resume → session restored with the compression checkpoint intact; a follow-up about beta.md answered correctly from the restored history.

JSONL checkpoint matches the /compress format, with triggerReason: "manual" recorded as intended:

{"info":{"originalTokenCount":30481,"newTokenCount":25901,"compressionStatus":1,"triggerReason":"manual"},"compressedHistory":[]}

One honest caveat: qwen3-coder-plus emitted no thought parts in this run (thought count was 0 in the transcript), so all the savings came from the force-microcompaction (clear old tool results) step. The thought-stripping step was effectively a no-op here and is only covered by the unit test geminiChat > compressFast > strips thinking. Exercising that half live would need a reasoning model that emits thinking.

Code-wise this LGTM.

Real terminal output (live tmux run, API key redacted — prompts were issued in Chinese; click to expand)
   ▄▄▄▄▄▄  ▄▄     ▄▄ ▄▄▄▄▄▄▄ ▄▄▄    ▄▄   ┌──────────────────────────────────────────────┐
  ██╔═══██╗██║    ██║██╔════╝████╗  ██║  │ >_ Qwen Code (v0.17.1)                       │
  ██║   ██║██║ █╗ ██║█████╗  ██╔██╗ ██║  │ API Key | qwen3-coder-plus (/model to change)│
  ╚██████╔╝╚███╔███╔╝███████╗██║ ╚████║  │ /private/tmp/qwen-pr4893-harness/workspace   │
   ╚══▀▀═╝  ╚══╝╚══╝ ╚══════╝╚═╝  ╚═══╝  └──────────────────────────────────────────────┘

  > [read alpha.md, beta.md, gamma.md and summarize each in one line]

  ✦ Reading the three files and summarizing each.

  ╭──────────────────────────────────────╮
  │ ✓  ReadFile alpha.md                 │
  ╰──────────────────────────────────────╯
  ╭──────────────────────────────────────╮
  │ ✓  ReadFile beta.md                  │
  ╰──────────────────────────────────────╯
  ╭──────────────────────────────────────╮
  │ ✓  ReadFile gamma.md                 │
  ╰──────────────────────────────────────╯

  ✦  - alpha.md: session bootstrap — reads config, resolves auth provider, hands a ready GeminiClient to the REPL.
     - beta.md: context-window accounting — tracks lastPromptTokenCount and decides when microcompaction fires.
     - gamma.md: persists the JSONL transcript so --resume can restore, and writes a checkpoint on compression.

  > [read doc1.txt .. doc8.txt one by one, then report file count and total lines]

  ✦ Reading doc1.txt through doc8.txt, then counting.

  ╭──────────────────────────────────────╮
  │ ✓  ReadFile doc1.txt                 │
  ╰──────────────────────────────────────╯
  …… (doc2 .. doc7 likewise ✓ ReadFile, elided) ……
  ╭──────────────────────────────────────╮
  │ ✓  ReadFile doc8.txt                 │
  ╰──────────────────────────────────────╯

  ✦ Read all 8 docs (doc1.txt .. doc8.txt). 8 files, 320 lines total.

  > /compress-fast

  ✦ Chat history compressed from 30320 tokens to 25869 tokens.      <- 1st run: applied, instant, zero API

  > /compress-fast

  ● No compression needed.                                          <- 2nd run: NOOP correctly detected

  > [after compression, which files did we read? names only]

  ✦  - alpha.md
     - beta.md
     - gamma.md
     - doc1.txt
     - doc2.txt
     - doc3.txt
     - doc4.txt
     - doc5.txt
     - doc6.txt
     - doc7.txt
     - doc8.txt                                                     <- skeleton intact, model still works

─────────────────────  $ qwen --resume  ─────────────────────

  To continue this session, run: qwen --resume 043290f4-8f6c-44f0-bee8-29bd80a79a99

╭──────────────────────────────────────────────────────────────────────────╮
│ Resume Session                                                             │
│ Press / to search                                                          │
│ › [read alpha.md, beta.md, gamma.md ...]   (just now)                      │
│   [read alpha.md, beta.md, gamma.md ...]   (9 minutes ago)                 │
│ Space to preview · ↑↓ to navigate · Type to search · Esc to cancel         │
╰──────────────────────────────────────────────────────────────────────────╯

  (selected -> session fully restored, incl. the prior compress / NOOP / file list)

  > [continue: what module is beta.md about? one line]

  ✦ beta.md is the context-window accounting module — tracks lastPromptTokenCount and decides when microcompaction fires.
                                                                    <- restored session answers correctly

@yiliang114

Copy link
Copy Markdown
Collaborator

The red CI here is unrelated to your changes — the failing tests are yaml-parser.test.ts > known limitations (pin until js-yaml lands), which this PR doesn't touch. They were already removed on main in #4980, so your branch is just missing that cleanup. Could you rebase / merge latest main? That should clear the test matrix, and we're good to approve once it's green. Thanks!

俊良 and others added 3 commits June 11, 2026 19:22
…ompression

Adds /compress-fast, a new slash command that compresses context without
any LLM side-query. It combines two rule-based steps:

1. Force microcompaction — clears old tool results and media parts,
   keeping the most recent N (default 5, configurable via
   toolResultsNumToKeep). Uses a new { force: true } option on
   microcompactHistory() to skip the time-based trigger.

2. Strip thinking blocks — removes thought parts from all model turns,
   keeping text and tool_use parts intact.

Uses setHistory() for zero-latency history replacement (no session
rebuild, deferred tools survive). Writes a chat_compression checkpoint
to JSONL so --resume works identically to /compress.

Post-compression, tryCompressChatFast() surgically disarms affected
file paths from FileReadCache via markReadEvictedFromHistory(), falling
back to clear() only when paths can't be resolved.

Resolves QwenLM#4264.
- Add test coverage for tryCompressChatFast FileReadCache disarming
  (NOOP, clear, surgical disarm with inode miss, full success)
- Fix weak assertions in geminiChat compressFast tests:
  - NOOP test now strictly asserts CompressionStatus.NOOP
  - lastPromptTokenCount test guarantees COMPRESSED with larger history
- Register 'No compression needed.' i18n key in en/zh/zh-TW locales

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- Fix token estimation: use same estimator (estimateContentTokens) on
  both sides of the NOOP gate, then delta-adjust API-authoritative
  lastPromptTokenCount instead of replacing it with char/4 heuristic
- Handle lastPromptTokenCount=0 fallback for fresh/continued sessions
- Extract duplicated FileReadCache disarm logic into shared
  disarmFileReadCacheAfterEviction() method with debug logs
- Remove redundant setLastPromptTokenCount call from tryCompressChatFast
- Update tests for delta-adjustment and zero-fallback behavior
- Add telemetry: emit logChatCompression event in compressFast() for usage tracking
- Add /compress-fast to docs/users/features/commands.md
- Use CompressionStatus.NOOP enum instead of token count comparison for NOOP detection
- Deduplicate disarm logic in microcompactIdleHistory to use shared disarmFileReadCacheAfterEviction method (resolves conflict with upstream QwenLM#4840)

@qwen-code-ci-bot qwen-code-ci-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. LGTM! ✅ — qwen3.7-max via Qwen Code /review

@yiliang114 yiliang114 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further issues from my pass. The latest head matches the reviewed revision, the focused local test suites passed, and CI is green. Thanks for the careful follow-ups.

@yiliang114 yiliang114 merged commit f00d145 into QwenLM:main Jun 12, 2026
31 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/feature-request New feature or enhancement request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Requrest: /compress-fast non-AI assisted context reduction

4 participants