feat(core)!: redesign auto-compaction thresholds with three-tier ladder#4168
feat(core)!: redesign auto-compaction thresholds with three-tier ladder#4168LaZzyMan wants to merge 14 commits into
Conversation
📋 Review SummaryThis PR redesigns qwen-code's auto-compaction threshold system from a single 70% proportional threshold to a three-tier ladder (warn/auto/hard) combining proportional fallback with absolute reservation. The implementation aligns with claude-code's design, recovers significant wasted context on large windows (~267K on 1M models), and bundles several related improvements: 3-strike failure circuit breaker, local token estimation for accurate threshold gating, and predictable output budget control via 🔍 General Feedback
🎯 Specific Feedback🟡 High Priority Issues
🟢 Medium Priority Issues
🔵 Low Priority Suggestions
✅ Highlights
|
1dcef8c to
d270af0
Compare
Adds a defensive guard in ChatCompressionService.compress() that detects when the side-query summary hit COMPACT_MAX_OUTPUT_TOKENS (20K). In that case the summary is likely truncated mid-content, so we drop it and return NOOP rather than persist a half-summary. The next send re-tries; reactive overflow still catches the catastrophic case where the API rejects the next request as too large. Documented in the design doc as risk #2; the bot reviewer on PR #4168 correctly pushed for it to land alongside the threshold redesign rather than as a follow-up since the new 20K cap is what makes truncation likely in the first place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review response — commit
|
| # | Outcome | Notes |
|---|---|---|
🟡 H1 — MAX_TOKENS guard |
✅ Fixed | Added a defensive check in compress() that NOOPs when compressionOutputTokenCount >= COMPACT_MAX_OUTPUT_TOKENS so a truncated summary isn't persisted. Includes a unit test asserting the path. Commit 6ce81e73c. |
🟡 H2 — estimateContentTokens footgun |
❌ Declined (false-positive) | The "MUST pass precomputedCharCounts" warning the comment references is on findCompressSplitPoint, not estimateContentTokens. The latter's imageTokenEstimate parameter has a benign default (DEFAULT_IMAGE_TOKEN_ESTIMATE = 1600) that matches the splitter's default, keeping the two estimators in sync. Different functions, different contracts. |
🟢 M1 — Hide TOOL_ROUND_RETAIN_COUNT |
❌ Declined (out of PR scope) | TOOL_ROUND_RETAIN_COUNT was already exported in chatCompressionService.ts before this PR. Reducing export visibility is a separate cleanup that I don't want to bundle into a threshold-redesign change. |
🟢 M2 — contextCommand.ts:177-183 still references the deprecated field |
❌ Declined (stale read) | This was actually rewritten by Task 11 of the redesign — contextCommand.ts now imports computeThresholds (line 28) and uses computeThresholds(contextWindowSize) (line 190). Grep contextPercentageThreshold in packages/cli/src/ui/commands/contextCommand.ts is empty. |
🟢 M3 — consecutiveFailures across --continue |
❌ Declined (works as intended) | consecutiveFailures is a private field on GeminiChat initialized to 0. --continue constructs a fresh GeminiChat (history is restored separately), so the counter naturally resets — which is the correct semantics (a restarted session should get a fresh 3-strike budget rather than inheriting a latched breaker from a previous run). |
🔵 L1 — JSDoc @example with threshold table |
❌ Declined (filter 3) | The same table lives in docs/design/auto-compaction-threshold-redesign.md (committed in this PR). Duplicating it in JSDoc creates two sources of truth that can drift independently when the constants are tuned. |
🔵 L2 — Missing BYTES_PER_TOKEN_JSON = 2 |
❌ Declined (not in code) | BYTES_PER_TOKEN_JSON doesn't exist in tokenEstimation.ts. The design doc only mentions it as a future possibility for json-dense content; the implementation deliberately uses a single BYTES_PER_TOKEN = 4 ratio (matching claude-code's approach). |
| 🔵 L3 / L4 — Verify files in PR | N/A | Both packages/cli/src/services/tips/tipRegistry.ts and packages/core/src/index.ts are in this PR (commit 28eb867a8); please re-check the changed-files view. |
Net: 1 fix accepted, 5 declined, 2 hallucinations dismissed. Force-pushed earlier (rebase onto main + the consecutiveFailures test fixup d270af030); this comment lands on top of the new merge-conflict-free branch tip 6ce81e73c.
🤖 Drafted with Claude Code using the review-response skill.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
The Task 11 redesign updated the non-interactive text formatter
(formatContextUsageText) but left ContextUsage.tsx — the interactive
React component that real /context users see — unchanged. As a result
the TUI still showed the old single "Autocompact buffer" line and none
of the new warn/auto/hard ladder.
Adds a "Compaction thresholds" section after the per-category breakdown:
- Effective window
- Warn / Auto / Hard threshold rows with a ▶ marker on the row the
current usage has crossed
- Current tier label coloured by severity (safe→green, warn/auto→
yellow, hard→red)
The existing progress bar legend (Used / Free / Autocompact buffer)
is preserved because it's tied to the three-segment progress bar
visualisation; the new section adds the absolute numbers + tier badge
on top of that.
Caught by the tmux e2e test (PR #4168 ci-monitor follow-up). Pre-fix
the assertion 'Compaction thresholds' missed completely from the TUI;
post-fix the new section renders correctly for fresh and live sessions
on 1M / 200K / 128K windows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E2E 测试报告跟进 review/CI 反馈,针对核心功能补做了多轮 E2E 验证。最终覆盖矩阵: ✅ 真实模型 E2E(最重要)配置:
实测 97.7% 上下文 reduction,所有 6 项断言通过:
✅ TUI
|
| ); | ||
| return { | ||
| newHistory: null, | ||
| info: { |
There was a problem hiding this comment.
[Suggestion] Truncated summary returns NOOP — circuit breaker never trips
When compression output hits the 20K COMPACT_MAX_OUTPUT_TOKENS cap, compress() returns CompressionStatus.NOOP. Since isCompressionFailureStatus() does not match NOOP, consecutiveFailures is never incremented. If the model consistently produces max-length summaries, every subsequent send wastes an API call on a compaction attempt that will be dropped.
Consider treating the MAX_TOKENS truncation as a recoverable failure (increment the counter without locking) so the breaker can trip after repeated occurrences:
| info: { | |
| config | |
| .getDebugLogger() | |
| .warn( | |
| `[chat-compression] summary output reached the ` + | |
| `COMPACT_MAX_OUTPUT_TOKENS cap (${COMPACT_MAX_OUTPUT_TOKENS}); ` + | |
| `dropping potentially-truncated result.`, | |
| ); | |
| return { | |
| newHistory: null, | |
| info: { | |
| originalTokenCount, | |
| newTokenCount: originalTokenCount, | |
| compressionStatus: CompressionStatus.FAILURE, | |
| }, | |
| }; |
— glm-5.1 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 181393c — MAX_TOKENS truncation guard now returns COMPRESSION_FAILED_EMPTY_SUMMARY (a failure status), so the consecutive-failure breaker ticks on repeated truncations instead of wasting an API call per send. Unit test updated to assert the new status.
| // suppress further auto-compaction since the chat clearly | ||
| // can't shrink — trip the breaker to its NOOP threshold so | ||
| // subsequent unforced sends short-circuit at the cheap-gate. | ||
| self.consecutiveFailures = MAX_CONSECUTIVE_FAILURES; |
There was a problem hiding this comment.
[Suggestion] Reactive failure permanently disables auto-compaction on a single transient error
self.consecutiveFailures = MAX_CONSECUTIVE_FAILURES directly sets the counter to the maximum rather than incrementing. A single transient network error during reactive compression permanently disables auto-compaction until the next hard-threshold crossing resets it. While the comment explains the rationale (forced compression already failed), consider incrementing instead so that N distinct failures are required:
| self.consecutiveFailures = MAX_CONSECUTIVE_FAILURES; | |
| self.consecutiveFailures += 1; |
Or, if the current behavior is intentional, add a short comment noting that hard-tier rescue is the designated recovery path.
— glm-5.1 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 181393c — reactive failure now does self.consecutiveFailures += 1 instead of = MAX. Comment notes hard-tier rescue as the designated recovery path. Test marks failed reactive compression attempts... updated to assert counter == 1 (not MAX).
| DEFAULT_TOKEN_LIMIT; | ||
| const { hard } = computeThresholds(contextLimit); | ||
| const effectiveTokens = estimatePromptTokens( | ||
| this.getHistory(true), |
There was a problem hiding this comment.
[Suggestion] Redundant deep clones + token estimation on every send
The hard-tier rescue calls estimatePromptTokens(this.getHistory(true), ...) (deep clone + full history walk). Then tryCompress → cheap-gate calls estimatePromptTokens(chat.getHistory(true), ...) again (second deep clone + walk). If the cheap-gate passes, compress() calls chat.getHistory(true) a third time.
Every sendMessageStream pays for 2–3 full-history clones and 2 estimation traversals even when no compaction is needed. Consider computing the effective token count once here and passing it into tryCompress as a pre-computed value:
| this.getHistory(true), | |
| const effectiveTokens = estimatePromptTokens( | |
| this.getHistory(true), | |
| userContent, | |
| this.lastPromptTokenCount, | |
| resolveSlimmingConfig(chatCompressionSettings).imageTokenEstimate, | |
| ); |
This also fixes a minor inconsistency: this call uses the default imageTokenEstimate (1600) while the cheap-gate inside tryCompress uses the user's configured value.
— glm-5.1 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 181393c — sendMessageStream now computes effectiveTokens once and threads it through TryCompressOptions.precomputedEffectiveTokens; service.compress skips its own estimation pass when supplied. Also uses resolveSlimmingConfig(chatCompressionSettings).imageTokenEstimate so the rescue and cheap-gate paths see the same value. Steady-state path (count>0) skips the costly getHistory(true) clone since estimatePromptTokens only needs the user message in that branch — drops the per-send clone count from 2–3 to 1.
| const pendingUserMessage = opts.pendingUserMessage; | ||
| const effectiveTokens = pendingUserMessage | ||
| ? estimatePromptTokens( | ||
| chat.getHistory(true), |
There was a problem hiding this comment.
[Suggestion] computeThresholds() called twice per send + redundant getHistory(true)
computeThresholds(contextLimit) is called here in the cheap-gate, and again in geminiChat.ts hard-tier rescue for the same contextLimit. Similarly, estimatePromptTokens(chat.getHistory(true), ...) clones the full history even when originalTokenCount > 0 (only estimateContentTokens([userMessage]) is needed in that case — the history is unused).
Consider:
- Moving
computeThresholdsinto a lazy field on the service (or passing the result from the caller) - Guarding the
getHistory(true)call behind theoriginalTokenCount === 0branch to avoid the wasted clone
| chat.getHistory(true), | |
| const thresholds = computeThresholds(contextLimit); | |
| const effectiveTokens = | |
| originalTokenCount > 0 | |
| ? estimateContentTokens([pendingUserMessage ?? []]) | |
| : estimatePromptTokens( | |
| chat.getHistory(true), | |
| pendingUserMessage, | |
| 0, | |
| slimmingConfig.imageTokenEstimate, | |
| ); |
— glm-5.1 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 181393c alongside the geminiChat-side change — precomputedEffectiveTokens lets sendMessageStream pass the value computed there directly, so the service's cheap-gate no longer recomputes computeThresholds + getHistory(true) + estimatePromptTokens. The steady-state branch (lastPromptTokenCount > 0) also skips history cloning entirely now.
| id: 'context-critical', | ||
| content: | ||
| 'Context is almost full! Run /compress now or start /new to continue.', | ||
| 'Context near hard limit — auto-compact will force on next send. Consider /clear if you want to start fresh.', |
There was a problem hiding this comment.
[Suggestion] Dead code: getContextUsagePercent has zero callers
The three context-* tips were all rewired to use ctx.thresholds directly (as visible here — the getContextUsagePercent(ctx) >= 95 call on the old L39 is replaced by ctx.thresholds.hard). However, getContextUsagePercent itself (defined at L41) is no longer called anywhere in the codebase but remains exported from ./index.ts. Consider removing it.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 181393c — getContextUsagePercent and its index.ts re-export removed. Confirmed no callers in source (grep clean across packages/cli/src and packages/core/src).
| * `breakdown.thresholds` + `breakdown.currentTier`, which the context command | ||
| * derives from `computeThresholds()` in core. | ||
| */ | ||
| const CompactionThresholds: React.FC<{ |
There was a problem hiding this comment.
[Suggestion] New CompactionThresholds/ThresholdRow components have no tests
The new ~60 lines of JSX (ThresholdRow at L149, CompactionThresholds at L202, tierColor switch at ~L158) that render the three-tier ladder with color-coded tier labels and isCurrent arrow indicators are untested. No test file exists for ContextUsage.tsx, and this PR adds none. While the underlying data calculations are covered in contextCommand.test.ts, the rendering behavior (tier color mapping, arrow positioning, conditional visibility) is not verified in CI.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
There was a problem hiding this comment.
✅ Partially fixed in 181393c — added packages/cli/src/ui/components/views/ContextUsage.test.tsx with 4 ink-testing-library cases covering the new Compaction-thresholds section: header + 4 threshold rows render, ▶ marker placement per current tier (safe/warn/hard), and the colored tier label. Skipped a true snapshot because the precise frame layout drifts with terminal width — string-match assertions on labels/marker presence are more durable.
wenshao
left a comment
There was a problem hiding this comment.
Two additional observations that don't map to specific diff lines:
-
Missing test:
contextPercentageThresholdsetting is silently ignored. Theconfig.tsdeprecation warns on stderr, but no test verifies the behavioral change — that settingcontextPercentageThreshold: 0(which previously disabled auto-compaction) now has no effect. Consider adding a test pinning this. -
tierTokens = 0whenisEstimatedshows misleading "safe" tier. IncontextCommand.ts, when no API data exists yet (first render,--continue),tierTokens = 0makescurrentTieralways'safe'even for large inherited history. Consider using estimated overhead as the tier input whenisEstimated.
— mimo-v2.5-pro via Qwen Code /review
| * Average bytes-per-token for char-based token estimation. | ||
| * Matches claude-code's roughTokenCountEstimation default (tokens.ts). | ||
| */ | ||
| export const BYTES_PER_TOKEN = 4; |
There was a problem hiding this comment.
[Suggestion] BYTES_PER_TOKEN is a misleading name — the value divides character counts (from estimateContentChars → string.length), not byte counts. For CJK text (3 bytes/char UTF-8), the name actively misleads. The adjacent module compactionInputSlimming.ts correctly names the identical ratio TOKEN_TO_CHAR_RATIO = 4.
| export const BYTES_PER_TOKEN = 4; | |
| /** | |
| * Average characters-per-token for char-based token estimation. | |
| * Matches the inverse of TOKEN_TO_CHAR_RATIO in compactionInputSlimming.ts. | |
| */ | |
| export const CHARS_PER_TOKEN = 4; |
Then update the two usages (Math.ceil(totalChars / CHARS_PER_TOKEN) at line 39 and the return at line 66).
— mimo-v2.5-pro via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 181393c — renamed BYTES_PER_TOKEN → CHARS_PER_TOKEN (the inputs are character counts from string.length, not byte counts; old name misleads on CJK). Doc updated to reference TOKEN_TO_CHAR_RATIO in compactionInputSlimming.ts as the inverse.
| if ( | ||
| !isSummaryEmpty && | ||
| typeof compressionOutputTokenCount === 'number' && | ||
| compressionOutputTokenCount >= COMPACT_MAX_OUTPUT_TOKENS |
There was a problem hiding this comment.
[Suggestion] Off-by-one in truncation guard: >= COMPACT_MAX_OUTPUT_TOKENS rejects summaries that are exactly 20,000 tokens. Since maxOutputTokens is set to the same value, a model that produces a valid summary at exactly the cap limit gets dropped. The > operator is more appropriate — it catches outputs that exceeded the cap (impossible with the API budget, but defensive), while allowing outputs that landed exactly at the limit.
| compressionOutputTokenCount >= COMPACT_MAX_OUTPUT_TOKENS | |
| compressionOutputTokenCount > COMPACT_MAX_OUTPUT_TOKENS |
Note: the dropped NOOP does NOT trip the circuit breaker (NOOP is not a failure status in isCompressionFailureStatus), so the worst case is a retry loop rather than permanent disablement. Still worth fixing to avoid unnecessary retry cycles.
— mimo-v2.5-pro via Qwen Code /review
There was a problem hiding this comment.
❌ Declined — the >= operator is intentionally conservative: with maxOutputTokens=20K, p99.99 of real summaries is ~17K (per claude-code data), so landing exactly at the cap is far more likely truncation than a clean stop. > would make the guard effectively dead (the API can't return more than the cap). With R1.1 now in place (this batch), persistent truncation trips the breaker after MAX_CONSECUTIVE_FAILURES, so the worst-case wasted-retry cost is bounded. The proper long-term fix is plumbing finish_reason through runSideQuery, which is out of scope here.
| estimateContentTokens([userMessage], imageTokenEstimate) | ||
| ); | ||
| } | ||
| return estimateContentTokens([...history, userMessage], imageTokenEstimate); |
There was a problem hiding this comment.
[Suggestion] When lastPromptTokenCount === 0 (first send after --continue or inherited history), the fallback estimates only history + userMessage. It misses system prompt (~8-15K tokens), tool definitions (~5K), skill content, and cached content. This underestimates by ~15-20K tokens, which could cause the hard-tier rescue to not fire when it should.
The docstring correctly warns "using it to SKIP compaction is not [safe]" and the reactive overflow is the safety net, but closing the gap would improve first-send behavior. Consider adding a configurable or estimated overhead baseline:
| return estimateContentTokens([...history, userMessage], imageTokenEstimate); | |
| // Fallback: estimate from history + user message. Note this underestimates | |
| // by ~15-20K tokens (system prompt, tool definitions, skills) — the reactive | |
| // overflow handler is the safety net if the hard-tier rescue misses. | |
| return estimateContentTokens([...history, userMessage], imageTokenEstimate); |
— mimo-v2.5-pro via Qwen Code /review
There was a problem hiding this comment.
✅ Partially fixed in 181393c — added an explanatory comment on the fallback branch documenting the ~15-20K under-estimate (system prompt + tool definitions + skills + cache headers) and that reactive overflow is the safety net. Skipped adding a magic overhead constant because the actual overhead is per-config (depends on tools loaded, skills active) and a fixed value would be a different kind of guess.
| }); | ||
|
|
||
| it('clears hasFailedCompressionAttempt after a forced successful compression', async () => { | ||
| it('forwards the pending user message to the compression cheap-gate', async () => { |
There was a problem hiding this comment.
[Suggestion] This test mocks compress entirely, so the real estimatePromptTokens(history, pendingUserMessage, 0) full-history fallback path is never exercised end-to-end. The "first send after --continue / sub-agent inherited history" scenario — where lastPromptTokenCount === 0 and only the full-history estimate can cross the auto threshold — has no integration test.
Consider adding a test that uses the real ChatCompressionService (not a mock) with a GeminiChat seeded with large inherited history and lastPromptTokenCount = 0, then asserts that sendMessageStream triggers compaction.
— mimo-v2.5-pro via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 181393c — added triggers compaction end-to-end through the real ChatCompressionService when lastPromptTokenCount === 0 and inherited history is large to geminiChat.test.ts. Uses the real service (no vi.spyOn on compress.prototype), seeds 400K chars of history, sets lastPromptTokenCount=0, mocks runSideQuery at the baseLlmClient layer, and asserts the stream emits a COMPRESSED event.
| expect(estimateContentTokens([c], 1600)).toBe(1600); | ||
| }); | ||
|
|
||
| it('estimates functionCall (json-dense) contributes some positive count', () => { |
There was a problem hiding this comment.
[Suggestion] estimateContentTokens has a test for functionCall but no test for functionResponse, which has a distinct branch in estimateContentChars (nested parts walk, 64-char floor for wrapper metadata). Tool-heavy conversations are the exact scenario where context grows fastest.
| it('estimates functionCall (json-dense) contributes some positive count', () => { | |
| it('estimates functionResponse (json-dense) contributes some positive count', () => { | |
| const c: Content = { | |
| role: 'user', | |
| parts: [{ functionResponse: { name: 'tool', response: { result: 'data'.repeat(100) } } }], | |
| }; | |
| const result = estimateContentTokens([c]); | |
| expect(result).toBeGreaterThan(0); | |
| }); | |
| it('estimates functionCall (json-dense) contributes some positive count', () => { |
— mimo-v2.5-pro via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 181393c — added estimates functionResponse (nested parts) contributes some positive count to tokenEstimation.test.ts. Tool-heavy conversations were the gap.
Adds a defensive guard in ChatCompressionService.compress() that detects when the side-query summary hit COMPACT_MAX_OUTPUT_TOKENS (20K). In that case the summary is likely truncated mid-content, so we drop it and return NOOP rather than persist a half-summary. The next send re-tries; reactive overflow still catches the catastrophic case where the API rejects the next request as too large. Documented in the design doc as risk #2; the bot reviewer on PR #4168 correctly pushed for it to land alongside the threshold redesign rather than as a follow-up since the new 20K cap is what makes truncation likely in the first place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Task 11 redesign updated the non-interactive text formatter
(formatContextUsageText) but left ContextUsage.tsx — the interactive
React component that real /context users see — unchanged. As a result
the TUI still showed the old single "Autocompact buffer" line and none
of the new warn/auto/hard ladder.
Adds a "Compaction thresholds" section after the per-category breakdown:
- Effective window
- Warn / Auto / Hard threshold rows with a ▶ marker on the row the
current usage has crossed
- Current tier label coloured by severity (safe→green, warn/auto→
yellow, hard→red)
The existing progress bar legend (Used / Free / Autocompact buffer)
is preserved because it's tied to the three-segment progress bar
visualisation; the new section adds the absolute numbers + tier badge
on top of that.
Caught by the tmux e2e test (PR #4168 ci-monitor follow-up). Pre-fix
the assertion 'Compaction thresholds' missed completely from the TUI;
post-fix the new section renders correctly for fresh and live sessions
on 1M / 200K / 128K windows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior fixes: - MAX_TOKENS truncation guard now returns COMPRESSION_FAILED_EMPTY_SUMMARY instead of NOOP so the consecutive-failure breaker actually trips after repeated max-length summaries (R1.1). - Reactive overflow failure increments consecutiveFailures by 1 instead of latching to MAX in one shot, so a transient network blip doesn't permanently disable auto-compaction. The hard-tier rescue resets the counter, which remains the designated recovery path (R1.2). - /context current-tier classification uses rawOverhead (system + tools + memory + skills) as the tier input when API data is not yet available, rather than 0 — large inherited contexts no longer silently show 'safe' (R2.2). Performance: - sendMessageStream computes effectiveTokens ONCE and passes it through TryCompressOptions.precomputedEffectiveTokens, so the cheap-gate inside service.compress doesn't redo the estimation. Also fixes the imageTokenEstimate inconsistency between the rescue and cheap-gate paths (R1.3 + R1.4). - Steady-state path (lastPromptTokenCount > 0) skips the costly getHistory(true) clone — estimatePromptTokens only needs the user message in that branch. Code hygiene: - BYTES_PER_TOKEN → CHARS_PER_TOKEN (inputs are char counts, not byte counts; CJK text would mislead under the old name) (R3.1). - Drop dead getContextUsagePercent helper + index re-export — no callers in source after the threshold rewire (R1.5). - Add a comment on estimatePromptTokens' first-send fallback documenting the ~15-20K under-estimate (system prompt + tools + skills) and that reactive overflow is the safety net (R3.3). Tests: - New CLI ContextUsage.test.tsx exercises the React renderer for the three-tier section: section presence, ▶ marker placement per tier, current-tier label coloring (R1.6). - New chatCompressionService.test.ts case pins that a stale contextPercentageThreshold: 0 value in user settings no longer short-circuits compaction (R2.1). - New tokenEstimation.test.ts case covers functionResponse (distinct nested-parts branch from functionCall) (R3.5). - New geminiChat.test.ts integration test exercises the real ChatCompressionService — not a mock — for the first-send-after- inherited-history scenario where lastPromptTokenCount=0 and only the full-history estimate can cross the auto threshold (R3.4). Declined: R3.2 (change `>=` to `>` on the MAX_TOKENS guard). The current operator catches the at-cap case as suspicious, which is intentional — landing exactly at the output cap is far more likely truncation than clean stop given p99.99 ≈ 17K. With R1.1 in place, persistent truncations trip the breaker after MAX_CONSECUTIVE_FAILURES so the worst case is bounded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3786355 to
181393c
Compare
Review batch 4 — commit
|
| # | Outcome | One-line |
|---|---|---|
| R1.1 truncation→NOOP doesn't trip breaker | ✅ Fixed | guard returns COMPRESSION_FAILED_EMPTY_SUMMARY so counter ticks |
| R1.2 reactive failure latches breaker | ✅ Fixed | consecutiveFailures += 1 instead of = MAX |
| R1.3 redundant deep clones + double estimation | ✅ Fixed | precomputedEffectiveTokens threaded through opts; steady-state path skips getHistory(true) |
R1.4 computeThresholds × 2 + redundant getHistory |
✅ Fixed | same change as R1.3 + resolved imageTokenEstimate consistency |
R1.5 getContextUsagePercent dead code |
✅ Fixed | removed, no callers in source |
| R1.6 new TUI components untested | ✅ Partially | added ContextUsage.test.tsx with 4 ink-testing-library cases (header, tier markers, color) |
| R2.1 missing test for deprecated field ignored | ✅ Fixed | added unit test asserting contextPercentageThreshold: 0 no longer disables compaction |
| R2.2 isEstimated → misleading 'safe' | ✅ Fixed | tier classification uses rawOverhead (not 0) when API data absent |
R3.1 BYTES_PER_TOKEN misleading on CJK |
✅ Fixed | renamed to CHARS_PER_TOKEN; doc updated |
R3.2 >= should be > |
❌ Declined | guard intentionally conservative — at-cap is more often truncation than clean stop; R1.1 bounds worst case |
| R3.3 first-send fallback underestimates | ✅ Partially | added comment documenting the ~15-20K under-estimate + reactive safety net; declined the magic-overhead constant |
| R3.4 first-send-after-continue not integration-tested | ✅ Fixed | added real-service integration test in geminiChat.test.ts |
R3.5 functionResponse not tested |
✅ Fixed | added estimateContentTokens test for nested-parts branch |
Verification
- Affected unit/integration tests: 172/172 pass (
chatCompressionService.test.ts68 ·geminiChat.test.ts96 ·tokenEstimation.test.ts8) - New
ContextUsage.test.tsx: 4/4 pass npm run typecheckclean across all workspacesnpm run lintclean (the 11 errors on.claude/skills/e2e-testing/scripts/*.jsare pre-existing and unrelated to this PR)- Rebased onto current main (
1c529e4f0); 3 conflict files (geminiChat.ts, chatCompressionService.ts/test.ts) resolved to merge main'sbypassTokenThresholdheap-pressure path with this PR'sconsecutiveFailuresbreaker — both mechanisms now coexist (heap-pressure bypass overrides the breaker carve-out)
11 inline threads have per-thread gh api replies pointing at this commit. Ready for next pass.
🤖 Triaged via the review-response skill.
wenshao
left a comment
There was a problem hiding this comment.
总评
Reviewed all 22 changed files focusing on chatCompressionService.ts / geminiChat.ts / tokenEstimation.ts / contextCommand.ts / tipRegistry.ts / ContextUsage.tsx 及对应测试与文档。
质量高,可以合并。 computeThresholds 是纯函数 + 覆盖 32K / 64K / 128K / 200K / 1M / 10K-极端 六个边界窗口的单元用例,数学清晰且可证;max(proportional, absolute) 的组合让小窗口自动降级到比例分支、大窗口完全由绝对分支主导;hard = max(rawHard, auto) 的 collapse-to-auto 兜住了小窗口下 hard < auto 的逻辑错。estimatePromptTokens 注释非常诚实地写明了首发 fallback 的 ~15-20K 偏低与 reactive overflow 作为 safety net 的角色,设计-代码-注释一致。steady-state 上 lastPromptTokenCount > 0 时传 [] 给 estimatePromptTokens 跳过 getHistory(true) clone 是合理的 hot-path 优化。
最赞的是 R3.4 的端到端集成测试 —— 真 sendMessageStream → tryCompress → 真 ChatCompressionService → 真 cheap-gate → splitter → mocked baseLlmClient → persistence 整条链走通,直接覆盖了历史上最容易出 bug 的 lastPromptTokenCount === 0 分支。
发现(以 inline 评论标注)
- 建议:
contextCommand.ts:313的 tier 分类用rawOverhead,不包含 messagesTokens;一个--continue恢复了 100K 历史消息的 session 仍会在/context显示 "safe",但下一条 send 立刻在 cheap-gate 触发压缩 —— UI 与 runtime 判断不一致。建议要么从 chat 取 history 复用estimatePromptTokens,要么收紧注释 scope。 - 建议:
chatCompressionService.ts:553的>= COMPACT_MAX_OUTPUT_TOKENS截断判据是 heuristic,正好 20K 的合法 summary 会被误判为截断;同时该路径复用COMPRESSION_FAILED_EMPTY_SUMMARY会让 telemetry 分不清 prompt 质量问题 vs 容量问题。建议:挂TODO(finish_reason)+ 加COMPRESSION_FAILED_OUTPUT_TRUNCATED子状态。 - 建议:
geminiChat.ts:752hard-rescue 在tryCompress(force=true)前 reset counter,而 force=true 路径在 service 失败分支又 skip 自增 —— 结果 hard-rescue 失败永远不累加 counter,只能靠 reactive overflow 兜底。语义合理但字段命名误导,在consecutiveFailures字段 JSDoc 加一行说明。 - 应修:
CompressOptions.hasFailedCompressionAttempt: boolean→consecutiveFailures: number是 SDK breaking change,PR description / release notes 里目前没列出。
风险审计声明
我反向审计了:
- hard-rescue 失败的无限循环可能 —— ✅ 不会,失败后 API 仍发出,reactive overflow 接管;
COMPACT_MAX_OUTPUT_TOKENS = 20K永久压不下的可能 ——⚠️ 三次 false-positive 截断会熔断,见发现 #2;- 跨 provider 行为一致性 —— ✅ thinking 关 + maxOutputTokens 钉死把不确定性最小化(Anthropic thinking budget / OpenAI reasoning tokens / Gemini 模型差异都被规避);
consecutiveFailures在 force / heap-pressure / reactive 各路径的累计边界 —— ✅ 看了 R1.2 / R1.4 注释,逻辑闭环,但需要发现 #3 的注释补充;/contextUI 与 cheap-gate 的同源性 ——⚠️ 发现 #1;- 兼容性 ——
⚠️ 发现 #4。
docs/plans/2026-05-14-auto-compaction-threshold-redesign.md 1752 行在 PR 里也注意到了 —— 风格层面,plan doc 在 tree 里的体量偏大,但因为不进运行时,不阻断合并。
| // should not silently show "safe" just because the API hasn't been hit. | ||
| // The estimate is a lower bound (excludes message body until first turn) | ||
| // so the tier may under-classify, but never over-classifies. (R2.2) | ||
| const tierTokens = isEstimated ? rawOverhead : apiTotalTokens; |
There was a problem hiding this comment.
/context 估算分支可能仍误报 "safe":这条注释说的目标是 "restored session with 800K of inherited history should not silently show safe",但 rawOverhead 只覆盖 system prompt / builtin tools / mcp tools / memory / skills,不包含 messagesTokens。
实际后果:一个 --continue 恢复了 100K 历史消息(不是 memory/skills overhead)的 session,首发前跑 /context 仍然显示 safe,但下一条 send 立刻在 cheap-gate 触发压缩 —— UI 与 runtime 的判断不一致。
cheap-gate 在 chatCompressionService.ts 用 estimatePromptTokens(history, pendingUserMessage, lastPromptTokenCount=0) 才能正确捕获,这里用的 rawOverhead 是完全不同的数据源。
建议二选一:
- (a) 把 chat history 透传给
collectContextData,在估算分支调用estimatePromptTokens(history, undefined, 0, imageTokenEstimate),与 cheap-gate 同源; - (b) 收紧注释,明说 "这条 fix 只覆盖 overhead-heavy 场景(memory / skills / MCP),message-heavy 场景(
--continue恢复)仍要等首发 cheap-gate 才会被纠正"。
R3.4 的端到端测试也佐证了这点:那个用例用 chat.setHistory([400K 字符的 user 消息, ...]) + lastPromptTokenCount === 0,cheap-gate 能正确触发压缩 —— 但同一时刻 /context 会显示 safe。
There was a problem hiding this comment.
✅ Partially fixed in 681b705 — went with reviewer's option (b): tightened the inline comment to scope the fix to overhead-heavy sessions, and added a TODO to plumb chat history into collectContextData for same-source-of-truth as the cheap-gate. Option (a) requires adding a parameter to collectContextData (Config doesn't expose the active chat today), which is a bigger plumbing change deferred.
| // perspective a truncated summary is unusable just like an empty | ||
| // one. `isCompressionFailureStatus()` returns true for this enum, | ||
| // so non-force callers will tick the consecutive-failure counter. | ||
| compressionStatus: CompressionStatus.COMPRESSION_FAILED_EMPTY_SUMMARY, |
There was a problem hiding this comment.
输出截断 guard 有两个可以加强的点:
(1) 判据用 finish_reason 而不是 >= 20K(L553):compressionOutputTokenCount >= COMPACT_MAX_OUTPUT_TOKENS 是一个 heuristic —— 恰好等于 20K 的合法 summary 也会被判成 truncated → 计 failure → 三次后熔断。PR description 里也承认 "A finish_reason === MAX_TOKENS NOOP guard would be a sensible follow-up"。至少在这里挂一个 TODO(finish_reason) 注释锚住跟进项:
// TODO: switch to finish_reason === MAX_TOKENS when sideQuery surfaces it
// — the current >= cap heuristic false-positives on legitimate 20K summaries.(2) 复用 COMPRESSION_FAILED_EMPTY_SUMMARY 会模糊 telemetry(L572):空 summary(prompt 质量问题,可能要调 prompt / 切分点)和 truncated(容量问题,要调 cap 或 splitter)是两个完全不同的运营信号。合并到同一 enum 值后,从日志 / telemetry 看不出到底是哪一类。
建议加一个 COMPRESSION_FAILED_OUTPUT_TRUNCATED,让 isCompressionFailureStatus() 同样返回 true(对持久化层等价),但日志和 telemetry 能区分这两种失败模式。
注释 L562 已经写了 "Reuse the empty-summary status: from the persistence layer's perspective a truncated summary is unusable just like an empty one" —— 是合理的近似,但代价是 observability 损失。
There was a problem hiding this comment.
✅ Fixed in 681b705 (both parts): (1) added TODO(finish_reason) comment at the truncation guard documenting that >= cap is a heuristic awaiting runSideQuery to surface finish_reason; (2) added CompressionStatus.COMPRESSION_FAILED_OUTPUT_TRUNCATED enum value distinct from EMPTY_SUMMARY so telemetry can separate prompt-quality failures from capacity failures. isCompressionFailureStatus() returns true for both, so persistence/breaker behaviour is unchanged. Updated the truncation test to expect the new status.
| ); | ||
| const shouldForceFromHard = effectiveTokens >= hard; | ||
| if (shouldForceFromHard) { | ||
| this.consecutiveFailures = 0; |
There was a problem hiding this comment.
Hard-rescue 与熔断计数器的实际语义需要写明:这里在调 tryCompress(force=true) 之前先把 counter 清 0;而 force=true 路径在 failure 分支又跳过了 this.consecutiveFailures += 1(chatCompressionService.ts 那一侧 if (!force) 守卫)。
两者组合的后果:hard-rescue 反复失败时,consecutiveFailures 永远不会累加,熔断对这条路径完全无效 —— 只能靠 reactive overflow 兜底,而 reactive overflow 自己也只会 +1 一次(那段代码也是 force=true,作为补偿才显式 self.consecutiveFailures += 1)。
这是合理的 fail-open 设计(hard 已经预测会爆,继续退让最有意义),但 consecutiveFailures 字段的字面含义和实际语义不一致:它实际是 "非 force、非 hard-rescue 的连续失败次数",而不是字面意义上的所有失败。
建议在 GeminiChat.consecutiveFailures 的 JSDoc(geminiChat.ts:459 附近)或这里的 reset 处加 1-2 行注释明说:
// Hard-rescue is a fail-open exception to the breaker: we reset the counter
// because the runtime decided overflow is imminent regardless of recent
// failure history. Combined with `if (!force)` in the service's failure
// branch, hard-rescue failures never accumulate — reactive overflow is the
// real safety net for this path (it explicitly bumps the counter by 1).
this.consecutiveFailures = 0;这样未来调试 "为什么 hard 一直触发但 counter 是 0" 时不至于困惑。
There was a problem hiding this comment.
✅ Fixed in 681b705 — expanded the consecutiveFailures field JSDoc on GeminiChat to spell out the real semantics: it tracks "non-force, non-hard-rescue consecutive failures". Listed each path's interaction with the counter (auto +1, manual /compress skipped, hard-rescue resets BEFORE force=true), and called out reactive overflow as the actual safety net (it explicitly bumps the counter by +1). Future debug-time confusion about "why is hard-rescue firing but counter is 0" now has a one-line answer in the field doc.
| * force=true call resets it. | ||
| */ | ||
| hasFailedCompressionAttempt: boolean; | ||
| consecutiveFailures: number; |
There was a problem hiding this comment.
SDK breaking change 需要补到 release notes:hasFailedCompressionAttempt: boolean → consecutiveFailures: number 在 CompressOptions 上是字段名 + 类型双变。CompressOptions 通过 ChatCompressionService 间接暴露在 @qwen-code/qwen-code-core 的 public surface,下游 SDK 直接调 service.compress({ ..., hasFailedCompressionAttempt: true }) 的代码会拿到 TS 编译错误,且语义也变了(布尔指示 vs 计数累加,默认值 0 不等价于 false 在某些边界上)。
PR description 的 "Breaking changes / migration notes" 段落里列了 contextPercentageThreshold 移除,但没提这条。建议补一行(并在 release notes / CHANGELOG 同步):
CompressOptions.hasFailedCompressionAttempt: boolean重命名为consecutiveFailures: number。SDK 消费者需要从 "传 true 表示已失败" 改为 "传当前累计失败次数(通常由 GeminiChat 维护)"。语义变化:true旧含义是 "永久禁用 auto",新的>= MAX_CONSECUTIVE_FAILURES等价。
这条改动我审计了一下使用面:仓库内只有 GeminiChat.tryCompress 一个调用方在传这个字段,所以内部 migration 风险很低;但 core 包的 d.ts 是会发出去的,对外仍是 breaking。
There was a problem hiding this comment.
✅ Fixed in 681b705 — added a dedicated SDK-Breaking-change subsection to the design doc (docs/design/auto-compaction-threshold-redesign.md) covering the CompressOptions.hasFailedCompressionAttempt: boolean → consecutiveFailures: number rename with a side-by-side semantics table and migration guide (true → MAX_CONSECUTIVE_FAILURES, false → 0). The PR description's release-notes block is sourced from the design doc, so this propagates.
- R5.1: tighten /context tier comment + TODO. The rawOverhead-based fix doesn't cover `--continue` restores with many history messages (since rawOverhead excludes messagesTokens). UI may still show 'safe' for one render until the first send. Documented inline and added a TODO to plumb chat history into collectContextData for same-source-of-truth as the cheap-gate. - R5.2a: add TODO(finish_reason) at the truncation guard. The `>= cap` heuristic false-positives on legitimate at-cap summaries; the proper signal is finish_reason which runSideQuery doesn't surface today. - R5.2b: split telemetry — new CompressionStatus.COMPRESSION_FAILED_OUTPUT_TRUNCATED enum value. Distinct from EMPTY_SUMMARY so logs/telemetry can tell prompt-quality failures (tune prompt / splitter) from capacity failures (raise cap / shrink splitter input). isCompressionFailureStatus() treats both as failures so the breaker behavior is unchanged. - R5.3: expand consecutiveFailures JSDoc to clarify it tracks "non-force, non-hard-rescue consecutive failures" — hard-rescue resets the counter and force=true skips increments, so the counter is the "regular path" health signal only; reactive overflow is the real safety net for the force-only paths. - R5.4: document the CompressOptions field rename (hasFailedCompressionAttempt: boolean → consecutiveFailures: number) as an SDK breaking change in the design doc with migration guide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review batch 5 — commit
|
| # | Outcome | One-line |
|---|---|---|
R5.1 /context tier 漏 messagesTokens |
✅ Partial | 选 path (b): 收紧注释 + TODO;path (a) 需要改 Config 接口暴露 chat instance,是更大改动 |
R5.2a >= heuristic 应基于 finish_reason |
✅ Fixed | 加 TODO(finish_reason) 锚定 follow-up |
| R5.2b 截断与空 summary 模糊 telemetry | ✅ Fixed | 新增 CompressionStatus.COMPRESSION_FAILED_OUTPUT_TRUNCATED 枚举值 |
| R5.3 hard-rescue counter 语义不一致 | ✅ Fixed | consecutiveFailures JSDoc 扩写,列明所有路径的累加/重置语义 |
| R5.4 SDK breaking change 未列 | ✅ Fixed | design doc Breaking-change 章节补全 `hasFailedCompressionAttempt` → `consecutiveFailures` 含迁移指引 |
Verification
- typecheck 跨 4 workspace clean
- core 测试 172/172 pass (
chatCompressionService.test.ts·geminiChat.test.ts·tokenEstimation.test.ts) - cli 测试 15/15 pass (
contextCommand.test.ts·ContextUsage.test.tsx·tipRegistry.test.ts) - 新增
COMPRESSION_FAILED_OUTPUT_TRUNCATEDenum 值附加在 NOOP 之后(不 shift 现有 enum 数值,二进制兼容)
Reviewer 风险审计回应
针对 reviewer 总评中的
- Where is the config saved? #2 三次 false-positive 熔断风险:已通过 R5.2b 的 enum 拆分让运维能从 telemetry 区分"提示词质量" vs "容量";finish_reason 路径上线后 R5.2a 的 TODO 可一次清理掉 false-positive
- pre-release: fix ci #1
/contextUI vs cheap-gate 同源性:注释和 TODO 已说明只覆盖 overhead-heavy 场景;message-heavy 在首发 cheap-gate 即纠正(≤1 render lag) - Are you interested in AI Terminal? #4 SDK 兼容性:design doc 已含 migration guide,仓库内只
GeminiChat.tryCompress一个内部调用方,外部影响面小
plan doc 体量
reviewer 提到 docs/plans/2026-05-14-auto-compaction-threshold-redesign.md 1752 行偏大、不进运行时。同意它是一次性产物,仅作为本 PR 的实施记录。如果你认为应该移到 .github/closed-PRs/ 这类归档位置,可以单独提一个 follow-up。
5 个 inline 评论都已在原 thread 单条回复并指向 681b70501。
🤖 Triaged via the review-response skill.
wenshao
left a comment
There was a problem hiding this comment.
Test coverage gaps (no specific diff line to anchor):
- [Suggestion]
compress()precomputedEffectiveTokenspath (path 1 — skip estimation) has no unit test at the service level. If future refactoring breaks the priority logic, the regression is invisible at this layer. (chatCompressionService.test.ts) - [Suggestion] Hard-tier rescue tests verify
force+pendingUserMessageare passed but never assertprecomputedEffectiveTokensis forwarded. The estimation-reuse optimization (a core perf win of this PR) is unprotected against regression. (geminiChat.test.ts) - [Suggestion]
COMPRESSION_FAILED_OUTPUT_TRUNCATEDis included inisCompressionFailureStatus()but never exercised as a failure variant in any geminiChat-level circuit breaker test. OnlyINFLATEDandEMPTY_SUMMARYare tested. (geminiChat.test.ts) - [Suggestion] The
originalTokenCount === 0+pendingUserMessagepath (which falls through to the first-send estimation fallback) has no dedicated test case. The existing "estimated tokens exceed threshold" test usesoriginalTokenCount = 160_000(non-zero), so the zero-path branch is uncovered at the service level. (chatCompressionService.test.ts)
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| consecutiveFailures >= MAX_CONSECUTIVE_FAILURES && | ||
| !force && | ||
| !bypassTokenThreshold | ||
| ) { |
There was a problem hiding this comment.
[Critical] Circuit breaker NOOP is completely silent. When consecutiveFailures >= MAX_CONSECUTIVE_FAILURES causes compress() to return CompressionStatus.NOOP, there is zero log output. The only observable signal that auto-compaction has stopped is the absence of compression — a nightmare to debug at 3 AM when the only visible symptom is "API context overflow."
| ) { | |
| this.config | |
| .getDebugLogger() | |
| .warn( | |
| `Auto-compaction breaker tripped: consecutiveFailures=${opts.consecutiveFailures} >= MAX=${MAX_CONSECUTIVE_FAILURES}. Use /compress to reset.`, | |
| ); | |
| return { compressionStatus: CompressionStatus.NOOP }; |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — added a warn-level log on the breaker NOOP path quoting consecutiveFailures, MAX, and recovery instruction (/compress). Was: silent. Now: shows up in logs so absence-of-compaction is no longer invisible.
| // trying after MAX_CONSECUTIVE_FAILURES strikes rather than burn an API | ||
| // call on every send. Reactive overflow still catches the catastrophic | ||
| // case. See docs/design/auto-compaction-threshold-redesign.md risk #2. | ||
| // |
There was a problem hiding this comment.
[Suggestion] The truncation guard compressionOutputTokenCount >= COMPACT_MAX_OUTPUT_TOKENS (>=) can false-positive on a legitimate summary that happens to land exactly at 20K tokens. Each false-positive counts as a compression failure, incrementing the circuit breaker — 3 false-positives permanently disable auto-compaction. The code's own TODO acknowledges the correct approach is finish_reason === 'length' / MAX_TOKENS. Until runSideQuery surfaces finish_reason, consider using > instead of >= to shrink the false-positive window (a model producing exactly 20K is far less likely than one exceeding 20K).
— DeepSeek/deepseek-v4-pro via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — >= changed to >. With R5.2b's new COMPRESSION_FAILED_OUTPUT_TRUNCATED status now counting toward the breaker, the false-positive cost (3 strikes → permanent disable) was too high. The proper finish_reason path is still TODO'd; > will essentially never fire today (API enforces hard cap) but is the right semantics once finish_reason lands.
| this.lastPromptTokenCount, | ||
| imageTokenEstimate, | ||
| ); | ||
| const shouldForceFromHard = effectiveTokens >= hard; |
There was a problem hiding this comment.
[Suggestion] Hard-tier rescue resets consecutiveFailures = 0 before calling tryCompress(force=true), and force=true causes tryCompress's failure branch to skip the consecutiveFailures += 1 increment. This means repeated hard-rescue compression failures never trip the breaker — each failing send burns one doomed compression API call with no limit. The opposite extreme of the proactive path's breaker; consider either saving/restoring the pre-call counter value on failure, or adding a separate hard-rescue cooldown counter (e.g., skip hard rescue for N turns after M consecutive hard-rescue failures).
— DeepSeek/deepseek-v4-pro via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — added a dedicated hardRescueFailureCount field on GeminiChat, bounded by MAX_CONSECUTIVE_FAILURES. After that many consecutive hard-rescue failures, the rescue stops firing and reactive overflow takes over. Resets on any compression success. Documented in JSDoc + observable via the new warn-log when the rescue fails.
|
|
||
| export type ContextTier = 'safe' | 'warn' | 'auto' | 'hard'; | ||
|
|
||
| export interface ContextThresholds { |
There was a problem hiding this comment.
[Suggestion] ContextThresholds (CLI package) has the exact same shape as CompactionThresholds (core package) — four identically-typed fields (effectiveWindow, warn, auto, hard). Maintaining a duplicate type definition across packages creates a silent drift risk: if core's shape changes, the CLI type becomes stale with no compiler error (because the runtime value still flows through untyped boundaries). Consider re-exporting CompactionThresholds from core's public API and referencing it here, or using Pick<CompactionThresholds, ...> to make the relationship explicit.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — ContextThresholds reduced to a type alias of core's CompactionThresholds (export type ContextThresholds = CompactionThresholds;). The four fields now have a single source of truth; the previously-silent drift risk is gone.
| self.hasFailedCompressionAttempt = true; | ||
| // Reactive compression is force=true so tryCompress's | ||
| // failure branch did not increment the counter. Count it | ||
| // explicitly as one strike — a single transient error |
There was a problem hiding this comment.
[Suggestion] The reactive overflow path (self.consecutiveFailures += 1 at L984) shares the same consecutiveFailures counter as the proactive cheap-gate path. This means reactive compression failures (transient model issues producing poor summaries) disable proactive auto-compaction. These two paths have different failure semantics — reactive is a last-resort recovery, proactive is a performance optimization — and sharing a breaker undermines fault isolation. Consider a separate reactiveConsecutiveFailures counter, or at minimum document the coupling in the field's JSDoc.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
There was a problem hiding this comment.
❌ Declined — coupling is documented in R5.3 JSDoc and the related R1.2 fix already softened reactive's impact: reactive failure increments by +1, not =MAX, so it takes MAX_CONSECUTIVE_FAILURES reactive failures to disable proactive. That's the correct outcome for a chat where reactive consistently fails — splitting the counter would just delay the bound without changing the steady-state behavior. R6.7 (this batch) additionally fixed reactive's missing increment in the catch block so the bound is now uniformly enforced across both reactive failure modes.
wenshao
left a comment
There was a problem hiding this comment.
Additional observation (not tied to a diff line): docs/users/configuration/settings.md still documents model.chatCompression.contextPercentageThreshold as a live setting (including "Use 0 to disable compression entirely"), but this field has been removed from ChatCompressionSettings and the value is silently ignored. The settings docs should be updated to reflect the removal.
— qwen-latest-series-invite-beta-v28 via Qwen Code /review
| imageTokenEstimate, | ||
| ); | ||
| const shouldForceFromHard = effectiveTokens >= hard; | ||
| if (shouldForceFromHard) { |
There was a problem hiding this comment.
[Critical] Hard-tier rescue fires with zero logging
When shouldForceFromHard is true, the code silently resets consecutiveFailures = 0 and passes force=true to tryCompress. There is no debugLogger.info or debugLogger.warn at this trigger point.
The JSDoc at L469 says "If you're debugging 'why is hard-rescue firing but the counter is 0', that's by design" — but there's nothing in the log to confirm that IS what's happening. An oncall engineer seeing repeated forced compaction with counter stuck at 0 cannot distinguish "hard-tier rescue by design" from "something is resetting the counter in a loop" without reading source code.
| if (shouldForceFromHard) { | |
| if (shouldForceFromHard) { | |
| debugLogger.info( | |
| `[compaction] hard-tier rescue: effectiveTokens=${effectiveTokens} >= hard=${hard}, forcing compaction (consecutiveFailures ${this.consecutiveFailures} → 0)`, | |
| ); | |
| this.consecutiveFailures = 0; | |
| } |
— qwen-latest-series-invite-beta-v28 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — added a debugLogger.info call at the rescue trigger point capturing effectiveTokens, the hard threshold, and the counter transition (consecutiveFailures N → 0). Now the field JSDoc's claim that "hard-rescue resets the counter by design" has a matching observable signal.
| // the intended recovery path. (review #4168 R1.2) | ||
| self.consecutiveFailures += 1; | ||
| } | ||
| } catch (compressionError) { |
There was a problem hiding this comment.
[Critical] Reactive overflow catch block doesn't increment consecutiveFailures
The status-based failure path at L984 correctly increments self.consecutiveFailures += 1, but thrown exceptions (network errors, model 500s, timeouts) in this catch block bypass the increment entirely. If reactive compression consistently throws rather than returning a failure status, the circuit breaker never trips — the system burns a failed reactive compression API call on every overflow indefinitely.
| } catch (compressionError) { | |
| } catch (compressionError) { | |
| if ( | |
| params.config?.abortSignal?.aborted || | |
| isAbortError(compressionError) | |
| ) { | |
| throw compressionError; | |
| } | |
| debugLogger.warn('Reactive compression failed.', compressionError); | |
| self.consecutiveFailures += 1; |
— qwen-latest-series-invite-beta-v28 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — self.consecutiveFailures += 1 added inside the catch block (was: status-based path only). Thrown exceptions (network errors, model 5xx, timeouts) now also count toward the breaker, matching the status-based pattern.
| id: 'context-critical', | ||
| content: | ||
| 'Context is almost full! Run /compress now or start /new to continue.', | ||
| 'Context near hard limit — auto-compact will force on next send. Consider /clear if you want to start fresh.', |
There was a problem hiding this comment.
[Suggestion] Tip text says "will force on next send" but hard-tier rescue already ran on current send
The isRelevant check fires when lastPromptTokenCount >= thresholds.hard — meaning the hard-tier rescue in sendMessageStream has already forced compaction on the send that just produced this response. Users are told compaction "will force on next send" when it already forced on this send.
| 'Context near hard limit — auto-compact will force on next send. Consider /clear if you want to start fresh.', | |
| 'Context near hard limit — auto-compact was forced on this turn. Consider /clear if context remains tight.', |
— qwen-latest-series-invite-beta-v28 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — tip text now reads "Context near hard limit — auto-compact was forced on this turn. Consider /clear if context remains tight." Tense matches the actual timing (rescue already ran by the time the tip renders).
| ) { | ||
| // eslint-disable-next-line no-console | ||
| console.warn( | ||
| '[qwen-code] chatCompression.contextPercentageThreshold has been removed ' + |
There was a problem hiding this comment.
[Suggestion] contextPercentageThreshold: 0 disable mechanism removed with no replacement
Users who set contextPercentageThreshold: 0 to disable auto-compaction (e.g., for debugging compression issues, or for sessions that must retain full history) now have no migration path. The one-time console.warn at startup is easy to miss, and doesn't inform users that compaction can no longer be disabled at all.
Consider either: (1) adding a chatCompression.enabled boolean as a replacement escape hatch, (2) adding an env var like QWEN_AUTO_COMPACT=false, or (3) at minimum, updating the warning to explicitly state that auto-compaction cannot currently be disabled.
— qwen-latest-series-invite-beta-v28 via Qwen Code /review
There was a problem hiding this comment.
✅ Partially fixed in 1030557 — went with reviewer's option (3): the deprecation warning now explicitly states "auto-compaction cannot currently be disabled" and points users to /clear / open an issue if they need a replacement. Did not add a new enabled flag or env var — those would re-introduce the disable mechanism this PR removed. The choice trades one round of forced-on auto-compact for the cleaner threshold semantics; the warning makes that trade-off visible to affected users.
| label: string; | ||
| tokens: number; | ||
| isCurrent?: boolean; | ||
| hint?: string; |
There was a problem hiding this comment.
[Suggestion] Dead code: ThresholdRow declares hint?: string prop but no caller passes it
The hint prop is declared, destructured, and rendered, but none of the ThresholdRow invocations in CompactionThresholds ever pass a hint value. Remove the prop until a caller actually needs it.
| hint?: string; | |
| const ThresholdRow: React.FC<{ | |
| label: string; | |
| tokens: number; | |
| isCurrent?: boolean; | |
| }> = ({ label, tokens, isCurrent }) => { |
— qwen-latest-series-invite-beta-v28 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — hint prop removed from ThresholdRow. The only caller that ever passed it was dropped in an earlier round (the "Effective window" hint was moved out to avoid wrapping). Now matches the actual usage.
| // single render — that resolves the moment any send happens. | ||
| // | ||
| // TODO: plumb the chat history into collectContextData and use | ||
| // estimatePromptTokens(history, undefined, 0, imageTokenEstimate) here |
There was a problem hiding this comment.
[Suggestion] TODO references a type-incorrect function call
The TODO proposes estimatePromptTokens(history, undefined, 0, imageTokenEstimate), but userMessage: Content is a non-optional parameter in estimatePromptTokens. A future developer following this TODO will hit a compile error.
Update the TODO to show a type-correct call, or note that userMessage needs to be made optional first.
— qwen-latest-series-invite-beta-v28 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — rewrote the TODO with a 3-step implementation sketch that calls out the prerequisite signature change (estimatePromptTokens.userMessage needs to be made optional first) and the plumbing path (add a chat?: GeminiChat parameter to collectContextData). A future developer following the TODO won't hit the compile error you flagged.
| */ | ||
| const DEFAULT_COMPRESSION_THRESHOLD = 0.7; | ||
| function currentTier( | ||
| tokens: number, |
There was a problem hiding this comment.
[Suggestion] hard === auto makes the 'auto' tier unreachable for small windows
computeThresholds collapses hard to auto for windows below ~82.5K tokens. Since this function checks >= hard first, when hard === auto the first branch always wins — the function returns 'hard' and the 'auto' tier is never displayed. Users with smaller-window models see the display jump from 'warn' directly to 'hard'.
Consider checking auto before hard when they're equal, or displaying 'auto/hard' when the thresholds are identical.
— qwen-latest-series-invite-beta-v28 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — currentTier() now only returns 'hard' when thresholds.hard > thresholds.auto. For small windows where computeThresholds collapses them, the function returns 'auto' instead, so the tier label is reachable for all windows.
| breakdown.autocompactBuffer, | ||
| contextWindowSize, | ||
| ), | ||
| ` Effective window: ${formatNum(breakdown.thresholds.effectiveWindow)} (window − 20K reserve)`, |
There was a problem hiding this comment.
[Suggestion] formatContextUsageText hardcodes English labels and a stale-prone magic constant
The text formatter embeds threshold labels in English ("Warn threshold", "Auto threshold", etc.) and includes a literal "(window − 20K reserve)" string. The interactive ContextUsage.tsx component uses t() for i18n. Additionally, "20K" is a hard-coded reference to SUMMARY_RESERVE — if that constant changes, this string silently goes stale.
Consider using t() for labels and deriving the reserve text from the constant.
— qwen-latest-series-invite-beta-v28 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — formatContextUsageText labels now go through t() (matching the interactive ContextUsage.tsx). "20K reserve" is now derived from SUMMARY_RESERVE (exported from core) via Math.round(SUMMARY_RESERVE / 1000) + "K", so it stays in sync if the constant ever changes.
| const effectiveTokens = | ||
| opts.precomputedEffectiveTokens !== undefined | ||
| ? opts.precomputedEffectiveTokens | ||
| : pendingUserMessage |
There was a problem hiding this comment.
[Suggestion] Cheap-gate path #2 (pendingUserMessage without precomputedEffectiveTokens) is dead code
The cheap-gate has three estimation paths, but no current caller ever passes pendingUserMessage without also passing precomputedEffectiveTokens: sendMessageStream passes both, tryCompressChat (manual /compress) passes neither, and heap-pressure bypass passes neither. This path is the only one that calls chat.getHistory(true) inside the cheap-gate — an expensive clone that is never actually reached.
Removing this branch (or adding an assertion) would simplify the gate and eliminate the latent clone risk if a future caller accidentally hits it.
— qwen-latest-series-invite-beta-v28 via Qwen Code /review
There was a problem hiding this comment.
✅ Fixed in 1030557 — removed the dead pendingUserMessage-only branch. The cheap-gate now uses opts.precomputedEffectiveTokens ?? originalTokenCount. Production callers (sendMessageStream) always pass precomputed; direct service callers fall back to originalTokenCount without cloning history. Eliminates the latent double-clone risk.
Adds a defensive guard in ChatCompressionService.compress() that detects when the side-query summary hit COMPACT_MAX_OUTPUT_TOKENS (20K). In that case the summary is likely truncated mid-content, so we drop it and return NOOP rather than persist a half-summary. The next send re-tries; reactive overflow still catches the catastrophic case where the API rejects the next request as too large. Documented in the design doc as risk #2; the bot reviewer on PR #4168 correctly pushed for it to land alongside the threshold redesign rather than as a follow-up since the new 20K cap is what makes truncation likely in the first place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Task 11 redesign updated the non-interactive text formatter
(formatContextUsageText) but left ContextUsage.tsx — the interactive
React component that real /context users see — unchanged. As a result
the TUI still showed the old single "Autocompact buffer" line and none
of the new warn/auto/hard ladder.
Adds a "Compaction thresholds" section after the per-category breakdown:
- Effective window
- Warn / Auto / Hard threshold rows with a ▶ marker on the row the
current usage has crossed
- Current tier label coloured by severity (safe→green, warn/auto→
yellow, hard→red)
The existing progress bar legend (Used / Free / Autocompact buffer)
is preserved because it's tied to the three-segment progress bar
visualisation; the new section adds the absolute numbers + tier badge
on top of that.
Caught by the tmux e2e test (PR #4168 ci-monitor follow-up). Pre-fix
the assertion 'Compaction thresholds' missed completely from the TUI;
post-fix the new section renders correctly for fresh and live sessions
on 1M / 200K / 128K windows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior fixes: - MAX_TOKENS truncation guard now returns COMPRESSION_FAILED_EMPTY_SUMMARY instead of NOOP so the consecutive-failure breaker actually trips after repeated max-length summaries (R1.1). - Reactive overflow failure increments consecutiveFailures by 1 instead of latching to MAX in one shot, so a transient network blip doesn't permanently disable auto-compaction. The hard-tier rescue resets the counter, which remains the designated recovery path (R1.2). - /context current-tier classification uses rawOverhead (system + tools + memory + skills) as the tier input when API data is not yet available, rather than 0 — large inherited contexts no longer silently show 'safe' (R2.2). Performance: - sendMessageStream computes effectiveTokens ONCE and passes it through TryCompressOptions.precomputedEffectiveTokens, so the cheap-gate inside service.compress doesn't redo the estimation. Also fixes the imageTokenEstimate inconsistency between the rescue and cheap-gate paths (R1.3 + R1.4). - Steady-state path (lastPromptTokenCount > 0) skips the costly getHistory(true) clone — estimatePromptTokens only needs the user message in that branch. Code hygiene: - BYTES_PER_TOKEN → CHARS_PER_TOKEN (inputs are char counts, not byte counts; CJK text would mislead under the old name) (R3.1). - Drop dead getContextUsagePercent helper + index re-export — no callers in source after the threshold rewire (R1.5). - Add a comment on estimatePromptTokens' first-send fallback documenting the ~15-20K under-estimate (system prompt + tools + skills) and that reactive overflow is the safety net (R3.3). Tests: - New CLI ContextUsage.test.tsx exercises the React renderer for the three-tier section: section presence, ▶ marker placement per tier, current-tier label coloring (R1.6). - New chatCompressionService.test.ts case pins that a stale contextPercentageThreshold: 0 value in user settings no longer short-circuits compaction (R2.1). - New tokenEstimation.test.ts case covers functionResponse (distinct nested-parts branch from functionCall) (R3.5). - New geminiChat.test.ts integration test exercises the real ChatCompressionService — not a mock — for the first-send-after- inherited-history scenario where lastPromptTokenCount=0 and only the full-history estimate can cross the auto threshold (R3.4). Declined: R3.2 (change `>=` to `>` on the MAX_TOKENS guard). The current operator catches the at-cap case as suspicious, which is intentional — landing exactly at the output cap is far more likely truncation than clean stop given p99.99 ≈ 17K. With R1.1 in place, persistent truncations trip the breaker after MAX_CONSECUTIVE_FAILURES so the worst case is bounded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- R5.1: tighten /context tier comment + TODO. The rawOverhead-based fix doesn't cover `--continue` restores with many history messages (since rawOverhead excludes messagesTokens). UI may still show 'safe' for one render until the first send. Documented inline and added a TODO to plumb chat history into collectContextData for same-source-of-truth as the cheap-gate. - R5.2a: add TODO(finish_reason) at the truncation guard. The `>= cap` heuristic false-positives on legitimate at-cap summaries; the proper signal is finish_reason which runSideQuery doesn't surface today. - R5.2b: split telemetry — new CompressionStatus.COMPRESSION_FAILED_OUTPUT_TRUNCATED enum value. Distinct from EMPTY_SUMMARY so logs/telemetry can tell prompt-quality failures (tune prompt / splitter) from capacity failures (raise cap / shrink splitter input). isCompressionFailureStatus() treats both as failures so the breaker behavior is unchanged. - R5.3: expand consecutiveFailures JSDoc to clarify it tracks "non-force, non-hard-rescue consecutive failures" — hard-rescue resets the counter and force=true skips increments, so the counter is the "regular path" health signal only; reactive overflow is the real safety net for the force-only paths. - R5.4: document the CompressOptions field rename (hasFailedCompressionAttempt: boolean → consecutiveFailures: number) as an SDK breaking change in the design doc with migration guide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Task 11 redesign updated the non-interactive text formatter
(formatContextUsageText) but left ContextUsage.tsx — the interactive
React component that real /context users see — unchanged. As a result
the TUI still showed the old single "Autocompact buffer" line and none
of the new warn/auto/hard ladder.
Adds a "Compaction thresholds" section after the per-category breakdown:
- Effective window
- Warn / Auto / Hard threshold rows with a ▶ marker on the row the
current usage has crossed
- Current tier label coloured by severity (safe→green, warn/auto→
yellow, hard→red)
The existing progress bar legend (Used / Free / Autocompact buffer)
is preserved because it's tied to the three-segment progress bar
visualisation; the new section adds the absolute numbers + tier badge
on top of that.
Caught by the tmux e2e test (PR #4168 ci-monitor follow-up). Pre-fix
the assertion 'Compaction thresholds' missed completely from the TUI;
post-fix the new section renders correctly for fresh and live sessions
on 1M / 200K / 128K windows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior fixes: - MAX_TOKENS truncation guard now returns COMPRESSION_FAILED_EMPTY_SUMMARY instead of NOOP so the consecutive-failure breaker actually trips after repeated max-length summaries (R1.1). - Reactive overflow failure increments consecutiveFailures by 1 instead of latching to MAX in one shot, so a transient network blip doesn't permanently disable auto-compaction. The hard-tier rescue resets the counter, which remains the designated recovery path (R1.2). - /context current-tier classification uses rawOverhead (system + tools + memory + skills) as the tier input when API data is not yet available, rather than 0 — large inherited contexts no longer silently show 'safe' (R2.2). Performance: - sendMessageStream computes effectiveTokens ONCE and passes it through TryCompressOptions.precomputedEffectiveTokens, so the cheap-gate inside service.compress doesn't redo the estimation. Also fixes the imageTokenEstimate inconsistency between the rescue and cheap-gate paths (R1.3 + R1.4). - Steady-state path (lastPromptTokenCount > 0) skips the costly getHistory(true) clone — estimatePromptTokens only needs the user message in that branch. Code hygiene: - BYTES_PER_TOKEN → CHARS_PER_TOKEN (inputs are char counts, not byte counts; CJK text would mislead under the old name) (R3.1). - Drop dead getContextUsagePercent helper + index re-export — no callers in source after the threshold rewire (R1.5). - Add a comment on estimatePromptTokens' first-send fallback documenting the ~15-20K under-estimate (system prompt + tools + skills) and that reactive overflow is the safety net (R3.3). Tests: - New CLI ContextUsage.test.tsx exercises the React renderer for the three-tier section: section presence, ▶ marker placement per tier, current-tier label coloring (R1.6). - New chatCompressionService.test.ts case pins that a stale contextPercentageThreshold: 0 value in user settings no longer short-circuits compaction (R2.1). - New tokenEstimation.test.ts case covers functionResponse (distinct nested-parts branch from functionCall) (R3.5). - New geminiChat.test.ts integration test exercises the real ChatCompressionService — not a mock — for the first-send-after- inherited-history scenario where lastPromptTokenCount=0 and only the full-history estimate can cross the auto threshold (R3.4). Declined: R3.2 (change `>=` to `>` on the MAX_TOKENS guard). The current operator catches the at-cap case as suspicious, which is intentional — landing exactly at the output cap is far more likely truncation than clean stop given p99.99 ≈ 17K. With R1.1 in place, persistent truncations trip the breaker after MAX_CONSECUTIVE_FAILURES so the worst case is bounded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- R5.1: tighten /context tier comment + TODO. The rawOverhead-based fix doesn't cover `--continue` restores with many history messages (since rawOverhead excludes messagesTokens). UI may still show 'safe' for one render until the first send. Documented inline and added a TODO to plumb chat history into collectContextData for same-source-of-truth as the cheap-gate. - R5.2a: add TODO(finish_reason) at the truncation guard. The `>= cap` heuristic false-positives on legitimate at-cap summaries; the proper signal is finish_reason which runSideQuery doesn't surface today. - R5.2b: split telemetry — new CompressionStatus.COMPRESSION_FAILED_OUTPUT_TRUNCATED enum value. Distinct from EMPTY_SUMMARY so logs/telemetry can tell prompt-quality failures (tune prompt / splitter) from capacity failures (raise cap / shrink splitter input). isCompressionFailureStatus() treats both as failures so the breaker behavior is unchanged. - R5.3: expand consecutiveFailures JSDoc to clarify it tracks "non-force, non-hard-rescue consecutive failures" — hard-rescue resets the counter and force=true skips increments, so the counter is the "regular path" health signal only; reactive overflow is the real safety net for the force-only paths. - R5.4: document the CompressOptions field rename (hasFailedCompressionAttempt: boolean → consecutiveFailures: number) as an SDK breaking change in the design doc with migration guide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Observability (R6.1 + R6.6): - chatCompressionService.compress() now warn-logs when the breaker trips the NOOP path; previously the only signal was the absence of compaction - sendMessageStream info-logs hard-tier rescue trigger + warn-logs on rescue failure so debugging matches the consecutiveFailures JSDoc Counter accounting (R6.3 + R6.7): - New `hardRescueFailureCount` field on GeminiChat bounds hard-rescue retries to MAX_CONSECUTIVE_FAILURES — without it a chat whose history can't shrink would burn an API call per send forever (force=true skipped the regular increment AND the rescue's pre-call reset wiped state). After MAX failures, hard rescue stops firing and reactive overflow takes over as the next defense layer. Reset on any compression success. - Reactive overflow catch block now increments consecutiveFailures so thrown exceptions (network, 5xx, timeouts) also count toward the breaker — previously only status-based reactive failures incremented. UI corrections (R6.8 + R6.9 + R6.12): - context-critical tip: tense corrected from "will force on next send" to "was forced on this turn" — the rescue already ran by the time the tip renders - Deprecation warning explicitly states auto-compaction can no longer be disabled (no replacement for `contextPercentageThreshold: 0`) - currentTier() returns 'auto' (not 'hard') when hard collapses to auto on small windows — previously the 'auto' tier was unreachable for those sessions Code hygiene (R6.2 / R6.4 / R6.10 / R6.11 / R6.13 / R6.14): - Truncation guard `>=` → `>`: legitimate at-cap summaries no longer treated as truncation (was particularly costly because R5.2b made these count toward the breaker) - ContextThresholds reduced to a type alias of core's CompactionThresholds to eliminate silent-drift risk - Removed dead `hint` prop on ThresholdRow (no caller after R5 refactor) - TODO at contextCommand.ts now shows a type-correct call sketch - formatContextUsageText uses t() for labels; "20K" derived from SUMMARY_RESERVE constant (exported from core) - cheap-gate dead branch removed: production callers always pass precomputedEffectiveTokens; direct service callers fall back to originalTokenCount instead of double-cloning history Tests (R6.15): - New: COMPRESSION_FAILED_OUTPUT_TRUNCATED counts toward the breaker - New: precomputedEffectiveTokens path skips estimation work - New: cheap-gate falls back to originalTokenCount when no precomputed - Hard-rescue test now asserts precomputedEffectiveTokens is forwarded Docs (R6.16): - docs/users/configuration/settings.md table entry for `model.chatCompression.contextPercentageThreshold` updated to mark the field REMOVED with link to PR rationale Declined: R6.5 (separate reactive/proactive counter). The R5.3 JSDoc already documents the coupling intentionally; R1.2 reduced reactive's weight to +1 (not =MAX), so it takes MAX_CONSECUTIVE_FAILURES reactive failures to disable proactive — which is the correct outcome for a chat where reactive consistently fails. A separate counter would add state without changing observable behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R7.1 critical (scratchpad data-retention): with includeThoughts=false,
the compression model emits its private <scratchpad> reasoning as plain
text alongside <state_snapshot>, and the entire concatenation was being
persisted as the chat's compressed memory — leaking sensitive tool
output (API keys, paths, file fragments) into every subsequent turn.
Extract just the <state_snapshot> envelope from the response; surface a
no-match as COMPRESSION_FAILED_EMPTY_SUMMARY so the breaker reacts to
prompt-format drift.
R7.2 / R7.3 critical (hard-rescue counter accounting): pessimistic
increment pattern. The previous post-call accounting silently leaked
two failure shapes:
- throw (provider 5xx / abort): post-handler unreachable, counter
stuck → infinite re-fire on every send.
- NOOP (history too small to split): neither failure-status nor
COMPRESSED branch matched → same infinite re-fire.
Increment hardRescueFailureCount BEFORE tryCompress(force=true); rely on
the existing success-branch reset in tryCompress to refund the strike
on COMPRESSED. Throws, NOOPs, and failure statuses all keep the strike
uniformly.
R7.4 critical (constant coupling): lifted TOKEN_TO_CHAR_RATIO to the
single declaration in compactionInputSlimming.ts; tokenEstimation.ts's
CHARS_PER_TOKEN is now a re-export. Silent-drift risk between splitter
sizing and gate sizing is gone.
R7.5: removed dead `pendingUserMessage` field from CompressOptions /
TryCompressOptions — unused since R6.14 collapsed its consumer.
R7.6: breaker-NOOP path returns the caller's `originalTokenCount`
rather than 0 so telemetry sees real session token counts on the trip
event, not a misleading zero.
R7.7: log at warn level when hard-rescue is skipped due to budget
exhaustion (hardRescueFailureCount >= MAX). Closes the "why isn't
rescue firing" oncall blind spot.
R7.8: reverted R6.2's `>` back to `>=` on the truncation guard. With
the API hard-capping output at COMPACT_MAX_OUTPUT_TOKENS, `>` could
never fire — making the guard dead code that silently persisted
truncated summaries. `>=` catches exact-at-cap (almost always
truncated); the breaker bounds 3 strikes. Declined the reviewer's
alternative `>= cap * 0.95` heuristic — broadens false positives into
the p99-realistic range (~19K) without addressing the root cause
(finish_reason plumbing, still TODO'd).
R7.9: throttle the breaker warn log via a `breakerWarningEmitted` flag
on GeminiChat. Fires once when the breaker first trips, resets when
consecutiveFailures returns to 0. Service stays stateless.
R7.10: neutral tip wording — "Run /compress or /clear to free space"
is correct whether hard-rescue ran, failed, or was budget-suppressed.
Previous past-tense ("was forced on this turn") was wrong in the
budget-exhausted case.
R7.11: 4 new test cases pinning the hardRescueFailureCount + reactive
overflow counter contracts (budget exhaustion via failures, via NOOPs,
via thrown exceptions; reactive throw increments consecutiveFailures).
Tests: packages/core 205 passing in changed files (chatCompression +
geminiChat + tokenEstimation + compactionInputSlimming);
packages/cli 33 passing (tips + ContextUsage + contextCommand). Pre-
existing serve/* breakage and timeout-flaky utils/filesearch tests
unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er + throttle)
The R7.1 <state_snapshot> extraction shipped in round 7 turned out to
be incomplete on three fronts that the round-8 review caught:
R8.1 critical (format-violation diagnostic): when the model produced
non-empty raw output but no <state_snapshot> tags, the path silently
classified as COMPRESSION_FAILED_EMPTY_SUMMARY — indistinguishable
from a model that genuinely returned nothing, and three such sends
trip the breaker with no actionable signal. Added a warn-level log
on the !isRawEmpty && isSummaryEmpty branch that includes length and
the first 200 chars of the raw output, so an oncall can distinguish
"prompt drift / model misbehaviour" from "provider error".
R8.6 (regex bypass): the non-greedy `<state_snapshot>[\s\S]*?</...>`
match captured from the FIRST occurrence of the opening tag. Because
the compression prompt instructs the model to "generate the
<state_snapshot>", the scratchpad is plausibly going to mention the
tag literally — and the match would then start at the scratchpad
mention and capture the scratchpad's reasoning through to the real
closing tag, defeating the data-retention fix. Anchored on the LAST
opening tag via `[\s\S]*<state_snapshot>([\s\S]*?)</state_snapshot>`
plus `${`<state_snapshot>${...}</state_snapshot>`}` reconstruction.
R8.7 (token math): the persisted history contains only the snapshot
envelope, but newTokenCount used the raw API `candidatesTokenCount`
which counts scratchpad+snapshot. Scaling by `summary.length /
rawSummaryText.length` while keeping the API count as the base
preserves tokenizer fidelity for the snapshot portion. Test scenario
of ~3x scratchpad vs snapshot drops the bookkeeping from 1024 → ~260,
which is materially closer to what the next cheap-gate actually sees.
R8.4 (throttle asymmetry): the R7.7 budget-exhausted warn fired on
every send when a session stayed above the hard threshold —
asymmetric with R7.9's `breakerWarningEmitted`. Added matching
`budgetExhaustedWarningEmitted` flag, cleared in the same COMPRESSED
success branch as the other resets.
R8.2 / R8.3 / R8.5 (test coverage gaps): added 6 tests pinning
contracts the previous rounds left unverified:
- exact-cap (20_000) truncation guard (R7.8 regression guard)
- scratchpad-strip end-to-end persistence assertion (R7.1)
- format-violation EMPTY_SUMMARY + warn (R8.1/R8.3b combined)
- breaker-tripped NOOP returns originalTokenCount (R7.6 telemetry)
- hardRescueFailureCount recovery after COMPRESSED success (R8.5)
- regex-anchor on literal scratchpad mention (R8.6)
- newTokenCount accounts for only persisted snapshot (R8.7)
Phase 5 ordering: R8.1, R8.6, R8.7 were written test-first
(RED → fix → GREEN); R8.4 mirrors R7.9 structurally. Phase 6
self-review checklist run and documented in the PR reply.
All 2126 core tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R9.1 (telemetry assertion): pre-existing breaker-NOOP test only checked status — added explicit token-count assertions so a regression to `0/0` would surface instead of silently corrupting trip-event telemetry. R9.2 critical (NOOP refund): the R7.2/R7.3 pessimistic increment was overcautious for the NOOP case. A forced rescue NOOPs when the compressible slice is too small to split this turn — not because the compression mechanism is broken. Refund the strike on NOOP so a session whose first few turns happen to be too small doesn't permanently disable hard-rescue. Throws and failure statuses still cost a strike. Flipped the R7.11 NOOP test to assert the new contract (budget does NOT exhaust on NOOPs). R9.3 critical (cross-file silent coupling): the `<state_snapshot>` tag name was hard-coded in both `prompts.ts` (literal XML in the template) and `chatCompressionService.ts` (extraction regex). A rename in one without the other was a silent failure mode (every compaction → EMPTY_SUMMARY → breaker trips after 3 sends → auto- compaction permanently off, looking like "model can't follow format"). Lifted `COMPRESSION_SNAPSHOT_TAG = 'state_snapshot'` as a shared constant; prompt template uses it via template literal, regex constructs from it via `new RegExp`. R9.4 (stale breaker flag): hard-rescue resets `consecutiveFailures = 0` in the pre-call path but pre-R9.4 left `breakerWarningEmitted` true. After a session sequence "breaker trips → warn emitted → hard-rescue resets counter → counter re-trips", the second trip emitted no warn. Clear the flag alongside the counter in the rescue pre-call path. R9.5 (tip small-window collapse): the `context-critical` tip fired at `>= thresholds.hard`, but on small windows (32K) `computeThresholds` collapses hard to equal auto — the tip would claim "near hard limit" when there is no distinct hard limit. Mirror the `currentTier` guard (`hard > auto`) so the `context-high` band `[auto, hard)` handles small windows cleanly. R9.6 declined as filter-1 false-positive: the cited inflation was fixed in R8.7 (current code scales `compressionOutputTokenCount` by the snapshot/raw char ratio). Reviewer was reading a stale snapshot. R9.7 (preserve valid snapshots): the truncation guard fired whenever `compressionOutputTokenCount >= COMPACT_MAX_OUTPUT_TOKENS` regardless of extraction success. When the model emits a complete `<state_snapshot>...</state_snapshot>` envelope and the cap was consumed by scratchpad, dropping the snapshot throws away a valid result. Gated the guard on `!snapshotMatch` so it now only fires when the envelope is incomplete (no closing tag) — strong evidence of mid-snapshot truncation. Existing R7.8/R8.2 truncation tests updated to use no-closing-tag mocks (the actual shape of mid- snapshot truncation); added new test for the "complete envelope + cap hit → preserved" contract. Phase 5 ordering: R9.2 / R9.4 / R9.7 were RED-first (the R7.11 NOOP test flip is the explicit RED for R9.2; R9.4 has a fresh internals-peek test; R9.7 has a fresh test that fails against the pre-R9.7 code which would return TRUNCATED instead of COMPRESSED). R9.3 is a constant-lift with no behavior change. R9.5 has a new small-window-collapse test. Tests: 2128 core + 24 CLI all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9cbcfd2 to
27200f7
Compare
R11.1 critical (NaN propagation): R10.1's `?? 0` only catches
null/undefined; NaN passes through and poisons every subsequent
`lastPromptTokenCount + NaN + ...` arithmetic — `NaN >= hard` is
always false, silently disabling hard-tier rescue for the session.
Guard with `Number.isFinite` so NaN / Infinity / non-numbers coerce
to 0. RED-first via hostile-NaN-payload test.
R11.2 (self-inflicted regression from R9.5): adding `hard > auto`
to context-critical left context-high's `[auto, hard)` band empty
when hard === auto (small windows 32K/64K). Users at the auto
threshold lost ALL contextual tips. Accept `>= auto` in
context-high when hard === auto so there's always exactly one tip
in the high-utilization range. RED-first via collapsed-window test.
R11.3 critical (per-strike observability): pre-R11.3, proactive
auto-compaction failures produced ZERO logs until the breaker
tripped on strike 3. An oncall investigating "auto-compaction
stopped" couldn't distinguish EMPTY_SUMMARY / OUTPUT_TRUNCATED /
INFLATED / TOKEN_COUNT_ERROR without source-diving. Added
info-level per-strike log citing status and strike-of-MAX. Declined
the second half of the suggestion (promote breaker/budget warns to
console.warn for user visibility) — that's UI noise; users without
DEBUG=QWEN_CODE_CHAT enabled see reactive overflow recovery
working, which is the intended UX.
R11.4 critical (disable escape hatch restored): the removal of
`contextPercentageThreshold: 0` was scope-collateral, not intent.
Users with compliance / debugging / audit-trail needs require a
way to opt out of auto-compaction entirely. Added
`chatCompression.disabled: boolean` field. Service-level cheap-gate
gates `!force && !bypassTokenThreshold` (proactive only); hard-
rescue gated at SOURCE in sendMessageStream since force=true would
bypass the service gate. Manual /compress (user-initiated force=true
via tryCompressChat) and reactive overflow (API-layer safety net)
remain active — matching the old contextPercentageThreshold=0
semantics that only gated the proactive path.
R11.5 declined-design: the counter asymmetry between
`consecutiveFailures` (proactive cheap-gate health) and
`hardRescueFailureCount` (rescue-budget pessimistic) is intentional
and documented in the JSDoc — they track different mechanisms with
legitimately different reset semantics. The "regular breaker
reports healthy while every compression fails" scenario the
reviewer describes IS the design: a flaky hard-rescue eventually
exhausts its own budget, then the proactive cheap-gate accumulates
strikes, then the cheap-gate breaker latches. Reactive overflow
catches the actual API failure throughout. The save/restore pattern
suggested would complicate the state machine without changing the
recovery shape.
R11.6 (sensitive content in warn log): R8.1's `slice(0, 200)` of
raw model output captured exactly the window where scratchpad's
sensitive content (quoted API keys, paths from tool output) is
most likely to appear. Length-only message preserves the
operationally actionable distinction ("model returned content but
no tags" vs "model returned nothing") without the leak risk.
Actual content is recoverable from provider-side logging.
R11.7 (regex hoist): the snapshot extraction regex depends only on
the immutable `COMPRESSION_SNAPSHOT_TAG` constant. Hoisted to
module-scope `SNAPSHOT_REGEX` — removes per-call `new RegExp()`
overhead and signals to readers that the pattern is a fixed
contract, not parameterised.
R11.8 (i18n hygiene): `breakdown.currentTier` value was interpolated
raw at 2 sites (contextCommand text formatter + ContextUsage Ink
component). Wrapped in `t()` so non-English locales don't see
mixed-language output. Sibling sweep via grep confirmed exactly 2
unwrapped render sites; the other `currentTier` references are
code comparisons against tier-name string literals (not user-facing
strings).
2361 core + 35 CLI tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
27200f7 to
e861d07
Compare
R12.1 critical (sibling sweep of R11.1): R11.1 added Number.isFinite
to `lastCandidatesTokenCount`, but `lastPromptTokenCount` (assigned
3 lines above) and `cachedContentTokenCount` had no guard. Also,
Number.isFinite(-1) is true — a negative value would still poison
arithmetic. Factored `coerceUsageCount(value)` enforcing
(finite ∧ >= 0) and routed all 4 API-value capture sites through it.
RED-first via Infinity/NaN/-1/-1e9 injection test.
R12.2 critical (computeThresholds NaN propagation): a provider
returning `"context_window": null` surfaces as `contextWindowSize:
NaN`. Pre-fix, NaN propagated to all 4 thresholds, every downstream
`tokens >= NaN` comparison evaluated false, and the entire three-tier
gate silently disabled. Guard with `!Number.isFinite || <= 0` →
return Infinity thresholds (gate falls through to NOOP) + 0
effectiveWindow. RED-first against NaN/0/-1/-Inf inputs.
R12.3 critical (R8.7 self-inflicted undercount): pure scaling
collapses on extreme scratchpad/snapshot ratios. Example: 200K
scratchpad + 5K snapshot with 15K API tokens scaled to ~375 tokens.
Floor by `estimateContentTokens` on the persisted summary — `Math.max(
scaledApi, charBased)` keeps API tokenizer fidelity when scratchpad is
reasonable, clamps when it isn't. RED-first via 200K/5K extreme test.
R12.4 critical (disabled NOOP observability): the R11.4 disable-knob
NOOP returned silently, leaving oncall unable to distinguish "user
disabled" from "system broken". Added once-per-process warn (module-
level flag because `ChatCompressionService` is per-call). Symmetric
with R7.9 `breakerWarningEmitted` / R8.4 `budgetExhaustedWarningEmitted`.
R12.5 critical (test gap for R11.4 source gate): R11.4's hard-rescue
source-level disable check had no regression guard. Added test
mocking `getChatCompression: { disabled: true }` + lastPromptTokenCount
above hard threshold; asserts no force=true call to tryCompress.
Test passes against current code — pins the contract against future
refactor removing the source gate.
R12.6 (deprecation text contradiction): the R11.4 commit added
`disabled: true` but left the deprecation warning saying
"auto-compaction cannot currently be disabled". Updated to mention
the new field.
R12.7 declined-design: `imageTokenEstimate: 0` silently clamping to
100 violates user intent on a user-configurable knob. The reviewer's
concern (user accidentally disabling image weight) is real but the
fix is wrong shape — silent override of explicit values is filter-5
defensive bloat. Users explicitly setting 0 are signaling intent;
config-validation warnings at load are a future enhancement if
real-world complaints surface.
R12.8 (locale baseline): the 8+ new t() keys in /context output
(`Compaction thresholds`, `Effective window`, `Warn/Auto/Hard
threshold`, `Current tier`, tier names, `window − {{reserve}}
reserve`) had no entries in en.js. Added as baseline; other locales
fall back to the literal key (existing Used/Free behavior).
Not flagged in mustTranslateKeys.ts — would force breaking-CI on
locale maintainers; same precedent as existing Used/Free which also
aren't flagged.
R12.9 + R12.10 (discoverability): added `model.chatCompression.disabled`
and `model.chatCompression.imageTokenEstimate` rows to settings.md;
updated the REMOVED row for `contextPercentageThreshold` to mention
the new `disabled: true` migration path per gpt-5.5's exact suggested
text. Schema entry in settingsSchema.ts deliberately NOT changed —
adding nested sub-properties for chatCompression would require
rewriting the schema design for ALL existing sub-fields
(imageTokenEstimate) and is out of scope for this round; TypeScript's
ChatCompressionSettings interface already provides IDE-side autocomplete.
2405 core + 43 CLI tests in touched files passing. Pre-existing
serve/* import resolution failures in CLI workspace unaffected.
| ) { | ||
| // eslint-disable-next-line no-console | ||
| console.warn( | ||
| '[qwen-code] chatCompression.contextPercentageThreshold has been removed ' + |
There was a problem hiding this comment.
[Critical] Deprecation warning contradicts the PR's own replacement mechanism
The console.warn message says "auto-compaction cannot currently be disabled" and tells users to open an issue "so we can consider a replacement." But this same PR adds ChatCompressionSettings.disabled as the first-class replacement for the removed contextPercentageThreshold: 0 escape hatch. Users who see this warning will be misled into thinking no disable mechanism exists and may file unnecessary issues or resort to unsafe workarounds.
| '[qwen-code] chatCompression.contextPercentageThreshold has been removed ' + | |
| 'Note: the "contextPercentageThreshold" setting is removed. ' + | |
| 'To disable auto-compaction, set "chatCompression": { "disabled": true } instead.' |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| // disable gate catches the proactive cheap-gate (force=false); | ||
| // hard-rescue uses force=true to bypass the breaker, so it would | ||
| // otherwise sidestep that gate. Skip at the source. | ||
| const autoCompactionDisabled = |
There was a problem hiding this comment.
[Critical] No test coverage for hard-rescue suppression when chatCompression.disabled === true
The sendMessageStream hard-rescue gate checks !autoCompactionDisabled before computing wantHardRescue. This is the sole protection against hard-rescue (which uses force=true) bypassing the service-layer disabled gate. Only the service-layer disable check is tested (chatCompressionService.test.ts "honors chatCompression.disabled"); the geminiChat guard has zero coverage. A regression that drops this guard would cause hard-rescue to silently ignore disabled: true.
| const autoCompactionDisabled = | |
| // Add a test in the "sendMessageStream hard-tier rescue" describe block: | |
| // set mockConfig.getChatCompression().mockReturnValue({ disabled: true }), | |
| // seed lastPromptTokenCount above hard threshold, | |
| // assert compressSpy.mock.calls[0][1].force is false |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| imageTokenEstimate: number = DEFAULT_IMAGE_TOKEN_ESTIMATE, | ||
| ): number { | ||
| let totalChars = 0; | ||
| for (const content of contents) { |
There was a problem hiding this comment.
[Critical] char/4 token estimation severely underestimates for CJK text (2-4x error)
The code comment claims "char/4 is a conservative lower bound (real tokenizers vary ±30%)," but for Chinese/Japanese/Korean the opposite is true — most tokenizers encode CJK at 1-2 chars per token, not 4. This means for CJK-heavy conversations the cheap-gate triggers much later than it should, hard-tier rescue misses the window before API overflow, and the /context display shows misleadingly low token counts. For CJK users this is a systematic correctness issue that silently undermines the entire three-tier ladder.
Consider applying a language-aware correction factor, or at minimum lowering the ratio for safety. The current CHARS_PER_TOKEN=4 is only "conservative" for Latin-alphabet languages.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| // the API layer is still the last-ditch safety net. Replaces the | ||
| // removed `contextPercentageThreshold: 0` escape hatch. | ||
| if (chatCompressionSettings?.disabled && !force && !bypassTokenThreshold) { | ||
| return { |
There was a problem hiding this comment.
[Critical] The disabled NOOP path returns with zero observability
When chatCompressionSettings.disabled === true, the early-return NOOP is entirely silent — no debug log, no warn, no telemetry event. This replaces the old contextPercentageThreshold: 0 escape hatch, but unlike the old path which at least produced NOOP telemetry, this leaves zero audit trail. An oncall investigating "why isn't compression running?" has no signal to find the disabled flag short of inspecting runtime config.
| return { | |
| if (chatCompressionSettings?.disabled && !force && !bypassTokenThreshold) { | |
| debugLogger.debug( | |
| '[compaction] NOOP: chatCompression.disabled is true', | |
| ); | |
| return { ... }; | |
| } |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| * (autoCompact.ts:30) which is based on p99.99 of real compaction outputs. | ||
| */ | ||
| export const COMPACT_MAX_OUTPUT_TOKENS = 20_000; | ||
|
|
There was a problem hiding this comment.
[Suggestion] SUMMARY_RESERVE and COMPACT_MAX_OUTPUT_TOKENS are both hardcoded to 20_000 with a prose comment requiring them to stay equal — but there is no compile-time enforcement
If someone changes one constant without the other, computeThresholds computes effectiveWindow = window - SUMMARY_RESERVE while the actual output cap is COMPACT_MAX_OUTPUT_TOKENS. The threshold ladder silently drifts from reality. This is a classic "works by coincidence" coupling with no compiler error, no runtime exception, and no test failure to catch the drift.
| export const COMPACT_MAX_OUTPUT_TOKENS = 20_000; | |
| export const SUMMARY_RESERVE = COMPACT_MAX_OUTPUT_TOKENS; |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| /** Token count at which UI warn tier triggers. */ | ||
| readonly warn: number; | ||
| /** Token count at which auto-compaction triggers. */ | ||
| readonly auto: number; |
There was a problem hiding this comment.
[Suggestion] computeThresholds(window) has no input validation — NaN/Infinity input silently disables all compression
If called with NaN (e.g., contextWindowSize misread from config), NaN - SUMMARY_RESERVE = NaN, Math.max(NaN, 0) = NaN, and all thresholds become NaN. Since NaN >= anything is always false, the cheap-gate always passes (NOOP) and hard-rescue never fires. Compression is silently disabled with no error. Extremely unlikely but catastrophic when it happens.
| readonly auto: number; | |
| if (!Number.isFinite(window) || window <= 0) { | |
| const infinite = Number.MAX_SAFE_INTEGER; | |
| return { warn: infinite, auto: infinite, hard: infinite }; | |
| } |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| * the steady-state estimate lags by one response (typically 500–5000 | ||
| * tokens) and the hard-tier rescue (which sits only HARD_BUFFER ≈ 3K | ||
| * from the window edge) fires late, costing a doomed API round-trip | ||
| * before reactive recovery catches the overflow. |
There was a problem hiding this comment.
[Suggestion] estimatePromptTokens accepts a history parameter that is unused in the steady-state path — leaky abstraction
The primary caller (sendMessageStream) passes [] for history to avoid a getHistory(true) clone. In the steady-state branch (lastPromptTokenCount > 0), the history argument is never read. This optimization leaks into the callee's API surface: future callers may not realize they can pass [], or may mistakenly pass [] on the cold-start path, silently getting a severe underestimate.
Consider splitting into two functions or using an options object so the cold-start case is explicitly opt-in:
| * before reactive recovery catches the overflow. | |
| export function estimateSteadyStateTokens( | |
| lastPromptTokenCount: number, | |
| lastCandidatesTokenCount: number, | |
| userMessage: Content, | |
| imageTokenEstimate?: number, | |
| ): number { ... } | |
| export function estimateColdStartTokens( | |
| history: Content[], | |
| userMessage: Content, | |
| imageTokenEstimate?: number, | |
| ): number { ... } |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| /** | ||
| * Number of consecutive auto-compaction failures for this chat. The | ||
| * cheap-gate NOOPs once this reaches MAX_CONSECUTIVE_FAILURES (default 3) | ||
| * until a successful compress (forced or not) resets it to 0. Replaces the |
There was a problem hiding this comment.
[Suggestion] Inconsistent naming between two adjacent failure counters: consecutiveFailures vs hardRescueFailureCount
One uses bare "Failures" without a count suffix, the other uses "FailureCount". Additionally, consecutiveFailures doesn't specify what fails — consecutive what? Without reading the JSDoc, it's ambiguous. Both track related-but-different failure domains within the same class.
Consider renaming both for symmetry and clarity, e.g., autoCompactionConsecutiveFailures and hardRescueConsecutiveFailures.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| */ | ||
| function tierColor(tier: ContextTier): string { | ||
| switch (tier) { | ||
| case 'safe': |
There was a problem hiding this comment.
[Suggestion] warn and auto tiers share the same color (theme.status.warning), making them visually indistinguishable
The three-tier ladder has four states (safe, warn, auto, hard) but only three distinct colors — the warn-to-auto escalation has no visual signal. A user seeing /context output cannot tell from color alone whether they're in the "fire a tip" tier (warn) or the "actually compacting" tier (auto).
Use a distinct color for auto (e.g., a lighter orange distinct from the warn yellow) so each of the four states has a visually unique badge.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
| '[qwen-code] chatCompression.contextPercentageThreshold has been removed ' + | ||
| 'and is now controlled by built-in thresholds. Setting will be ignored. ' + | ||
| 'Note: auto-compaction cannot currently be disabled — the old ' + | ||
| '"set threshold to 0 to disable" escape hatch is gone. If you need ' + |
There was a problem hiding this comment.
[Critical] Deprecation warning contradicts disabled field
This warning states "auto-compaction cannot currently be disabled" but the same PR introduces chatCompression.disabled: boolean (config.ts:281) which does exactly that. Users migrating from contextPercentageThreshold: 0 will believe there is no replacement.
The same false claim appears in docs/users/configuration/settings.md:148 ("There is currently no replacement to disable auto-compaction").
| '"set threshold to 0 to disable" escape hatch is gone. If you need ' + | |
| 'Note: to disable auto-compaction, set chatCompression.disabled to true instead. ' + | |
| 'If you need to retain full history for other reasons, use /clear between conversations.', |
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
| // would silently disable hard-tier rescue for the rest of | ||
| // the session. | ||
| this.lastCandidatesTokenCount = Number.isFinite( | ||
| usageMetadata.candidatesTokenCount, |
There was a problem hiding this comment.
[Critical] Negative candidatesTokenCount not clamped
The R11.1 Number.isFinite guard catches NaN and Infinity but NOT negative finite values. Number.isFinite(-100_000) returns true, so a hostile/buggy provider reporting candidatesTokenCount: -100000 passes through and is added to the estimate in estimatePromptTokens, effectively subtracting from it.
Concretely: lastPromptTokenCount=176_999 + lastCandidatesTokenCount=-100_000 + small user message → effectiveTokens ≈ 77K, far below any hard threshold. Hard-tier rescue never fires regardless of actual prompt size.
| usageMetadata.candidatesTokenCount, | |
| this.lastCandidatesTokenCount = Number.isFinite( | |
| usageMetadata.candidatesTokenCount, | |
| ) | |
| ? Math.max(0, usageMetadata.candidatesTokenCount as number) | |
| : 0; |
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
| @@ -1375,6 +1667,23 @@ export class GeminiChat { | |||
| // Always update the per-chat counter so this chat (including | |||
There was a problem hiding this comment.
[Critical] totalTokenCount fallback double-counts candidates tokens
When promptTokenCount is falsy (0 or undefined), this falls back to totalTokenCount which already includes candidates tokens. Then estimatePromptTokens (tokenEstimation.ts:82-84) adds lastCandidatesTokenCount on top, producing promptTokens + 2×candidatesTokens.
On providers that omit promptTokenCount (some OpenAI-compatible endpoints), this over-estimates by 500–5000 tokens, causing false-positive hard-tier rescue triggers (the HARD_BUFFER is only 3K from the window edge).
| // Always update the per-chat counter so this chat (including | |
| usageMetadata.promptTokenCount ?? usageMetadata.totalTokenCount; |
Additionally, estimatePromptTokens should only add lastCandidatesTokenCount when lastPromptTokenCount was sourced from promptTokenCount (not totalTokenCount). Consider exposing a flag or restructuring the estimator.
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
| // sent on THIS turn. | ||
| // | ||
| // R11.1: use Number.isFinite so a hostile / buggy provider | ||
| // payload (NaN, Infinity, non-number) coerces to 0 instead |
There was a problem hiding this comment.
[Critical] lastPromptTokenCount missing Number.isFinite guard
Unlike lastCandidatesTokenCount (guarded 18 lines below with the R11.1 Number.isFinite check), lastPromptTokenCount is assigned directly without validation. The R11.1 comment explicitly warns that "NaN >= hard is always false" would "silently disable hard-tier rescue" — yet the same class of bug exists on the far more impactful primary field.
A provider returning -Infinity for promptTokenCount passes the if (lastPromptTokenCount) truthiness check and is stored. -Infinity < auto is always false → cheap-gate never fires → auto-compaction silently disabled.
| // payload (NaN, Infinity, non-number) coerces to 0 instead | |
| const rawPromptTokens = | |
| usageMetadata.promptTokenCount ?? usageMetadata.totalTokenCount; | |
| if (rawPromptTokens) { | |
| this.lastPromptTokenCount = Number.isFinite(rawPromptTokens) | |
| ? (rawPromptTokens as number) | |
| : this.lastPromptTokenCount; |
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
| // hard-rescue uses force=true to bypass the breaker, so it would | ||
| // otherwise sidestep that gate. Skip at the source. | ||
| const autoCompactionDisabled = | ||
| this.config.getChatCompression()?.disabled === true; |
There was a problem hiding this comment.
[Critical] autoCompactionDisabled hard-rescue guard has no integration test
The guard suppressing hard-tier rescue when chatCompression.disabled === true has no test at the sendMessageStream level. The service-level cheap-gate NOOP test exists (chatCompressionService.test.ts:675), but nothing verifies that shouldForceFromHard stays false when disabled.
Without this test, a regression removing !autoCompactionDisabled from the wantHardRescue expression would silently force-compress sessions whose user explicitly opted out.
Suggested test: set mockConfig.getChatCompression to return { disabled: true }, set lastPromptTokenCount above the hard threshold, send a message, and assert compressSpy.mock.calls[0][1].force === false.
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
| // Services | ||
| // ============================================================================ | ||
|
|
||
| export { |
There was a problem hiding this comment.
[Critical] MAX_CONSECUTIVE_FAILURES not exported from barrel
The design doc's migration guide instructs SDK consumers to "change true to MAX_CONSECUTIVE_FAILURES", but this constant is not re-exported from the package barrel. Consumers following the documented path cannot import { MAX_CONSECUTIVE_FAILURES } from '@qwen-code/qwen-code-core'.
| export { | |
| export { | |
| computeThresholds, | |
| MAX_CONSECUTIVE_FAILURES, | |
| SUMMARY_RESERVE, | |
| type CompactionThresholds, | |
| } from './services/chatCompressionService.js'; |
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
| persistedOutputTokens, | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
[Suggestion] Hook contract change: firePostCompactEvent now receives XML-wrapped summary
The summary parameter passed to hooks is now <state_snapshot>...</state_snapshot> (constructed at line 598), whereas previously it was raw model text (thoughts filtered). External hook consumers that parse, display, or store compact_summary will now see XML envelope tags they did not expect.
Consider passing snapshotMatch[1] (the inner content without wrapper tags), or documenting the format change in the hook-event type definition.
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
| ); | ||
| lines.push(''); | ||
| lines.push(`**${t('Compaction thresholds')}**`); | ||
| // R6.13: i18n the labels + derive the reserve hint from the |
There was a problem hiding this comment.
[Suggestion] Inconsistent i18n: new labels use t(), existing category labels do not
New threshold labels are wrapped in t() (e.g., t('Effective window'), t('Warn threshold')), but adjacent category labels remain raw strings: fmtCategoryRow('System prompt', ...), fmtCategoryRow('Built-in tools', ...), etc.
Non-English locales will see a mixed-language render. Either wrap all labels in t() or add a // TODO(i18n): wrap remaining category labels comment to track the gap.
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
| * runtime would treat the session as sitting in. | ||
| */ | ||
| const DEFAULT_COMPRESSION_THRESHOLD = 0.7; | ||
| function currentTier( |
There was a problem hiding this comment.
[Suggestion] currentTier small-window degradation branch untested
When thresholds.hard === thresholds.auto (e.g., 32K window: both are 22,400), the guard thresholds.hard > thresholds.auto prevents misclassifying the tier as 'hard'. All /context tests use a 200K window where hard > auto, so this branch is never exercised.
Suggested test: use makeMockConfig(32_000) with lastPromptTokenCount = 25_000, assert currentTier === 'auto' (not 'hard').
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
| * design discussion. | ||
| */ | ||
| const SNAPSHOT_REGEX = new RegExp( | ||
| `[\\s\\S]*<${COMPRESSION_SNAPSHOT_TAG}>([\\s\\S]*?)</${COMPRESSION_SNAPSHOT_TAG}>`, |
There was a problem hiding this comment.
[Suggestion] Truncation guard false-positive on partial closing tag
SNAPSHOT_REGEX requires the full closing </state_snapshot> tag. When the model writes a complete snapshot body but hits COMPACT_MAX_OUTPUT_TOKENS mid-closing-tag (e.g., emits </state_snap), the regex returns no match and a semantically-complete summary is discarded as COMPRESSION_FAILED_OUTPUT_TRUNCATED.
After 3 such events the breaker trips and auto-compaction stops. Consider accepting partial closing tags or verifying the captured content contains expected child tags (<overall_goal>, <current_plan>) as a more robust completeness check.
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
|
Closing in favor of #4345 — a clean re-cut of the original spec + early-discovered real bugs (R1-R5, 6 commits, 4288 LOC, no AI-driven scope creep). This PR ran 12 review rounds and accumulated ~2700 LOC of review-driven additions, much of it self-inflicted regression chains and AI-reviewer scope creep. The PR became un-reviewable. Substantive review-driven refinements (hard-rescue counter, scratchpad envelope extraction, disable escape hatch, hostile-provider hardening, etc.) are being split into focused follow-up issues for independent review. Branch state preserved at tag The triage retrospective that drove this revert is also being folded into the project's pr-triage skill — adds a round-weighted decline bar (defaults Suggestions/test gaps/docs/observability to |
…istoryShallow) Main landed #4286 (replace structuredClone with shallow copy) which: - Reverted #4186's heap-pressure auto-compaction safety net (#4286 removed HEAP_PRESSURE_COMPRESSION_RATIO because the underlying OOM cause was fixed by the shallow-copy refactor) - Reverted #4168's consecutiveFailures ladder back to single-shot hasFailedCompressionAttempt - Introduced getHistoryShallow() / peekLastHistoryEntry() to replace structuredClone-based history access - Added a Chinese-language design doc draft for this exact redesign Resolution strategy: - Take OUR redesign everywhere it conflicts: three-tier threshold ladder, consecutiveFailures circuit breaker, hard-rescue, token estimator, hard-rescue debug log, CompressOptions plumbing for pendingUserMessage / precomputedEffectiveTokens / trigger. - DROP all bypassTokenThreshold / heapPressureCompressionCooldownUntil / HEAP_PRESSURE_* / mockGetHeapStatistics / mockHeapPressure code (heap-pressure mechanism is gone on main; we're not reviving it). - Use main's new getHistoryShallow(true) in chatCompressionService and in the hard-tier rescue estimator path (was getHistory(true) before main's refactor; the shallow path is what other compaction call sites now use). - For chatCompressionService.test.ts inline mockChat objects, alias getHistoryShallow to the same vi.fn() as getHistory so existing .mockReturnValue() calls drive both methods. - For the design doc, keep our resolved Open Question 2 closure rationale and prepend the round-2 blockquote clarifying the Background section describes pre-redesign behavior; take main's slightly more thorough SUMMARY_RESERVE paragraph where it explains both with/without-thinking cases. - Replace the round-2 test that asserted "hard-rescue forwards consecutiveFailures=3" with a test compatible with the post-merge history-access shape (now using getHistoryShallow). 346 core tests passing; CLI typecheck clean for affected files. Pre-existing provider-config typecheck errors from main's #4287 refactor are unrelated to this PR and not touched here.
…er (#4345) * feat(core)!: redesign auto-compaction thresholds with three-tier ladder Replaces the single 70% proportional threshold with a three-tier ladder (warn/auto/hard) that combines proportional fallback with absolute reservation. Large-window models (>=128K) now reserve ~33K instead of 30% of the window, freeing tens of thousands of context tokens that the old formula wasted. Other improvements bundled in the same redesign: - Compression sideQuery now disables thinking and caps maxOutputTokens at 20K, matching claude-code so the buffer math is predictable across providers (Anthropic/OpenAI/Gemini handle thinking budgets inconsistently) - Failure handling upgraded from one-shot permanent lock to a 3-strike circuit breaker; reactive overflow still latches immediately - New estimatePromptTokens helper closes the lag-by-one-turn and first-send-is-0 gaps in lastPromptTokenCount - Hard-tier rescue pulls reactive overflow recovery forward to before the API call, saving an oversized round-trip - /context command displays the three-tier ladder + current tier - tipRegistry's context-* tips track the new thresholds instead of fixed 50/80/95 percentages BREAKING CHANGE: chatCompression.contextPercentageThreshold setting is removed. Settings files containing the field log a one-line deprecation warning at startup and the value is ignored; behaviour is now controlled by built-in thresholds via the new computeThresholds() function. Design: docs/design/auto-compaction-threshold-redesign.md Plan: docs/plans/2026-05-14-auto-compaction-threshold-redesign.md * test(core): fix leftover hasFailedCompressionAttempt option in compress test A pre-existing test case at chatCompressionService.test.ts:678 still passed `hasFailedCompressionAttempt: false` in the CompressOptions shape; rebasing onto current main surfaced this as a typecheck error because the field was renamed to `consecutiveFailures` (Task 7 of the three-tier ladder migration). Update to `consecutiveFailures: 0` — semantically equivalent, the test asserts the side-query is called when `force: true`, no other behaviour change. * fix(core): drop compaction summary when output hits maxOutputTokens cap Adds a defensive guard in ChatCompressionService.compress() that detects when the side-query summary hit COMPACT_MAX_OUTPUT_TOKENS (20K). In that case the summary is likely truncated mid-content, so we drop it and return NOOP rather than persist a half-summary. The next send re-tries; reactive overflow still catches the catastrophic case where the API rejects the next request as too large. Documented in the design doc as risk #2; the bot reviewer on PR #4168 correctly pushed for it to land alongside the threshold redesign rather than as a follow-up since the new 20K cap is what makes truncation likely in the first place. * fix(cli): render three-tier thresholds in /context TUI view The Task 11 redesign updated the non-interactive text formatter (formatContextUsageText) but left ContextUsage.tsx — the interactive React component that real /context users see — unchanged. As a result the TUI still showed the old single "Autocompact buffer" line and none of the new warn/auto/hard ladder. Adds a "Compaction thresholds" section after the per-category breakdown: - Effective window - Warn / Auto / Hard threshold rows with a ▶ marker on the row the current usage has crossed - Current tier label coloured by severity (safe→green, warn/auto→ yellow, hard→red) The existing progress bar legend (Used / Free / Autocompact buffer) is preserved because it's tied to the three-segment progress bar visualisation; the new section adds the absolute numbers + tier badge on top of that. Caught by the tmux e2e test (PR #4168 ci-monitor follow-up). Pre-fix the assertion 'Compaction thresholds' missed completely from the TUI; post-fix the new section renders correctly for fresh and live sessions on 1M / 200K / 128K windows. * fix(core,cli): address PR #4168 review batch 4 Behavior fixes: - MAX_TOKENS truncation guard now returns COMPRESSION_FAILED_EMPTY_SUMMARY instead of NOOP so the consecutive-failure breaker actually trips after repeated max-length summaries (R1.1). - Reactive overflow failure increments consecutiveFailures by 1 instead of latching to MAX in one shot, so a transient network blip doesn't permanently disable auto-compaction. The hard-tier rescue resets the counter, which remains the designated recovery path (R1.2). - /context current-tier classification uses rawOverhead (system + tools + memory + skills) as the tier input when API data is not yet available, rather than 0 — large inherited contexts no longer silently show 'safe' (R2.2). Performance: - sendMessageStream computes effectiveTokens ONCE and passes it through TryCompressOptions.precomputedEffectiveTokens, so the cheap-gate inside service.compress doesn't redo the estimation. Also fixes the imageTokenEstimate inconsistency between the rescue and cheap-gate paths (R1.3 + R1.4). - Steady-state path (lastPromptTokenCount > 0) skips the costly getHistory(true) clone — estimatePromptTokens only needs the user message in that branch. Code hygiene: - BYTES_PER_TOKEN → CHARS_PER_TOKEN (inputs are char counts, not byte counts; CJK text would mislead under the old name) (R3.1). - Drop dead getContextUsagePercent helper + index re-export — no callers in source after the threshold rewire (R1.5). - Add a comment on estimatePromptTokens' first-send fallback documenting the ~15-20K under-estimate (system prompt + tools + skills) and that reactive overflow is the safety net (R3.3). Tests: - New CLI ContextUsage.test.tsx exercises the React renderer for the three-tier section: section presence, ▶ marker placement per tier, current-tier label coloring (R1.6). - New chatCompressionService.test.ts case pins that a stale contextPercentageThreshold: 0 value in user settings no longer short-circuits compaction (R2.1). - New tokenEstimation.test.ts case covers functionResponse (distinct nested-parts branch from functionCall) (R3.5). - New geminiChat.test.ts integration test exercises the real ChatCompressionService — not a mock — for the first-send-after- inherited-history scenario where lastPromptTokenCount=0 and only the full-history estimate can cross the auto threshold (R3.4). Declined: R3.2 (change `>=` to `>` on the MAX_TOKENS guard). The current operator catches the at-cap case as suspicious, which is intentional — landing exactly at the output cap is far more likely truncation than clean stop given p99.99 ≈ 17K. With R1.1 in place, persistent truncations trip the breaker after MAX_CONSECUTIVE_FAILURES so the worst case is bounded. * fix(core,cli): address PR #4168 review batch 5 - R5.1: tighten /context tier comment + TODO. The rawOverhead-based fix doesn't cover `--continue` restores with many history messages (since rawOverhead excludes messagesTokens). UI may still show 'safe' for one render until the first send. Documented inline and added a TODO to plumb chat history into collectContextData for same-source-of-truth as the cheap-gate. - R5.2a: add TODO(finish_reason) at the truncation guard. The `>= cap` heuristic false-positives on legitimate at-cap summaries; the proper signal is finish_reason which runSideQuery doesn't surface today. - R5.2b: split telemetry — new CompressionStatus.COMPRESSION_FAILED_OUTPUT_TRUNCATED enum value. Distinct from EMPTY_SUMMARY so logs/telemetry can tell prompt-quality failures (tune prompt / splitter) from capacity failures (raise cap / shrink splitter input). isCompressionFailureStatus() treats both as failures so the breaker behavior is unchanged. - R5.3: expand consecutiveFailures JSDoc to clarify it tracks "non-force, non-hard-rescue consecutive failures" — hard-rescue resets the counter and force=true skips increments, so the counter is the "regular path" health signal only; reactive overflow is the real safety net for the force-only paths. - R5.4: document the CompressOptions field rename (hasFailedCompressionAttempt: boolean → consecutiveFailures: number) as an SDK breaking change in the design doc with migration guide. * fix(core): disambiguate hard-rescue from manual /compress orphan-strip Self-review (dual reviewer / pr-triage round 1) caught a correctness regression in the hard-rescue path: `sendMessageStream` calls `tryCompress(force=true)` from inside the pre-push window when `effectiveTokens >= hard`. The service's orphan-strip predicate at `chatCompressionService.ts:426-429` gated on `force` alone, which conflated two distinct call shapes: - manual `/compress` (force=true, trigger='manual'): user-initiated between turns; trailing model funcCall IS orphaned because no funcResponse is coming - hard-rescue (force=true, trigger='auto'): automatic mid-turn; trailing model funcCall is ACTIVE because its matching funcResponse is sitting in the pending `userContent` waiting to be pushed The strip fired for both, so a hard-rescue triggered mid tool-use loop would drop the active funcCall. After compression returned and `userContent` (the funcResponse) was pushed, the next API request carried tool_result with no matching tool_use → provider validation error. The in-code comment at L422-424 already documented this exact constraint for the auto-compress case (`force=false`), but reusing `force=true` for hard-rescue silently violated the same constraint. Fix: - Gate `hasOrphanedFuncCall` on `compactTrigger === 'manual'` instead of `force`. The trigger field already disambiguates intent. - `sendMessageStream` hard-rescue now passes `trigger: 'auto'` explicitly (without it, `force=true` defaults to `trigger='manual'` via the `?? (force ? 'manual' : 'auto')` resolver). Sibling audit for "force=true non-manual callsites": - `GeminiClient.tryCompressChat` (manual /compress): correct — manual - `sendMessageStream` hard-rescue: fixed in this commit - `sendMessageStream` reactive overflow catch: already passes trigger='auto'; runs AFTER API call (userContent in history), so if it observes a trailing funcCall it IS orphaned but findCompressSplitPoint handles the case without needing the strip RED-first regression test added: `preserves trailing model+funcCall under hard-rescue (force=true + trigger=auto)` in `chatCompressionService.test.ts`. Failed against pre-fix code (the strip dropped the funcCall); passes against the fix. Adjacent fixes from the same triage round: - `docs/users/configuration/settings.md`: the `chatCompression.contextPercentageThreshold` row still said "use 0 to disable compression entirely" — code has ignored the value since the removal commit. Marked the row REMOVED with migration guidance pointing at the design doc. - `packages/core/src/config/config.ts`: the deprecation warning now tells users how to silence it (remove the key) and where to read current behavior, instead of just announcing the removal. - `docs/design/auto-compaction-threshold-redesign.md`: closed Open Question 2 (small-window hard/auto collapse) — decision is to NOT annotate `/context`, with rationale on file. Tests: 2395 core tests passing, typecheck clean. * docs(core): fix tier-collapse direction in auto-compaction design doc Self-review on the 50bac97 commit caught a direction error in the M2a Open Question 2 closure note: said `currentTier` skips `'hard'` and goes to `'auto'` on collapsed windows, which is backwards. `contextCommand.ts:43-44` checks `tokens >= thresholds.hard` first (no `hard > auto` guard — that fix lives in a separate follow-up), so when `hard === auto` the `'hard'` branch matches first and the `'auto'` band is the empty one. Updated the rationale to describe the actual collapse direction and cite the source-of-truth file:line. Conclusion of the open question (don't annotate `/context`) is unchanged — only the explanation is corrected. * refactor(core): extract shared in-flight funcCall fixture in compression tests The auto-compress and hard-rescue tests for "trailing funcCall is active, not orphaned" shared a byte-identical 4-message history and mock setup. Pull both into setupInFlightFuncCallFixture() inside the describe block so each test only contains the scenario name, the compress() call shape, and its own assertions. Net -29 LOC, no behavior change. * fix(core,cli): address PR #4345 round-2 review feedback - geminiChat: remove pre-call consecutiveFailures reset in hard-rescue. force=true already bypasses the breaker check in chatCompressionService; the pre-reset was redundant on success (post-call L614 already handles it) and *broke* the breaker on failure paths — hard-rescue failures don't increment via tryCompress (force=true skips that branch), only the reactive overflow path at L992 explicitly increments. With the pre-reset the counter oscillated 0↔1 every send and MAX_CONSECUTIVE_FAILURES=3 was unreachable. Wrote a RED test asserting the forwarded counter is the latched value, not zero; the test failed against the old code and passes with the reset removed. - geminiChat: log hard-tier-rescue triggers via debugLogger.warn including effectiveTokens, hard, and the current consecutiveFailures so operators debugging "compaction stopped working" have a breadcrumb. - chatCompressionService: clamp effectiveWindow to >= 0 in computeThresholds so the value surfaced in /context stays meaningful for tiny windows (window < SUMMARY_RESERVE). auto/warn/hard outputs are unaffected because each is Math.max(proportional, absolute) and the proportional branch dominates whenever the absolute branch goes negative. - turn.ts: rewrite COMPRESSION_FAILED_OUTPUT_TRUNCATED docstring. Drop the misleading "compression succeeded" framing (the summary is dropped and isCompressionFailureStatus returns true) and reference the full enum name COMPRESSION_FAILED_EMPTY_SUMMARY instead of the abbreviation. - contextCommand.test.ts: reword the no-API-data-session test comment. collectContextData classifies estimated sessions against rawOverhead; with default fixtures rawOverhead lands in `safe`, but heavy system-prompt / skill / MCP loads can push it into warn/auto/hard. - design doc Background: prepend a blockquote clarifying the section describes pre-redesign behavior and that the inline file:line references point at code before PR #4345 (which removes them). - ui/types: replace the duplicated ContextThresholds interface with a type alias to the core's CompactionThresholds. Field-by-field copy in contextCommand.ts becomes a direct spread. ContextUsage.tsx keeps its CompactionThresholds React component name — the alias avoids the collision a direct import would have caused. - contextCommand: interpolate the actual reserve value into the "(window − 20K reserve)" annotation so SUMMARY_RESERVE retuning doesn't leave the text stale. * fix(core): address PR #4345 round-3 + round-4 review feedback R3-1: rewrite the stale "Hard-tier rescue resets the counter" comment in the reactive-overflow path. The R2 commit removed the pre-call reset from hard-rescue; the only counter-reset path is now the post-call COMPRESSED branch in tryCompress. Two contradicting comments in the same file would mislead a future maintainer tracing the lifecycle. R3-2: rewrite the JSDoc on CompactionThresholds.hard. The "(resets failure counter)" phrasing was true under the pre-R2 design; after R2 the hard threshold force-triggers compaction and bypasses the breaker, but does not reset the counter (which only happens on COMPRESSED success via the post-call branch). The type is consumed by both geminiChat and the CLI UI (via ContextThresholds alias), so the authoritative description had to match the actual contract. R3-3: add a Step 3 to the hard-rescue regression test. The test title claims "success recovers via the post-call branch" but the original Steps 1-2 only verified the latched counter was forwarded INTO the call. Step 3 follows up with a below-hard send and asserts the forwarded counter is 0 — proving geminiChat.ts:614 ran on the COMPRESSED result. R3-4: assert effectiveWindow === 0 on the existing extreme-small-window test and add a separate zero-window edge case. The Math.max(0, ...) clamp from R2 was previously unasserted; a regression that removed the clamp would go undetected. R4-1: forward originalTokenCount on the breaker-NOOP path in chatCompressionService.compress() to match the adjacent threshold-NOOP path (L368-369). Returning {originalTokenCount: 0, newTokenCount: 0} masked "breaker tripped at N tokens" as "empty session" in telemetry dashboards. R4-2a: add debugLogger.warn at the two consecutiveFailures increment sites (cheap-gate path L586 and reactive-overflow path L955) when the counter reaches MAX_CONSECUTIVE_FAILURES. The breaker is one of the PR's headline safety features but, prior to this round, had zero observability when it tripped. Required importing MAX_CONSECUTIVE_FAILURES into geminiChat.ts. R4-3: programmatically link tokenEstimation.ts's CHARS_PER_TOKEN to compactionInputSlimming.ts's TOKEN_TO_CHAR_RATIO. Both are 4 today and represent the same generic char/token conversion. Exporting from compactionInputSlimming and aliasing in tokenEstimation eliminates the silent-drift hazard the JSDoc already warned about. Declined (round-weighted bar at round 4): - R3-5: debugLogger test for hard-rescue trigger — observability test coverage is overthinking at round 3+; the log is informational. - R4-2b: expose breaker state in /context — new feature; out of scope. - R4-4: render test for auto-tier marker — test coverage gap on working code, defer to follow-up PR per round-weighted bar. - R4-5a: extract makeFakeChat/makeFakeConfig shared factory — pure test refactor at round 4, not a fix. - R4-5b: direct unit test for precomputedEffectiveTokens — exercised indirectly via hard-rescue path tests in geminiChat.test.ts. - R4-6: truncation-guard fallback test for missing candidatesTokenCount — code already has a TODO acknowledging the heuristic is imperfect (chatCompressionService.ts:549-553); defer. * fix(core): address PR #4345 round-5 review feedback R5-1: assert breaker-NOOP forwards originalTokenCount. R4-1 changed the breaker-NOOP return from `{0, 0}` to `{originalTokenCount, originalTokenCount}` so telemetry can distinguish "breaker tripped at N tokens" from "empty session", but the existing test only checked compressionStatus and newHistory. Now seeds a non-zero originalTokenCount (120K) and asserts both fields forward it. R5-2: forward originalTokenCount on the empty-history NOOP. This was sibling drift on R4-1 — I fixed the cited breaker-NOOP site but missed the empty-history NOOP. Of 5 NOOP return sites in chatCompressionService, 4 now forward originalTokenCount (breaker, threshold-gate, post-split, min-compression-fraction) and 1 (this one) was still returning `{0, 0}`, breaking the project-wide invariant. Now consistent. R5-3: replace 10 stale line-number references with semantic anchors. After the R3+R4 push, the line refs in my R2/R3 comments (`geminiChat.ts:614`, `chatCompressionService.ts:339`, `line 992`, `L627`, `line 944`) no longer pointed at their original targets — `geminiChat.ts:614` now points at `setSystemInstruction`'s body, completely unrelated to compaction. The pattern itself is fragile; semantic phrasing ("the post-call reset in tryCompress's COMPRESSED handler") doesn't drift when lines shift. 347/347 affected core tests passing locally; typecheck clean. * fix(core): address PR #4345 round-6 review feedback (R6 sweep) R6-1: rewrite the stale JSDoc bullet on `consecutiveFailures` (the "Hard-tier rescue failures" bullet). The old wording said "the counter is reset to 0 BEFORE the rescue call" — that contradicted R5 which explicitly removed the pre-call reset. Now the bullet matches the actual behavior: counter is NOT pre-reset, force=true bypasses the breaker, post-call COMPRESSED handler resets on success, reactive overflow is the explicit-increment safety net. My R5 stale-comment sweep only grep'd inline `//` comments; this JSDoc on the field declaration slipped through. Re-audited "reset to 0 BEFORE" / "pre-reset" across both packages — single site remaining. R6-7: assert `passedOpts.trigger === 'auto'` in the hard-rescue test. This field is the orphan-strip safety wire added by the C1 fix (the service's `compactTrigger === 'manual'` check would otherwise strip the trailing active funcCall mid tool-loop). The test asserted force and pendingUserMessage but not the trigger; a refactor dropping the 'auto' from `trigger: shouldForceFromHard ? 'auto' : undefined` would silently break orphan-strip safety. Now regression-guarded with a single-line expect. 164/164 affected core tests passing locally. Declined per round-weighted bar (round 6 defaults Suggestion / Test coverage / Style to overthinking): - R6-2/3/6: test-coverage gaps on working code — defer to follow-up - R6-4: redundant truthy guard on always-set fields — style nit - R6-5: text-vs-UI inconsistency on /context — existing test enforces current behavior; treat as design decision (offer follow-up if reviewer escalates) - R6-8 (tipRegistry small-window context-high): explicitly closed in design doc's Open Question 2 — small windows have empty context-high band by design; UI work is out-of-scope for this PR - R6-9: wasted clone on rare fallback path — Suggestion-level perf - R6-10 (CompressionMessage missing case): file not in this PR's diff; reviewer themselves proposed it as follow-up
Summary
maxOutputTokenson the compression sideQuery, upgrades failure handling from a one-shot lock to a 3-strike circuit breaker, adds a local token estimator for the cheap-gate, plumbs a hard-tier rescue intosendMessageStream, rewires the/contextcommand andtipRegistrytips around the new thresholds, and removes thechatCompression.contextPercentageThresholdsetting.--continuecoverage (lastPromptTokenCount = 0previously bypassed all gates), and predictable buffer math across providers (thinking budget semantics vary).packages/core/src/services/chatCompressionService.ts— newcomputeThresholds(), tier constants, cheap-gatepackages/core/src/services/tokenEstimation.ts— local char/4 estimatorpackages/core/src/core/geminiChat.ts— hard-tier rescue +consecutiveFailuresbreakerpackages/cli/src/ui/commands/contextCommand.ts—/contextdisplayValidation
```bash
npm run typecheck # clean (4 workspaces)
npm run lint # clean (project files; pre-existing e2e-testing/scripts/*.js untouched)
cd packages/core && npx vitest run src/services src/core # 1930/1930 pass
cd packages/cli && npx vitest run # 5995/5995 pass + 9 skipped
```
Threshold table across windows (matches design doc):
Scope / Risk
Testing Matrix
Testing matrix notes:
Design references
Both are committed in this PR so the rationale is visible alongside the code.