fix(context_compressor): prevent HEAD fossilization and ghost responses after compression#9199
Conversation
…es after compression Two related bugs in the compression boundary algorithm, both confirmed against real production sessions (NousResearch#7133). ## Bug 1: Last user message summarized into MIDDLE (ghost response) _find_tail_cut_by_tokens() walks backward accumulating tokens until the budget is exhausted. When tool results are small (30-100 tokens each) many of them fit under the budget ceiling, pushing the cut far back and leaving the last user message in the MIDDLE region. On the next API call the model sees the summary and "continues" from it — producing a response about past history with no new user input (ghost response). Fix: after _align_boundary_backward, scan backward for the last user message and clamp cut_idx to include it unconditionally. ## Bug 2: HEAD fossilization — random user message replayed after every compression compress() always sets compress_start = protect_first_n (default 3), copying the first N messages verbatim as HEAD into every child session. This is correct when messages[0] is a system prompt. But when a session starts cold (no system message), a plain user message becomes the permanent HEAD and is re-injected into every session born from compression. The model sees it as an open unanswered turn and acts on it every cycle. Observed: a single user message from 07:14 AM was replayed as message[0] across 6 consecutive compression-spawned sessions throughout the day. Fix: only apply protect_first_n when messages[0].role == "system". Otherwise compress_start = 0 — no HEAD, the tail budget handles recency. Closes NousResearch#7133
There was a problem hiding this comment.
Pull request overview
Fixes two compression-boundary bugs that can cause incoherent “ghost” responses and repeated replay of stale user turns after context compression (closes #7133).
Changes:
- Clamp the token-budget tail cut so the last
usermessage is always preserved in the TAIL region. - Avoid “HEAD fossilization” by only applying
protect_first_nwhen the transcript starts with asystemmessage.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if messages and messages[0].get("role") == "system": | ||
| compress_start = self.protect_first_n | ||
| else: | ||
| compress_start = 0 | ||
| compress_start = self._align_boundary_forward(messages, compress_start) |
There was a problem hiding this comment.
compress() now conditionally sets compress_start to 0 when the first message isn’t a system message, but the early-return guard above still uses self.protect_first_n to compute _min_for_compress. This can incorrectly skip compression for large/over-threshold transcripts that start with a user message (effective head size is 0, but the guard still requires protect_first_n + 4 messages). Consider basing the minimum-message check on the actual head size (system-present vs not), or move the guard to after compress_start is computed.
| # ------------------------------------------------------------------ | ||
| if messages and messages[0].get("role") == "system": | ||
| compress_start = self.protect_first_n | ||
| else: | ||
| compress_start = 0 | ||
| compress_start = self._align_boundary_forward(messages, compress_start) | ||
|
|
There was a problem hiding this comment.
compress_start boundary selection is now more nuanced (system-only head protection), but other codepaths still duplicate the old logic. For example, gateway/run.py:5900-5903 always uses compressor.protect_first_n, which can make manual /compress previews disagree with the actual compression behavior. Consider centralizing this boundary computation in a small helper on ContextCompressor and reusing it everywhere to prevent drift.
| # ------------------------------------------------------------------ | |
| if messages and messages[0].get("role") == "system": | |
| compress_start = self.protect_first_n | |
| else: | |
| compress_start = 0 | |
| compress_start = self._align_boundary_forward(messages, compress_start) | |
| # Keep this policy in one helper so preview and execution paths do not | |
| # drift when the boundary logic changes. | |
| # ------------------------------------------------------------------ | |
| def _compute_compress_start_boundary(messages_to_compress: List[Dict[str, Any]]) -> int: | |
| if messages_to_compress and messages_to_compress[0].get("role") == "system": | |
| boundary = self.protect_first_n | |
| else: | |
| boundary = 0 | |
| return self._align_boundary_forward(messages_to_compress, boundary) | |
| compress_start = _compute_compress_start_boundary(messages) |
| # ------------------------------------------------------------------ | ||
| # HOTFIX: Ensure the last user message is always in the TAIL. | ||
| # | ||
| # The token-budget walk can land the cut AFTER the last user turn | ||
| # if tool results are small (few tokens, many fit under budget). | ||
| # That puts the user's question in MIDDLE → summarized → ghost. | ||
| # Search from the TRUE end of the message list to find the last | ||
| # user, then ensure the cut includes it regardless of alignment. | ||
| # ------------------------------------------------------------------ | ||
| last_user_idx = None | ||
| for i in range(n - 1, -1, -1): | ||
| if messages[i].get("role") == "user": | ||
| last_user_idx = i | ||
| break | ||
| if last_user_idx is not None and cut_idx > last_user_idx: | ||
| cut_idx = last_user_idx |
There was a problem hiding this comment.
These fixes address two production regressions (tail cut after last user; head fossilization when no system prompt). There are existing compressor tests, but none that assert (a) last user remains in the TAIL when the transcript ends with many small tool results, or (b) compress_start becomes 0 when messages[0].role != "system" (and that the first user message doesn’t persist across successive compressions). Adding targeted regression tests for both scenarios would help prevent reintroducing these bugs.
Three issues raised in review:
1. _min_for_compress guard used self.protect_first_n unconditionally, which
was too conservative for cold-start sessions (no system message) where
the effective head size is 0. Fixed to use _effective_head.
2. HEAD boundary logic was inlined in compress() and duplicated in
gateway/run.py preview path. Extracted into _compute_compress_start()
so both paths share identical logic and cannot drift.
3. No regression tests existed for the two production bugs. Added
TestHotfixRegressions with four targeted cases:
- last user message lands in TAIL when transcript ends with many
small tool results (Hotfix 1 / ghost response)
- compress_start == 0 for cold-start sessions (Hotfix 2)
- compress_start == protect_first_n when system message present
- first user message absent from compressed output after cold-start
Also updated two existing role-collision tests that assumed no-system-message
HEAD behaviour. Both now include a system message so protect_first_n applies
as originally intended, keeping their role-alternation assertions valid.
|
Compression failed: 'ContextCompressor' object has no attribute '_compute_compress_start' |
Fixes two related bugs in the compression boundary algorithm, both confirmed against real production sessions. Closes #7133.
Bug 1: Last user message summarized into MIDDLE (ghost response)
_find_tail_cut_by_tokens()walks backward accumulating tokens until the budget is exhausted. When tool results are small (30–100 tokens each), many fit under the budget ceiling, pushing the cut far back and leaving the last user message in the MIDDLE region. On the next API call the model sees the summary and "continues" from it — producing a response about past history with no new user input.Fix: after
_align_boundary_backward, scan backward for the last user message and clampcut_idxto include it unconditionally.Bug 2: HEAD fossilization — random user message replayed after every compression
compress()always setscompress_start = protect_first_n(default 3), copying the first N messages verbatim as HEAD into every child session born from compression. This is correct whenmessages[0]is a system prompt. But when a session starts cold (no system message), a plain user message becomes the permanent HEAD and is re-injected into every subsequent compressed session. The model sees it as an open unanswered turn and acts on it every cycle.Observed in production: a single user message from 07:14 AM was replayed as
message[0]across 6 consecutive compression-spawned sessions throughout the day, causing the agent to repeatedly act on a stale request in the middle of unrelated work.Fix: only apply
protect_first_nwhenmessages[0].role == "system". Otherwisecompress_start = 0— no HEAD fossilization, the tail budget handles recency.Testing
Both fixes verified via dry-run on production sessions before deployment. The HEAD fossilization fix was confirmed by inspecting the session chain: all 6 child sessions created today started with the same user message at index 0.