Skip to content

fix(cli): cap per-turn compaction attempts#9344

Merged
alex-alecu merged 1 commit into
mainfrom
fix/infinite-compact
Apr 22, 2026
Merged

fix(cli): cap per-turn compaction attempts#9344
alex-alecu merged 1 commit into
mainfrom
fix/infinite-compact

Conversation

@alex-alecu

Copy link
Copy Markdown
Contributor

Why

If a model kept saying the conversation was too big after every compaction, the chat would get stuck in a forever "busy" loop and eventually look like it finished normally even though nothing happened.

What changed

Each turn now keeps track of how many times it has tried to shrink the conversation. After three tries, the chat stops looping, marks the turn as an error, and shows a clear "context overflow" message instead of pretending everything was fine. Before the third try, everything behaves exactly as it did before, so normal overflow-then-recover cases still work. The error also reaches anything listening for turn-close events, so tools and UIs now see "error" instead of "completed" for these stuck turns.

How to test

  1. Point the CLI at a fake or local provider that always replies with HTTP 400 {"error":{"code":"context_length_exceeded"}}.
  2. Start a new chat and send any message.
  3. The chat should try to compact a few times, then stop and show a red "context overflow" error on the last assistant message instead of hanging on "busy".
  4. From packages/opencode/, run bun test test/kilocode/session-compaction-cap.test.ts — all five cases should pass.

When every compaction round still overflowed the model context, SessionPrompt.runLoop would keep calling compaction forever and report the turn as completed. Cap attempts at three per turn and surface exhaustion as a ContextOverflowError on the assistant message with TurnClose reason=error.
@alex-alecu

Copy link
Copy Markdown
Contributor Author

High-level issues on the cloud repo side

There is one function on the cloud side that is implicated in this bug:

apps/web/src/lib/ai-gateway/llm-proxy-helpers.ts:168makeErrorReadable()

It has three distinct problems, only one of which matters for the infinite-loop bug, but all of which deserve attention:

Problem 1 — it rewrites any upstream 4xx/5xx into context_length_exceeded

The structure of the check is:

if (response.status < 400) return undefined;             // only run on errors
if (isUserByok) { ... }                                  // BYOK branch
const model = kiloExclusiveModels.find(m => m.public_id === requestedModel);
if (model) {
  const estimatedTokenCount = estimateTokenCount(request);
  if (estimatedTokenCount >= model.context_length) {
    // REWRITE: replace upstream error with a synthesized context_length_exceeded
    return NextResponse.json(
      { error, error_type: ProxyErrorType.context_length_exceeded, message: error },
      { status: response.status }
    );
  }
}

The stated intent (in the comment immediately above) is:

Sometimes we get generic or nonsensical errors when the context length is exceeded (such as "Internal Server Error" or "No allowed providers are available for the selected model")

— i.e. translate ambiguous overflow errors into a clearer message.

In practice there is no check that the upstream error was actually overflow-related. The rewrite fires for:

  • Genuine overflows ✅ correct
  • Provider outages (Novita/Minimax 500/502/503) ❌ wrong
  • Rate limits (429) ❌ wrong
  • Parse errors / malformed upstream responses ❌ wrong
  • Any other 4xx/5xx that happens to land on a large request ❌ wrong

So yes — we are wrongly transforming errors. Any time a Kilo-exclusive model has an upstream hiccup on a moderately-sized request, the client sees context_length_exceeded instead of the real cause. The CLI then does what it's designed to do for that signal: compact and retry. If the upstream keeps failing, the cloud keeps translating, and the CLI keeps compacting → the exact loop PR #9344 caps.

Problem 2 — the token estimate is badly inaccurate and over-counts

function estimateTokenCount(request: GatewayRequest) {
  return Math.round(JSON.stringify(request).length / 4 + (getMaxTokens(request) ?? 0));
}

Two issues:

  1. JSON.stringify(...).length / 4 — the "4 chars ≈ 1 token" heuristic is for plaintext English. JSON carries 50–100% overhead from quotes, braces, escape sequences, tool-definition scaffolding, cache-control markers, etc. So the "character side" of the estimate over-counts real tokens by roughly 1.5–2×.
  2. Adding max_tokens — the model's context_length is total budget (input + output). max_tokens is the output budget. Adding them is formally correct, but clients typically send max_tokens = model.max_completion_tokens (the model's advertised max output). For minimax/minimax-m2.5:free that's 131_072 out of a 204_800 context — so the comparison effectively becomes "is the estimated input alone ≥ 73,728 tokens?", and with the 2× over-count, an actual 37k-token input is already enough to trigger the rewrite.

Net effect: the rewrite fires on inputs that would actually fit in the model. On kilo-auto/free (minimax-m2.5:free, 204.8k context, 131k max output) this is very easy to reach after just a few tool calls.

Problem 3 — the rewrite preserves the upstream status code

return NextResponse.json(
  { error, error_type: ProxyErrorType.context_length_exceeded, message: error },
  { status: response.status }        // <-- upstream status preserved
);

So a 500 upstream error becomes a 500 response with error.code === "context_length_exceeded". That shape confuses the CLI's error classifier: parseAPICallError uses statusCode === 413 as one overflow signal, so the status is no guide, but it uses body.error.code === "context_length_exceeded" as another — and this path matches that regardless of the underlying status. The effect: the CLI treats a genuine 500 outage as recoverable via compaction.

Timeline / PRs on the cloud side

The bug itself — the transformation code

Commit PR Author Date What it did
bc8179c70 initial commit Remon Oldenbeuving (remonoldenbeuving) 2026-02-04 The makeErrorReadable function with the overflow-rewrite block was already present in the initial commit of this cloud repo. Predates this bug report by months.
c4ff5bebb fix(llm-gateway): add context-length exceeded error translation for Kilo free models Igor Šćekić (iscekic) 2026-03-03 Ported the same logic into the (now-retired) llm-gateway Cloudflare Worker. Commit message explicitly calls the web app version "the reference".

None of the recent cloud PRs introduced the buggy rewrite. It has been the dominant error-translation path for Kilo-exclusive models since February.

The amplifiers — what made the rewrite fire frequently in the last week

PR Commit Author Date Effect on overflow frequency
#2491 feat(proxy): add error_type zod enum to all LLM proxy error responses dff71cbac AI-authored via kilo-code-bot (no human named in the PR body) 2026-04-16 Added error_type enum everywhere. Did not change the rewrite logic — neutral for the bug.
#2509 Route 10% of kilo-auto/free to Step Flash ae6033fa3 Christiaan Arnoldus (chrarnoldus) 2026-04-16 10% of kilo-auto/free sessions now hit a different backing model; any transient upstream error there gets rewritten via the same path.
#2518 Update Claude Opus model IDs and names to 4.7 48cb77744 Christiaan Arnoldus (chrarnoldus) 2026-04-16 Name bump, minor.
#2526 Add xhigh output effort / verbosity for Opus 4.7 1353dc14d Christiaan Arnoldus (chrarnoldus) 2026-04-16 New xhigh/max variants inflate request max_tokens. estimateTokenCount adds that directly, so the check flips true sooner.
#2502 feat(auto): replace kilo-auto/small backing with Gemma 4 416ca73a9 AI-authored by anthropic/claude-opus-4.6, merged via kilo-code-bot 2026-04-17 New backing models for kilo-auto/small.
#2576 Enable reasoning summaries by default dc74d3b46 Christiaan Arnoldus (chrarnoldus) 2026-04-20 Every reasoning request now carries thinking.display: 'summarized' / reasoning.summary: 'auto'. Responses are larger → next turn's input is larger → estimateTokenCount creeps up.
#2621 Disable Trinity Large Thinking free and notify affected users f9302ea43 AI-drafted via "Kilo for Slack" at request of Ari Messer, merged via kilo-code-bot 2026-04-20 Pushed a block of users off Trinity (262k ctx) onto Kilo Auto Free (minimax-m2.5:free, 204.8k ctx — the tightest of the exclusive models).

Existing cloud-repo issues about this

I searched Kilo-Org/cloud for open issues about context_length_exceeded, makeErrorReadable, or estimateTokenCount. There are none. The bug is tracked entirely on the CLI side (Kilo-Org/kilocode#9285 by Zindaar, confirmed by visonforcoding). The cloud repo does not have an issue filed for the wrong-error-transformation problem yet.

Diagram

flowchart TD
    U[Upstream provider] -->|any 4xx or 5xx:<br/>500, 502, 503, 429, 400, ...| M[makeErrorReadable]

    M --> C1{BYOK?}
    C1 -->|yes| R1[BYOK-specific message]
    C1 -->|no| C2{Kilo-exclusive model?}
    C2 -->|no| Pass[pass upstream error through]
    C2 -->|yes| C3["estimateTokenCount >= context_length?<br/>(JSON.stringify/4 + max_tokens)"]
    C3 -->|no| C4{Stealth model?}
    C3 -->|yes| Rewrite["REWRITE to<br/>error_type: context_length_exceeded<br/>status: upstream status"]

    Rewrite --> CLI[CLI classifies as<br/>context_overflow]
    CLI --> Compact[auto-compact]
    Compact --> U

    style Rewrite fill:#fee,stroke:#c00
    style C3 fill:#ffe,stroke:#cc0
Loading

The red box is where we're wrongly transforming. The yellow box is where the transformation decision is made on a badly inflated heuristic. Together they produce the loop that PR #9344 now caps on the CLI side.

Recommended cloud-side follow-ups (not fixed by #9344)

  1. Gate the rewrite on upstream error content, not just status + size. Only rewrite if the upstream body actually contains an ambiguous overflow signature (e.g. matches /maximum context|context.*length|token.*exceed/i or is empty/generic) — not for every 4xx/5xx.
  2. Fix the over-counting estimate. Either (a) use an actual tokenizer (tiktoken / model-specific) before rewriting, or (b) widen the trigger threshold to context_length * 1.5 to compensate for JSON overhead.
  3. Don't preserve the upstream status code on rewrite. If we're confident enough to call it overflow, return 413 (the canonical overflow status) so clients can rely on status code alone.
  4. Add a cloud-repo issue tracking this — right now the only record is on the CLI side.

@alex-alecu alex-alecu merged commit 740c2b4 into main Apr 22, 2026
19 checks passed
@alex-alecu alex-alecu deleted the fix/infinite-compact branch April 22, 2026 13:10
ausard pushed a commit to ausard/kilocode that referenced this pull request Apr 28, 2026
Trim pre-summary history for any completed summary (not only those whose
parent has a compaction part) and strip image/PDF attachments from
historical turns once a summary exists. This stops the outgoing request
from re-shipping multi-MB base-64 attachments on every follow-up turn,
which was causing gateway body-size rejections and cascading compaction
loops even after PR Kilo-Org#9344's attempt cap kicked in.
slamj1 pushed a commit to slamj1/kilocode that referenced this pull request May 16, 2026
jliounis pushed a commit to jliounis/kilocode that referenced this pull request May 18, 2026
jliounis pushed a commit to jliounis/kilocode that referenced this pull request May 18, 2026
fix(cli): cap per-turn compaction attempts
jliounis pushed a commit to jliounis/kilocode that referenced this pull request May 18, 2026
Trim pre-summary history for any completed summary (not only those whose
parent has a compaction part) and strip image/PDF attachments from
historical turns once a summary exists. This stops the outgoing request
from re-shipping multi-MB base-64 attachments on every follow-up turn,
which was causing gateway body-size rejections and cascading compaction
loops even after PR Kilo-Org#9344's attempt cap kicked in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants