Skip to content

Audit countTokens callers — add upper-char bound to defend non-MCP paths from pathological BPE input #558

@esengine

Description

@esengine

Summary

A user report flagged the pure-TS BPE tokenizer as O(n²) on
pathological repetitive input (AAAA…) and claimed multiple call
paths could hang for 30s+. Most of the report misread our code —
specifically, the Never tokenizes full input — pathological repetitive text (AAAA…) costs 30s+... comment at
src/mcp/registry.ts:152 is defense documentation, not a bug
admission. truncateForModelByTokens is deliberately designed to
never feed full input into BPE; sizePrefixToTokens only tokenizes
budget-bounded slices with a 6-iter cap. That whole MCP truncation
path is already safe.

But there's a narrower real surface that the report stumbled into,
worth fixing.

The actual exposure

Three call sites tokenize full message content with only a
lower-bound short-circuit (length <= maxTokens skips), and no upper
char cap:

  1. src/loop/shrink.ts:52shrinkOversizedToolResultsByTokens:
    if (content.length <= maxTokens) return msg;
    const beforeTokens = countTokens(content);  // full content
  2. src/loop/shrink.ts:83 — same pattern for tool_calls.arguments.
  3. src/tokenizer.ts:258 estimateConversationTokens — preflight
    sums countTokens(m.content) over every message in the
    conversation; no per-message char cap.

Most of the time these are safe because upstream caps (MCP default
maxResultChars: 8000, subagent default 8000) keep individual
strings small. But:

  • Non-MCP tools (read_file, shell stdout) don't share the MCP cap —
    a read_file of a 200KB log file lands in the message log at full
    size, and the next preflight tokenizes the whole thing.
  • A repetitive payload (CSV columns, log timestamps, base64 chunks)
    hits the BPE inner loop's worst case where merges keep finding
    matches across the full string.

Worst case isn't 30s on realistic inputs — that number was
speculation in the report — but tokenizing 200KB of moderately
repetitive content on the pure-TS port does take seconds, and it
runs synchronously on the main thread. That's enough to:

  • Stall the loop's preflight
  • Make decidePreflight (src/context-manager.ts:107) feel like a
    freeze on long sessions
  • Compound when shrink + preflight + estimate all touch the same
    oversized message in one turn

Proposed fix

Add a bounded-tokenize helper alongside countTokens:

// Returns an exact count when feasible, an estimate for oversized
// inputs. Never tokenizes more than `maxChars` of the input.
export function countTokensBounded(text: string, maxChars: number): number;

Implementation: when text.length <= maxChars, just delegate to
countTokens. Otherwise tokenize a head + tail sample (same shape as
registry.ts:170-182 already uses for size estimation) and scale by
char/token ratio. The math is already in-tree at registry.ts:180:

const ratio = sampleChars > 0 ? sampleTokens / sampleChars : 0.3;
const estTotalTokens = Math.ceil(s.length * ratio);

Then swap the three sites above to call countTokensBounded with a
sane cap (e.g. 32KB — well above any realistic single message that
matters for a token-budget decision, well below the slow zone).

For shrinkOversizedToolResultsByTokens the cap is fine to be loose
because the function's job is just deciding "is this over budget";
an estimate within ±10% is enough to gate the truncation pass.

Out of scope

  • Replacing the pure-TS BPE port with a native binding. The TS port
    is "good enough for budgeting decisions" by design; native is a
    separate, much larger change.
  • Caching countTokens results. Useful but orthogonal.
  • The MCP truncation path. Already defended; do not touch.

Credit

Original report flagged the right family of risk even though the
specific claims were wrong. The 30s number isn't real on observed
inputs and the cited comment was misread, but the underlying "we
sometimes tokenize unbounded user input on the main thread" exposure
is genuine and worth closing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions