Skip to content

fix: CJK-aware token estimation with shared utility (6× correction for CJK/emoji)#344

Merged
jalehman merged 2 commits into
Martian-Engineering:mainfrom
jetd1:fix/cjk-token-v2
Apr 9, 2026
Merged

fix: CJK-aware token estimation with shared utility (6× correction for CJK/emoji)#344
jalehman merged 2 commits into
Martian-Engineering:mainfrom
jetd1:fix/cjk-token-v2

Conversation

@jetd1

@jetd1 jetd1 commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Problem

estimateTokens() uses text.length / 4 to estimate token counts. In JavaScript, String.length counts UTF-16 code units, not Unicode code points. This causes severe underestimation for non-ASCII text.

CJK text (~6× underestimate)

Chinese/Japanese/Korean characters are typically tokenized at ~1.5 tokens per character, but length / 4 treats them as ~0.25 tokens per character (1 UTF-16 unit ÷ 4).

Emoji / Supplementary Plane (~2-4× underestimate)

Emoji are UTF-16 surrogate pairs (🔥.length === 2), so length / 4 = 0.5, rounded up to 1 token each. Real tokenization is typically 2-4 tokens per emoji.

Text Old (length/4) New (CJK-aware) Actual ~tokens
Hello world 3 3 ~3
你好世界 1 6 ~6
こんにちは 2 8 ~8
안녕하세요 2 8 ~8
🔥🎉💯 2 6 ~6-12
mixed 你好 🔥 4 7 ~7-10

Impact

When LCM underestimates token counts for CJK-heavy conversations:

  • Compaction triggers too late — context grows far beyond the configured threshold
  • Assembly budgets are effectively 6× too lenient for Chinese text
  • Root cause of a context explosion incident: 388K+ tokens accumulated before compaction triggered

Fix

Extract a shared src/estimate-tokens.ts utility and replace all 6 inline estimateTokens definitions across the codebase:

  • src/engine.ts
  • src/assembler.ts
  • src/compaction.ts
  • src/retrieval.ts
  • src/summarize.ts
  • src/plugin/lcm-doctor-apply.ts

The shared implementation uses for (const char of text) for correct Unicode code point iteration and applies per-character-class weighting:

  • CJK Ideographs (Extensions A-F), Hiragana, Katakana, Hangul, CJK Symbols/Punctuation, Fullwidth Forms: 1.5 tokens/char
  • Emoji / Supplementary Plane (cp > 0xFFFF): 2 tokens/char
  • ASCII / Latin: 0.25 tokens/char (unchanged)

Compared to other open PRs

PR #47 PR #256 This PR
Call sites patched 2 of 6 5 of 6 6 of 6
Shared utility ❌ (inline)
Emoji / Supplementary Plane
Hiragana / Katakana
Hangul
lcm-doctor-apply.ts
Test coverage

Tests

Added test/estimate-tokens.test.ts with 11 test cases covering:

  • ASCII text
  • CJK Han ideographs
  • Hiragana / Katakana
  • Hangul
  • Emoji / Supplementary Plane
  • Mixed text (ASCII + CJK + emoji)
  • CJK Extension B (supplementary plane Han)
  • Fullwidth forms
  • CJK punctuation
  • Empty string

All 636 tests pass (39 suites) including the new estimator tests.

Performance

O(n) vs O(1) but negligible — compaction bottleneck is the LLM call (seconds), not token estimation (microseconds).

Closes #47, Closes #250, Closes #256, Closes #266

@jetd1 jetd1 force-pushed the fix/cjk-token-v2 branch from 7e1d8e6 to d177fee Compare April 9, 2026 13:24
Replace naive text.length/4 token estimation across all 6 call sites
with a shared code-point-aware estimator in src/estimate-tokens.ts.

- CJK (Chinese/Japanese/Korean): ~1.5 tokens/char
- Emoji / Supplementary Plane: ~2 tokens/char
- ASCII / Latin: ~0.25 tokens/char (~4 chars/token)

The old formula used String.length (UTF-16 code units) which
underestimates CJK by ~6x and emoji by ~2-4x, causing compaction
to trigger far too late for non-English conversations.

Closes #47, Closes #250, Closes #256, Closes #266
@jetd1 jetd1 force-pushed the fix/cjk-token-v2 branch from d177fee to 6a9af64 Compare April 9, 2026 13:25
Keep compaction hard caps and deterministic fallback summaries inside their intended token budgets after switching to the shared Unicode-aware estimator. Add CJK-heavy regression coverage for both the summary cap path and fallback truncation, and add a patch changeset for the release notes.

Regeneration-Prompt: |
  Review PR #344's shared Unicode-aware token estimator for downstream callers that still assume 4 characters per token. Fix compaction so both the hard-cap path and the deterministic fallback truncate by estimated token budget instead of raw string length, preserving surrogate pairs and working for CJK-heavy or emoji-heavy text. Add regression tests in the compaction integration suite that prove capped summaries and fallback summaries stay within budget for CJK-heavy content, and add a patch changeset because this is user-visible compaction behavior.
@jalehman

jalehman commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Thank you!

@jalehman jalehman merged commit 897a953 into Martian-Engineering:main Apr 9, 2026
1 check passed
@github-actions github-actions Bot mentioned this pull request Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

'summary.ts:estimateTokens' could underestimate the Chinese context. [Bug] CJK token estimation uses length/4 causing severe underestimation

2 participants