fix(tokens): CJK-aware token estimation — 3x accuracy for CJK content#256
fix(tokens): CJK-aware token estimation — 3x accuracy for CJK content#256fchaudhryspear wants to merge 1 commit into
Conversation
Replace naive length/4 with per-character Unicode weighting: - CJK Unified Ideographs (U+4E00–U+9FFF) + Extension A–F: 1.5 tokens/char - CJK Symbols + Fullwidth Forms (U+3000, U+FF00): 1.5 tokens/char - Everything else: 0.25 tokens/char (4 chars/token, unchanged) Rationale: Claude/GPT tokenizers encode CJK at ~1.5 tokens/char vs 0.25 for ASCII. Old method underestimated CJK-heavy sessions by 3x, causing premature 200k context exhaustion (issue Martian-Engineering#250). Change is purely estimation-side — no effect on actual tokenization. Zero new dependencies. Backward compatible. Fixes: Martian-Engineering#250
jalehman
left a comment
There was a problem hiding this comment.
The shared estimator refactor is fine, but this still needs changes before merge.
-
src/estimate-tokens.tsdoes not cover Japanese kana or Korean Hangul. The new helper only weights Han ideographs, CJK punctuation, and fullwidth forms as "CJK", so strings likeこんにちはand안녕하세요still fall back to the old0.25tokens-per-character heuristic. That means the PR only fixes Han-heavy text, not the broader CJK cases described in the PR title/body. -
There is no regression coverage for the new estimator. Existing helpers in
test/lcm-integration.test.tsandtest/engine.test.tsstill hard-codeMath.ceil(length / 4), so the tests are now out of sync with production behavior and would not catch the missing kana/Hangul coverage.
Please extend the estimator to cover kana/Hangul (or align it exactly with the upstream reference implementation) and add direct tests for ASCII, Han, kana, and Hangul inputs.
|
+1 |
…r CJK/emoji) (Martian-Engineering#344) * fix: CJK-aware token estimation with shared utility Replace naive text.length/4 token estimation across all 6 call sites with a shared code-point-aware estimator in src/estimate-tokens.ts. - CJK (Chinese/Japanese/Korean): ~1.5 tokens/char - Emoji / Supplementary Plane: ~2 tokens/char - ASCII / Latin: ~0.25 tokens/char (~4 chars/token) The old formula used String.length (UTF-16 code units) which underestimates CJK by ~6x and emoji by ~2-4x, causing compaction to trigger far too late for non-English conversations. Closes Martian-Engineering#47, Closes Martian-Engineering#250, Closes Martian-Engineering#256, Closes Martian-Engineering#266 * fix: enforce unicode-aware compaction truncation Keep compaction hard caps and deterministic fallback summaries inside their intended token budgets after switching to the shared Unicode-aware estimator. Add CJK-heavy regression coverage for both the summary cap path and fallback truncation, and add a patch changeset for the release notes. Regeneration-Prompt: | Review PR Martian-Engineering#344's shared Unicode-aware token estimator for downstream callers that still assume 4 characters per token. Fix compaction so both the hard-cap path and the deterministic fallback truncate by estimated token budget instead of raw string length, preserving surrogate pairs and working for CJK-heavy or emoji-heavy text. Add regression tests in the compaction integration suite that prove capped summaries and fallback summaries stay within budget for CJK-heavy content, and add a patch changeset because this is user-visible compaction behavior. --------- Co-authored-by: jet <dev@jetd.one> Co-authored-by: Josh Lehman <josh@martian.engineering>
Replace naive Math.ceil(text.length / 4) token estimation with per-character Unicode weighting. CJK-aware estimateTokens() now correctly estimates ~1.5 tokens per CJK character instead of 0.25. Fixes #250.