fix(tokens): CJK-aware token estimation — 3x accuracy for CJK content by fchaudhryspear · Pull Request #256 · Martian-Engineering/lossless-claw

fchaudhryspear · 2026-04-03T21:38:43Z

Replace naive Math.ceil(text.length / 4) token estimation with per-character Unicode weighting. CJK-aware estimateTokens() now correctly estimates ~1.5 tokens per CJK character instead of 0.25. Fixes #250.

Replace naive length/4 with per-character Unicode weighting: - CJK Unified Ideographs (U+4E00–U+9FFF) + Extension A–F: 1.5 tokens/char - CJK Symbols + Fullwidth Forms (U+3000, U+FF00): 1.5 tokens/char - Everything else: 0.25 tokens/char (4 chars/token, unchanged) Rationale: Claude/GPT tokenizers encode CJK at ~1.5 tokens/char vs 0.25 for ASCII. Old method underestimated CJK-heavy sessions by 3x, causing premature 200k context exhaustion (issue Martian-Engineering#250). Change is purely estimation-side — no effect on actual tokenization. Zero new dependencies. Backward compatible. Fixes: Martian-Engineering#250

jalehman

The shared estimator refactor is fine, but this still needs changes before merge.

src/estimate-tokens.ts does not cover Japanese kana or Korean Hangul. The new helper only weights Han ideographs, CJK punctuation, and fullwidth forms as "CJK", so strings like こんにちは and 안녕하세요 still fall back to the old 0.25 tokens-per-character heuristic. That means the PR only fixes Han-heavy text, not the broader CJK cases described in the PR title/body.
There is no regression coverage for the new estimator. Existing helpers in test/lcm-integration.test.ts and test/engine.test.ts still hard-code Math.ceil(length / 4), so the tests are now out of sync with production behavior and would not catch the missing kana/Hangul coverage.

Please extend the estimator to cover kana/Hangul (or align it exactly with the upstream reference implementation) and add direct tests for ASCII, Han, kana, and Hangul inputs.

freivokh · 2026-04-08T20:07:32Z

+1

…r CJK/emoji) (Martian-Engineering#344) * fix: CJK-aware token estimation with shared utility Replace naive text.length/4 token estimation across all 6 call sites with a shared code-point-aware estimator in src/estimate-tokens.ts. - CJK (Chinese/Japanese/Korean): ~1.5 tokens/char - Emoji / Supplementary Plane: ~2 tokens/char - ASCII / Latin: ~0.25 tokens/char (~4 chars/token) The old formula used String.length (UTF-16 code units) which underestimates CJK by ~6x and emoji by ~2-4x, causing compaction to trigger far too late for non-English conversations. Closes Martian-Engineering#47, Closes Martian-Engineering#250, Closes Martian-Engineering#256, Closes Martian-Engineering#266 * fix: enforce unicode-aware compaction truncation Keep compaction hard caps and deterministic fallback summaries inside their intended token budgets after switching to the shared Unicode-aware estimator. Add CJK-heavy regression coverage for both the summary cap path and fallback truncation, and add a patch changeset for the release notes. Regeneration-Prompt: | Review PR Martian-Engineering#344's shared Unicode-aware token estimator for downstream callers that still assume 4 characters per token. Fix compaction so both the hard-cap path and the deterministic fallback truncate by estimated token budget instead of raw string length, preserving surrogate pairs and working for CJK-heavy or emoji-heavy text. Add regression tests in the compaction integration suite that prove capped summaries and fallback summaries stay within budget for CJK-heavy content, and add a patch changeset because this is user-visible compaction behavior. --------- Co-authored-by: jet <dev@jetd.one> Co-authored-by: Josh Lehman <josh@martian.engineering>

fchaudhryspear force-pushed the main branch from d802dcb to 91d8a1f Compare April 3, 2026 21:40

jalehman requested changes Apr 3, 2026

View reviewed changes

This was referenced Apr 9, 2026

fix: CJK-aware token estimation (6× correction for Chinese/Japanese/Korean) #343

Closed

fix: CJK-aware token estimation with shared utility (6× correction for CJK/emoji) #344

Merged

jalehman closed this in #344 Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokens): CJK-aware token estimation — 3x accuracy for CJK content#256

fix(tokens): CJK-aware token estimation — 3x accuracy for CJK content#256
fchaudhryspear wants to merge 1 commit into
Martian-Engineering:mainfrom
fchaudhryspear:main

fchaudhryspear commented Apr 3, 2026

Uh oh!

jalehman left a comment

Uh oh!

freivokh commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fchaudhryspear commented Apr 3, 2026

Uh oh!

jalehman left a comment

Choose a reason for hiding this comment

Uh oh!

freivokh commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants