Skip to content

fix(tokens): CJK-aware token estimation — 3x accuracy for CJK content#256

Closed
fchaudhryspear wants to merge 1 commit into
Martian-Engineering:mainfrom
fchaudhryspear:main
Closed

fix(tokens): CJK-aware token estimation — 3x accuracy for CJK content#256
fchaudhryspear wants to merge 1 commit into
Martian-Engineering:mainfrom
fchaudhryspear:main

Conversation

@fchaudhryspear

Copy link
Copy Markdown

Replace naive Math.ceil(text.length / 4) token estimation with per-character Unicode weighting. CJK-aware estimateTokens() now correctly estimates ~1.5 tokens per CJK character instead of 0.25. Fixes #250.

Replace naive length/4 with per-character Unicode weighting:
- CJK Unified Ideographs (U+4E00–U+9FFF) + Extension A–F: 1.5 tokens/char
- CJK Symbols + Fullwidth Forms (U+3000, U+FF00): 1.5 tokens/char
- Everything else: 0.25 tokens/char (4 chars/token, unchanged)

Rationale: Claude/GPT tokenizers encode CJK at ~1.5 tokens/char vs 0.25
for ASCII. Old method underestimated CJK-heavy sessions by 3x, causing
premature 200k context exhaustion (issue Martian-Engineering#250).

Change is purely estimation-side — no effect on actual tokenization.
Zero new dependencies. Backward compatible.

Fixes: Martian-Engineering#250

@jalehman jalehman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared estimator refactor is fine, but this still needs changes before merge.

  1. src/estimate-tokens.ts does not cover Japanese kana or Korean Hangul. The new helper only weights Han ideographs, CJK punctuation, and fullwidth forms as "CJK", so strings like こんにちは and 안녕하세요 still fall back to the old 0.25 tokens-per-character heuristic. That means the PR only fixes Han-heavy text, not the broader CJK cases described in the PR title/body.

  2. There is no regression coverage for the new estimator. Existing helpers in test/lcm-integration.test.ts and test/engine.test.ts still hard-code Math.ceil(length / 4), so the tests are now out of sync with production behavior and would not catch the missing kana/Hangul coverage.

Please extend the estimator to cover kana/Hangul (or align it exactly with the upstream reference implementation) and add direct tests for ASCII, Han, kana, and Hangul inputs.

@freivokh

freivokh commented Apr 8, 2026

Copy link
Copy Markdown

+1

@jalehman jalehman closed this in #344 Apr 9, 2026
torchy55 pushed a commit to torchy55/lossless-claw that referenced this pull request Apr 9, 2026
…r CJK/emoji) (Martian-Engineering#344)

* fix: CJK-aware token estimation with shared utility

Replace naive text.length/4 token estimation across all 6 call sites
with a shared code-point-aware estimator in src/estimate-tokens.ts.

- CJK (Chinese/Japanese/Korean): ~1.5 tokens/char
- Emoji / Supplementary Plane: ~2 tokens/char
- ASCII / Latin: ~0.25 tokens/char (~4 chars/token)

The old formula used String.length (UTF-16 code units) which
underestimates CJK by ~6x and emoji by ~2-4x, causing compaction
to trigger far too late for non-English conversations.

Closes Martian-Engineering#47, Closes Martian-Engineering#250, Closes Martian-Engineering#256, Closes Martian-Engineering#266

* fix: enforce unicode-aware compaction truncation

Keep compaction hard caps and deterministic fallback summaries inside their intended token budgets after switching to the shared Unicode-aware estimator. Add CJK-heavy regression coverage for both the summary cap path and fallback truncation, and add a patch changeset for the release notes.

Regeneration-Prompt: |
  Review PR Martian-Engineering#344's shared Unicode-aware token estimator for downstream callers that still assume 4 characters per token. Fix compaction so both the hard-cap path and the deterministic fallback truncate by estimated token budget instead of raw string length, preserving surrogate pairs and working for CJK-heavy or emoji-heavy text. Add regression tests in the compaction integration suite that prove capped summaries and fallback summaries stay within budget for CJK-heavy content, and add a patch changeset because this is user-visible compaction behavior.

---------

Co-authored-by: jet <dev@jetd.one>
Co-authored-by: Josh Lehman <josh@martian.engineering>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] CJK token estimation uses length/4 causing severe underestimation

3 participants