fix: CJK-aware token estimation with shared utility (6× correction for CJK/emoji) by jetd1 · Pull Request #344 · Martian-Engineering/lossless-claw

jetd1 · 2026-04-09T13:21:07Z

Problem

estimateTokens() uses text.length / 4 to estimate token counts. In JavaScript, String.length counts UTF-16 code units, not Unicode code points. This causes severe underestimation for non-ASCII text.

CJK text (~6× underestimate)

Chinese/Japanese/Korean characters are typically tokenized at ~1.5 tokens per character, but length / 4 treats them as ~0.25 tokens per character (1 UTF-16 unit ÷ 4).

Emoji / Supplementary Plane (~2-4× underestimate)

Emoji are UTF-16 surrogate pairs (🔥.length === 2), so length / 4 = 0.5, rounded up to 1 token each. Real tokenization is typically 2-4 tokens per emoji.

Text	Old (`length/4`)	New (CJK-aware)	Actual ~tokens
`Hello world`	3	3	~3
`你好世界`	1	6	~6
`こんにちは`	2	8	~8
`안녕하세요`	2	8	~8
`🔥🎉💯`	2	6	~6-12
`mixed 你好 🔥`	4	7	~7-10

Impact

When LCM underestimates token counts for CJK-heavy conversations:

Compaction triggers too late — context grows far beyond the configured threshold
Assembly budgets are effectively 6× too lenient for Chinese text
Root cause of a context explosion incident: 388K+ tokens accumulated before compaction triggered

Fix

Extract a shared src/estimate-tokens.ts utility and replace all 6 inline estimateTokens definitions across the codebase:

src/engine.ts
src/assembler.ts
src/compaction.ts
src/retrieval.ts
src/summarize.ts
src/plugin/lcm-doctor-apply.ts

The shared implementation uses for (const char of text) for correct Unicode code point iteration and applies per-character-class weighting:

CJK Ideographs (Extensions A-F), Hiragana, Katakana, Hangul, CJK Symbols/Punctuation, Fullwidth Forms: 1.5 tokens/char
Emoji / Supplementary Plane (cp > 0xFFFF): 2 tokens/char
ASCII / Latin: 0.25 tokens/char (unchanged)

Compared to other open PRs

	PR #47	PR #256	This PR
Call sites patched	2 of 6	5 of 6	6 of 6
Shared utility	❌ (inline)	✅	✅
Emoji / Supplementary Plane	❌	❌	✅
Hiragana / Katakana	❌	❌	✅
Hangul	❌	❌	✅
`lcm-doctor-apply.ts`	❌	❌	✅
Test coverage	❌	❌	✅

Tests

Added test/estimate-tokens.test.ts with 11 test cases covering:

ASCII text
CJK Han ideographs
Hiragana / Katakana
Hangul
Emoji / Supplementary Plane
Mixed text (ASCII + CJK + emoji)
CJK Extension B (supplementary plane Han)
Fullwidth forms
CJK punctuation
Empty string

All 636 tests pass (39 suites) including the new estimator tests.

Performance

O(n) vs O(1) but negligible — compaction bottleneck is the LLM call (seconds), not token estimation (microseconds).

Closes #47, Closes #250, Closes #256, Closes #266

Replace naive text.length/4 token estimation across all 6 call sites with a shared code-point-aware estimator in src/estimate-tokens.ts. - CJK (Chinese/Japanese/Korean): ~1.5 tokens/char - Emoji / Supplementary Plane: ~2 tokens/char - ASCII / Latin: ~0.25 tokens/char (~4 chars/token) The old formula used String.length (UTF-16 code units) which underestimates CJK by ~6x and emoji by ~2-4x, causing compaction to trigger far too late for non-English conversations. Closes #47, Closes #250, Closes #256, Closes #266

Keep compaction hard caps and deterministic fallback summaries inside their intended token budgets after switching to the shared Unicode-aware estimator. Add CJK-heavy regression coverage for both the summary cap path and fallback truncation, and add a patch changeset for the release notes. Regeneration-Prompt: | Review PR #344's shared Unicode-aware token estimator for downstream callers that still assume 4 characters per token. Fix compaction so both the hard-cap path and the deterministic fallback truncate by estimated token budget instead of raw string length, preserving surrogate pairs and working for CJK-heavy or emoji-heavy text. Add regression tests in the compaction integration suite that prove capped summaries and fallback summaries stay within budget for CJK-heavy content, and add a patch changeset because this is user-visible compaction behavior.

jalehman · 2026-04-09T20:30:39Z

Thank you!

jetd1 force-pushed the fix/cjk-token-v2 branch from 7e1d8e6 to d177fee Compare April 9, 2026 13:24

jetd1 force-pushed the fix/cjk-token-v2 branch from d177fee to 6a9af64 Compare April 9, 2026 13:25

jalehman merged commit 897a953 into Martian-Engineering:main Apr 9, 2026
1 check passed

github-actions Bot mentioned this pull request Apr 9, 2026

chore: version packages #348

Merged

jalehman mentioned this pull request Apr 16, 2026

[Feature Request] Add CJK-aware token estimation (like win4r fork) #452

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: CJK-aware token estimation with shared utility (6× correction for CJK/emoji)#344

fix: CJK-aware token estimation with shared utility (6× correction for CJK/emoji)#344
jalehman merged 2 commits into
Martian-Engineering:mainfrom
jetd1:fix/cjk-token-v2

jetd1 commented Apr 9, 2026 •

edited

Loading

Uh oh!

jalehman commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jetd1 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

CJK text (~6× underestimate)

Emoji / Supplementary Plane (~2-4× underestimate)

Impact

Fix

Compared to other open PRs

Tests

Performance

Uh oh!

jalehman commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jetd1 commented Apr 9, 2026 •

edited

Loading