fix: CJK-aware token estimation with shared utility (6× correction for CJK/emoji)#344
Merged
Merged
Conversation
Replace naive text.length/4 token estimation across all 6 call sites with a shared code-point-aware estimator in src/estimate-tokens.ts. - CJK (Chinese/Japanese/Korean): ~1.5 tokens/char - Emoji / Supplementary Plane: ~2 tokens/char - ASCII / Latin: ~0.25 tokens/char (~4 chars/token) The old formula used String.length (UTF-16 code units) which underestimates CJK by ~6x and emoji by ~2-4x, causing compaction to trigger far too late for non-English conversations. Closes #47, Closes #250, Closes #256, Closes #266
Keep compaction hard caps and deterministic fallback summaries inside their intended token budgets after switching to the shared Unicode-aware estimator. Add CJK-heavy regression coverage for both the summary cap path and fallback truncation, and add a patch changeset for the release notes. Regeneration-Prompt: | Review PR #344's shared Unicode-aware token estimator for downstream callers that still assume 4 characters per token. Fix compaction so both the hard-cap path and the deterministic fallback truncate by estimated token budget instead of raw string length, preserving surrogate pairs and working for CJK-heavy or emoji-heavy text. Add regression tests in the compaction integration suite that prove capped summaries and fallback summaries stay within budget for CJK-heavy content, and add a patch changeset because this is user-visible compaction behavior.
Contributor
|
Thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
estimateTokens()usestext.length / 4to estimate token counts. In JavaScript,String.lengthcounts UTF-16 code units, not Unicode code points. This causes severe underestimation for non-ASCII text.CJK text (~6× underestimate)
Chinese/Japanese/Korean characters are typically tokenized at ~1.5 tokens per character, but
length / 4treats them as ~0.25 tokens per character (1 UTF-16 unit ÷ 4).Emoji / Supplementary Plane (~2-4× underestimate)
Emoji are UTF-16 surrogate pairs (
🔥.length === 2), solength / 4 = 0.5, rounded up to 1 token each. Real tokenization is typically 2-4 tokens per emoji.length/4)Hello world你好世界こんにちは안녕하세요🔥🎉💯mixed 你好 🔥Impact
When LCM underestimates token counts for CJK-heavy conversations:
Fix
Extract a shared
src/estimate-tokens.tsutility and replace all 6 inlineestimateTokensdefinitions across the codebase:src/engine.tssrc/assembler.tssrc/compaction.tssrc/retrieval.tssrc/summarize.tssrc/plugin/lcm-doctor-apply.tsThe shared implementation uses
for (const char of text)for correct Unicode code point iteration and applies per-character-class weighting:Compared to other open PRs
lcm-doctor-apply.tsTests
Added
test/estimate-tokens.test.tswith 11 test cases covering:All 636 tests pass (39 suites) including the new estimator tests.
Performance
O(n) vs O(1) but negligible — compaction bottleneck is the LLM call (seconds), not token estimation (microseconds).
Closes #47, Closes #250, Closes #256, Closes #266