Skip to content

Fix CJK token underestimation in _estimate_tokens fallback#661

Merged
MaojiaSheng merged 1 commit intovolcengine:mainfrom
jnMetaCode:fix/cjk-token-estimation
Mar 16, 2026
Merged

Fix CJK token underestimation in _estimate_tokens fallback#661
MaojiaSheng merged 1 commit intovolcengine:mainfrom
jnMetaCode:fix/cjk-token-estimation

Conversation

@jnMetaCode
Copy link
Copy Markdown
Contributor

Summary

When tiktoken is unavailable, _estimate_tokens() falls back to len(text) // 3, which assumes ~3 characters per token. This is reasonable for Latin/ASCII text but severely underestimates token counts for CJK (Chinese, Japanese, Korean) text, where each character typically maps to 1–2 tokens.

The underestimation means CJK text that actually exceeds the 8192-token API limit gets an estimated count well below the threshold, so _chunk_text() never triggers chunking. The full text is then sent to the embedding API, which rejects it with a BadRequestError.

Fix

Replace:

return len(text) // 3

With:

return max(len(text) // 3, len(text.encode("utf-8")) // 4)

Each CJK character is 3 bytes in UTF-8, so len(text.encode("utf-8")) // 4 yields ~0.75 tokens per CJK character — much closer to the actual 1–2 range. The max() ensures whichever estimate is more conservative wins: the char-based estimate is still used for ASCII-heavy text, while the byte-based estimate kicks in for CJK-heavy text.

Fixes #616
Fixes #634

When tiktoken is unavailable, the fallback `len(text) // 3` severely
underestimates tokens for CJK text (Chinese/Japanese/Korean characters
are ~1-2 tokens each, not 0.33). This causes text exceeding the 8192-token
API limit to bypass chunking, resulting in BadRequestError.

Use `max(len(text) // 3, len(text.encode("utf-8")) // 4)` instead, which
picks the more conservative estimate. For ASCII-heavy text the char-based
estimate still wins; for CJK text the byte-based estimate correctly
produces ~0.75 tokens per character.

Fixes volcengine#616, fixes volcengine#634

Signed-off-by: JiangNan <1394485448@qq.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 16, 2026

CLA assistant check
All committers have signed the CLA.

@MaojiaSheng MaojiaSheng merged commit 6471868 into volcengine:main Mar 16, 2026
1 check was pending
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

3 participants