Fix CJK token underestimation in _estimate_tokens fallback by jnMetaCode · Pull Request #661 · volcengine/OpenViking

jnMetaCode · 2026-03-16T10:18:10Z

Summary

When tiktoken is unavailable, _estimate_tokens() falls back to len(text) // 3, which assumes ~3 characters per token. This is reasonable for Latin/ASCII text but severely underestimates token counts for CJK (Chinese, Japanese, Korean) text, where each character typically maps to 1–2 tokens.

The underestimation means CJK text that actually exceeds the 8192-token API limit gets an estimated count well below the threshold, so _chunk_text() never triggers chunking. The full text is then sent to the embedding API, which rejects it with a BadRequestError.

Fix

Replace:

return len(text) // 3

With:

return max(len(text) // 3, len(text.encode("utf-8")) // 4)

Each CJK character is 3 bytes in UTF-8, so len(text.encode("utf-8")) // 4 yields ~0.75 tokens per CJK character — much closer to the actual 1–2 range. The max() ensures whichever estimate is more conservative wins: the char-based estimate is still used for ASCII-heavy text, while the byte-based estimate kicks in for CJK-heavy text.

Fixes #616
Fixes #634

When tiktoken is unavailable, the fallback `len(text) // 3` severely underestimates tokens for CJK text (Chinese/Japanese/Korean characters are ~1-2 tokens each, not 0.33). This causes text exceeding the 8192-token API limit to bypass chunking, resulting in BadRequestError. Use `max(len(text) // 3, len(text.encode("utf-8")) // 4)` instead, which picks the more conservative estimate. For ASCII-heavy text the char-based estimate still wins; for CJK text the byte-based estimate correctly produces ~0.75 tokens per character. Fixes volcengine#616, fixes volcengine#634 Signed-off-by: JiangNan <1394485448@qq.com>

CLAassistant · 2026-03-16T10:18:18Z

All committers have signed the CLA.

github-project-automation bot added this to OpenViking project Mar 16, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 16, 2026

MaojiaSheng approved these changes Mar 16, 2026

View reviewed changes

MaojiaSheng merged commit 6471868 into volcengine:main Mar 16, 2026
1 check was pending

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CJK token underestimation in _estimate_tokens fallback#661

Fix CJK token underestimation in _estimate_tokens fallback#661
MaojiaSheng merged 1 commit intovolcengine:mainfrom
jnMetaCode:fix/cjk-token-estimation

jnMetaCode commented Mar 16, 2026

Uh oh!

CLAassistant commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jnMetaCode commented Mar 16, 2026

Summary

Fix

Uh oh!

CLAassistant commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Mar 16, 2026 •

edited

Loading