fix(language): add threshold for ko/ru/ar detection to avoid misclass…#658
Merged
qin-ctx merged 1 commit intovolcengine:mainfrom Mar 16, 2026
Merged
Conversation
…ification The old logic detected Korean/Russian/Arabic with just 1 character, causing mixed CJK+Cyrillic text to be misclassified. Changes: - Require >=2 chars AND >=10% of text for ko/ru/ar detection - Keep original detection order (ko/ru/ar first, then CJK) - Add test cases for mixed-language scenarios Fixes issue where Chinese text with a single Cyrillic letter was incorrectly detected as Russian.
qin-ctx
approved these changes
Mar 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The old logic detected Korean/Russian/Arabic with just 1 character, causing mixed CJK+Cyrillic text to be misclassified.
Changes:
Fixes issue where Chinese text with a single Cyrillic letter such as (。>д<) was incorrectly detected as Russian.
Description
Fix language detection logic in
memory_extractor.pyto prevent misclassification of CJK text containing isolated Cyrillic/Arabic characters.The original logic detected Korean/Russian/Arabic as long as there was
at least 1 matching character, causing mixed-language text like
"这是中文 Д 再继续" to be incorrectly classified as Russian.
以及 Testing 部分:
Tested with the following scenarios:
zh-CN(wasru)ja(wasru)ru(unchanged)en(wasru)This PR adds a threshold requirement: ko/ru/ar detection now requires at least 2 characters AND those characters must constitute at least 10% of the
non-whitespace text.
Related Issue
Type of Change
Changes Made
Modified
_detect_output_language()inopenviking/session/memory_extractor.py:total_charscalculation for threshold comparisonscore > 0toscore >= 2 and score / total_chars >= 0.10Added test cases in
tests/session/test_memory_extractor_language.py:test_detect_output_language_chinese_with_single_cyrillic()- Chinese with 1 Cyrillic char → zh-CNtest_detect_output_language_japanese_with_single_cyrillic()- Japanese with 1 Cyrillic char → jatest_detect_output_language_russian_with_threshold()- Russian text meeting threshold → rutest_detect_output_language_insufficient_cyrillic_fallback()- 1 Cyrillic char among Latin → enTesting
Checklist
Screenshots (if applicable)
Additional Notes
The detection order remains unchanged: ko/ru/ar are still checked before CJK, but now require meeting the threshold to return early. This preserves
the original design intent while fixing the false positive issue.