fix(language): add threshold for ko/ru/ar detection to avoid misclass… by KorenKrita · Pull Request #658 · volcengine/OpenViking

KorenKrita · 2026-03-16T09:09:14Z

The old logic detected Korean/Russian/Arabic with just 1 character, causing mixed CJK+Cyrillic text to be misclassified.

Changes:

Require >=2 chars AND >=10% of text for ko/ru/ar detection
Keep original detection order (ko/ru/ar first, then CJK)
Add test cases for mixed-language scenarios

Fixes issue where Chinese text with a single Cyrillic letter such as (｡>д<) was incorrectly detected as Russian.

Description

Fix language detection logic in memory_extractor.py to prevent misclassification of CJK text containing isolated Cyrillic/Arabic characters.

The original logic detected Korean/Russian/Arabic as long as there was
at least 1 matching character, causing mixed-language text like
"这是中文 Д 再继续" to be incorrectly classified as Russian.

以及 Testing 部分：

Tested with the following scenarios:

"这是中文 Д 再继续" → correctly detected as zh-CN (was ru)
"これは日本語 Я " → correctly detected as ja (was ru)
"Это русский текст" → correctly detected as ru (unchanged)
"Hello Ф world" → correctly falls back to en (was ru)

This PR adds a threshold requirement: ko/ru/ar detection now requires at least 2 characters AND those characters must constitute at least 10% of the
non-whitespace text.

Related Issue

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Modified _detect_output_language() in openviking/session/memory_extractor.py:
- Added total_chars calculation for threshold comparison
- Changed detection threshold from score > 0 to score >= 2 and score / total_chars >= 0.10
- Added clarifying comment about the threshold purpose
Added test cases in tests/session/test_memory_extractor_language.py:
- test_detect_output_language_chinese_with_single_cyrillic() - Chinese with 1 Cyrillic char → zh-CN
- test_detect_output_language_japanese_with_single_cyrillic() - Japanese with 1 Cyrillic char → ja
- test_detect_output_language_russian_with_threshold() - Russian text meeting threshold → ru
- test_detect_output_language_insufficient_cyrillic_fallback() - 1 Cyrillic char among Latin → en

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows (or use your actual platform)

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

The detection order remains unchanged: ko/ru/ar are still checked before CJK, but now require meeting the threshold to return early. This preserves
the original design intent while fixing the false positive issue.

…ification The old logic detected Korean/Russian/Arabic with just 1 character, causing mixed CJK+Cyrillic text to be misclassified. Changes: - Require >=2 chars AND >=10% of text for ko/ru/ar detection - Keep original detection order (ko/ru/ar first, then CJK) - Add test cases for mixed-language scenarios Fixes issue where Chinese text with a single Cyrillic letter was incorrectly detected as Russian.

CLAassistant · 2026-03-16T09:09:24Z

All committers have signed the CLA.

github-project-automation bot added this to OpenViking project Mar 16, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 16, 2026

qin-ctx approved these changes Mar 16, 2026

View reviewed changes

qin-ctx merged commit 59ff67b into volcengine:main Mar 16, 2026
1 check was pending

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(language): add threshold for ko/ru/ar detection to avoid misclass…#658

fix(language): add threshold for ko/ru/ar detection to avoid misclass…#658
qin-ctx merged 1 commit intovolcengine:mainfrom
KorenKrita:fix/language-detection

KorenKrita commented Mar 16, 2026

Uh oh!

CLAassistant commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KorenKrita commented Mar 16, 2026

Description

Related Issue

Type of Change

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

CLAassistant commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Mar 16, 2026 •

edited

Loading