Skip to content

fix(language): add threshold for ko/ru/ar detection to avoid misclass…#658

Merged
qin-ctx merged 1 commit intovolcengine:mainfrom
KorenKrita:fix/language-detection
Mar 16, 2026
Merged

fix(language): add threshold for ko/ru/ar detection to avoid misclass…#658
qin-ctx merged 1 commit intovolcengine:mainfrom
KorenKrita:fix/language-detection

Conversation

@KorenKrita
Copy link
Copy Markdown
Contributor

The old logic detected Korean/Russian/Arabic with just 1 character, causing mixed CJK+Cyrillic text to be misclassified.

Changes:

  • Require >=2 chars AND >=10% of text for ko/ru/ar detection
  • Keep original detection order (ko/ru/ar first, then CJK)
  • Add test cases for mixed-language scenarios

Fixes issue where Chinese text with a single Cyrillic letter such as (。>д<) was incorrectly detected as Russian.

Description

Fix language detection logic in memory_extractor.py to prevent misclassification of CJK text containing isolated Cyrillic/Arabic characters.

The original logic detected Korean/Russian/Arabic as long as there was
at least 1 matching character, causing mixed-language text like
"这是中文 Д 再继续" to be incorrectly classified as Russian.

以及 Testing 部分:

Tested with the following scenarios:

  • "这是中文 Д 再继续" → correctly detected as zh-CN (was ru)
  • "これは日本語 Я " → correctly detected as ja (was ru)
  • "Это русский текст" → correctly detected as ru (unchanged)
  • "Hello Ф world" → correctly falls back to en (was ru)

This PR adds a threshold requirement: ko/ru/ar detection now requires at least 2 characters AND those characters must constitute at least 10% of the
non-whitespace text.

Related Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Modified _detect_output_language() in openviking/session/memory_extractor.py:

    • Added total_chars calculation for threshold comparison
    • Changed detection threshold from score > 0 to score >= 2 and score / total_chars >= 0.10
    • Added clarifying comment about the threshold purpose
  • Added test cases in tests/session/test_memory_extractor_language.py:

    • test_detect_output_language_chinese_with_single_cyrillic() - Chinese with 1 Cyrillic char → zh-CN
    • test_detect_output_language_japanese_with_single_cyrillic() - Japanese with 1 Cyrillic char → ja
    • test_detect_output_language_russian_with_threshold() - Russian text meeting threshold → ru
    • test_detect_output_language_insufficient_cyrillic_fallback() - 1 Cyrillic char among Latin → en

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows (or use your actual platform)

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

The detection order remains unchanged: ko/ru/ar are still checked before CJK, but now require meeting the threshold to return early. This preserves
the original design intent while fixing the false positive issue.

…ification

The old logic detected Korean/Russian/Arabic with just 1 character,
causing mixed CJK+Cyrillic text to be misclassified.

Changes:
- Require >=2 chars AND >=10% of text for ko/ru/ar detection
- Keep original detection order (ko/ru/ar first, then CJK)
- Add test cases for mixed-language scenarios

Fixes issue where Chinese text with a single Cyrillic letter
was incorrectly detected as Russian.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 16, 2026

CLA assistant check
All committers have signed the CLA.

@qin-ctx qin-ctx merged commit 59ff67b into volcengine:main Mar 16, 2026
1 check was pending
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants