fix(memory): add CJK support to MMR tokenize() for diversity detection#28016
fix(memory): add CJK support to MMR tokenize() for diversity detection#28016buyitsydney wants to merge 2 commits intoopenclaw:mainfrom
Conversation
The tokenize() function in mmr.ts only matched ASCII [a-z0-9_]+, returning an empty set for pure CJK text. This made Jaccard similarity always return 1 (both sets empty = identical) or 0 (one has ASCII), rendering MMR diversity detection completely ineffective for CJK content. Added CJK character (unigram) and bigram extraction: - Individual CJK characters for broad matching - Consecutive character pairs for phrase-level similarity - Language-agnostic: works for Chinese, Japanese kanji, Korean hanja - No external tokenizer dependency Fixes openclaw#28000
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d0f26f2a2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/memory/mmr.ts
Outdated
| return new Set(tokens); | ||
| const lower = text.toLowerCase(); | ||
| const ascii = lower.match(/[a-z0-9_]+/g) ?? []; | ||
| const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)); |
There was a problem hiding this comment.
Expand CJK matcher to include kana and hangul
The new tokenizer only matches \u3400-\u4DBF and \u4E00-\u9FFF, which are Han ideographs; kana- or hangul-only text still produces an empty token set. In this code path, two empty sets are treated as similarity 1, so snippets like こんにちは vs さようなら (or Korean hangul sentences) will still be treated as identical and MMR diversity remains ineffective for large portions of Japanese/Korean content.
Useful? React with 👍 / 👎.
src/memory/mmr.ts
Outdated
| const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)); | ||
| const bigrams: string[] = []; | ||
| for (let i = 0; i < cjkChars.length - 1; i++) { | ||
| bigrams.push(cjkChars[i] + cjkChars[i + 1]); |
There was a problem hiding this comment.
Build CJK bigrams only from contiguous runs
Bigrams are generated after filtering all Han characters out of the full string, so separators are lost and non-adjacent characters get merged into fake bigrams (for example, 你a好 yields 你好). This inflates token overlap across unrelated snippets and can incorrectly penalize diverse results during MMR reranking whenever Han characters are separated by ASCII, punctuation, or whitespace.
Useful? React with 👍 / 👎.
Greptile SummaryAdded CJK (Chinese/Japanese/Korean) tokenization support to MMR diversity detection by extracting CJK characters as unigrams and consecutive character pairs as bigrams. Key changes:
Issue identified:
Confidence Score: 3/5
Last reviewed commit: 0d0f26f |
src/memory/mmr.ts
Outdated
| const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)); | ||
| const bigrams: string[] = []; | ||
| for (let i = 0; i < cjkChars.length - 1; i++) { | ||
| bigrams.push(cjkChars[i] + cjkChars[i + 1]); | ||
| } |
There was a problem hiding this comment.
bigrams created from non-consecutive CJK characters when ASCII separates them
For example, with input "我喜欢hello world你好", cjkChars becomes ["我", "喜", "欢", "你", "好"], creating the bigram "欢你" even though these characters aren't consecutive in the original text (separated by "hello world").
Consider tracking original positions:
const chars = Array.from(lower);
const cjkData: { char: string; index: number }[] = [];
chars.forEach((c, i) => {
if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)) {
cjkData.push({ char: c, index: i });
}
});
const bigrams: string[] = [];
for (let i = 0; i < cjkData.length - 1; i++) {
if (cjkData[i + 1].index === cjkData[i].index + 1) {
bigrams.push(cjkData[i].char + cjkData[i + 1].char);
}
}
const cjkChars = cjkData.map(d => d.char);
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory/mmr.ts
Line: 40-44
Comment:
bigrams created from non-consecutive CJK characters when ASCII separates them
For example, with input "我喜欢hello world你好", `cjkChars` becomes `["我", "喜", "欢", "你", "好"]`, creating the bigram "欢你" even though these characters aren't consecutive in the original text (separated by "hello world").
Consider tracking original positions:
```
const chars = Array.from(lower);
const cjkData: { char: string; index: number }[] = [];
chars.forEach((c, i) => {
if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)) {
cjkData.push({ char: c, index: i });
}
});
const bigrams: string[] = [];
for (let i = 0; i < cjkData.length - 1; i++) {
if (cjkData[i + 1].index === cjkData[i].index + 1) {
bigrams.push(cjkData[i].char + cjkData[i + 1].char);
}
}
const cjkChars = cjkData.map(d => d.char);
```
How can I resolve this? If you propose a fix, please make it concise.- Expand CJK regex to include Hiragana, Katakana, Hangul Syllables, and Hangul Jamo — Japanese/Korean-only text now tokenizes correctly - Track original character positions so bigrams are only created from adjacent characters (fixes spurious bigrams like '欢你' from '欢hello你') - Add test cases: Japanese hiragana, Korean hangul, non-adjacent guard
|
Addressed both review comments in e102c9f: P1 — Kana/Hangul support: Expanded the CJK regex to include Hiragana ( P2 — Non-adjacent bigrams: Now tracking original character positions so bigrams are only created from characters that are actually consecutive in the source text. All 25 tests pass. Thanks for the thorough reviews! 🙏 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e102c9f415
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| * - Hiragana & Katakana (Japanese) | ||
| * - Hangul Syllables & Jamo (Korean) | ||
| */ | ||
| const CJK_RE = /[\u3040-\u309f\u30a0-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u1100-\u11ff]/; |
There was a problem hiding this comment.
Include astral Han code points in CJK regex
The new CJK_RE only matches BMP blocks, so Han characters in supplementary planes (for example U+20000 from CJK Extension B) are still dropped by tokenize. When a snippet is composed of those characters, tokenization can still return an empty set and jaccardSimilarity will treat two unrelated snippets as identical (1), which defeats MMR diversity for that subset of Chinese text. Please expand matching to supplementary Han code points (or use Unicode script properties with the u flag) so these inputs are actually covered.
Useful? React with 👍 / 👎.
|
Closing in favor of #29396 which now includes the full CJK/Kana/Hangul coverage from this PR, plus an existing Approve. All changes consolidated there. |
Problem
The
tokenize()function insrc/memory/mmr.tsonly matches ASCII tokens via/[a-z0-9_]+/g, returning an empty set for pure CJK (Chinese/Japanese/Korean) text. This makes MMR diversity detection completely ineffective for CJK content:Set{}→ Jaccard similarity = 1 (treated as identical)Fix
Added CJK character and bigram extraction to
tokenize():Tests
Added 3 new test cases for CJK tokenization + 2 for CJK text similarity:
All 25 existing tests pass unchanged.
Related