fix(memory): MMR tokenize() returns empty set for CJK text#29396
fix(memory): MMR tokenize() returns empty set for CJK text#29396buyitsydney wants to merge 1 commit intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4d327133b8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/memory/mmr.ts
Outdated
| const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)); | ||
| const bigrams: string[] = []; | ||
| for (let i = 0; i < cjkChars.length - 1; i++) { | ||
| bigrams.push(cjkChars[i] + cjkChars[i + 1]); |
There was a problem hiding this comment.
Build CJK bigrams from contiguous text only
In tokenize, CJK bigrams are built after filtering the full string down to only CJK characters, so separators are discarded before pairing. This creates tokens that never appear contiguously in the source (for example, "你 abc 好" yields the synthetic bigram "你好"), which inflates Jaccard overlap and can cause MMR to over-penalize otherwise diverse snippets. Generating bigrams per contiguous CJK run avoids this false similarity.
Useful? React with 👍 / 👎.
Greptile SummaryFixed CJK text tokenization in MMR algorithm by extracting individual Han characters (U+4E00–U+9FFF, U+3400–U+4DBF) and character bigrams, resolving the issue where Key changes:
Issue found:
Note on scope:
Confidence Score: 3/5
Last reviewed commit: 4d32713 |
src/memory/mmr.ts
Outdated
| const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)); | ||
| const bigrams: string[] = []; | ||
| for (let i = 0; i < cjkChars.length - 1; i++) { | ||
| bigrams.push(cjkChars[i] + cjkChars[i + 1]); |
There was a problem hiding this comment.
Bigram generation creates incorrect bigrams when CJK characters are separated by ASCII text. For example, tokenize("A你B好C") would generate the bigram "你好" even though these characters aren't adjacent in the original text (separated by "B").
To fix, track character positions and only generate bigrams for truly adjacent CJK characters:
const chars = Array.from(lower);
const cjkChars: string[] = [];
const bigrams: string[] = [];
for (let i = 0; i < chars.length; i++) {
if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i])) {
cjkChars.push(chars[i]);
if (i + 1 < chars.length && /[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i + 1])) {
bigrams.push(chars[i] + chars[i + 1]);
}
}
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory/mmr.ts
Line: 35-38
Comment:
Bigram generation creates incorrect bigrams when CJK characters are separated by ASCII text. For example, `tokenize("A你B好C")` would generate the bigram `"你好"` even though these characters aren't adjacent in the original text (separated by `"B"`).
To fix, track character positions and only generate bigrams for truly adjacent CJK characters:
```typescript
const chars = Array.from(lower);
const cjkChars: string[] = [];
const bigrams: string[] = [];
for (let i = 0; i < chars.length; i++) {
if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i])) {
cjkChars.push(chars[i]);
if (i + 1 < chars.length && /[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i + 1])) {
bigrams.push(chars[i] + chars[i + 1]);
}
}
}
```
How can I resolve this? If you propose a fix, please make it concise.
nikolasdehor
left a comment
There was a problem hiding this comment.
Good fix — the CJK bigram approach is well-suited for logographic text where whitespace tokenization fails. Two observations: (1) The Unicode ranges cover CJK Unified Ideographs but miss Japanese kana (\u3040-\u30ff), Korean Hangul syllables (\uac00-\ud7af), and CJK Extension B+. If the goal is full CJK support, consider widening the range in a follow-up. (2) Nit: the blank line between tokenize() and jaccardSimilarity() was removed — please restore for readability. LGTM overall.
4d32713 to
5c1c157
Compare
|
Good catch @greptile-apps — fixed in the updated commit. What changed: Bigrams are now only generated between positionally adjacent CJK characters in the original text. Previously, all CJK characters were extracted first (losing position info), then bigrams were generated — which could pair characters separated by ASCII. Added a test case to verify: All 29 tests pass. |
|
Perfect! The fix is correct. By checking |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5c1c157158
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/memory/mmr.ts
Outdated
| return new Set(tokens); | ||
| const lower = text.toLowerCase(); | ||
| const ascii = lower.match(/[a-z0-9_]+/g) ?? []; | ||
| const isCJK = (c: string): boolean => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c); |
There was a problem hiding this comment.
Support Hangul and Kana when tokenizing CJK text
The new CJK tokenizer only matches Han ideograph ranges (\u3400-\u4DBF, \u4E00-\u9FFF), so Korean Hangul and Japanese kana inputs still produce empty token sets. In those cases, two unrelated snippets in those scripts are treated as maximally similar (jaccardSimilarity returns 1 when both sets are empty), so MMR diversity remains broken for a large part of CJK-language content. Expanding the script coverage (or using Unicode script properties) is needed for this fix to actually cover CJK text broadly.
Useful? React with 👍 / 👎.
…ity detection The tokenize() function only matched [a-z0-9_]+ patterns, returning an empty set for CJK-only text. This made Jaccard similarity always 0 (or always 1 for two empty sets) for CJK content, effectively disabling MMR diversity detection. Add support for: - CJK Unified Ideographs (U+4E00–U+9FFF, U+3400–U+4DBF) - Hiragana (U+3040–U+309F) and Katakana (U+30A0–U+30FF) - Hangul Syllables (U+AC00–U+D7AF) and Jamo (U+1100–U+11FF) Characters are extracted as unigrams, and bigrams are generated only from characters that are adjacent in the original text (no spurious bigrams across ASCII boundaries). Fixes openclaw#28000
5c1c157 to
2dfc47f
Compare
|
Updated: merged the broader CJK coverage from #28016 into this PR (which already has an Approve). What changed in this push:
Closing #28016 in favor of this PR. All 25 tests pass. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2dfc47fa92
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| * - Hiragana & Katakana (Japanese) | ||
| * - Hangul Syllables & Jamo (Korean) | ||
| */ | ||
| const CJK_RE = /[\u3040-\u309f\u30a0-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u1100-\u11ff]/; |
There was a problem hiding this comment.
Match astral Han ideographs when tokenizing CJK text
CJK_RE only includes BMP ranges, so Han Extension B+ characters (for example 𠀀) are not recognized as CJK tokens; tokenize("𠀀𠀁") currently returns an empty set. In that case, two unrelated snippets composed of those ideographs still hit the jaccardSimilarity empty-set path (1), which recreates the same MMR de-duplication failure this change is trying to fix for CJK content. Please include astral Han ranges (or a Unicode-script matcher with u) so non-BMP Han text is tokenized too.
Useful? React with 👍 / 👎.
|
This pull request has been automatically marked as stale due to inactivity. |
|
Still active — this PR has an Approve and all 25 tests pass. Awaiting maintainer merge. The latest commit includes full CJK coverage (Han + Japanese kana + Korean Hangul) with position-tracked bigrams. Happy to address any feedback. |
Problem
tokenize()insrc/memory/mmr.tsonly matches[a-z0-9_]+, returning an empty set for any CJK-only text. This makes Jaccard similarity always 0 between CJK content, effectively disabling MMR diversity detection for Chinese, Japanese, and Korean users.When MMR cannot detect content overlap, search results contain near-duplicate snippets instead of diverse results.
Fix
Extract individual CJK characters (U+4E00–U+9FFF, U+3400–U+4DBF) and character bigrams, merged with the existing ASCII tokens. This gives MMR meaningful tokens to compute similarity on CJK content.
Before:
tokenize("你好世界")→Set(0) {}(empty)After:
tokenize("你好世界")→Set(7) {"你","好","世","界","你好","好世","世界"}Tests
Added 3 test cases:
All 28 tests pass.
Context
This is part of a cluster of CJK search issues:
This PR addresses the MMR layer; the others address the BM25 layer. Together they would significantly improve memory search quality for CJK users.
/cc @Takhoffman @steipete — would appreciate a review when you have a moment. Small change (2 files, +35/−4), all tests green.