Skip to content

fix(memory): add CJK support to MMR tokenize() for diversity detection#28016

Closed
buyitsydney wants to merge 2 commits intoopenclaw:mainfrom
buyitsydney:fix/mmr-cjk-tokenize
Closed

fix(memory): add CJK support to MMR tokenize() for diversity detection#28016
buyitsydney wants to merge 2 commits intoopenclaw:mainfrom
buyitsydney:fix/mmr-cjk-tokenize

Conversation

@buyitsydney
Copy link

Problem

The tokenize() function in src/memory/mmr.ts only matches ASCII tokens via /[a-z0-9_]+/g, returning an empty set for pure CJK (Chinese/Japanese/Korean) text. This makes MMR diversity detection completely ineffective for CJK content:

  • Two different CJK documents → both tokenize to Set{} → Jaccard similarity = 1 (treated as identical)
  • MMR cannot detect or penalize near-duplicate CJK results

Fix

Added CJK character and bigram extraction to tokenize():

const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));
const bigrams: string[] = [];
for (let i = 0; i < cjkChars.length - 1; i++) {
  bigrams.push(cjkChars[i] + cjkChars[i + 1]);
}
return new Set([...ascii, ...bigrams, ...cjkChars]);
  • Unigrams: Individual CJK characters for broad matching
  • Bigrams: Consecutive character pairs for phrase-level similarity
  • Language-agnostic: Works for Chinese, Japanese kanji, Korean hanja without external tokenization libraries
  • Backward-compatible: ASCII tokenization unchanged

Tests

Added 3 new test cases for CJK tokenization + 2 for CJK text similarity:

  • Pure CJK text produces unigrams + bigrams
  • Mixed ASCII/CJK text produces both token types
  • Single CJK character (no bigrams)
  • Similar CJK texts have positive similarity
  • Disjoint CJK texts have zero similarity

All 25 existing tests pass unchanged.

Related

The tokenize() function in mmr.ts only matched ASCII [a-z0-9_]+,
returning an empty set for pure CJK text. This made Jaccard similarity
always return 1 (both sets empty = identical) or 0 (one has ASCII),
rendering MMR diversity detection completely ineffective for CJK content.

Added CJK character (unigram) and bigram extraction:
- Individual CJK characters for broad matching
- Consecutive character pairs for phrase-level similarity
- Language-agnostic: works for Chinese, Japanese kanji, Korean hanja
- No external tokenizer dependency

Fixes openclaw#28000
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d0f26f2a2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return new Set(tokens);
const lower = text.toLowerCase();
const ascii = lower.match(/[a-z0-9_]+/g) ?? [];
const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Expand CJK matcher to include kana and hangul

The new tokenizer only matches \u3400-\u4DBF and \u4E00-\u9FFF, which are Han ideographs; kana- or hangul-only text still produces an empty token set. In this code path, two empty sets are treated as similarity 1, so snippets like こんにちは vs さようなら (or Korean hangul sentences) will still be treated as identical and MMR diversity remains ineffective for large portions of Japanese/Korean content.

Useful? React with 👍 / 👎.

Comment on lines +40 to +43
const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));
const bigrams: string[] = [];
for (let i = 0; i < cjkChars.length - 1; i++) {
bigrams.push(cjkChars[i] + cjkChars[i + 1]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Build CJK bigrams only from contiguous runs

Bigrams are generated after filtering all Han characters out of the full string, so separators are lost and non-adjacent characters get merged into fake bigrams (for example, 你a好 yields 你好). This inflates token overlap across unrelated snippets and can incorrectly penalize diverse results during MMR reranking whenever Han characters are separated by ASCII, punctuation, or whitespace.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Greptile Summary

Added CJK (Chinese/Japanese/Korean) tokenization support to MMR diversity detection by extracting CJK characters as unigrams and consecutive character pairs as bigrams.

Key changes:

  • tokenize() now extracts CJK characters using Unicode ranges \u4e00-\u9fff and \u3400-\u4dbf
  • Generates both unigrams (individual characters) and bigrams (character pairs) for phrase-level matching
  • Added 5 new test cases covering pure CJK, mixed ASCII/CJK, and similarity scenarios
  • Backward compatible with existing ASCII tokenization

Issue identified:

  • Bigram generation doesn't verify that CJK characters are actually consecutive in the original text. When ASCII text separates CJK characters (e.g., "我喜欢hello你好"), spurious bigrams are created from the filtered sequence, which may affect similarity scores.

Confidence Score: 3/5

  • Safe to merge with a known edge case issue that should be addressed
  • Core functionality is solid with good test coverage, but the bigram generation logic has a flaw that creates incorrect bigrams when ASCII text separates CJK characters. This affects similarity scores in mixed-content scenarios. The issue is not critical since most real-world CJK text keeps characters grouped, but should be fixed for correctness.
  • Pay close attention to src/memory/mmr.ts line 40-44 for the bigram generation logic

Last reviewed commit: 0d0f26f

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +40 to +44
const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));
const bigrams: string[] = [];
for (let i = 0; i < cjkChars.length - 1; i++) {
bigrams.push(cjkChars[i] + cjkChars[i + 1]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigrams created from non-consecutive CJK characters when ASCII separates them

For example, with input "我喜欢hello world你好", cjkChars becomes ["我", "喜", "欢", "你", "好"], creating the bigram "欢你" even though these characters aren't consecutive in the original text (separated by "hello world").

Consider tracking original positions:

const chars = Array.from(lower);
const cjkData: { char: string; index: number }[] = [];
chars.forEach((c, i) => {
  if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)) {
    cjkData.push({ char: c, index: i });
  }
});
const bigrams: string[] = [];
for (let i = 0; i < cjkData.length - 1; i++) {
  if (cjkData[i + 1].index === cjkData[i].index + 1) {
    bigrams.push(cjkData[i].char + cjkData[i + 1].char);
  }
}
const cjkChars = cjkData.map(d => d.char);
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory/mmr.ts
Line: 40-44

Comment:
bigrams created from non-consecutive CJK characters when ASCII separates them

For example, with input "我喜欢hello world你好", `cjkChars` becomes `["我", "喜", "欢", "你", "好"]`, creating the bigram "欢你" even though these characters aren't consecutive in the original text (separated by "hello world").

Consider tracking original positions:
```
const chars = Array.from(lower);
const cjkData: { char: string; index: number }[] = [];
chars.forEach((c, i) => {
  if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)) {
    cjkData.push({ char: c, index: i });
  }
});
const bigrams: string[] = [];
for (let i = 0; i < cjkData.length - 1; i++) {
  if (cjkData[i + 1].index === cjkData[i].index + 1) {
    bigrams.push(cjkData[i].char + cjkData[i + 1].char);
  }
}
const cjkChars = cjkData.map(d => d.char);
```

How can I resolve this? If you propose a fix, please make it concise.

- Expand CJK regex to include Hiragana, Katakana, Hangul Syllables,
  and Hangul Jamo — Japanese/Korean-only text now tokenizes correctly
- Track original character positions so bigrams are only created from
  adjacent characters (fixes spurious bigrams like '欢你' from '欢hello你')
- Add test cases: Japanese hiragana, Korean hangul, non-adjacent guard
@buyitsydney
Copy link
Author

Addressed both review comments in e102c9f:

P1 — Kana/Hangul support: Expanded the CJK regex to include Hiragana (\u3040-\u309f), Katakana (\u30a0-\u30ff), Hangul Syllables (\uac00-\ud7af), and Hangul Jamo (\u1100-\u11ff). Added test cases for Japanese and Korean text.

P2 — Non-adjacent bigrams: Now tracking original character positions so bigrams are only created from characters that are actually consecutive in the source text. "我a好" correctly produces ["a", "我", "好"] with no spurious "我好" bigram. Added a dedicated test case for this guard.

All 25 tests pass. Thanks for the thorough reviews! 🙏

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e102c9f415

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

* - Hiragana & Katakana (Japanese)
* - Hangul Syllables & Jamo (Korean)
*/
const CJK_RE = /[\u3040-\u309f\u30a0-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u1100-\u11ff]/;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include astral Han code points in CJK regex

The new CJK_RE only matches BMP blocks, so Han characters in supplementary planes (for example U+20000 from CJK Extension B) are still dropped by tokenize. When a snippet is composed of those characters, tokenization can still return an empty set and jaccardSimilarity will treat two unrelated snippets as identical (1), which defeats MMR diversity for that subset of Chinese text. Please expand matching to supplementary Han code points (or use Unicode script properties with the u flag) so these inputs are actually covered.

Useful? React with 👍 / 👎.

@buyitsydney
Copy link
Author

Closing in favor of #29396 which now includes the full CJK/Kana/Hangul coverage from this PR, plus an existing Approve. All changes consolidated there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: MMR tokenize() returns empty set for CJK text, making diversity detection ineffective

1 participant