fix(memory): add CJK support to MMR tokenize() for diversity detection by buyitsydney · Pull Request #28016 · openclaw/openclaw

buyitsydney · 2026-02-26T23:09:13Z

Problem

The tokenize() function in src/memory/mmr.ts only matches ASCII tokens via /[a-z0-9_]+/g, returning an empty set for pure CJK (Chinese/Japanese/Korean) text. This makes MMR diversity detection completely ineffective for CJK content:

Two different CJK documents → both tokenize to Set{} → Jaccard similarity = 1 (treated as identical)
MMR cannot detect or penalize near-duplicate CJK results

Fix

Added CJK character and bigram extraction to tokenize():

const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));
const bigrams: string[] = [];
for (let i = 0; i < cjkChars.length - 1; i++) {
  bigrams.push(cjkChars[i] + cjkChars[i + 1]);
}
return new Set([...ascii, ...bigrams, ...cjkChars]);

Unigrams: Individual CJK characters for broad matching
Bigrams: Consecutive character pairs for phrase-level similarity
Language-agnostic: Works for Chinese, Japanese kanji, Korean hanja without external tokenization libraries
Backward-compatible: ASCII tokenization unchanged

Tests

Added 3 new test cases for CJK tokenization + 2 for CJK text similarity:

Pure CJK text produces unigrams + bigrams
Mixed ASCII/CJK text produces both token types
Single CJK character (no bigrams)
Similar CJK texts have positive similarity
Disjoint CJK texts have zero similarity

All 25 existing tests pass unchanged.

Fixes [Bug]: MMR tokenize() returns empty set for CJK text, making diversity detection ineffective #28000
Related: memorySearch BM25 (FTS5) does not work for CJK text #20730 (FTS5 CJK tokenization — different component, same language family)
Related: [Bug]: bm25RankToScore always return same value #5767 / fix(memory): bm25RankToScore returns constant 1.0 for all negative BM25 ranks #14005 (bm25RankToScore — different bug, same search pipeline)

The tokenize() function in mmr.ts only matched ASCII [a-z0-9_]+, returning an empty set for pure CJK text. This made Jaccard similarity always return 1 (both sets empty = identical) or 0 (one has ASCII), rendering MMR diversity detection completely ineffective for CJK content. Added CJK character (unigram) and bigram extraction: - Individual CJK characters for broad matching - Consecutive character pairs for phrase-level similarity - Language-agnostic: works for Chinese, Japanese kanji, Korean hanja - No external tokenizer dependency Fixes openclaw#28000

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d0f26f2a2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-26T23:10:50Z

src/memory/mmr.ts

-  return new Set(tokens);
+  const lower = text.toLowerCase();
+  const ascii = lower.match(/[a-z0-9_]+/g) ?? [];
+  const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));


Expand CJK matcher to include kana and hangul

The new tokenizer only matches \u3400-\u4DBF and \u4E00-\u9FFF, which are Han ideographs; kana- or hangul-only text still produces an empty token set. In this code path, two empty sets are treated as similarity 1, so snippets like こんにちは vs さようなら (or Korean hangul sentences) will still be treated as identical and MMR diversity remains ineffective for large portions of Japanese/Korean content.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-26T23:10:50Z

src/memory/mmr.ts

+  const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));
+  const bigrams: string[] = [];
+  for (let i = 0; i < cjkChars.length - 1; i++) {
+    bigrams.push(cjkChars[i] + cjkChars[i + 1]);


Build CJK bigrams only from contiguous runs

Bigrams are generated after filtering all Han characters out of the full string, so separators are lost and non-adjacent characters get merged into fake bigrams (for example, 你a好 yields 你好). This inflates token overlap across unrelated snippets and can incorrectly penalize diverse results during MMR reranking whenever Han characters are separated by ASCII, punctuation, or whitespace.

Useful? React with 👍 / 👎.

greptile-apps · 2026-02-26T23:12:26Z

Greptile Summary

Added CJK (Chinese/Japanese/Korean) tokenization support to MMR diversity detection by extracting CJK characters as unigrams and consecutive character pairs as bigrams.

Key changes:

tokenize() now extracts CJK characters using Unicode ranges \u4e00-\u9fff and \u3400-\u4dbf
Generates both unigrams (individual characters) and bigrams (character pairs) for phrase-level matching
Added 5 new test cases covering pure CJK, mixed ASCII/CJK, and similarity scenarios
Backward compatible with existing ASCII tokenization

Issue identified:

Bigram generation doesn't verify that CJK characters are actually consecutive in the original text. When ASCII text separates CJK characters (e.g., "我喜欢hello你好"), spurious bigrams are created from the filtered sequence, which may affect similarity scores.

Confidence Score: 3/5

Safe to merge with a known edge case issue that should be addressed
Core functionality is solid with good test coverage, but the bigram generation logic has a flaw that creates incorrect bigrams when ASCII text separates CJK characters. This affects similarity scores in mixed-content scenarios. The issue is not critical since most real-world CJK text keeps characters grouped, but should be fixed for correctness.
Pay close attention to src/memory/mmr.ts line 40-44 for the bigram generation logic

_{Last reviewed commit: 0d0f26f}

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-26T23:12:29Z

src/memory/mmr.ts

+  const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));
+  const bigrams: string[] = [];
+  for (let i = 0; i < cjkChars.length - 1; i++) {
+    bigrams.push(cjkChars[i] + cjkChars[i + 1]);
+  }


bigrams created from non-consecutive CJK characters when ASCII separates them

For example, with input "我喜欢hello world你好", cjkChars becomes ["我", "喜", "欢", "你", "好"], creating the bigram "欢你" even though these characters aren't consecutive in the original text (separated by "hello world").

Consider tracking original positions:

const chars = Array.from(lower); const cjkData: { char: string; index: number }[] = []; chars.forEach((c, i) => { if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)) { cjkData.push({ char: c, index: i }); } }); const bigrams: string[] = []; for (let i = 0; i < cjkData.length - 1; i++) { if (cjkData[i + 1].index === cjkData[i].index + 1) { bigrams.push(cjkData[i].char + cjkData[i + 1].char); } } const cjkChars = cjkData.map(d => d.char);

Prompt To Fix With AI

This is a comment left during a code review. Path: src/memory/mmr.ts Line: 40-44 Comment: bigrams created from non-consecutive CJK characters when ASCII separates them For example, with input "我喜欢hello world你好", `cjkChars` becomes `["我", "喜", "欢", "你", "好"]`, creating the bigram "欢你" even though these characters aren't consecutive in the original text (separated by "hello world"). Consider tracking original positions: ``` const chars = Array.from(lower); const cjkData: { char: string; index: number }[] = []; chars.forEach((c, i) => { if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(c)) { cjkData.push({ char: c, index: i }); } }); const bigrams: string[] = []; for (let i = 0; i < cjkData.length - 1; i++) { if (cjkData[i + 1].index === cjkData[i].index + 1) { bigrams.push(cjkData[i].char + cjkData[i + 1].char); } } const cjkChars = cjkData.map(d => d.char); ``` How can I resolve this? If you propose a fix, please make it concise.

- Expand CJK regex to include Hiragana, Katakana, Hangul Syllables, and Hangul Jamo — Japanese/Korean-only text now tokenizes correctly - Track original character positions so bigrams are only created from adjacent characters (fixes spurious bigrams like '欢你' from '欢hello你') - Add test cases: Japanese hiragana, Korean hangul, non-adjacent guard

buyitsydney · 2026-02-26T23:15:47Z

Addressed both review comments in e102c9f:

P1 — Kana/Hangul support: Expanded the CJK regex to include Hiragana (\u3040-\u309f), Katakana (\u30a0-\u30ff), Hangul Syllables (\uac00-\ud7af), and Hangul Jamo (\u1100-\u11ff). Added test cases for Japanese and Korean text.

P2 — Non-adjacent bigrams: Now tracking original character positions so bigrams are only created from characters that are actually consecutive in the source text. "我a好" correctly produces ["a", "我", "好"] with no spurious "我好" bigram. Added a dedicated test case for this guard.

All 25 tests pass. Thanks for the thorough reviews! 🙏

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e102c9f415

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-26T23:18:42Z

src/memory/mmr.ts

+ * - Hiragana & Katakana (Japanese)
+ * - Hangul Syllables & Jamo (Korean)
+ */
+const CJK_RE = /[\u3040-\u309f\u30a0-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u1100-\u11ff]/;


Include astral Han code points in CJK regex

The new CJK_RE only matches BMP blocks, so Han characters in supplementary planes (for example U+20000 from CJK Extension B) are still dropped by tokenize. When a snippet is composed of those characters, tokenization can still return an empty set and jaccardSimilarity will treat two unrelated snippets as identical (1), which defeats MMR diversity for that subset of Chinese text. Please expand matching to supplementary Han code points (or use Unicode script properties with the u flag) so these inputs are actually covered.

Useful? React with 👍 / 👎.

buyitsydney · 2026-02-28T10:34:21Z

Closing in favor of #29396 which now includes the full CJK/Kana/Hangul coverage from this PR, plus an existing Approve. All changes consolidated there.

openclaw-barnacle bot added the size: S label Feb 26, 2026

chatgpt-codex-connector bot reviewed Feb 26, 2026

View reviewed changes

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 26, 2026

View reviewed changes

buyitsydney mentioned this pull request Feb 28, 2026

fix(memory): MMR tokenize() returns empty set for CJK text #29396

Open

buyitsydney closed this Feb 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(memory): add CJK support to MMR tokenize() for diversity detection#28016

fix(memory): add CJK support to MMR tokenize() for diversity detection#28016
buyitsydney wants to merge 2 commits intoopenclaw:mainfrom
buyitsydney:fix/mmr-cjk-tokenize

buyitsydney commented Feb 26, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 26, 2026

Uh oh!

chatgpt-codex-connector bot Feb 26, 2026

Uh oh!

greptile-apps bot commented Feb 26, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 26, 2026

Uh oh!

buyitsydney commented Feb 26, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 26, 2026

Uh oh!

buyitsydney commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

buyitsydney commented Feb 26, 2026

Problem

Fix

Tests

Related

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 26, 2026

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

buyitsydney commented Feb 26, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

buyitsydney commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant