fix(memory): MMR tokenize() returns empty set for CJK text by buyitsydney · Pull Request #29396 · openclaw/openclaw

buyitsydney · 2026-02-28T03:43:33Z

Problem

tokenize() in src/memory/mmr.ts only matches [a-z0-9_]+, returning an empty set for any CJK-only text. This makes Jaccard similarity always 0 between CJK content, effectively disabling MMR diversity detection for Chinese, Japanese, and Korean users.

When MMR cannot detect content overlap, search results contain near-duplicate snippets instead of diverse results.

Fix

Extract individual CJK characters (U+4E00–U+9FFF, U+3400–U+4DBF) and character bigrams, merged with the existing ASCII tokens. This gives MMR meaningful tokens to compute similarity on CJK content.

Before: tokenize("你好世界") → Set(0) {} (empty)
After: tokenize("你好世界") → Set(7) {"你","好","世","界","你好","好世","世界"}

Tests

Added 3 test cases:

Pure CJK tokenization (characters + bigrams)
Mixed CJK + ASCII
Real-world Traditional Chinese text (from memorySearch BM25 (FTS5) does not work for CJK text #20730 user report)

All 28 tests pass.

Context

This is part of a cluster of CJK search issues:

memorySearch BM25 (FTS5) does not work for CJK text #20730 — FTS5 BM25 does not work for CJK text (tokenization)
fix(memory): bm25RankToScore returns constant 1.0 for all negative BM25 ranks #14005 — BM25 score collapses to constant 1.0 (scoring)
[Bug]: bm25RankToScore always return same value #5767 — bm25RankToScore always returns same value (report)

This PR addresses the MMR layer; the others address the BM25 layer. Together they would significantly improve memory search quality for CJK users.

/cc @Takhoffman @steipete — would appreciate a review when you have a moment. Small change (2 files, +35/−4), all tests green.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d327133b8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T03:45:21Z

src/memory/mmr.ts

+  const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));
+  const bigrams: string[] = [];
+  for (let i = 0; i < cjkChars.length - 1; i++) {
+    bigrams.push(cjkChars[i] + cjkChars[i + 1]);


Build CJK bigrams from contiguous text only

In tokenize, CJK bigrams are built after filtering the full string down to only CJK characters, so separators are discarded before pairing. This creates tokens that never appear contiguously in the source (for example, "你 abc 好" yields the synthetic bigram "你好"), which inflates Jaccard overlap and can cause MMR to over-penalize otherwise diverse snippets. Generating bigrams per contiguous CJK run avoids this false similarity.

Useful? React with 👍 / 👎.

greptile-apps · 2026-02-28T03:46:57Z

Greptile Summary

Fixed CJK text tokenization in MMR algorithm by extracting individual Han characters (U+4E00–U+9FFF, U+3400–U+4DBF) and character bigrams, resolving the issue where tokenize() returned empty sets for CJK-only text.

Key changes:

Modified tokenize() in src/memory/mmr.ts to handle CJK characters alongside ASCII tokens
Added character-level tokenization and bigram generation for CJK text
Added 3 test cases covering pure CJK, mixed CJK+ASCII, and real-world Traditional Chinese text

Issue found:

Bigram generation has a bug where it creates bigrams for CJK characters that aren't adjacent in the original text (when separated by ASCII). This occurs because the code filters out all non-CJK characters before generating bigrams, losing positional information.

Note on scope:

Implementation covers Han characters (Chinese/Japanese Kanji) but doesn't include Japanese kana (Hiragana/Katakana) or Korean Hangul, which use different Unicode ranges. The PR description explicitly states the targeted Unicode ranges, so this limitation appears intentional.

Confidence Score: 3/5

Safe to merge after fixing the bigram generation bug for mixed CJK+ASCII text
Score reflects a logical bug in bigram generation where non-adjacent CJK characters (separated by ASCII) are incorrectly paired. The core fix for pure CJK text is sound, tests pass, and the change is well-targeted. However, the bigram generation needs correction to handle mixed text properly before merging.
src/memory/mmr.ts needs the bigram generation logic fixed to track character positions

_{Last reviewed commit: 4d32713}

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-28T03:47:01Z

src/memory/mmr.ts

+  const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));
+  const bigrams: string[] = [];
+  for (let i = 0; i < cjkChars.length - 1; i++) {
+    bigrams.push(cjkChars[i] + cjkChars[i + 1]);


Bigram generation creates incorrect bigrams when CJK characters are separated by ASCII text. For example, tokenize("A你B好C") would generate the bigram "你好" even though these characters aren't adjacent in the original text (separated by "B").

To fix, track character positions and only generate bigrams for truly adjacent CJK characters:

const chars = Array.from(lower); const cjkChars: string[] = []; const bigrams: string[] = []; for (let i = 0; i < chars.length; i++) { if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i])) { cjkChars.push(chars[i]); if (i + 1 < chars.length && /[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i + 1])) { bigrams.push(chars[i] + chars[i + 1]); } } }

Prompt To Fix With AI

This is a comment left during a code review. Path: src/memory/mmr.ts Line: 35-38 Comment: Bigram generation creates incorrect bigrams when CJK characters are separated by ASCII text. For example, `tokenize("A你B好C")` would generate the bigram `"你好"` even though these characters aren't adjacent in the original text (separated by `"B"`). To fix, track character positions and only generate bigrams for truly adjacent CJK characters: ```typescript const chars = Array.from(lower); const cjkChars: string[] = []; const bigrams: string[] = []; for (let i = 0; i < chars.length; i++) { if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i])) { cjkChars.push(chars[i]); if (i + 1 < chars.length && /[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i + 1])) { bigrams.push(chars[i] + chars[i + 1]); } } } ``` How can I resolve this? If you propose a fix, please make it concise.

nikolasdehor

Good fix — the CJK bigram approach is well-suited for logographic text where whitespace tokenization fails. Two observations: (1) The Unicode ranges cover CJK Unified Ideographs but miss Japanese kana (\u3040-\u30ff), Korean Hangul syllables (\uac00-\ud7af), and CJK Extension B+. If the goal is full CJK support, consider widening the range in a follow-up. (2) Nit: the blank line between tokenize() and jaccardSimilarity() was removed — please restore for readability. LGTM overall.

buyitsydney · 2026-02-28T03:55:05Z

Good catch @greptile-apps — fixed in the updated commit.

What changed: Bigrams are now only generated between positionally adjacent CJK characters in the original text. Previously, all CJK characters were extracted first (losing position info), then bigrams were generated — which could pair characters separated by ASCII.

// Before (buggy): "你 test 好" → bigrams: ["你好"] ← wrong, not adjacent
// After (fixed):  "你 test 好" → bigrams: []      ← correct
// Pure CJK still works: "你好世界" → bigrams: ["你好","好世","世界"] ← correct

Added a test case to verify: expect(tokenize("你 test 好")).not.toContain("你好")

All 29 tests pass.

greptile-apps · 2026-02-28T03:55:50Z

Perfect! The fix is correct. By checking isCJK(chars[i + 1]) immediately after detecting a CJK character, you ensure bigrams are only created for consecutive CJK characters in the original text. The test case confirms "你 test 好" won't produce a "你好" bigram while "你好世界" still generates all 3 correct bigrams. LGTM! ✅

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c1c157158

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T03:57:23Z

src/memory/mmr.ts

-  return new Set(tokens);
+  const lower = text.toLowerCase();
+  const ascii = lower.match(/[a-z0-9_]+/g) ?? [];
+  const isCJK = (c: string): boolean => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c);


Support Hangul and Kana when tokenizing CJK text

The new CJK tokenizer only matches Han ideograph ranges (\u3400-\u4DBF, \u4E00-\u9FFF), so Korean Hangul and Japanese kana inputs still produce empty token sets. In those cases, two unrelated snippets in those scripts are treated as maximally similar (jaccardSimilarity returns 1 when both sets are empty), so MMR diversity remains broken for a large part of CJK-language content. Expanding the script coverage (or using Unicode script properties) is needed for this fix to actually cover CJK text broadly.

Useful? React with 👍 / 👎.

…ity detection The tokenize() function only matched [a-z0-9_]+ patterns, returning an empty set for CJK-only text. This made Jaccard similarity always 0 (or always 1 for two empty sets) for CJK content, effectively disabling MMR diversity detection. Add support for: - CJK Unified Ideographs (U+4E00–U+9FFF, U+3400–U+4DBF) - Hiragana (U+3040–U+309F) and Katakana (U+30A0–U+30FF) - Hangul Syllables (U+AC00–U+D7AF) and Jamo (U+1100–U+11FF) Characters are extracted as unigrams, and bigrams are generated only from characters that are adjacent in the original text (no spurious bigrams across ASCII boundaries). Fixes openclaw#28000

buyitsydney · 2026-02-28T10:34:10Z

Updated: merged the broader CJK coverage from #28016 into this PR (which already has an Approve).

What changed in this push:

Added Hiragana/Katakana (Japanese) and Hangul (Korean) support — not just Han characters
Uses position-tracked bigram generation (original approach from fix(memory): add CJK support to MMR tokenize() for diversity detection #28016)
Added test cases for Japanese (こんにちは), Korean (안녕하세요), and CJK text similarity

Closing #28016 in favor of this PR. All 25 tests pass.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2dfc47fa92

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T10:36:10Z

src/memory/mmr.ts

+ * - Hiragana & Katakana (Japanese)
+ * - Hangul Syllables & Jamo (Korean)
+ */
+const CJK_RE = /[\u3040-\u309f\u30a0-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u1100-\u11ff]/;


Match astral Han ideographs when tokenizing CJK text

CJK_RE only includes BMP ranges, so Han Extension B+ characters (for example 𠀀) are not recognized as CJK tokens; tokenize("𠀀𠀁") currently returns an empty set. In that case, two unrelated snippets composed of those ideographs still hit the jaccardSimilarity empty-set path (1), which recreates the same MMR de-duplication failure this change is trying to fix for CJK content. Please include astral Han ranges (or a Unicode-script matcher with u) so non-BMP Han text is tokenized too.

Useful? React with 👍 / 👎.

openclaw-barnacle · 2026-03-06T04:07:07Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

buyitsydney · 2026-03-08T01:01:18Z

Still active — this PR has an Approve and all 25 tests pass. Awaiting maintainer merge. The latest commit includes full CJK coverage (Han + Japanese kana + Korean Hangul) with position-tracked bigrams. Happy to address any feedback.

openclaw-barnacle bot added the size: XS label Feb 28, 2026

This was referenced Feb 28, 2026

fix(memory): bm25RankToScore returns constant 1.0 for all negative BM25 ranks #14005

Closed

memorySearch BM25 (FTS5) does not work for CJK text #20730

Open

chatgpt-codex-connector bot reviewed Feb 28, 2026

View reviewed changes

greptile-apps bot reviewed Feb 28, 2026

View reviewed changes

nikolasdehor approved these changes Feb 28, 2026

View reviewed changes

buyitsydney force-pushed the fix/mmr-tokenize-cjk branch from 4d32713 to 5c1c157 Compare February 28, 2026 03:54

openclaw-barnacle bot added size: S and removed size: XS labels Feb 28, 2026

chatgpt-codex-connector bot reviewed Feb 28, 2026

View reviewed changes

buyitsydney force-pushed the fix/mmr-tokenize-cjk branch from 5c1c157 to 2dfc47f Compare February 28, 2026 10:33

buyitsydney mentioned this pull request Feb 28, 2026

fix(memory): add CJK support to MMR tokenize() for diversity detection #28016

Closed

chatgpt-codex-connector bot reviewed Feb 28, 2026

View reviewed changes

buyitsydney mentioned this pull request Mar 4, 2026

[Bug]: bm25RankToScore always return same value #5767

Closed

openclaw-barnacle bot added the stale Marked as stale due to inactivity label Mar 6, 2026

buyitsydney mentioned this pull request Mar 8, 2026

[Bug]: MMR tokenize() returns empty set for CJK text, making diversity detection ineffective #28000

Open

Uh oh!

Conversation

buyitsydney commented Feb 28, 2026

Problem

Fix

Tests

Context

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 28, 2026

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

nikolasdehor left a comment

Choose a reason for hiding this comment

Uh oh!

buyitsydney commented Feb 28, 2026

Uh oh!

greptile-apps bot commented Feb 28, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

buyitsydney commented Feb 28, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

openclaw-barnacle bot commented Mar 6, 2026

Uh oh!

buyitsydney commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants