Skip to content

fix(memory): MMR tokenize() returns empty set for CJK text#29396

Open
buyitsydney wants to merge 1 commit intoopenclaw:mainfrom
buyitsydney:fix/mmr-tokenize-cjk
Open

fix(memory): MMR tokenize() returns empty set for CJK text#29396
buyitsydney wants to merge 1 commit intoopenclaw:mainfrom
buyitsydney:fix/mmr-tokenize-cjk

Conversation

@buyitsydney
Copy link

Problem

tokenize() in src/memory/mmr.ts only matches [a-z0-9_]+, returning an empty set for any CJK-only text. This makes Jaccard similarity always 0 between CJK content, effectively disabling MMR diversity detection for Chinese, Japanese, and Korean users.

When MMR cannot detect content overlap, search results contain near-duplicate snippets instead of diverse results.

Fix

Extract individual CJK characters (U+4E00–U+9FFF, U+3400–U+4DBF) and character bigrams, merged with the existing ASCII tokens. This gives MMR meaningful tokens to compute similarity on CJK content.

Before: tokenize("你好世界")Set(0) {} (empty)
After: tokenize("你好世界")Set(7) {"你","好","世","界","你好","好世","世界"}

Tests

Added 3 test cases:

All 28 tests pass.

Context

This is part of a cluster of CJK search issues:

This PR addresses the MMR layer; the others address the BM25 layer. Together they would significantly improve memory search quality for CJK users.

/cc @Takhoffman @steipete — would appreciate a review when you have a moment. Small change (2 files, +35/−4), all tests green.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d327133b8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +35 to +38
const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));
const bigrams: string[] = [];
for (let i = 0; i < cjkChars.length - 1; i++) {
bigrams.push(cjkChars[i] + cjkChars[i + 1]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Build CJK bigrams from contiguous text only

In tokenize, CJK bigrams are built after filtering the full string down to only CJK characters, so separators are discarded before pairing. This creates tokens that never appear contiguously in the source (for example, "你 abc 好" yields the synthetic bigram "你好"), which inflates Jaccard overlap and can cause MMR to over-penalize otherwise diverse snippets. Generating bigrams per contiguous CJK run avoids this false similarity.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 28, 2026

Greptile Summary

Fixed CJK text tokenization in MMR algorithm by extracting individual Han characters (U+4E00–U+9FFF, U+3400–U+4DBF) and character bigrams, resolving the issue where tokenize() returned empty sets for CJK-only text.

Key changes:

  • Modified tokenize() in src/memory/mmr.ts to handle CJK characters alongside ASCII tokens
  • Added character-level tokenization and bigram generation for CJK text
  • Added 3 test cases covering pure CJK, mixed CJK+ASCII, and real-world Traditional Chinese text

Issue found:

  • Bigram generation has a bug where it creates bigrams for CJK characters that aren't adjacent in the original text (when separated by ASCII). This occurs because the code filters out all non-CJK characters before generating bigrams, losing positional information.

Note on scope:

  • Implementation covers Han characters (Chinese/Japanese Kanji) but doesn't include Japanese kana (Hiragana/Katakana) or Korean Hangul, which use different Unicode ranges. The PR description explicitly states the targeted Unicode ranges, so this limitation appears intentional.

Confidence Score: 3/5

  • Safe to merge after fixing the bigram generation bug for mixed CJK+ASCII text
  • Score reflects a logical bug in bigram generation where non-adjacent CJK characters (separated by ASCII) are incorrectly paired. The core fix for pure CJK text is sound, tests pass, and the change is well-targeted. However, the bigram generation needs correction to handle mixed text properly before merging.
  • src/memory/mmr.ts needs the bigram generation logic fixed to track character positions

Last reviewed commit: 4d32713

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +35 to +38
const cjkChars = Array.from(lower).filter((c) => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c));
const bigrams: string[] = [];
for (let i = 0; i < cjkChars.length - 1; i++) {
bigrams.push(cjkChars[i] + cjkChars[i + 1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bigram generation creates incorrect bigrams when CJK characters are separated by ASCII text. For example, tokenize("A你B好C") would generate the bigram "你好" even though these characters aren't adjacent in the original text (separated by "B").

To fix, track character positions and only generate bigrams for truly adjacent CJK characters:

const chars = Array.from(lower);
const cjkChars: string[] = [];
const bigrams: string[] = [];

for (let i = 0; i < chars.length; i++) {
  if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i])) {
    cjkChars.push(chars[i]);
    if (i + 1 < chars.length && /[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i + 1])) {
      bigrams.push(chars[i] + chars[i + 1]);
    }
  }
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory/mmr.ts
Line: 35-38

Comment:
Bigram generation creates incorrect bigrams when CJK characters are separated by ASCII text. For example, `tokenize("A你B好C")` would generate the bigram `"你好"` even though these characters aren't adjacent in the original text (separated by `"B"`).

To fix, track character positions and only generate bigrams for truly adjacent CJK characters:

```typescript
const chars = Array.from(lower);
const cjkChars: string[] = [];
const bigrams: string[] = [];

for (let i = 0; i < chars.length; i++) {
  if (/[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i])) {
    cjkChars.push(chars[i]);
    if (i + 1 < chars.length && /[\u4e00-\u9fff\u3400-\u4dbf]/.test(chars[i + 1])) {
      bigrams.push(chars[i] + chars[i + 1]);
    }
  }
}
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link

@nikolasdehor nikolasdehor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix — the CJK bigram approach is well-suited for logographic text where whitespace tokenization fails. Two observations: (1) The Unicode ranges cover CJK Unified Ideographs but miss Japanese kana (\u3040-\u30ff), Korean Hangul syllables (\uac00-\ud7af), and CJK Extension B+. If the goal is full CJK support, consider widening the range in a follow-up. (2) Nit: the blank line between tokenize() and jaccardSimilarity() was removed — please restore for readability. LGTM overall.

@buyitsydney
Copy link
Author

Good catch @greptile-apps — fixed in the updated commit.

What changed: Bigrams are now only generated between positionally adjacent CJK characters in the original text. Previously, all CJK characters were extracted first (losing position info), then bigrams were generated — which could pair characters separated by ASCII.

// Before (buggy): "你 test 好" → bigrams: ["你好"] ← wrong, not adjacent
// After (fixed):  "你 test 好" → bigrams: []      ← correct
// Pure CJK still works: "你好世界" → bigrams: ["你好","好世","世界"] ← correct

Added a test case to verify: expect(tokenize("你 test 好")).not.toContain("你好")

All 29 tests pass.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 28, 2026

Perfect! The fix is correct. By checking isCJK(chars[i + 1]) immediately after detecting a CJK character, you ensure bigrams are only created for consecutive CJK characters in the original text. The test case confirms "你 test 好" won't produce a "你好" bigram while "你好世界" still generates all 3 correct bigrams. LGTM! ✅

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c1c157158

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return new Set(tokens);
const lower = text.toLowerCase();
const ascii = lower.match(/[a-z0-9_]+/g) ?? [];
const isCJK = (c: string): boolean => /[\u4e00-\u9fff\u3400-\u4dbf]/.test(c);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Support Hangul and Kana when tokenizing CJK text

The new CJK tokenizer only matches Han ideograph ranges (\u3400-\u4DBF, \u4E00-\u9FFF), so Korean Hangul and Japanese kana inputs still produce empty token sets. In those cases, two unrelated snippets in those scripts are treated as maximally similar (jaccardSimilarity returns 1 when both sets are empty), so MMR diversity remains broken for a large part of CJK-language content. Expanding the script coverage (or using Unicode script properties) is needed for this fix to actually cover CJK text broadly.

Useful? React with 👍 / 👎.

…ity detection

The tokenize() function only matched [a-z0-9_]+ patterns, returning an
empty set for CJK-only text. This made Jaccard similarity always 0 (or
always 1 for two empty sets) for CJK content, effectively disabling MMR
diversity detection.

Add support for:
- CJK Unified Ideographs (U+4E00–U+9FFF, U+3400–U+4DBF)
- Hiragana (U+3040–U+309F) and Katakana (U+30A0–U+30FF)
- Hangul Syllables (U+AC00–U+D7AF) and Jamo (U+1100–U+11FF)

Characters are extracted as unigrams, and bigrams are generated only
from characters that are adjacent in the original text (no spurious
bigrams across ASCII boundaries).

Fixes openclaw#28000
@buyitsydney
Copy link
Author

Updated: merged the broader CJK coverage from #28016 into this PR (which already has an Approve).

What changed in this push:

Closing #28016 in favor of this PR. All 25 tests pass.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2dfc47fa92

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

* - Hiragana & Katakana (Japanese)
* - Hangul Syllables & Jamo (Korean)
*/
const CJK_RE = /[\u3040-\u309f\u30a0-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u1100-\u11ff]/;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match astral Han ideographs when tokenizing CJK text

CJK_RE only includes BMP ranges, so Han Extension B+ characters (for example 𠀀) are not recognized as CJK tokens; tokenize("𠀀𠀁") currently returns an empty set. In that case, two unrelated snippets composed of those ideographs still hit the jaccardSimilarity empty-set path (1), which recreates the same MMR de-duplication failure this change is trying to fix for CJK content. Please include astral Han ranges (or a Unicode-script matcher with u) so non-BMP Han text is tokenized too.

Useful? React with 👍 / 👎.

@openclaw-barnacle
Copy link

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@buyitsydney
Copy link
Author

Still active — this PR has an Approve and all 25 tests pass. Awaiting maintainer merge. The latest commit includes full CJK coverage (Han + Japanese kana + Korean Hangul) with position-tracked bigrams. Happy to address any feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size: S stale Marked as stale due to inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants