fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (#80613)#86645
Conversation
The dreaming-phases dedupe path's local `tokenizeSnippet` split on `/[^a-z0-9]+/i`, producing empty token sets for pure-CJK snippets and dropping all CJK content for mixed snippets. That had two failure modes: 1. Two close paraphrases of the same Chinese fact tokenized to empty sets, fell back to exact-string match, returned similarity 0, and ended up as duplicate candidates in MEMORY.md. 2. Two semantically distinct CJK snippets that happened to share ASCII tokens (e.g. `Plan` + `exRule`) returned similarity 1.0, so the dedupe path silently dropped one of the two distinct memories. The memory MMR layer already has a CJK-aware tokenizer (`extensions/memory-core/src/memory/mmr.ts`: unigrams + adjacent bigrams + ASCII alphanumerics). This change extracts it into `extensions/memory-core/src/memory/tokenize.ts` and routes the dreaming dedupe path through the same helper via `textSimilarity`. `mmr.ts` re-exports `tokenize` / `jaccardSimilarity` / `textSimilarity` so existing imports (including `mmr.test.ts`) continue to work without churn. Verification with the patched module against the reporter's CJK scenarios: - Pure-CJK paraphrase pair textSimilarity: 0 -> 0.622 (dedup threshold 0.5 now succeeds). - Mixed-CJK distinct pair textSimilarity: 1.000 -> 0.056 (two distinct facts now kept). - English paraphrase: 0.600 (Latin behavior unchanged). - Unrelated short snippets: 0.000 (no over-collapse). Scope: Bug 2 from issue #80613 only. The Bug 1 (promotion rehydration leaks managed dreaming block lines into MEMORY.md) is a separate end-to-end fixture problem that clawsweeper flagged as not high-confidence-reproducible from source alone; it should be addressed in a separate PR with a targeted promotion-path reproduction. This PR is the narrow CJK dedupe repair that clawsweeper directly endorsed.
oxlint flagged Array#sort() in the new regression test; use Array#toSorted() instead. Non-functional change — test logic and output are unchanged.
…e to empty (#80613) Addresses chatgpt-codex-connector P1 review on #80620. textSimilarity is used by dreaming dedupeEntries to merge near-duplicate recall entries. The shared tokenize() only emits ASCII word-tokens and CJK uni-/bigrams, so inputs in other scripts (Cyrillic, Arabic, emoji-only, punctuation-only) tokenize to the empty set. Raw Jaccard returns 1 for two empty sets — that is the correct, intentional semantics for MMR re-ranking and is asserted by mmr.test.ts — but for the dedupe path it would collapse distinct non-tokenized snippets into one and drop data. Add a literal normalized-string equality fallback inside textSimilarity for the both-empty case only. Non-empty cases (the existing MMR path) keep Jaccard semantics unchanged. Add a regression test in mmr.test.ts covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets: distinct stays 0, identical stays 1.
…#80613) Upstream lint sweep #83542 (chore(lint): remove underscore-dangle allow list) removed the `__testing` alias from the lint allow list, exposing that the 4 new CJK regression tests added in c497966 referenced `__testing.dedupeEntries` while the import statement only brought in `testing`. After upstream's rebase merge into this branch, tsgo reported TS2552 on dreaming-phases.test.ts:3028,3042,3054,3063 and a TS7006 implicit any on the inferred entry param (cascade from the missing identifier). Fix: use the imported `testing.dedupeEntries` directly. The `testing as __testing` alias still exists in dreaming-phases.ts for any other consumers; this only adjusts the local test references. Verification: pnpm tsgo:extensions:test reports 0 errors in dreaming-phases.test.ts (the 6 remaining errors are pre-existing infra issues unrelated to this branch: @openclaw/proxyline resolution, src/plugin-sdk/file-lock.ts type narrowing).
|
Codex review: passed. Reviewed May 25, 2026, 5:47 PM ET / 21:47 UTC. Summary PR surface: Source +15, Tests +96. Total +111 across 5 files. Reproducibility: yes. Current main has an ASCII-only tokenizeSnippet path in dreaming dedupe, and the source PR includes terminal before/after proof against production source bytes for the CJK failure modes; I did not run tests locally because this review is read-only. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the tokenizer fix after changing the PR body so the remaining managed-block leak stays open or is tracked by a separate canonical issue. Do we have a high-confidence way to reproduce the issue? Yes. Current main has an ASCII-only tokenizeSnippet path in dreaming dedupe, and the source PR includes terminal before/after proof against production source bytes for the CJK failure modes; I did not run tests locally because this review is read-only. Is this the best way to solve the issue? Yes for the CJK dedupe portion: extracting the existing CJK-aware tokenizer and preserving mmr.ts re-exports is the narrow maintainable fix. The PR body should not close the whole linked two-part issue until the promotion leak is tracked or fixed. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 5b6d03e3e2f1. Label changesLabel justifications:
Evidence reviewedPR surface: Source +15, Tests +96. Total +111 across 5 files. View PR surface stats
Acceptance criteria:
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🌱 uncommon Tiny Clawlet Hatch commandComment Hatchability rules:
Rarity: 🌱 uncommon. What is this egg doing here?
|
|
🦞🧹
Draft PRs stay fix-only until GitHub marks them ready for review. Pause with Automerge progress:
|
…aw#80613) (openclaw#86645) Summary: - The PR extracts the CJK-aware memory tokenizer into a shared helper, routes dreaming dedupe through it, preserves MMR re-exports, and adds regression coverage for CJK and empty-token cases. - PR surface: Source +15, Tests +96. Total +111 across 5 files. - Reproducibility: yes. Current main has an ASCII-only tokenizeSnippet path in dreaming dedupe, and the source ... ction source bytes for the CJK failure modes; I did not run tests locally because this review is read-only. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory-core): use Array.toSorted for openclaw#80613 lint fix - PR branch already contained follow-up commit before automerge: fix(memory-core): preserve dedupe identity when both snippets tokeniz… - PR branch already contained follow-up commit before automerge: fix(memory-core): rename __testing to testing in CJK regression tests… - PR branch already contained follow-up commit before automerge: fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (openclaw#80613) Validation: - ClawSweeper review passed for head ca9c027. - Required merge gates passed before the squash merge. Prepared head SHA: ca9c027 Review: openclaw#86645 (comment) Co-authored-by: MoerAI <friendnt@g.skku.edu> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
…aw#80613) (openclaw#86645) Summary: - The PR extracts the CJK-aware memory tokenizer into a shared helper, routes dreaming dedupe through it, preserves MMR re-exports, and adds regression coverage for CJK and empty-token cases. - PR surface: Source +15, Tests +96. Total +111 across 5 files. - Reproducibility: yes. Current main has an ASCII-only tokenizeSnippet path in dreaming dedupe, and the source ... ction source bytes for the CJK failure modes; I did not run tests locally because this review is read-only. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory-core): use Array.toSorted for openclaw#80613 lint fix - PR branch already contained follow-up commit before automerge: fix(memory-core): preserve dedupe identity when both snippets tokeniz… - PR branch already contained follow-up commit before automerge: fix(memory-core): rename __testing to testing in CJK regression tests… - PR branch already contained follow-up commit before automerge: fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (openclaw#80613) Validation: - ClawSweeper review passed for head ca9c027. - Required merge gates passed before the squash merge. Prepared head SHA: ca9c027 Review: openclaw#86645 (comment) Co-authored-by: MoerAI <friendnt@g.skku.edu> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
…aw#80613) (openclaw#86645) Summary: - The PR extracts the CJK-aware memory tokenizer into a shared helper, routes dreaming dedupe through it, preserves MMR re-exports, and adds regression coverage for CJK and empty-token cases. - PR surface: Source +15, Tests +96. Total +111 across 5 files. - Reproducibility: yes. Current main has an ASCII-only tokenizeSnippet path in dreaming dedupe, and the source ... ction source bytes for the CJK failure modes; I did not run tests locally because this review is read-only. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory-core): use Array.toSorted for openclaw#80613 lint fix - PR branch already contained follow-up commit before automerge: fix(memory-core): preserve dedupe identity when both snippets tokeniz… - PR branch already contained follow-up commit before automerge: fix(memory-core): rename __testing to testing in CJK regression tests… - PR branch already contained follow-up commit before automerge: fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (openclaw#80613) Validation: - ClawSweeper review passed for head ca9c027. - Required merge gates passed before the squash merge. Prepared head SHA: ca9c027 Review: openclaw#86645 (comment) Co-authored-by: MoerAI <friendnt@g.skku.edu> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
…aw#80613) (openclaw#86645) Summary: - The PR extracts the CJK-aware memory tokenizer into a shared helper, routes dreaming dedupe through it, preserves MMR re-exports, and adds regression coverage for CJK and empty-token cases. - PR surface: Source +15, Tests +96. Total +111 across 5 files. - Reproducibility: yes. Current main has an ASCII-only tokenizeSnippet path in dreaming dedupe, and the source ... ction source bytes for the CJK failure modes; I did not run tests locally because this review is read-only. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory-core): use Array.toSorted for openclaw#80613 lint fix - PR branch already contained follow-up commit before automerge: fix(memory-core): preserve dedupe identity when both snippets tokeniz… - PR branch already contained follow-up commit before automerge: fix(memory-core): rename __testing to testing in CJK regression tests… - PR branch already contained follow-up commit before automerge: fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (openclaw#80613) Validation: - ClawSweeper review passed for head ca9c027. - Required merge gates passed before the squash merge. Prepared head SHA: ca9c027 Review: openclaw#86645 (comment) Co-authored-by: MoerAI <friendnt@g.skku.edu> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
…aw#80613) (openclaw#86645) Summary: - The PR extracts the CJK-aware memory tokenizer into a shared helper, routes dreaming dedupe through it, preserves MMR re-exports, and adds regression coverage for CJK and empty-token cases. - PR surface: Source +15, Tests +96. Total +111 across 5 files. - Reproducibility: yes. Current main has an ASCII-only tokenizeSnippet path in dreaming dedupe, and the source ... ction source bytes for the CJK failure modes; I did not run tests locally because this review is read-only. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory-core): use Array.toSorted for openclaw#80613 lint fix - PR branch already contained follow-up commit before automerge: fix(memory-core): preserve dedupe identity when both snippets tokeniz… - PR branch already contained follow-up commit before automerge: fix(memory-core): rename __testing to testing in CJK regression tests… - PR branch already contained follow-up commit before automerge: fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (openclaw#80613) Validation: - ClawSweeper review passed for head ca9c027. - Required merge gates passed before the squash merge. Prepared head SHA: ca9c027 Review: openclaw#86645 (comment) Co-authored-by: MoerAI <friendnt@g.skku.edu> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
…aw#80613) (openclaw#86645) Summary: - The PR extracts the CJK-aware memory tokenizer into a shared helper, routes dreaming dedupe through it, preserves MMR re-exports, and adds regression coverage for CJK and empty-token cases. - PR surface: Source +15, Tests +96. Total +111 across 5 files. - Reproducibility: yes. Current main has an ASCII-only tokenizeSnippet path in dreaming dedupe, and the source ... ction source bytes for the CJK failure modes; I did not run tests locally because this review is read-only. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory-core): use Array.toSorted for openclaw#80613 lint fix - PR branch already contained follow-up commit before automerge: fix(memory-core): preserve dedupe identity when both snippets tokeniz… - PR branch already contained follow-up commit before automerge: fix(memory-core): rename __testing to testing in CJK regression tests… - PR branch already contained follow-up commit before automerge: fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (openclaw#80613) Validation: - ClawSweeper review passed for head ca9c027. - Required merge gates passed before the squash merge. Prepared head SHA: ca9c027 Review: openclaw#86645 (comment) Co-authored-by: MoerAI <friendnt@g.skku.edu> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Makes #80620 merge-ready for the ClawSweeper automerge loop.
The edit pass should inspect the live PR diff, review comments, and failing checks; rebase if needed; keep the contributor branch credited; and stop only when validation is green or an external blocker is proven.
ClawSweeper 🐠 replacement reef notes:
Inherited issue-closing references from the source PR:
Closes #80613
Co-author credit kept:
fish notes: model gpt-5.5, reasoning high; reviewed against ca9c027.