fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (#80613)#80620
fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (#80613)#80620MoerAI wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c497966ee6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (setA.size === 0 && setB.size === 0) { | ||
| return 1; |
There was a problem hiding this comment.
Preserve empty-token fallback for non-CJK scripts
tokenize only emits ASCII/CJK tokens, so inputs in other scripts (for example Cyrillic, Arabic, emoji-only, or punctuation-only snippets) produce empty sets on both sides. With jaccardSimilarity now returning 1 when both sets are empty, dedupeEntries in dreaming-phases.ts will treat distinct snippets from the same path as duplicates and drop one whenever the threshold is <= 1. The previous dedupe logic only treated empty-token pairs as equal when the normalized full strings matched exactly, so this change introduces false merges and data loss for non-tokenized languages/content.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Applied in 0d30320 — added an exact-string-equality fallback inside textSimilarity for the both-empty case so distinct non-CJK/non-ASCII snippets no longer collapse. Jaccard semantics for non-empty inputs (and the MMR re-ranking suite) are unchanged. Regression added in mmr.test.ts covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets. See full reply: #80620 (comment)
|
Codex review: found issues before merge. Latest ClawSweeper review: 2026-05-22 06:53 UTC / May 22, 2026, 2:53 AM ET. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. Current main's PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review findings
Review detailsBest possible solution: Land the shared-tokenizer fix after retargeting the linked issue reference and refreshing the stale comment, while keeping the promotion-leak work tracked separately. Do we have a high-confidence way to reproduce the issue? Yes. Current main's Is this the best way to solve the issue? Yes for the code path: extracting the existing CJK-aware MMR tokenizer is the narrow maintainable fix. The merge path should still retarget the broader issue-closing metadata before landing. Label justifications:
Full review comments:
Overall correctness: patch is correct What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 170f72d5a161. |
|
@clawsweeper automerge |
|
🦞🔧
Draft PRs stay fix-only until GitHub marks them ready for review. Pause with Automerge progress:
|
|
🦞✅ Source: I added |
…ydration (openclaw#80613) Daily memory notes can interleave human content with managed `<!-- openclaw:dreaming:light:* -->` and `<!-- openclaw:dreaming:rem:* -->` blocks. The chunk builder strips those regions before snippet capture, but `rehydratePromotionCandidate` re-reads the raw source file and feeds it to `relocateCandidateRange`, whose fuzzy window search will happily latch onto a window that straddles the human bullet and the adjacent dreaming bullets. That leaks `- Candidate: …` / `confidence: …` / `status: staged` lines into `MEMORY.md`. Add `redactManagedDreamingLines` and call it on the source split before relocation, mirroring the chunk-side `stripManagedDailyDreamingLines` heading-walk so the `## Light Sleep` / `## REM Sleep` heading is also zeroed when it sits directly above the start marker. Unterminated managed blocks are redacted through the end of file rather than left as a partial window. Cover with a unit test of the helper (terminated, unterminated, multiple markers) and an integration test that writes a note with a `## Light Sleep` dreaming block and asserts the promoted `MEMORY.md` keeps the human bullet and contains no `Candidate:` / `confidence:` / `status: staged` / `openclaw:dreaming:light` traces. Refs openclaw#80620 (CJK dedupe) — that PR fixes the second sub-bug from the issue; this one only addresses the promotion-leak half.
…ydration (openclaw#80613) Daily memory notes can interleave human content with managed `<!-- openclaw:dreaming:light:* -->` and `<!-- openclaw:dreaming:rem:* -->` blocks. The chunk builder strips those regions before snippet capture, but `rehydratePromotionCandidate` re-reads the raw source file and feeds it to `relocateCandidateRange`, whose fuzzy window search will happily latch onto a window that straddles the human bullet and the adjacent dreaming bullets. That leaks `- Candidate: …` / `confidence: …` / `status: staged` lines into `MEMORY.md`. Add `redactManagedDreamingLines` and call it on the source split before relocation, mirroring the chunk-side `stripManagedDailyDreamingLines` heading-walk so the `## Light Sleep` / `## REM Sleep` heading is also zeroed when it sits directly above the start marker. Unterminated managed blocks are redacted through the end of file rather than left as a partial window. Cover with a unit test of the helper (terminated, unterminated, multiple markers) and an integration test that writes a note with a `## Light Sleep` dreaming block and asserts the promoted `MEMORY.md` keeps the human bullet and contains no `Candidate:` / `confidence:` / `status: staged` / `openclaw:dreaming:light` traces. Refs openclaw#80620 (CJK dedupe) — that PR fixes the second sub-bug from the issue; this one only addresses the promotion-leak half.
|
When can it be merged into the main branch? I think this bug still has a significant impact on CJK users. |
|
Thanks for chiming in, @p0pfan — the impact on CJK users is exactly what motivated this fix. Status as of today:
The PR is sitting at the maintainer-human-review gate, not a code-quality gate. Flagging for visibility — happy to address any concerns or rebase if needed. @steipete @Takhoffman |
|
@clawsweeper automerge |
…e to empty (openclaw#80613) Addresses chatgpt-codex-connector P1 review on openclaw#80620. textSimilarity is used by dreaming dedupeEntries to merge near-duplicate recall entries. The shared tokenize() only emits ASCII word-tokens and CJK uni-/bigrams, so inputs in other scripts (Cyrillic, Arabic, emoji-only, punctuation-only) tokenize to the empty set. Raw Jaccard returns 1 for two empty sets — that is the correct, intentional semantics for MMR re-ranking and is asserted by mmr.test.ts — but for the dedupe path it would collapse distinct non-tokenized snippets into one and drop data. Add a literal normalized-string equality fallback inside textSimilarity for the both-empty case only. Non-empty cases (the existing MMR path) keep Jaccard semantics unchanged. Add a regression test in mmr.test.ts covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets: distinct stays 0, identical stays 1.
|
Applied in You're right — Narrow fix at the
Why not change Verification:
|
|
Hi @MoerAI , I just find that this bug fix still hasn't been merged. |
…aw#80613) The dreaming-phases dedupe path's local `tokenizeSnippet` split on `/[^a-z0-9]+/i`, producing empty token sets for pure-CJK snippets and dropping all CJK content for mixed snippets. That had two failure modes: 1. Two close paraphrases of the same Chinese fact tokenized to empty sets, fell back to exact-string match, returned similarity 0, and ended up as duplicate candidates in MEMORY.md. 2. Two semantically distinct CJK snippets that happened to share ASCII tokens (e.g. `Plan` + `exRule`) returned similarity 1.0, so the dedupe path silently dropped one of the two distinct memories. The memory MMR layer already has a CJK-aware tokenizer (`extensions/memory-core/src/memory/mmr.ts`: unigrams + adjacent bigrams + ASCII alphanumerics). This change extracts it into `extensions/memory-core/src/memory/tokenize.ts` and routes the dreaming dedupe path through the same helper via `textSimilarity`. `mmr.ts` re-exports `tokenize` / `jaccardSimilarity` / `textSimilarity` so existing imports (including `mmr.test.ts`) continue to work without churn. Verification with the patched module against the reporter's CJK scenarios: - Pure-CJK paraphrase pair textSimilarity: 0 -> 0.622 (dedup threshold 0.5 now succeeds). - Mixed-CJK distinct pair textSimilarity: 1.000 -> 0.056 (two distinct facts now kept). - English paraphrase: 0.600 (Latin behavior unchanged). - Unrelated short snippets: 0.000 (no over-collapse). Scope: Bug 2 from issue openclaw#80613 only. The Bug 1 (promotion rehydration leaks managed dreaming block lines into MEMORY.md) is a separate end-to-end fixture problem that clawsweeper flagged as not high-confidence-reproducible from source alone; it should be addressed in a separate PR with a targeted promotion-path reproduction. This PR is the narrow CJK dedupe repair that clawsweeper directly endorsed.
oxlint flagged Array#sort() in the new regression test; use Array#toSorted() instead. Non-functional change — test logic and output are unchanged.
…e to empty (openclaw#80613) Addresses chatgpt-codex-connector P1 review on openclaw#80620. textSimilarity is used by dreaming dedupeEntries to merge near-duplicate recall entries. The shared tokenize() only emits ASCII word-tokens and CJK uni-/bigrams, so inputs in other scripts (Cyrillic, Arabic, emoji-only, punctuation-only) tokenize to the empty set. Raw Jaccard returns 1 for two empty sets — that is the correct, intentional semantics for MMR re-ranking and is asserted by mmr.test.ts — but for the dedupe path it would collapse distinct non-tokenized snippets into one and drop data. Add a literal normalized-string equality fallback inside textSimilarity for the both-empty case only. Non-empty cases (the existing MMR path) keep Jaccard semantics unchanged. Add a regression test in mmr.test.ts covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets: distinct stays 0, identical stays 1.
…openclaw#80613) Upstream lint sweep openclaw#83542 (chore(lint): remove underscore-dangle allow list) removed the `__testing` alias from the lint allow list, exposing that the 4 new CJK regression tests added in c497966 referenced `__testing.dedupeEntries` while the import statement only brought in `testing`. After upstream's rebase merge into this branch, tsgo reported TS2552 on dreaming-phases.test.ts:3028,3042,3054,3063 and a TS7006 implicit any on the inferred entry param (cascade from the missing identifier). Fix: use the imported `testing.dedupeEntries` directly. The `testing as __testing` alias still exists in dreaming-phases.ts for any other consumers; this only adjusts the local test references. Verification: pnpm tsgo:extensions:test reports 0 errors in dreaming-phases.test.ts (the 6 remaining errors are pre-existing infra issues unrelated to this branch: @openclaw/proxyline resolution, src/plugin-sdk/file-lock.ts type narrowing).
0d30320 to
84bcee9
Compare
|
Hi @p0pfan — thanks for the nudge, you're right that this should have landed already. What happened: after the chatgpt-codex-connector P1 fix in Just pushed
Should clear |
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Pearl Shellbean Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
|
@clawsweeper automerge |
1 similar comment
|
@clawsweeper automerge |
|
Why this pr is still waiting |
|
Status update: rebase head
The earlier @Takhoffman — happy for you to re-issue |
|
@clawsweeper approve |
|
🦞👀 Command router queued. I will update this comment with the next step. |
|
ClawSweeper 🐠 reef update Thanks for the work on this. ClawSweeper could not push to this branch with the permissions available, so it opened a narrow replacement PR to keep the fix swimming forward without losing the contributor trail. not your fault, just GitHub branch-permission tides. Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
fish notes: model gpt-5.5, reasoning high; reviewed against ca9c027. |
Summary
The dreaming-phases dedupe path's local
tokenizeSnippetsplit on/[^a-z0-9]+/i, producing empty token sets for pure-CJK snippets and dropping all CJK content for mixed snippets. That had two failure modes on current main:0, and ended up as duplicate candidates in MEMORY.md.Plan+exRule) returned similarity1.0, so the dedupe path silently dropped one of the two distinct memories.The memory MMR layer at
extensions/memory-core/src/memory/mmr.tsalready has a CJK-aware tokenizer (unigrams + adjacent bigrams + ASCII alphanumerics). This PR extracts it intoextensions/memory-core/src/memory/tokenize.tsand routes the dreaming dedupe path through the same helper viatextSimilarity.mmr.tsre-exportstokenize/jaccardSimilarity/textSimilarityso existing imports (includingmmr.test.ts) continue to work without churn.Root Cause
extensions/memory-core/src/dreaming-phases.ts:1347(before):function tokenizeSnippet(snippet) { return new Set(snippet.toLowerCase().split(/[^a-z0-9]+/i).map(t => t.trim()).filter(Boolean)); }. CJK characters fall outside[a-z0-9], so they are split into delimiters and dropped.extensions/memory-core/src/dreaming-phases.ts:1357(before):function jaccardSimilarity(left, right)calls the broken tokenizer and, when either token set is empty, falls back toleft.trim().toLowerCase() === right.trim().toLowerCase() ? 1 : 0. That makes close-but-not-identical CJK pairs return 0 (missed dedup) and identical-ASCII pairs return 1 (spurious dedup, drops distinct content).dedupeEntries({ entries, threshold })(dreaming-phases.ts:1373) →jaccardSimilarity(candidate.snippet, entry.snippet) >= threshold→ either misses CJK duplicates or wrongly merges distinct CJK candidates → light dreaming output →MEMORY.mdaccumulates duplicate or loses unique entries.Changes
extensions/memory-core/src/memory/tokenize.ts(NEW): extract the CJK-awaretokenize,jaccardSimilarity, andtextSimilarityhelpers frommmr.tsinto a shared module so both consumers route through one source of truth.extensions/memory-core/src/memory/mmr.ts: delete the in-fileCJK_RE,tokenize,jaccardSimilarity,textSimilaritybodies; import them from./tokenize.jsand re-export verbatim so existing imports (mmr.test.tsetc.) keep working.extensions/memory-core/src/dreaming-phases.ts: delete the ASCII-onlytokenizeSnippetand localjaccardSimilarity. Replace the single call site indedupeEntrieswithsnippetSimilarity(aliased fromtextSimilarityin./memory/tokenize.js). ExposededupeEntriesvia__testingfor the regression test.extensions/memory-core/src/dreaming-phases.test.ts: add adedupeEntries — CJK-aware snippet similarity (#80613)describe with 4 colocated regression cases: pure-CJK dedup, mixed-CJK kept distinct, English paraphrase unchanged, unrelated short snippets stay separate.Net diff: +86 / -103 LOC across 4 files (1 new + 3 modified) — a reduction from removing the duplicate ASCII-only tokenizer.
Real behavior proof
教训:配置中实验开关字段是叫做规则and教训:配置里实验开关的字段叫做规则) both tokenize to empty sets, fall through to exact-match, return similarity0, and both reach MEMORY.md instead of one merging into the other. (2) Two distinct CJK snippets that share ASCII tokens (Plan 实验开关字段叫做 exRulevsPlan 整个产品体系彻底重构 exRule) return similarity1.0, so the dedupe pass silently drops one of two semantically different memories. Issue: [Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613../openclaw-80613on Windows 11 + Node 22.14. The before-fix smoke script readsdreaming-phases.tsfrom currentupstream/main, extracts the productiontokenizeSnippet+jaccardSimilaritybytes verbatim, and runs them against the issue's CJK scenarios. The after-fix smoke script reads the patchedextensions/memory-core/src/memory/tokenize.tsfrom this PR head, prints its SHA-256, and runs the sharedtokenize/textSimilarityagainst the same scenarios. Both scripts run vianode --experimental-strip-types— the production function bytes drive the assertions, no mocks. PR head:c497966ee6d12659dc750a1819969d38817f53d1.git checkout fix/dreaming-cjk-tokenizer && node --experimental-strip-types ./smoke-verify-80613.mts, wheresmoke-verify-80613.mtsreads the patchedtokenize.tsfromextensions/memory-core/src/memory/tokenize.ts, hashes it, then runs the CJK-awaretokenizeandtextSimilarityagainst the same four scenarios used in the colocated regression test (pure-CJK paraphrase, mixed-CJK distinct, English paraphrase control, unrelated short).c497966ee6(Windows 11 + Node 22.14, the patchedextensions/memory-core/src/memory/tokenize.tsexercised vianode --experimental-strip-typesagainst verbatim source bytes; tokenize.ts SHA-2569cd5a672bb1866c2db2384af578a8efd12d0009a69553b6d7a84cc6ee048596b).0.000(was missed by ASCII-only tokenizer falling through to exact-match) to 0.622 — above the typical0.5–0.6dedupe threshold, so the second copy is now correctly merged. The mixed-CJK distinct pair went from1.000(spurious merge that silently dropped one of two distinct memories) to 0.056 — well below the0.7threshold, so both semantically different snippets are kept. The English paraphrase control stayed at0.600(Latin-script behavior unchanged). Unrelated short snippets stayed at0.000(no over-collapse). The patchedtokenize.tsis pinned by SHA-2569cd5a672bb1866c2db2384af578a8efd12d0009a69553b6d7a84cc6ee048596b, so the proof is bound to the exact source bytes this PR ships.~/.openclaw/workspacewith Chinese daily memory files and a configured agent). The fix is a pure-function tokenizer change at thededupeEntriesboundary; the downstream snippet-promotion path (short-term-promotion.ts) consumes the returnedShortTermRecallEntry[]verbatim, so the only behavioral change is the now-correct dedupe arithmetic. Bug 1 from the same issue (managed dreaming block content leaking into promoted MEMORY.md entries viabuildDailySnippetChunks/stripManagedDailyDreamingLinesboundary mismatch) is intentionally out of scope here — clawsweeper flagged it as not high-confidence-reproducible from source alone ("the raw managed-block leak has a credible source path through promotion rehydration, but still needs a focused regression to pin the exact current-main fixture"), and the right shape is a separate promotion-path PR with a targeted fixture. Maintainers may applyproof: overrideif a live CJK workspace recording is required.Test
pnpm test extensions/memory-core/src/dreaming-phases.test.ts— newdedupeEntries — CJK-aware snippet similarity (#80613)describe with 4 cases (CJK paraphrase dedup / mixed-CJK kept distinct / English paraphrase control / unrelated short kept separate); all existing cases continue to pass.pnpm test extensions/memory-core/src/memory/mmr.test.ts— unchanged; the re-exports oftokenize/jaccardSimilarity/textSimilarityfrommmr.tspreserve the existing import surface so this test stays green.pnpm tsgoagainst the touched files — clean (LSP diagnostics ontokenize.ts,mmr.ts,dreaming-phases.ts,dreaming-phases.test.tsshow only pre-existing environment-onlyopenclaw/plugin-sdk/*resolution warnings; no new TypeScript errors introduced).Notes
mmr.tsre-exportstokenize/jaccardSimilarity/textSimilarityso every existing call site (includingmmr.test.ts) keeps working without an import-path change. The CJK-aware tokenizer is byte-for-byte the same algorithmmmr.tsalready shipped, so MMR behavior is unchanged.Closes #80613