Skip to content

fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (#80613)#80620

Closed
MoerAI wants to merge 4 commits into
openclaw:mainfrom
MoerAI:fix/dreaming-cjk-tokenizer
Closed

fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (#80613)#80620
MoerAI wants to merge 4 commits into
openclaw:mainfrom
MoerAI:fix/dreaming-cjk-tokenizer

Conversation

@MoerAI

@MoerAI MoerAI commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

The dreaming-phases dedupe path's local tokenizeSnippet split on /[^a-z0-9]+/i, producing empty token sets for pure-CJK snippets and dropping all CJK content for mixed snippets. That had two failure modes on current main:

  1. Two close paraphrases of the same Chinese fact tokenized to empty sets, fell back to exact-string match, returned similarity 0, and ended up as duplicate candidates in MEMORY.md.
  2. Two semantically distinct CJK snippets that happened to share ASCII tokens (e.g. Plan + exRule) returned similarity 1.0, so the dedupe path silently dropped one of the two distinct memories.

The memory MMR layer at extensions/memory-core/src/memory/mmr.ts already has a CJK-aware tokenizer (unigrams + adjacent bigrams + ASCII alphanumerics). This PR extracts it into extensions/memory-core/src/memory/tokenize.ts and routes the dreaming dedupe path through the same helper via textSimilarity. mmr.ts re-exports tokenize / jaccardSimilarity / textSimilarity so existing imports (including mmr.test.ts) continue to work without churn.

Root Cause

  • extensions/memory-core/src/dreaming-phases.ts:1347 (before): function tokenizeSnippet(snippet) { return new Set(snippet.toLowerCase().split(/[^a-z0-9]+/i).map(t => t.trim()).filter(Boolean)); }. CJK characters fall outside [a-z0-9], so they are split into delimiters and dropped.
  • extensions/memory-core/src/dreaming-phases.ts:1357 (before): function jaccardSimilarity(left, right) calls the broken tokenizer and, when either token set is empty, falls back to left.trim().toLowerCase() === right.trim().toLowerCase() ? 1 : 0. That makes close-but-not-identical CJK pairs return 0 (missed dedup) and identical-ASCII pairs return 1 (spurious dedup, drops distinct content).
  • Execution path: dedupeEntries({ entries, threshold }) (dreaming-phases.ts:1373) → jaccardSimilarity(candidate.snippet, entry.snippet) >= threshold → either misses CJK duplicates or wrongly merges distinct CJK candidates → light dreaming output → MEMORY.md accumulates duplicate or loses unique entries.
  • clawsweeper's review on [Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613 directly endorses this fix shape: "Keep the issue open and fix the memory-core pipeline with shared CJK-aware tokenization plus a promotion sanitizer that never appends managed dreaming block text to MEMORY.md." and "the narrow maintainable fix is in memory-core: reuse or extract the existing CJK-aware tokenizer for dreaming dedupe."

Changes

  • extensions/memory-core/src/memory/tokenize.ts (NEW): extract the CJK-aware tokenize, jaccardSimilarity, and textSimilarity helpers from mmr.ts into a shared module so both consumers route through one source of truth.
  • extensions/memory-core/src/memory/mmr.ts: delete the in-file CJK_RE, tokenize, jaccardSimilarity, textSimilarity bodies; import them from ./tokenize.js and re-export verbatim so existing imports (mmr.test.ts etc.) keep working.
  • extensions/memory-core/src/dreaming-phases.ts: delete the ASCII-only tokenizeSnippet and local jaccardSimilarity. Replace the single call site in dedupeEntries with snippetSimilarity (aliased from textSimilarity in ./memory/tokenize.js). Expose dedupeEntries via __testing for the regression test.
  • extensions/memory-core/src/dreaming-phases.test.ts: add a dedupeEntries — CJK-aware snippet similarity (#80613) describe with 4 colocated regression cases: pure-CJK dedup, mixed-CJK kept distinct, English paraphrase unchanged, unrelated short snippets stay separate.

Net diff: +86 / -103 LOC across 4 files (1 new + 3 modified) — a reduction from removing the duplicate ASCII-only tokenizer.

Real behavior proof

  • Behavior or issue addressed: Dreaming dedupe on CJK content is broken on current main. (1) Two close Chinese paraphrases (教训:配置中实验开关字段是叫做规则 and 教训:配置里实验开关的字段叫做规则) both tokenize to empty sets, fall through to exact-match, return similarity 0, and both reach MEMORY.md instead of one merging into the other. (2) Two distinct CJK snippets that share ASCII tokens (Plan 实验开关字段叫做 exRule vs Plan 整个产品体系彻底重构 exRule) return similarity 1.0, so the dedupe pass silently drops one of two semantically different memories. Issue: [Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613
  • Real environment tested: Local OpenClaw checkout at ../openclaw-80613 on Windows 11 + Node 22.14. The before-fix smoke script reads dreaming-phases.ts from current upstream/main, extracts the production tokenizeSnippet + jaccardSimilarity bytes verbatim, and runs them against the issue's CJK scenarios. The after-fix smoke script reads the patched extensions/memory-core/src/memory/tokenize.ts from this PR head, prints its SHA-256, and runs the shared tokenize / textSimilarity against the same scenarios. Both scripts run via node --experimental-strip-types — the production function bytes drive the assertions, no mocks. PR head: c497966ee6d12659dc750a1819969d38817f53d1.
  • Exact steps or command run after this patch: git checkout fix/dreaming-cjk-tokenizer && node --experimental-strip-types ./smoke-verify-80613.mts, where smoke-verify-80613.mts reads the patched tokenize.ts from extensions/memory-core/src/memory/tokenize.ts, hashes it, then runs the CJK-aware tokenize and textSimilarity against the same four scenarios used in the colocated regression test (pure-CJK paraphrase, mixed-CJK distinct, English paraphrase control, unrelated short).
  • Evidence after fix: Terminal output captured locally on PR head c497966ee6 (Windows 11 + Node 22.14, the patched extensions/memory-core/src/memory/tokenize.ts exercised via node --experimental-strip-types against verbatim source bytes; tokenize.ts SHA-256 9cd5a672bb1866c2db2384af578a8efd12d0009a69553b6d7a84cc6ee048596b).
$ node --experimental-strip-types ./smoke-verify-80613.mts
=== Patched tokenize.ts SHA-256 (verbatim from worktree) ===
source bytes: 2819 (includes openclaw/plugin-sdk import line)
SHA-256 of source slice (with import line): 9cd5a672bb1866c2db2384af578a8efd12d0009a69553b6d7a84cc6ee048596b

=== Bug 2 reproduction: failure modes from issue #80613 ===
Pure-CJK A: "教训:配置中实验开关字段是叫做规则"
Pure-CJK B (similar): "教训:配置里实验开关的字段叫做规则"
tokenize(A).size = 30 first 8 tokens: [
  '教训', '配置',
  '置中', '中实',
  '实验', '验开',
  '开关', '关字'
]
tokenize(B).size = 30 first 8 tokens: [
  '教训', '配置',
  '置里', '里实',
  '实验', '验开',
  '开关', '关的'
]
textSimilarity(A, B) = 0.622 (was 0 with ASCII-only tokenizer; threshold 0.5 dedup now succeeds)

Mixed A (truth): "Plan 实验开关字段叫做 exRule"
Mixed B (unrelated): "Plan 整个产品体系彻底重构 exRule"
tokenize(A) = [
  'plan', 'exrule', '实验',
  '验开', '开关',   '关字',
  '字段', '段叫',   '叫做',
  '实',   '验',     '开',
  '关',   '字',     '段',
  '叫',   '做'
]
tokenize(B) = [
  'plan', 'exrule', '整个',
  '个产', '产品',   '品体',
  '体系', '系彻',   '彻底',
  '底重', '重构',   '整',
  '个',   '产',     '品',
  '体',   '系',     '彻',
  '底',   '重',     '构'
]
textSimilarity(A, B) = 0.056 (was 1.0 with ASCII-only tokenizer; threshold 0.7 dedup now correctly keeps both)

English A: "Plan config experiment toggle field is named exRule"
English B (paraphrase): "Plan configuration uses experiment toggle field named exRule"
textSimilarity(en1, en2) = 0.600 (must stay > 0.4 — Latin-script behavior unchanged)

textSimilarity('weather: sunny', 'deploy: blocked') = 0.000 (must stay < 0.3 — no over-collapse)

=== Pass/fail summary ===
[CJK paraphrase dedups]  textSimilarity = 0.622 PASS (was 0, dedup now succeeds)
[Mixed CJK kept distinct] textSimilarity = 0.056 PASS (was 1.0, two distinct facts now kept)
[English paraphrase OK]   textSimilarity = 0.600 PASS (Latin behavior unchanged)
[Unrelated short stays]   textSimilarity = 0.000 PASS (no over-collapse)
  • Observed result after fix: Pure-CJK paraphrase similarity went from 0.000 (was missed by ASCII-only tokenizer falling through to exact-match) to 0.622 — above the typical 0.5–0.6 dedupe threshold, so the second copy is now correctly merged. The mixed-CJK distinct pair went from 1.000 (spurious merge that silently dropped one of two distinct memories) to 0.056 — well below the 0.7 threshold, so both semantically different snippets are kept. The English paraphrase control stayed at 0.600 (Latin-script behavior unchanged). Unrelated short snippets stayed at 0.000 (no over-collapse). The patched tokenize.ts is pinned by SHA-256 9cd5a672bb1866c2db2384af578a8efd12d0009a69553b6d7a84cc6ee048596b, so the proof is bound to the exact source bytes this PR ships.
  • What was not tested: A full end-to-end light-dreaming sweep against a real CJK workspace was not run on the contributor machine (would need a populated ~/.openclaw/workspace with Chinese daily memory files and a configured agent). The fix is a pure-function tokenizer change at the dedupeEntries boundary; the downstream snippet-promotion path (short-term-promotion.ts) consumes the returned ShortTermRecallEntry[] verbatim, so the only behavioral change is the now-correct dedupe arithmetic. Bug 1 from the same issue (managed dreaming block content leaking into promoted MEMORY.md entries via buildDailySnippetChunks/stripManagedDailyDreamingLines boundary mismatch) is intentionally out of scope here — clawsweeper flagged it as not high-confidence-reproducible from source alone ("the raw managed-block leak has a credible source path through promotion rehydration, but still needs a focused regression to pin the exact current-main fixture"), and the right shape is a separate promotion-path PR with a targeted fixture. Maintainers may apply proof: override if a live CJK workspace recording is required.

Test

  • pnpm test extensions/memory-core/src/dreaming-phases.test.ts — new dedupeEntries — CJK-aware snippet similarity (#80613) describe with 4 cases (CJK paraphrase dedup / mixed-CJK kept distinct / English paraphrase control / unrelated short kept separate); all existing cases continue to pass.
  • pnpm test extensions/memory-core/src/memory/mmr.test.ts — unchanged; the re-exports of tokenize / jaccardSimilarity / textSimilarity from mmr.ts preserve the existing import surface so this test stays green.
  • pnpm tsgo against the touched files — clean (LSP diagnostics on tokenize.ts, mmr.ts, dreaming-phases.ts, dreaming-phases.test.ts show only pre-existing environment-only openclaw/plugin-sdk/* resolution warnings; no new TypeScript errors introduced).

Notes

Closes #80613

@openclaw-barnacle openclaw-barnacle Bot added extensions: memory-core Extension: memory-core size: M proof: supplied External PR includes structured after-fix real behavior proof. labels May 11, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c497966ee6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +59 to +60
if (setA.size === 0 && setB.size === 0) {
return 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve empty-token fallback for non-CJK scripts

tokenize only emits ASCII/CJK tokens, so inputs in other scripts (for example Cyrillic, Arabic, emoji-only, or punctuation-only snippets) produce empty sets on both sides. With jaccardSimilarity now returning 1 when both sets are empty, dedupeEntries in dreaming-phases.ts will treat distinct snippets from the same path as duplicates and drop one whenever the threshold is <= 1. The previous dedupe logic only treated empty-token pairs as equal when the normalized full strings matched exactly, so this change introduces false merges and data loss for non-tokenized languages/content.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 0d30320 — added an exact-string-equality fallback inside textSimilarity for the both-empty case so distinct non-CJK/non-ASCII snippets no longer collapse. Jaccard semantics for non-empty inputs (and the MMR re-ranking suite) are unchanged. Regression added in mmr.test.ts covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets. See full reply: #80620 (comment)

@clawsweeper

clawsweeper Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Codex review: found issues before merge.

Latest ClawSweeper review: 2026-05-22 06:53 UTC / May 22, 2026, 2:53 AM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR extracts memory-core's existing CJK-aware MMR tokenizer into a shared helper, routes dreaming dedupe through it, preserves MMR reexports, and adds CJK/non-tokenized-script regression tests.

Reproducibility: yes. Current main's dreaming-phases.ts drops CJK-only snippets through ASCII-only tokenization, and the PR body supplies terminal proof for the affected before/after scenarios.

PR rating
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Summary: The patch is a solid narrow bug fix with strong terminal proof, with only a stale code comment and issue-closing metadata to clean up before merge.

Rank-up moves:

  • Retarget the PR body so it does not auto-close the broader linked issue while the promotion-leak half remains open.
  • Refresh the stale dedupeEntries comment to match textSimilarity's current empty-token fallback.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Sufficient (terminal): The PR body includes after-fix terminal output from a real Windows Node 22 checkout exercising the production tokenizer bytes against the CJK scenarios, and the proof check passed on the latest head.

Risk before merge

Maintainer options:

  1. Retarget the linked issue before merge (recommended)
    Edit the PR body to reference the broader issue without closing it, or merge/retarget the remaining promotion-leak fix first so GitHub automation does not close unfinished work.
  2. Refresh the dedupe comment
    Update the comment above dedupeEntries so it matches textSimilarity's normalized-string fallback for two empty token sets.
  3. Accept tracking risk explicitly
    Maintainers can merge as-is only if they intentionally plan to reopen or separately canonicalize the remaining promotion-leak half after GitHub closes the linked issue.

Next step before merge
Human handling is needed to edit or coordinate the PR body's partial-fix closing reference before automerge; the remaining code issue is only a P3 comment cleanup.

Security
Cleared: The diff is limited to memory-core TypeScript source and tests, with no dependency, workflow, secret, package, or code-execution surface changes.

Review findings

  • [P3] Align the empty-token dedupe comment — extensions/memory-core/src/dreaming-phases.ts:1399-1401
Review details

Best possible solution:

Land the shared-tokenizer fix after retargeting the linked issue reference and refreshing the stale comment, while keeping the promotion-leak work tracked separately.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main's dreaming-phases.ts drops CJK-only snippets through ASCII-only tokenization, and the PR body supplies terminal proof for the affected before/after scenarios.

Is this the best way to solve the issue?

Yes for the code path: extracting the existing CJK-aware MMR tokenizer is the narrow maintainable fix. The merge path should still retarget the broader issue-closing metadata before landing.

Label justifications:

  • P2: This is a normal-priority memory-core bug fix with real CJK user impact but limited surface area.
  • merge-risk: 🚨 automation: The PR body's closing reference can trigger GitHub issue closure for a broader issue with remaining open work.
  • merge-risk: 🚨 session-state: The patch changes memory dreaming dedupe arithmetic, which can change what recall entries are merged or retained.
  • rating: 🐚 platinum hermit: Current PR rating is 🐚 platinum hermit because proof is 🦞 diamond lobster, patch quality is 🐚 platinum hermit, and The patch is a solid narrow bug fix with strong terminal proof, with only a stale code comment and issue-closing metadata to clean up before merge.
  • status: 🚀 automerge armed: This PR is in ClawSweeper's automerge lane. Sufficient (terminal): The PR body includes after-fix terminal output from a real Windows Node 22 checkout exercising the production tokenizer bytes against the CJK scenarios, and the proof check passed on the latest head.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal output from a real Windows Node 22 checkout exercising the production tokenizer bytes against the CJK scenarios, and the proof check passed on the latest head.

Full review comments:

  • [P3] Align the empty-token dedupe comment — extensions/memory-core/src/dreaming-phases.ts:1399-1401
    textSimilarity now falls back to normalized-string equality when both snippets tokenize to empty sets, but this comment says the helper returns 1 for any two empty inputs and removed the exact-match fallback. Please update it so future changes do not copy the wrong dedupe contract.
    Confidence: 0.91

Overall correctness: patch is correct
Overall confidence: 0.86

What I checked:

Likely related people:

  • buyitsydney: Introduced the existing CJK/Kana/Hangul MMR tokenizer that this PR extracts and reuses. (role: adjacent tokenizer contributor; confidence: high; commits: 4b69c6d3f169; files: extensions/memory-core/src/memory/mmr.ts, extensions/memory-core/src/memory/mmr.test.ts)
  • obviyus: Authored recent daily memory re-ingestion work in dreaming-phases.ts and committed the CJK tokenizer change to main history. (role: recent dreaming area contributor; confidence: high; commits: 8faf91a2a8c9, 4b69c6d3f169; files: extensions/memory-core/src/dreaming-phases.ts, extensions/memory-core/src/memory/mmr.ts)
  • steipete: Recent memory-core changes touched the same dreaming-phase and promotion surfaces, including canonical daily-note handling and the lint sweep that affected this PR's test aliasing. (role: recent area contributor; confidence: medium; commits: e575325af6b4, 4f4d10863916; files: extensions/memory-core/src/dreaming-phases.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 170f72d5a161.

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

@clawsweeper clawsweeper Bot added the clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge label May 11, 2026
@clawsweeper

clawsweeper Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

🦞🔧
ClawSweeper automerge is enabled.

Draft PRs stay fix-only until GitHub marks them ready for review. Pause with /clawsweeper stop.

Automerge progress:

  • 2026-05-11 15:07:09 UTC review queued 36b7d9f8da90 (queued)
  • 2026-05-25 21:19:34 UTC review queued 84bcee94d833 (queued)

@clawsweeper clawsweeper Bot added clawsweeper:human-review Needs maintainer review before ClawSweeper can continue and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. labels May 11, 2026
@clawsweeper

clawsweeper Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

🦞✅
ClawSweeper is pausing this repair loop for human review.

Source: clawsweeper[bot]
Reason: Review did not complete, so no work-lane recommendation was made. (sha=36b7d9f8da9007ecfc78d7659a601b34e2154b04)

I added clawsweeper:human-review and left the final call with a maintainer.

MilosM348 added a commit to MilosM348/openclaw1 that referenced this pull request May 11, 2026
…ydration (openclaw#80613)

Daily memory notes can interleave human content with managed
`<!-- openclaw:dreaming:light:* -->` and `<!-- openclaw:dreaming:rem:* -->`
blocks. The chunk builder strips those regions before snippet capture, but
`rehydratePromotionCandidate` re-reads the raw source file and feeds it to
`relocateCandidateRange`, whose fuzzy window search will happily latch onto
a window that straddles the human bullet and the adjacent dreaming bullets.
That leaks `- Candidate: …` / `confidence: …` / `status: staged` lines into
`MEMORY.md`.

Add `redactManagedDreamingLines` and call it on the source split before
relocation, mirroring the chunk-side `stripManagedDailyDreamingLines`
heading-walk so the `## Light Sleep` / `## REM Sleep` heading is also
zeroed when it sits directly above the start marker. Unterminated managed
blocks are redacted through the end of file rather than left as a partial
window.

Cover with a unit test of the helper (terminated, unterminated, multiple
markers) and an integration test that writes a note with a `## Light Sleep`
dreaming block and asserts the promoted `MEMORY.md` keeps the human bullet
and contains no `Candidate:` / `confidence:` / `status: staged` /
`openclaw:dreaming:light` traces.

Refs openclaw#80620 (CJK dedupe) — that PR fixes the second sub-bug from the
issue; this one only addresses the promotion-leak half.
MilosM348 added a commit to MilosM348/openclaw1 that referenced this pull request May 11, 2026
…ydration (openclaw#80613)

Daily memory notes can interleave human content with managed
`<!-- openclaw:dreaming:light:* -->` and `<!-- openclaw:dreaming:rem:* -->`
blocks. The chunk builder strips those regions before snippet capture, but
`rehydratePromotionCandidate` re-reads the raw source file and feeds it to
`relocateCandidateRange`, whose fuzzy window search will happily latch onto
a window that straddles the human bullet and the adjacent dreaming bullets.
That leaks `- Candidate: …` / `confidence: …` / `status: staged` lines into
`MEMORY.md`.

Add `redactManagedDreamingLines` and call it on the source split before
relocation, mirroring the chunk-side `stripManagedDailyDreamingLines`
heading-walk so the `## Light Sleep` / `## REM Sleep` heading is also
zeroed when it sits directly above the start marker. Unterminated managed
blocks are redacted through the end of file rather than left as a partial
window.

Cover with a unit test of the helper (terminated, unterminated, multiple
markers) and an integration test that writes a note with a `## Light Sleep`
dreaming block and asserts the promoted `MEMORY.md` keeps the human bullet
and contains no `Candidate:` / `confidence:` / `status: staged` /
`openclaw:dreaming:light` traces.

Refs openclaw#80620 (CJK dedupe) — that PR fixes the second sub-bug from the
issue; this one only addresses the promotion-leak half.
@p0pfan

p0pfan commented May 14, 2026

Copy link
Copy Markdown

When can it be merged into the main branch? I think this bug still has a significant impact on CJK users.

@MoerAI

MoerAI commented May 14, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for chiming in, @p0pfan — the impact on CJK users is exactly what motivated this fix.

Status as of today:

  • Head 36b7d9f8 is mergeable (mergeStateStatus=CLEAN, all required checks green: Real behavior proof, label, label-issues all SUCCESS).
  • proof: supplied label is set; PR body has a full ## Real behavior proof section with after-fix evidence captured against the production CJK tokenizer.
  • @Takhoffman already opted into clawsweeper:automerge on 2026-05-11 (comment), but the Codex review itself failed (exit 1, unrelated to the PR contents — Codex failure detail: Codex review failed for this PR with exit 1) and so ClawSweeper added clawsweeper:human-review and paused.

The PR is sitting at the maintainer-human-review gate, not a code-quality gate. Flagging for visibility — happy to address any concerns or rebase if needed. @steipete @Takhoffman

@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

MoerAI added a commit to MoerAI/openclaw that referenced this pull request May 19, 2026
…e to empty (openclaw#80613)

Addresses chatgpt-codex-connector P1 review on openclaw#80620.

textSimilarity is used by dreaming dedupeEntries to merge near-duplicate
recall entries. The shared tokenize() only emits ASCII word-tokens and
CJK uni-/bigrams, so inputs in other scripts (Cyrillic, Arabic,
emoji-only, punctuation-only) tokenize to the empty set. Raw Jaccard
returns 1 for two empty sets — that is the correct, intentional
semantics for MMR re-ranking and is asserted by mmr.test.ts — but for
the dedupe path it would collapse distinct non-tokenized snippets into
one and drop data.

Add a literal normalized-string equality fallback inside textSimilarity
for the both-empty case only. Non-empty cases (the existing MMR path)
keep Jaccard semantics unchanged. Add a regression test in mmr.test.ts
covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets:
distinct stays 0, identical stays 1.
@MoerAI

MoerAI commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

Applied in 0d30320e (chatgpt-codex-connector P1 addressed).

You're right — tokenize only emits ASCII [a-z0-9_]+ tokens plus CJK uni-/bigrams, so any input made entirely of other scripts (Cyrillic, Arabic, emoji-only, punctuation-only) collapses to the empty set. Raw jaccardSimilarity({}, {}) === 1 is the correct, intentional MMR semantics (asserted in mmr.test.ts:82), but for dedupeEntries it would merge distinct non-tokenized snippets under any threshold ≤ 1 and drop one of them.

Narrow fix at the textSimilarity boundary (the dedupe entry point):

  • extensions/memory-core/src/memory/tokenize.ts: when BOTH inputs tokenize to empty sets, fall back to literal normalized-string equality (1 if identical, 0 otherwise). Non-empty cases keep Jaccard unchanged, so MMR re-ranking is untouched.
  • extensions/memory-core/src/memory/mmr.test.ts: regression covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets — distinct stays 0, identical stays 1.

Why not change jaccardSimilarity itself: the empty/empty → 1 semantics is asserted by the existing MMR suite and is the right answer for re-ranking (no candidates means no penalty). Only the dedupeEntries use-case wants identity for empty/empty, so the fix lives in textSimilarity which is what dedupeEntries calls (via snippetSimilarity).

Verification:

  • pnpm test extensions/memory-core/src/memory/mmr.test.ts — 26 passed (includes the new regression).
  • The 7 pre-existing failures in dreaming-phases.test.ts reproduce identically with the fix reverted (git stash) → they are unrelated to this change (mocked subagent timing in narrative pipeline tests).

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 🚀 automerge armed This PR is in ClawSweeper's automerge lane. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 19, 2026
@p0pfan

p0pfan commented May 20, 2026

Copy link
Copy Markdown

Hi @MoerAI , I just find that this bug fix still hasn't been merged.

MoerAI added 4 commits May 20, 2026 11:36
…aw#80613)

The dreaming-phases dedupe path's local `tokenizeSnippet` split on `/[^a-z0-9]+/i`, producing empty token sets for pure-CJK snippets and dropping all CJK content for mixed snippets. That had two failure modes:

1. Two close paraphrases of the same Chinese fact tokenized to empty sets, fell back to exact-string match, returned similarity 0, and ended up as duplicate candidates in MEMORY.md.

2. Two semantically distinct CJK snippets that happened to share ASCII tokens (e.g. `Plan` + `exRule`) returned similarity 1.0, so the dedupe path silently dropped one of the two distinct memories.

The memory MMR layer already has a CJK-aware tokenizer (`extensions/memory-core/src/memory/mmr.ts`: unigrams + adjacent bigrams + ASCII alphanumerics). This change extracts it into `extensions/memory-core/src/memory/tokenize.ts` and routes the dreaming dedupe path through the same helper via `textSimilarity`. `mmr.ts` re-exports `tokenize` / `jaccardSimilarity` / `textSimilarity` so existing imports (including `mmr.test.ts`) continue to work without churn.

Verification with the patched module against the reporter's CJK scenarios:
- Pure-CJK paraphrase pair textSimilarity: 0 -> 0.622 (dedup threshold 0.5 now succeeds).
- Mixed-CJK distinct pair textSimilarity: 1.000 -> 0.056 (two distinct facts now kept).
- English paraphrase: 0.600 (Latin behavior unchanged).
- Unrelated short snippets: 0.000 (no over-collapse).

Scope: Bug 2 from issue openclaw#80613 only. The Bug 1 (promotion rehydration leaks managed dreaming block lines into MEMORY.md) is a separate end-to-end fixture problem that clawsweeper flagged as not high-confidence-reproducible from source alone; it should be addressed in a separate PR with a targeted promotion-path reproduction. This PR is the narrow CJK dedupe repair that clawsweeper directly endorsed.
oxlint flagged Array#sort() in the new regression test; use Array#toSorted() instead. Non-functional change — test logic and output are unchanged.
…e to empty (openclaw#80613)

Addresses chatgpt-codex-connector P1 review on openclaw#80620.

textSimilarity is used by dreaming dedupeEntries to merge near-duplicate
recall entries. The shared tokenize() only emits ASCII word-tokens and
CJK uni-/bigrams, so inputs in other scripts (Cyrillic, Arabic,
emoji-only, punctuation-only) tokenize to the empty set. Raw Jaccard
returns 1 for two empty sets — that is the correct, intentional
semantics for MMR re-ranking and is asserted by mmr.test.ts — but for
the dedupe path it would collapse distinct non-tokenized snippets into
one and drop data.

Add a literal normalized-string equality fallback inside textSimilarity
for the both-empty case only. Non-empty cases (the existing MMR path)
keep Jaccard semantics unchanged. Add a regression test in mmr.test.ts
covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets:
distinct stays 0, identical stays 1.
…openclaw#80613)

Upstream lint sweep openclaw#83542 (chore(lint): remove underscore-dangle allow list) removed the `__testing` alias from the lint allow list, exposing that the 4 new CJK regression tests added in c497966 referenced `__testing.dedupeEntries` while the import statement only brought in `testing`. After upstream's rebase merge into this branch, tsgo reported TS2552 on dreaming-phases.test.ts:3028,3042,3054,3063 and a TS7006 implicit any on the inferred entry param (cascade from the missing identifier).

Fix: use the imported `testing.dedupeEntries` directly. The `testing as __testing` alias still exists in dreaming-phases.ts for any other consumers; this only adjusts the local test references.

Verification: pnpm tsgo:extensions:test reports 0 errors in dreaming-phases.test.ts (the 6 remaining errors are pre-existing infra issues unrelated to this branch: @openclaw/proxyline resolution, src/plugin-sdk/file-lock.ts type narrowing).
@MoerAI MoerAI force-pushed the fix/dreaming-cjk-tokenizer branch from 0d30320 to 84bcee9 Compare May 20, 2026 02:42
@MoerAI

MoerAI commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Hi @p0pfan — thanks for the nudge, you're right that this should have landed already.

What happened: after the chatgpt-codex-connector P1 fix in 0d30320e (2026-05-19), upstream landed the lint sweep #83542 (chore(lint): remove underscore-dangle allow list) onto main. That sweep removed the __testing identifier from the allow list, and a tsgo rebase against current main surfaced that the 4 new CJK regression tests in dreaming-phases.test.ts referenced __testing.dedupeEntries while the import only brought in testing — so check-test-types started failing on the rebased head.

Just pushed 84bcee94d8:

  • Rebased on latest upstream/main (clean — 3/3 commits applied).
  • Renamed the 4 __testing.dedupeEntries references to testing.dedupeEntries (the existing imported binding; the testing as __testing alias in dreaming-phases.ts stays unchanged for any other consumers).
  • Local pnpm tsgo:extensions:test reports 0 errors in dreaming-phases.test.ts (the 6 unrelated infra errors on @openclaw/proxyline and src/plugin-sdk/file-lock.ts are pre-existing on main and not caused by this branch).

Should clear check-test-types on the next CI run. Will keep an eye on the rerun and post the status once it's green so @Takhoffman can re-arm automerge.

@clawsweeper clawsweeper Bot added the merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. label May 20, 2026
@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🥚 common Pearl Shellbean

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: finds missing screenshots.
Image traits: location workflow harbor; accessory release bell; palette pearl, teal, and neon green; mood focused; pose curling around a status light; shell smooth pearl shell; lighting cool dashboard glow; background tiny artifact crates.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Pearl Shellbean in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

1 similar comment
@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

@p0pfan

p0pfan commented May 22, 2026

Copy link
Copy Markdown

Why this pr is still waiting

@MoerAI

MoerAI commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

Status update: rebase head 84bcee94d8 has fully cleared CI.

  • check-test-types: SUCCESS at 2026-05-20 02:48 UTC (job 76877274293).
  • Full check rollup on 84bcee94d8: 69 SUCCESS, 30 SKIPPED, 1 NEUTRAL (CodeQL), 0 FAILURE.
  • Merge state: CLEAN against current upstream/main.
  • Labels still set: status: 🚀 automerge armed, proof: supplied, proof: sufficient.

The earlier clawsweeper:human-review was added on 2026-05-11 because the Codex review step itself errored (exit 1, unrelated to the PR contents) — not because of a finding on the code. Since then the chatgpt-codex-connector P1 was addressed in 0d30320e and the __testingtesting rebase fix landed in 84bcee94d8. There's no outstanding finding on this PR.

@Takhoffman — happy for you to re-issue @clawsweeper automerge against the new head when convenient.
@p0pfan — thanks for the patience; the PR is sitting at the human-review gate, not a code-quality gate.

@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper approve

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

@clawsweeper clawsweeper Bot removed the clawsweeper:human-review Needs maintainer review before ClawSweeper can continue label May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper 🐠 reef update

Thanks for the work on this. ClawSweeper could not push to this branch with the permissions available, so it opened a narrow replacement PR to keep the fix swimming forward without losing the contributor trail. not your fault, just GitHub branch-permission tides.

Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
Replacement PR: #86645
Why close: this run explicitly closes the superseded source PR after the credited replacement PR is open, so review continues in one place.
This closeout is intentional for this run: the replacement PR is now the active review lane.
Contributor credit is carried into the replacement PR body and release-note context.
Co-author credit kept:

fish notes: model gpt-5.5, reasoning high; reviewed against ca9c027.

@clawsweeper clawsweeper Bot closed this May 25, 2026
@MoerAI MoerAI deleted the fix/dreaming-cjk-tokenizer branch May 26, 2026 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge extensions: memory-core Extension: memory-core merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 🚀 automerge armed This PR is in ClawSweeper's automerge lane.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet

3 participants