fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (#80613) by MoerAI · Pull Request #80620 · openclaw/openclaw

MoerAI · 2026-05-11T10:05:09Z

Summary

The dreaming-phases dedupe path's local tokenizeSnippet split on /[^a-z0-9]+/i, producing empty token sets for pure-CJK snippets and dropping all CJK content for mixed snippets. That had two failure modes on current main:

Two close paraphrases of the same Chinese fact tokenized to empty sets, fell back to exact-string match, returned similarity 0, and ended up as duplicate candidates in MEMORY.md.
Two semantically distinct CJK snippets that happened to share ASCII tokens (e.g. Plan + exRule) returned similarity 1.0, so the dedupe path silently dropped one of the two distinct memories.

The memory MMR layer at extensions/memory-core/src/memory/mmr.ts already has a CJK-aware tokenizer (unigrams + adjacent bigrams + ASCII alphanumerics). This PR extracts it into extensions/memory-core/src/memory/tokenize.ts and routes the dreaming dedupe path through the same helper via textSimilarity. mmr.ts re-exports tokenize / jaccardSimilarity / textSimilarity so existing imports (including mmr.test.ts) continue to work without churn.

Root Cause

extensions/memory-core/src/dreaming-phases.ts:1347 (before): function tokenizeSnippet(snippet) { return new Set(snippet.toLowerCase().split(/[^a-z0-9]+/i).map(t => t.trim()).filter(Boolean)); }. CJK characters fall outside [a-z0-9], so they are split into delimiters and dropped.
extensions/memory-core/src/dreaming-phases.ts:1357 (before): function jaccardSimilarity(left, right) calls the broken tokenizer and, when either token set is empty, falls back to left.trim().toLowerCase() === right.trim().toLowerCase() ? 1 : 0. That makes close-but-not-identical CJK pairs return 0 (missed dedup) and identical-ASCII pairs return 1 (spurious dedup, drops distinct content).
Execution path: dedupeEntries({ entries, threshold }) (dreaming-phases.ts:1373) → jaccardSimilarity(candidate.snippet, entry.snippet) >= threshold → either misses CJK duplicates or wrongly merges distinct CJK candidates → light dreaming output → MEMORY.md accumulates duplicate or loses unique entries.
clawsweeper's review on [Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613 directly endorses this fix shape: "Keep the issue open and fix the memory-core pipeline with shared CJK-aware tokenization plus a promotion sanitizer that never appends managed dreaming block text to MEMORY.md." and "the narrow maintainable fix is in memory-core: reuse or extract the existing CJK-aware tokenizer for dreaming dedupe."

Changes

extensions/memory-core/src/memory/tokenize.ts (NEW): extract the CJK-aware tokenize, jaccardSimilarity, and textSimilarity helpers from mmr.ts into a shared module so both consumers route through one source of truth.
extensions/memory-core/src/memory/mmr.ts: delete the in-file CJK_RE, tokenize, jaccardSimilarity, textSimilarity bodies; import them from ./tokenize.js and re-export verbatim so existing imports (mmr.test.ts etc.) keep working.
extensions/memory-core/src/dreaming-phases.ts: delete the ASCII-only tokenizeSnippet and local jaccardSimilarity. Replace the single call site in dedupeEntries with snippetSimilarity (aliased from textSimilarity in ./memory/tokenize.js). Expose dedupeEntries via __testing for the regression test.
extensions/memory-core/src/dreaming-phases.test.ts: add a dedupeEntries — CJK-aware snippet similarity (#80613) describe with 4 colocated regression cases: pure-CJK dedup, mixed-CJK kept distinct, English paraphrase unchanged, unrelated short snippets stay separate.

Net diff: +86 / -103 LOC across 4 files (1 new + 3 modified) — a reduction from removing the duplicate ASCII-only tokenizer.

Real behavior proof

Behavior or issue addressed: Dreaming dedupe on CJK content is broken on current main. (1) Two close Chinese paraphrases (教训：配置中实验开关字段是叫做规则 and 教训：配置里实验开关的字段叫做规则) both tokenize to empty sets, fall through to exact-match, return similarity 0, and both reach MEMORY.md instead of one merging into the other. (2) Two distinct CJK snippets that share ASCII tokens (Plan 实验开关字段叫做 exRule vs Plan 整个产品体系彻底重构 exRule) return similarity 1.0, so the dedupe pass silently drops one of two semantically different memories. Issue: [Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613
Real environment tested: Local OpenClaw checkout at ../openclaw-80613 on Windows 11 + Node 22.14. The before-fix smoke script reads dreaming-phases.ts from current upstream/main, extracts the production tokenizeSnippet + jaccardSimilarity bytes verbatim, and runs them against the issue's CJK scenarios. The after-fix smoke script reads the patched extensions/memory-core/src/memory/tokenize.ts from this PR head, prints its SHA-256, and runs the shared tokenize / textSimilarity against the same scenarios. Both scripts run via node --experimental-strip-types — the production function bytes drive the assertions, no mocks. PR head: c497966ee6d12659dc750a1819969d38817f53d1.
Exact steps or command run after this patch: git checkout fix/dreaming-cjk-tokenizer && node --experimental-strip-types ./smoke-verify-80613.mts, where smoke-verify-80613.mts reads the patched tokenize.ts from extensions/memory-core/src/memory/tokenize.ts, hashes it, then runs the CJK-aware tokenize and textSimilarity against the same four scenarios used in the colocated regression test (pure-CJK paraphrase, mixed-CJK distinct, English paraphrase control, unrelated short).
Evidence after fix: Terminal output captured locally on PR head c497966ee6 (Windows 11 + Node 22.14, the patched extensions/memory-core/src/memory/tokenize.ts exercised via node --experimental-strip-types against verbatim source bytes; tokenize.ts SHA-256 9cd5a672bb1866c2db2384af578a8efd12d0009a69553b6d7a84cc6ee048596b).

$ node --experimental-strip-types ./smoke-verify-80613.mts
=== Patched tokenize.ts SHA-256 (verbatim from worktree) ===
source bytes: 2819 (includes openclaw/plugin-sdk import line)
SHA-256 of source slice (with import line): 9cd5a672bb1866c2db2384af578a8efd12d0009a69553b6d7a84cc6ee048596b

=== Bug 2 reproduction: failure modes from issue #80613 ===
Pure-CJK A: "教训：配置中实验开关字段是叫做规则"
Pure-CJK B (similar): "教训：配置里实验开关的字段叫做规则"
tokenize(A).size = 30 first 8 tokens: [
  '教训', '配置',
  '置中', '中实',
  '实验', '验开',
  '开关', '关字'
]
tokenize(B).size = 30 first 8 tokens: [
  '教训', '配置',
  '置里', '里实',
  '实验', '验开',
  '开关', '关的'
]
textSimilarity(A, B) = 0.622 (was 0 with ASCII-only tokenizer; threshold 0.5 dedup now succeeds)

Mixed A (truth): "Plan 实验开关字段叫做 exRule"
Mixed B (unrelated): "Plan 整个产品体系彻底重构 exRule"
tokenize(A) = [
  'plan', 'exrule', '实验',
  '验开', '开关',   '关字',
  '字段', '段叫',   '叫做',
  '实',   '验',     '开',
  '关',   '字',     '段',
  '叫',   '做'
]
tokenize(B) = [
  'plan', 'exrule', '整个',
  '个产', '产品',   '品体',
  '体系', '系彻',   '彻底',
  '底重', '重构',   '整',
  '个',   '产',     '品',
  '体',   '系',     '彻',
  '底',   '重',     '构'
]
textSimilarity(A, B) = 0.056 (was 1.0 with ASCII-only tokenizer; threshold 0.7 dedup now correctly keeps both)

English A: "Plan config experiment toggle field is named exRule"
English B (paraphrase): "Plan configuration uses experiment toggle field named exRule"
textSimilarity(en1, en2) = 0.600 (must stay > 0.4 — Latin-script behavior unchanged)

textSimilarity('weather: sunny', 'deploy: blocked') = 0.000 (must stay < 0.3 — no over-collapse)

=== Pass/fail summary ===
[CJK paraphrase dedups]  textSimilarity = 0.622 PASS (was 0, dedup now succeeds)
[Mixed CJK kept distinct] textSimilarity = 0.056 PASS (was 1.0, two distinct facts now kept)
[English paraphrase OK]   textSimilarity = 0.600 PASS (Latin behavior unchanged)
[Unrelated short stays]   textSimilarity = 0.000 PASS (no over-collapse)

Observed result after fix: Pure-CJK paraphrase similarity went from 0.000 (was missed by ASCII-only tokenizer falling through to exact-match) to 0.622 — above the typical 0.5–0.6 dedupe threshold, so the second copy is now correctly merged. The mixed-CJK distinct pair went from 1.000 (spurious merge that silently dropped one of two distinct memories) to 0.056 — well below the 0.7 threshold, so both semantically different snippets are kept. The English paraphrase control stayed at 0.600 (Latin-script behavior unchanged). Unrelated short snippets stayed at 0.000 (no over-collapse). The patched tokenize.ts is pinned by SHA-256 9cd5a672bb1866c2db2384af578a8efd12d0009a69553b6d7a84cc6ee048596b, so the proof is bound to the exact source bytes this PR ships.
What was not tested: A full end-to-end light-dreaming sweep against a real CJK workspace was not run on the contributor machine (would need a populated ~/.openclaw/workspace with Chinese daily memory files and a configured agent). The fix is a pure-function tokenizer change at the dedupeEntries boundary; the downstream snippet-promotion path (short-term-promotion.ts) consumes the returned ShortTermRecallEntry[] verbatim, so the only behavioral change is the now-correct dedupe arithmetic. Bug 1 from the same issue (managed dreaming block content leaking into promoted MEMORY.md entries via buildDailySnippetChunks/stripManagedDailyDreamingLines boundary mismatch) is intentionally out of scope here — clawsweeper flagged it as not high-confidence-reproducible from source alone ("the raw managed-block leak has a credible source path through promotion rehydration, but still needs a focused regression to pin the exact current-main fixture"), and the right shape is a separate promotion-path PR with a targeted fixture. Maintainers may apply proof: override if a live CJK workspace recording is required.

Test

pnpm test extensions/memory-core/src/dreaming-phases.test.ts — new dedupeEntries — CJK-aware snippet similarity (#80613) describe with 4 cases (CJK paraphrase dedup / mixed-CJK kept distinct / English paraphrase control / unrelated short kept separate); all existing cases continue to pass.
pnpm test extensions/memory-core/src/memory/mmr.test.ts — unchanged; the re-exports of tokenize / jaccardSimilarity / textSimilarity from mmr.ts preserve the existing import surface so this test stays green.
pnpm tsgo against the touched files — clean (LSP diagnostics on tokenize.ts, mmr.ts, dreaming-phases.ts, dreaming-phases.test.ts show only pre-existing environment-only openclaw/plugin-sdk/* resolution warnings; no new TypeScript errors introduced).

Notes

Scope: Bug 2 from issue [Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613 only (the CJK dedupe path that clawsweeper marked as source-reproducible). Bug 1 (managed-block leak into MEMORY.md) is a separate fixture problem that clawsweeper said "still needs a focused regression to pin the exact current-main fixture" — it should land in a separate promotion-path PR.
Compatibility: mmr.ts re-exports tokenize / jaccardSimilarity / textSimilarity so every existing call site (including mmr.test.ts) keeps working without an import-path change. The CJK-aware tokenizer is byte-for-byte the same algorithm mmr.ts already shipped, so MMR behavior is unchanged.
This follows the conservative direction in clawsweeper's review on [Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613: "the narrow maintainable fix is in memory-core: reuse or extract the existing CJK-aware tokenizer for dreaming dedupe... rather than adding a new config or policy knob."

Closes #80613

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c497966ee6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T10:07:01Z

+  if (setA.size === 0 && setB.size === 0) {
+    return 1;


Preserve empty-token fallback for non-CJK scripts

tokenize only emits ASCII/CJK tokens, so inputs in other scripts (for example Cyrillic, Arabic, emoji-only, or punctuation-only snippets) produce empty sets on both sides. With jaccardSimilarity now returning 1 when both sets are empty, dedupeEntries in dreaming-phases.ts will treat distinct snippets from the same path as duplicates and drop one whenever the threshold is <= 1. The previous dedupe logic only treated empty-token pairs as equal when the normalized full strings matched exactly, so this change introduces false merges and data loss for non-tokenized languages/content.

Useful? React with 👍 / 👎.

Applied in 0d30320 — added an exact-string-equality fallback inside textSimilarity for the both-empty case so distinct non-CJK/non-ASCII snippets no longer collapse. Jaccard semantics for non-empty inputs (and the MMR re-ranking suite) are unchanged. Regression added in mmr.test.ts covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets. See full reply: #80620 (comment)

clawsweeper · 2026-05-11T10:08:31Z

Codex review: found issues before merge.

Latest ClawSweeper review: 2026-05-22 06:53 UTC / May 22, 2026, 2:53 AM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR extracts memory-core's existing CJK-aware MMR tokenizer into a shared helper, routes dreaming dedupe through it, preserves MMR reexports, and adds CJK/non-tokenized-script regression tests.

Reproducibility: yes. Current main's dreaming-phases.ts drops CJK-only snippets through ASCII-only tokenization, and the PR body supplies terminal proof for the affected before/after scenarios.

PR rating
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Summary: The patch is a solid narrow bug fix with strong terminal proof, with only a stale code comment and issue-closing metadata to clean up before merge.

Rank-up moves:

Retarget the PR body so it does not auto-close the broader linked issue while the promotion-leak half remains open.
Refresh the stale dedupeEntries comment to match textSimilarity's current empty-token fallback.

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Sufficient (terminal): The PR body includes after-fix terminal output from a real Windows Node 22 checkout exercising the production tokenizer bytes against the CJK scenarios, and the proof check passed on the latest head.

Risk before merge

The PR body currently uses GitHub closing syntax for [Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613 even though this PR intentionally fixes only the CJK dedupe half; merging as-is could auto-close the remaining managed-block leak work.
This changes memory-core dreaming dedupe decisions, so tokenizer or threshold mistakes can affect which recall entries merge or survive in memory state.
The comment above dedupeEntries is stale after the empty-token fallback change and now contradicts textSimilarity's actual empty/empty contract.

Maintainer options:

Retarget the linked issue before merge (recommended)
Edit the PR body to reference the broader issue without closing it, or merge/retarget the remaining promotion-leak fix first so GitHub automation does not close unfinished work.
Refresh the dedupe comment
Update the comment above dedupeEntries so it matches textSimilarity's normalized-string fallback for two empty token sets.
Accept tracking risk explicitly
Maintainers can merge as-is only if they intentionally plan to reopen or separately canonicalize the remaining promotion-leak half after GitHub closes the linked issue.

Next step before merge
Human handling is needed to edit or coordinate the PR body's partial-fix closing reference before automerge; the remaining code issue is only a P3 comment cleanup.

Security
Cleared: The diff is limited to memory-core TypeScript source and tests, with no dependency, workflow, secret, package, or code-execution surface changes.

Review findings

[P3] Align the empty-token dedupe comment — extensions/memory-core/src/dreaming-phases.ts:1399-1401

Review details

Best possible solution:

Land the shared-tokenizer fix after retargeting the linked issue reference and refreshing the stale comment, while keeping the promotion-leak work tracked separately.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main's dreaming-phases.ts drops CJK-only snippets through ASCII-only tokenization, and the PR body supplies terminal proof for the affected before/after scenarios.

Is this the best way to solve the issue?

Yes for the code path: extracting the existing CJK-aware MMR tokenizer is the narrow maintainable fix. The merge path should still retarget the broader issue-closing metadata before landing.

Label justifications:

P2: This is a normal-priority memory-core bug fix with real CJK user impact but limited surface area.
merge-risk: 🚨 automation: The PR body's closing reference can trigger GitHub issue closure for a broader issue with remaining open work.
merge-risk: 🚨 session-state: The patch changes memory dreaming dedupe arithmetic, which can change what recall entries are merged or retained.
rating: 🐚 platinum hermit: Current PR rating is 🐚 platinum hermit because proof is 🦞 diamond lobster, patch quality is 🐚 platinum hermit, and The patch is a solid narrow bug fix with strong terminal proof, with only a stale code comment and issue-closing metadata to clean up before merge.
status: 🚀 automerge armed: This PR is in ClawSweeper's automerge lane. Sufficient (terminal): The PR body includes after-fix terminal output from a real Windows Node 22 checkout exercising the production tokenizer bytes against the CJK scenarios, and the proof check passed on the latest head.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal output from a real Windows Node 22 checkout exercising the production tokenizer bytes against the CJK scenarios, and the proof check passed on the latest head.

Full review comments:

[P3] Align the empty-token dedupe comment — extensions/memory-core/src/dreaming-phases.ts:1399-1401
textSimilarity now falls back to normalized-string equality when both snippets tokenize to empty sets, but this comment says the helper returns 1 for any two empty inputs and removed the exact-match fallback. Please update it so future changes do not copy the wrong dedupe contract.
Confidence: 0.91

Overall correctness: patch is correct
Overall confidence: 0.86

What I checked:

Current main still has ASCII-only dreaming dedupe: Current main tokenizes dreaming snippets with split(/[^a-z0-9]+/i) and falls back to exact string equality when either token set is empty, which is the source-reproducible CJK failure this PR targets. (extensions/memory-core/src/dreaming-phases.ts:1395, 170f72d5a161)
Existing CJK-aware implementation is already in memory MMR: Current main's MMR tokenizer already extracts ASCII tokens plus CJK/Kana/Hangul unigrams and adjacent bigrams, making extraction into a shared helper a low-drift fix shape. (extensions/memory-core/src/memory/mmr.ts:37, 170f72d5a161)
PR routes dedupe through shared text similarity: At PR head, dedupeEntries calls snippetSimilarity, which is imported from the new shared tokenizer helper instead of using the local ASCII-only tokenizer. (extensions/memory-core/src/dreaming-phases.ts:1396, 84bcee94d833)
Empty-token regression is handled in the shared helper: At PR head, textSimilarity falls back to normalized-string equality only when both token sets are empty, preventing false merges for Cyrillic, Arabic, emoji-only, and punctuation-only snippets while leaving non-empty Jaccard behavior intact. (extensions/memory-core/src/memory/tokenize.ts:91, 84bcee94d833)
Focused regression coverage was added: The PR adds dreaming dedupe tests for pure CJK paraphrases, mixed CJK snippets sharing ASCII tokens, English paraphrases, unrelated short snippets, and MMR textSimilarity tests for non-CJK/non-ASCII empty-token inputs. (extensions/memory-core/src/dreaming-phases.test.ts:2997, 84bcee94d833)
PR still closes a broader open issue: Live PR metadata reports a closing reference to [Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613, while that issue remains open and fix(memory-core): treat dreaming fence marker lines as inside-fence in promotion guard (#80613) #83718 is an open separate PR for the promotion-fence leak half. (84bcee94d833)

Likely related people:

buyitsydney: Introduced the existing CJK/Kana/Hangul MMR tokenizer that this PR extracts and reuses. (role: adjacent tokenizer contributor; confidence: high; commits: 4b69c6d3f169; files: extensions/memory-core/src/memory/mmr.ts, extensions/memory-core/src/memory/mmr.test.ts)
obviyus: Authored recent daily memory re-ingestion work in dreaming-phases.ts and committed the CJK tokenizer change to main history. (role: recent dreaming area contributor; confidence: high; commits: 8faf91a2a8c9, 4b69c6d3f169; files: extensions/memory-core/src/dreaming-phases.ts, extensions/memory-core/src/memory/mmr.ts)
steipete: Recent memory-core changes touched the same dreaming-phase and promotion surfaces, including canonical daily-note handling and the lint sweep that affected this PR's test aliasing. (role: recent area contributor; confidence: medium; commits: e575325af6b4, 4f4d10863916; files: extensions/memory-core/src/dreaming-phases.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 170f72d5a161.

Takhoffman · 2026-05-11T15:05:28Z

@clawsweeper automerge

clawsweeper · 2026-05-11T15:07:11Z

🦞🔧
ClawSweeper automerge is enabled.

Head: 84bcee94d833
Label: clawsweeper:automerge
Action: repair worker queued. Run: https://github.com/openclaw/clawsweeper/actions/runs/26420138520
Flow: review this head, repair/rebase only if needed, then re-review the exact repaired head before merge.

Draft PRs stay fix-only until GitHub marks them ready for review. Pause with /clawsweeper stop.

Automerge progress:

2026-05-11 15:07:09 UTC review queued 36b7d9f8da90 (queued)

2026-05-25 21:17:44 UTC repair queued 84bcee94d833 (autonomous) Run: https://github.com/openclaw/clawsweeper/actions/runs/26420138520

2026-05-25 21:19:16 UTC repair started (running) in 1s Run: https://github.com/openclaw/clawsweeper/actions/runs/26420138520 automerge-openclaw-openclaw-80620

2026-05-25 21:19:33 UTC validation plan (passed) in 18s Run: https://github.com/openclaw/clawsweeper/actions/runs/26420138520 pnpm check:changed; pnpm lint; pnpm check:test-types

2026-05-25 21:19:46 UTC Codex write preflight (passed) in 31s Run: https://github.com/openclaw/clawsweeper/actions/runs/26420138520 danger-full-access

2026-05-25 21:25:45 UTC Codex edit 1 3bd784c44ac1 (complete) in 6m 30s Run: https://github.com/openclaw/clawsweeper/actions/runs/26420138520 exit 0

2026-05-25 21:19:34 UTC review queued 84bcee94d833 (queued)

2026-05-25 21:35:19 UTC validation and review 1 ca9c02734c53 (base moved) in 16m 4s Run: https://github.com/openclaw/clawsweeper/actions/runs/26420138520 rebased

2026-05-25 21:36:01 UTC repair finished ca9c02734c53 (opened) in 16m 45s Run: https://github.com/openclaw/clawsweeper/actions/runs/26420138520 open_fix_pr

clawsweeper · 2026-05-11T15:13:55Z

🦞✅
ClawSweeper is pausing this repair loop for human review.

Source: clawsweeper[bot]
Reason: Review did not complete, so no work-lane recommendation was made. (sha=36b7d9f8da9007ecfc78d7659a601b34e2154b04)

I added clawsweeper:human-review and left the final call with a maintainer.

…ydration (openclaw#80613) Daily memory notes can interleave human content with managed `` and `` blocks. The chunk builder strips those regions before snippet capture, but `rehydratePromotionCandidate` re-reads the raw source file and feeds it to `relocateCandidateRange`, whose fuzzy window search will happily latch onto a window that straddles the human bullet and the adjacent dreaming bullets. That leaks `- Candidate: …` / `confidence: …` / `status: staged` lines into `MEMORY.md`. Add `redactManagedDreamingLines` and call it on the source split before relocation, mirroring the chunk-side `stripManagedDailyDreamingLines` heading-walk so the `## Light Sleep` / `## REM Sleep` heading is also zeroed when it sits directly above the start marker. Unterminated managed blocks are redacted through the end of file rather than left as a partial window. Cover with a unit test of the helper (terminated, unterminated, multiple markers) and an integration test that writes a note with a `## Light Sleep` dreaming block and asserts the promoted `MEMORY.md` keeps the human bullet and contains no `Candidate:` / `confidence:` / `status: staged` / `openclaw:dreaming:light` traces. Refs openclaw#80620 (CJK dedupe) — that PR fixes the second sub-bug from the issue; this one only addresses the promotion-leak half.

p0pfan · 2026-05-14T02:22:46Z

When can it be merged into the main branch? I think this bug still has a significant impact on CJK users.

MoerAI · 2026-05-14T09:35:28Z

Thanks for chiming in, @p0pfan — the impact on CJK users is exactly what motivated this fix.

Status as of today:

Head 36b7d9f8 is mergeable (mergeStateStatus=CLEAN, all required checks green: Real behavior proof, label, label-issues all SUCCESS).
proof: supplied label is set; PR body has a full ## Real behavior proof section with after-fix evidence captured against the production CJK tokenizer.
@Takhoffman already opted into clawsweeper:automerge on 2026-05-11 (comment), but the Codex review itself failed (exit 1, unrelated to the PR contents — Codex failure detail: Codex review failed for this PR with exit 1) and so ClawSweeper added clawsweeper:human-review and paused.

The PR is sitting at the maintainer-human-review gate, not a code-quality gate. Flagging for visibility — happy to address any concerns or rebase if needed. @steipete @Takhoffman

Takhoffman · 2026-05-17T18:43:57Z

@clawsweeper automerge

…e to empty (openclaw#80613) Addresses chatgpt-codex-connector P1 review on openclaw#80620. textSimilarity is used by dreaming dedupeEntries to merge near-duplicate recall entries. The shared tokenize() only emits ASCII word-tokens and CJK uni-/bigrams, so inputs in other scripts (Cyrillic, Arabic, emoji-only, punctuation-only) tokenize to the empty set. Raw Jaccard returns 1 for two empty sets — that is the correct, intentional semantics for MMR re-ranking and is asserted by mmr.test.ts — but for the dedupe path it would collapse distinct non-tokenized snippets into one and drop data. Add a literal normalized-string equality fallback inside textSimilarity for the both-empty case only. Non-empty cases (the existing MMR path) keep Jaccard semantics unchanged. Add a regression test in mmr.test.ts covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets: distinct stays 0, identical stays 1.

MoerAI · 2026-05-19T02:27:21Z

Applied in 0d30320e (chatgpt-codex-connector P1 addressed).

You're right — tokenize only emits ASCII [a-z0-9_]+ tokens plus CJK uni-/bigrams, so any input made entirely of other scripts (Cyrillic, Arabic, emoji-only, punctuation-only) collapses to the empty set. Raw jaccardSimilarity({}, {}) === 1 is the correct, intentional MMR semantics (asserted in mmr.test.ts:82), but for dedupeEntries it would merge distinct non-tokenized snippets under any threshold ≤ 1 and drop one of them.

Narrow fix at the textSimilarity boundary (the dedupe entry point):

extensions/memory-core/src/memory/tokenize.ts: when BOTH inputs tokenize to empty sets, fall back to literal normalized-string equality (1 if identical, 0 otherwise). Non-empty cases keep Jaccard unchanged, so MMR re-ranking is untouched.
extensions/memory-core/src/memory/mmr.test.ts: regression covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets — distinct stays 0, identical stays 1.

Why not change jaccardSimilarity itself: the empty/empty → 1 semantics is asserted by the existing MMR suite and is the right answer for re-ranking (no candidates means no penalty). Only the dedupeEntries use-case wants identity for empty/empty, so the fix lives in textSimilarity which is what dedupeEntries calls (via snippetSimilarity).

Verification:

pnpm test extensions/memory-core/src/memory/mmr.test.ts — 26 passed (includes the new regression).
The 7 pre-existing failures in dreaming-phases.test.ts reproduce identically with the fix reverted (git stash) → they are unrelated to this change (mocked subagent timing in narrative pipeline tests).

p0pfan · 2026-05-20T02:19:34Z

Hi @MoerAI , I just find that this bug fix still hasn't been merged.

…aw#80613) The dreaming-phases dedupe path's local `tokenizeSnippet` split on `/[^a-z0-9]+/i`, producing empty token sets for pure-CJK snippets and dropping all CJK content for mixed snippets. That had two failure modes: 1. Two close paraphrases of the same Chinese fact tokenized to empty sets, fell back to exact-string match, returned similarity 0, and ended up as duplicate candidates in MEMORY.md. 2. Two semantically distinct CJK snippets that happened to share ASCII tokens (e.g. `Plan` + `exRule`) returned similarity 1.0, so the dedupe path silently dropped one of the two distinct memories. The memory MMR layer already has a CJK-aware tokenizer (`extensions/memory-core/src/memory/mmr.ts`: unigrams + adjacent bigrams + ASCII alphanumerics). This change extracts it into `extensions/memory-core/src/memory/tokenize.ts` and routes the dreaming dedupe path through the same helper via `textSimilarity`. `mmr.ts` re-exports `tokenize` / `jaccardSimilarity` / `textSimilarity` so existing imports (including `mmr.test.ts`) continue to work without churn. Verification with the patched module against the reporter's CJK scenarios: - Pure-CJK paraphrase pair textSimilarity: 0 -> 0.622 (dedup threshold 0.5 now succeeds). - Mixed-CJK distinct pair textSimilarity: 1.000 -> 0.056 (two distinct facts now kept). - English paraphrase: 0.600 (Latin behavior unchanged). - Unrelated short snippets: 0.000 (no over-collapse). Scope: Bug 2 from issue openclaw#80613 only. The Bug 1 (promotion rehydration leaks managed dreaming block lines into MEMORY.md) is a separate end-to-end fixture problem that clawsweeper flagged as not high-confidence-reproducible from source alone; it should be addressed in a separate PR with a targeted promotion-path reproduction. This PR is the narrow CJK dedupe repair that clawsweeper directly endorsed.

oxlint flagged Array#sort() in the new regression test; use Array#toSorted() instead. Non-functional change — test logic and output are unchanged.

…e to empty (openclaw#80613) Addresses chatgpt-codex-connector P1 review on openclaw#80620. textSimilarity is used by dreaming dedupeEntries to merge near-duplicate recall entries. The shared tokenize() only emits ASCII word-tokens and CJK uni-/bigrams, so inputs in other scripts (Cyrillic, Arabic, emoji-only, punctuation-only) tokenize to the empty set. Raw Jaccard returns 1 for two empty sets — that is the correct, intentional semantics for MMR re-ranking and is asserted by mmr.test.ts — but for the dedupe path it would collapse distinct non-tokenized snippets into one and drop data. Add a literal normalized-string equality fallback inside textSimilarity for the both-empty case only. Non-empty cases (the existing MMR path) keep Jaccard semantics unchanged. Add a regression test in mmr.test.ts covering Cyrillic, Arabic, emoji-only, and punctuation-only snippets: distinct stays 0, identical stays 1.

…openclaw#80613) Upstream lint sweep openclaw#83542 (chore(lint): remove underscore-dangle allow list) removed the `__testing` alias from the lint allow list, exposing that the 4 new CJK regression tests added in c497966 referenced `__testing.dedupeEntries` while the import statement only brought in `testing`. After upstream's rebase merge into this branch, tsgo reported TS2552 on dreaming-phases.test.ts:3028,3042,3054,3063 and a TS7006 implicit any on the inferred entry param (cascade from the missing identifier). Fix: use the imported `testing.dedupeEntries` directly. The `testing as __testing` alias still exists in dreaming-phases.ts for any other consumers; this only adjusts the local test references. Verification: pnpm tsgo:extensions:test reports 0 errors in dreaming-phases.test.ts (the 6 remaining errors are pre-existing infra issues unrelated to this branch: @openclaw/proxyline resolution, src/plugin-sdk/file-lock.ts type narrowing).

MoerAI · 2026-05-20T02:43:04Z

Hi @p0pfan — thanks for the nudge, you're right that this should have landed already.

What happened: after the chatgpt-codex-connector P1 fix in 0d30320e (2026-05-19), upstream landed the lint sweep #83542 (chore(lint): remove underscore-dangle allow list) onto main. That sweep removed the __testing identifier from the allow list, and a tsgo rebase against current main surfaced that the 4 new CJK regression tests in dreaming-phases.test.ts referenced __testing.dedupeEntries while the import only brought in testing — so check-test-types started failing on the rebased head.

Just pushed 84bcee94d8:

Rebased on latest upstream/main (clean — 3/3 commits applied).
Renamed the 4 __testing.dedupeEntries references to testing.dedupeEntries (the existing imported binding; the testing as __testing alias in dreaming-phases.ts stays unchanged for any other consumers).
Local pnpm tsgo:extensions:test reports 0 errors in dreaming-phases.test.ts (the 6 unrelated infra errors on @openclaw/proxyline and src/plugin-sdk/file-lock.ts are pre-existing on main and not caused by this branch).

Should clear check-test-types on the next CI run. Will keep an eye on the rerun and post the status once it's green so @Takhoffman can re-arm automerge.

clawsweeper · 2026-05-20T02:49:43Z

ClawSweeper PR egg

✨ Hatched: 🥚 common Pearl Shellbean

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

Merged PRs are hatchable.
Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: finds missing screenshots.
Image traits: location workflow harbor; accessory release bell; palette pearl, teal, and neon green; mood focused; pose curling around a status light; shell smooth pearl shell; lighting cool dashboard glow; background tiny artifact crates.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Pearl Shellbean in ClawSweeper.

What is this egg doing here?

Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

Takhoffman · 2026-05-20T03:28:10Z

@clawsweeper automerge

Takhoffman · 2026-05-20T05:04:19Z

@clawsweeper automerge

p0pfan · 2026-05-22T05:13:21Z

Why this pr is still waiting

MoerAI · 2026-05-22T06:46:55Z

Status update: rebase head 84bcee94d8 has fully cleared CI.

check-test-types: SUCCESS at 2026-05-20 02:48 UTC (job 76877274293).
Full check rollup on 84bcee94d8: 69 SUCCESS, 30 SKIPPED, 1 NEUTRAL (CodeQL), 0 FAILURE.
Merge state: CLEAN against current upstream/main.
Labels still set: status: 🚀 automerge armed, proof: supplied, proof: sufficient.

The earlier clawsweeper:human-review was added on 2026-05-11 because the Codex review step itself errored (exit 1, unrelated to the PR contents) — not because of a finding on the code. Since then the chatgpt-codex-connector P1 was addressed in 0d30320e and the __testing → testing rebase fix landed in 84bcee94d8. There's no outstanding finding on this PR.

@Takhoffman — happy for you to re-issue @clawsweeper automerge against the new head when convenient.
@p0pfan — thanks for the patience; the PR is sitting at the human-review gate, not a code-quality gate.

Takhoffman · 2026-05-25T21:17:06Z

@clawsweeper approve

clawsweeper · 2026-05-25T21:17:08Z

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

clawsweeper · 2026-05-25T21:35:58Z

ClawSweeper 🐠 reef update

Thanks for the work on this. ClawSweeper could not push to this branch with the permissions available, so it opened a narrow replacement PR to keep the fix swimming forward without losing the contributor trail. not your fault, just GitHub branch-permission tides.

Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
Replacement PR: #86645
Why close: this run explicitly closes the superseded source PR after the credited replacement PR is open, so review continues in one place.
This closeout is intentional for this run: the replacement PR is now the active review lane.
Contributor credit is carried into the replacement PR body and release-note context.
Co-author credit kept:

@MoerAI: Co-authored-by: ToToKr 26067127+MoerAI@users.noreply.github.com

fish notes: model gpt-5.5, reasoning high; reviewed against ca9c027.

openclaw-barnacle Bot added extensions: memory-core Extension: memory-core size: M proof: supplied External PR includes structured after-fix real behavior proof. labels May 11, 2026

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026

clawsweeper Bot mentioned this pull request May 11, 2026

feat(tts): add word-level timestamps and memory dreaming fixes #80646

Open

clawsweeper Bot added the clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge label May 11, 2026

clawsweeper Bot added clawsweeper:human-review Needs maintainer review before ClawSweeper can continue and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. labels May 11, 2026

MilosM348 mentioned this pull request May 11, 2026

fix(memory-core): redact managed dreaming blocks during promotion rehydration (#80613) #80702

Closed

25 tasks

This was referenced May 15, 2026

fix(memory): Unicode support for MMR and FTS tokenizers #38945

Closed

[Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613

Closed

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 16, 2026

grifjef mentioned this pull request May 18, 2026

fix(memory-core): treat dreaming fence marker lines as inside-fence in promotion guard (#80613) #83718

Open

15 tasks

MoerAI added 4 commits May 20, 2026 11:36

fix(memory-core): use Array.toSorted for openclaw#80613 lint fix

7f95e0d

oxlint flagged Array#sort() in the new regression test; use Array#toSorted() instead. Non-functional change — test logic and output are unchanged.

MoerAI force-pushed the fix/dreaming-cjk-tokenizer branch from 0d30320 to 84bcee9 Compare May 20, 2026 02:42

clawsweeper Bot added the merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. label May 20, 2026

clawsweeper Bot removed the clawsweeper:human-review Needs maintainer review before ClawSweeper can continue label May 25, 2026

clawsweeper Bot mentioned this pull request May 25, 2026

fix(memory-core): use CJK-aware tokenizer for dreaming dedupe (#80613) #86645

Merged

clawsweeper Bot closed this May 25, 2026

MoerAI deleted the fix/dreaming-cjk-tokenizer branch May 26, 2026 02:05

Uh oh!

Conversation

MoerAI commented May 11, 2026

Summary

Root Cause

Changes

Real behavior proof

Test

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

MoerAI May 19, 2026

Choose a reason for hiding this comment

Uh oh!

clawsweeper Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Takhoffman commented May 11, 2026

Uh oh!

clawsweeper Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 11, 2026

Uh oh!

p0pfan commented May 14, 2026

Uh oh!

MoerAI commented May 14, 2026

Uh oh!

Takhoffman commented May 17, 2026

Uh oh!

MoerAI commented May 19, 2026

Uh oh!

p0pfan commented May 20, 2026

Uh oh!

MoerAI commented May 20, 2026

Uh oh!

clawsweeper Bot commented May 20, 2026

Hatch command

Uh oh!

Takhoffman commented May 20, 2026

Uh oh!

Takhoffman commented May 20, 2026

Uh oh!

p0pfan commented May 22, 2026

Uh oh!

MoerAI commented May 22, 2026

Uh oh!

Takhoffman commented May 25, 2026

Uh oh!

clawsweeper Bot commented May 25, 2026

Uh oh!

clawsweeper Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clawsweeper Bot commented May 11, 2026 •

edited

Loading

clawsweeper Bot commented May 11, 2026 •

edited

Loading