Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
Two bugs found in OpenClaw's memory dreaming pipeline affecting CJK (Chinese/Japanese/Korean) workspaces:
- Dreaming block content leaks into promoted MEMORY.md snippets — Raw
- Candidate: / - confidence: / - status: staged dreaming format text gets included in snippets promoted to MEMORY.md instead of only clean human-written content.
- CJK Jaccard deduplication is ineffective —
tokenizeSnippet in dreaming-phases.ts uses ASCII-only tokenization, producing empty token sets for CJK text and making similarity-based dedup fall back to exact string match only.
Steps to reproduce
Bug 1: Dreaming content leaks into promoted snippets
- Have a daily memory file (e.g.
memory/2026-04-12.md) with both human-written content and a managed dreaming block:
## 教训:Plan 开关字段
- Plan 配置中实验开关字段是 `exRule`,不是 `abConfig`
- 配置中每个 Plan 的开关在 `exRule` 字段
## Light Sleep
<!-- openclaw:dreaming:light:start -->
- Candidate: 教训:Plan 实验开关字段: ...
- confidence: 0.00
- status: staged
<!-- openclaw:dreaming:light:end -->
- Run openclaw memory promote --apply
- Check MEMORY.md for the promoted entry
Bug 2: CJK dedup ineffective
- Configure workspace with CJK (Chinese) memory content
- Have two similar (but not identical) Chinese snippets in short-term recall
- Run dreaming sweep
- Observe both snippets pass dedup and appear as separate candidates in light dreaming output
Expected behavior
- Promoted snippets in MEMORY.md should contain only clean human-written content. All managed dreaming block content (- Candidate:, - confidence:, - status: staged, etc.) must be fully stripped before ingestion.
- Similar CJK snippets (e.g. two Chinese sentences describing the same concept with minor wording differences) should be correctly identified as duplicates and merged, same as ASCII content.
Actual behavior
Bug 1: Dreaming content leaks into promoted snippets
MEMORY.md receives a promoted entry containing raw dreaming format lines:
<!-- openclaw-memory-promotion:memory:memory/2026-04-12.md:1:19 -->
- ## 教训:Plan 实验开关字段 - Plan 配置中实验开关字段是 `exRule`...
## Light Sleep <!-- openclaw:dreaming:light:start -->
- Candidate: 教训:Plan 实验开关字段: ...
- confidence: 0.00
- status: staged
Root cause: buildDailySnippetChunks begins accumulating a chunk from the human-written lines before the <!-- openclaw:dreaming:light:start --> marker. The marker is not treated as a flush boundary, so chunk accumulation continues into the dreaming block. Even though stripManagedDailyDreamingLines runs first, the startLine/endLine metadata recorded in the chunk still references the pre-strip line positions, causing the first few dreaming lines to be captured in the snippet.
Bug 2: CJK dedup ineffective in tokenizeSnippet
Two similar Chinese snippets describing the same concept with minor wording differences both pass through dedupeEntries with similarity score 0, producing duplicate candidates in the light dreaming output and eventually redundant entries in MEMORY.md.
Root cause: tokenizeSnippet splits on /[^a-z0-9]+/, which produces an empty Set for any CJK input. jaccardSimilarity then falls back to exact string comparison:
// dreaming-phases.ts
function tokenizeSnippet(snippet) {
return new Set(
snippet.toLowerCase().split(/[^a-z0-9]+/i)
.map(token => token.trim())
.filter(Boolean)
);
// CJK input → empty Set → jaccard falls back to exact-match only
}
Contrast with the correct CJK-aware implementation in mmr.ts:
const CJK_RE = /[\u3040-\u309f\u30a0-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u1100-\u11ff]/;
// extracts CJK unigrams + adjacent bigrams → similarity works correctly
OpenClaw version
v2026.5.6
Operating system
Linux 5.10 (x64)
Install method
pnpm dev
Model
claude-sonnet-4.6
Provider / routing chain
idealab-anthropic → claude-sonnet-4-6
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Impact and severity
No response
Additional information
No response
Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
Two bugs found in OpenClaw's memory dreaming pipeline affecting CJK (Chinese/Japanese/Korean) workspaces:
- Candidate: / - confidence: / - status: stageddreaming format text gets included in snippets promoted toMEMORY.mdinstead of only clean human-written content.tokenizeSnippetindreaming-phases.tsuses ASCII-only tokenization, producing empty token sets for CJK text and making similarity-based dedup fall back to exact string match only.Steps to reproduce
Bug 1: Dreaming content leaks into promoted snippets
memory/2026-04-12.md) with both human-written content and a managed dreaming block:Bug 2: CJK dedup ineffective
Expected behavior
Actual behavior
Bug 1: Dreaming content leaks into promoted snippets
MEMORY.mdreceives a promoted entry containing raw dreaming format lines:Root cause:
buildDailySnippetChunksbegins accumulating a chunk from the human-written lines before the<!-- openclaw:dreaming:light:start -->marker. The marker is not treated as a flush boundary, so chunk accumulation continues into the dreaming block. Even thoughstripManagedDailyDreamingLinesruns first, thestartLine/endLinemetadata recorded in the chunk still references the pre-strip line positions, causing the first few dreaming lines to be captured in the snippet.Bug 2: CJK dedup ineffective in
tokenizeSnippetTwo similar Chinese snippets describing the same concept with minor wording differences both pass through
dedupeEntrieswith similarity score0, producing duplicate candidates in the light dreaming output and eventually redundant entries inMEMORY.md.Root cause:
tokenizeSnippetsplits on/[^a-z0-9]+/, which produces an emptySetfor any CJK input.jaccardSimilaritythen falls back to exact string comparison:Contrast with the correct CJK-aware implementation in
mmr.ts:OpenClaw version
v2026.5.6
Operating system
Linux 5.10 (x64)
Install method
pnpm dev
Model
claude-sonnet-4.6
Provider / routing chain
idealab-anthropic → claude-sonnet-4-6
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Impact and severity
No response
Additional information
No response