Skip to content

[Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613

@p0pfan

Description

@p0pfan

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Two bugs found in OpenClaw's memory dreaming pipeline affecting CJK (Chinese/Japanese/Korean) workspaces:

  1. Dreaming block content leaks into promoted MEMORY.md snippets — Raw - Candidate: / - confidence: / - status: staged dreaming format text gets included in snippets promoted to MEMORY.md instead of only clean human-written content.
  1. CJK Jaccard deduplication is ineffectivetokenizeSnippet in dreaming-phases.ts uses ASCII-only tokenization, producing empty token sets for CJK text and making similarity-based dedup fall back to exact string match only.

Steps to reproduce

Bug 1: Dreaming content leaks into promoted snippets

  1. Have a daily memory file (e.g. memory/2026-04-12.md) with both human-written content and a managed dreaming block:
## 教训:Plan 开关字段
- Plan 配置中实验开关字段是 `exRule`,不是 `abConfig`
- 配置中每个 Plan 的开关在 `exRule` 字段

## Light Sleep
<!-- openclaw:dreaming:light:start -->
- Candidate: 教训:Plan 实验开关字段: ...
  - confidence: 0.00
  - status: staged
<!-- openclaw:dreaming:light:end -->
  1. Run openclaw memory promote --apply
  2. Check MEMORY.md for the promoted entry

Bug 2: CJK dedup ineffective

  1. Configure workspace with CJK (Chinese) memory content
  2. Have two similar (but not identical) Chinese snippets in short-term recall
  3. Run dreaming sweep
  4. Observe both snippets pass dedup and appear as separate candidates in light dreaming output

Expected behavior

  1. Promoted snippets in MEMORY.md should contain only clean human-written content. All managed dreaming block content (- Candidate:, - confidence:, - status: staged, etc.) must be fully stripped before ingestion.
  2. Similar CJK snippets (e.g. two Chinese sentences describing the same concept with minor wording differences) should be correctly identified as duplicates and merged, same as ASCII content.

Actual behavior

Bug 1: Dreaming content leaks into promoted snippets

MEMORY.md receives a promoted entry containing raw dreaming format lines:

<!-- openclaw-memory-promotion:memory:memory/2026-04-12.md:1:19 -->
- ## 教训:Plan 实验开关字段 - Plan 配置中实验开关字段是 `exRule`...
  ## Light Sleep <!-- openclaw:dreaming:light:start -->
  - Candidate: 教训:Plan 实验开关字段: ...
    - confidence: 0.00
    - status: staged

Root cause: buildDailySnippetChunks begins accumulating a chunk from the human-written lines before the <!-- openclaw:dreaming:light:start --> marker. The marker is not treated as a flush boundary, so chunk accumulation continues into the dreaming block. Even though stripManagedDailyDreamingLines runs first, the startLine/endLine metadata recorded in the chunk still references the pre-strip line positions, causing the first few dreaming lines to be captured in the snippet.

Bug 2: CJK dedup ineffective in tokenizeSnippet

Two similar Chinese snippets describing the same concept with minor wording differences both pass through dedupeEntries with similarity score 0, producing duplicate candidates in the light dreaming output and eventually redundant entries in MEMORY.md.

Root cause: tokenizeSnippet splits on /[^a-z0-9]+/, which produces an empty Set for any CJK input. jaccardSimilarity then falls back to exact string comparison:

// dreaming-phases.ts
function tokenizeSnippet(snippet) {
  return new Set(
    snippet.toLowerCase().split(/[^a-z0-9]+/i)
      .map(token => token.trim())
      .filter(Boolean)
  );
  // CJK input → empty Set → jaccard falls back to exact-match only
}

Contrast with the correct CJK-aware implementation in mmr.ts:

const CJK_RE = /[\u3040-\u309f\u30a0-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u1100-\u11ff]/;
// extracts CJK unigrams + adjacent bigrams → similarity works correctly

OpenClaw version

v2026.5.6

Operating system

Linux 5.10 (x64)

Install method

pnpm dev

Model

claude-sonnet-4.6

Provider / routing chain

idealab-anthropic → claude-sonnet-4-6

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions