[Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet

### Bug type

Regression (worked before, now fails)

### Beta release blocker

No

### Summary

Two bugs found in OpenClaw's memory dreaming pipeline affecting CJK (Chinese/Japanese/Korean) workspaces:

1. **Dreaming block content leaks into promoted MEMORY.md snippets** — Raw `- Candidate: / - confidence: / - status: staged` dreaming format text gets included in snippets promoted to `MEMORY.md` instead of only clean human-written content.

2. **CJK Jaccard deduplication is ineffective** — `tokenizeSnippet` in `dreaming-phases.ts` uses ASCII-only tokenization, producing empty token sets for CJK text and making similarity-based dedup fall back to exact string match only.


### Steps to reproduce

### Bug 1: Dreaming content leaks into promoted snippets

1. Have a daily memory file (e.g. `memory/2026-04-12.md`) with both human-written content and a managed dreaming block:

```markdown
## 教训：Plan 开关字段
- Plan 配置中实验开关字段是 `exRule`，不是 `abConfig`
- 配置中每个 Plan 的开关在 `exRule` 字段

## Light Sleep

- Candidate: 教训：Plan 实验开关字段: ...
  - confidence: 0.00
  - status: staged

```

2. Run openclaw memory promote --apply
3. Check MEMORY.md for the promoted entry

### Bug 2: CJK dedup ineffective

1. Configure workspace with CJK (Chinese) memory content
2. Have two similar (but not identical) Chinese snippets in short-term recall
3. Run dreaming sweep
4. Observe both snippets pass dedup and appear as separate candidates in light dreaming output

### Expected behavior

1. Promoted snippets in MEMORY.md should contain only clean human-written content. All managed dreaming block content (- Candidate:, - confidence:, - status: staged, etc.) must be fully stripped before ingestion.
2. Similar CJK snippets (e.g. two Chinese sentences describing the same concept with minor wording differences) should be correctly identified as duplicates and merged, same as ASCII content.

### Actual behavior


### Bug 1: Dreaming content leaks into promoted snippets

`MEMORY.md` receives a promoted entry containing raw dreaming format lines:

```

- ## 教训：Plan 实验开关字段 - Plan 配置中实验开关字段是 `exRule`...
  ## Light Sleep 
  - Candidate: 教训：Plan 实验开关字段: ...
    - confidence: 0.00
    - status: staged
```

**Root cause:** `buildDailySnippetChunks` begins accumulating a chunk from the human-written lines before the `` marker. The marker is not treated as a flush boundary, so chunk accumulation continues into the dreaming block. Even though `stripManagedDailyDreamingLines` runs first, the `startLine`/`endLine` metadata recorded in the chunk still references the pre-strip line positions, causing the first few dreaming lines to be captured in the snippet.

### Bug 2: CJK dedup ineffective in `tokenizeSnippet`

Two similar Chinese snippets describing the same concept with minor wording differences both pass through `dedupeEntries` with similarity score `0`, producing duplicate candidates in the light dreaming output and eventually redundant entries in `MEMORY.md`.

**Root cause:** `tokenizeSnippet` splits on `/[^a-z0-9]+/`, which produces an empty `Set` for any CJK input. `jaccardSimilarity` then falls back to exact string comparison:

```ts
// dreaming-phases.ts
function tokenizeSnippet(snippet) {
  return new Set(
    snippet.toLowerCase().split(/[^a-z0-9]+/i)
      .map(token => token.trim())
      .filter(Boolean)
  );
  // CJK input → empty Set → jaccard falls back to exact-match only
}
```

Contrast with the correct CJK-aware implementation in `mmr.ts`:

```ts
const CJK_RE = /[\u3040-\u309f\u30a0-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u1100-\u11ff]/;
// extracts CJK unigrams + adjacent bigrams → similarity works correctly
```


### OpenClaw version

v2026.5.6

### Operating system

Linux 5.10 (x64)

### Install method

pnpm dev

### Model

claude-sonnet-4.6

### Provider / routing chain

idealab-anthropic → claude-sonnet-4-6

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

```shell

```

### Impact and severity

_No response_

### Additional information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613

Bug type

Beta release blocker

Summary

Steps to reproduce

Bug 1: Dreaming content leaks into promoted snippets

Bug 2: CJK dedup ineffective

Expected behavior

Actual behavior

Bug 1: Dreaming content leaks into promoted snippets

Bug 2: CJK dedup ineffective in `tokenizeSnippet`

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: dreaming pipeline leaks raw candidate content into MEMORY.md and CJK dedup is ineffective in tokenizeSnippet #80613

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

Bug 1: Dreaming content leaks into promoted snippets

Bug 2: CJK dedup ineffective

Expected behavior

Actual behavior

Bug 1: Dreaming content leaks into promoted snippets

Bug 2: CJK dedup ineffective in tokenizeSnippet

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 2: CJK dedup ineffective in `tokenizeSnippet`