fix: preserve persona and language continuity in compaction summaries#10456
Conversation
| function truncateUnicodeSafe(s: string, maxCodePoints: number): string { | ||
| if (s.length <= maxCodePoints) return s; | ||
| const chars = Array.from(s); | ||
| return chars.length <= maxCodePoints ? s : chars.slice(0, maxCodePoints).join(""); |
There was a problem hiding this comment.
Incorrect truncation condition
truncateUnicodeSafe compares s.length (UTF-16 code units) to maxCodePoints. This will skip truncation for strings where s.length <= 800 but the number of Unicode code points is > 800 (e.g., many combining-mark sequences), allowing instructions to exceed the intended 800 code point cap. Use Array.from(s).length (or a shared code-point count) for the early-return check so the limit is actually enforced for all Unicode inputs.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-extensions/compaction-instructions.ts
Line: 25:28
Comment:
**Incorrect truncation condition**
`truncateUnicodeSafe` compares `s.length` (UTF-16 code units) to `maxCodePoints`. This will skip truncation for strings where `s.length <= 800` but the number of Unicode code points is > 800 (e.g., many combining-mark sequences), allowing instructions to exceed the intended 800 *code point* cap. Use `Array.from(s).length` (or a shared code-point count) for the early-return check so the limit is actually enforced for all Unicode inputs.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Already fixed in dc37ce4 — the early-return now uses Array.from(s).length (line 27), which counts code points instead of UTF-16 code units.
| it("does not split surrogate pair when cut lands inside a pair", () => { | ||
| const input = "X" + "\u{1F600}".repeat(800); | ||
| const result = resolveCompactionInstructions(input, undefined); | ||
| const codePoints = Array.from(result); | ||
| expect(codePoints).toHaveLength(800); | ||
| expect(codePoints[0]).toBe("X"); | ||
| const lastCodeUnit = result.charCodeAt(result.length - 1); | ||
| const isLowSurrogate = lastCodeUnit >= 0xdc00 && lastCodeUnit <= 0xdfff; | ||
| expect(isLowSurrogate).toBe(true); | ||
| }); |
There was a problem hiding this comment.
Surrogate-pair test is wrong
This test asserts the last code unit is a low surrogate (0xDC00–0xDFFF) after truncation, but if truncation is “surrogate-safe” the string should never end with an unmatched surrogate at all. As written, it would pass for an output that ends with a dangling low surrogate (which is a broken string) and fail for a correctly-truncated output that ends on a complete code point. Please adjust the assertion to validate you don’t end with a lone high/low surrogate (and/or that the result round-trips via Array.from without replacement chars).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-extensions/compaction-instructions.test.ts
Line: 151:160
Comment:
**Surrogate-pair test is wrong**
This test asserts the last code unit is a *low surrogate* (`0xDC00–0xDFFF`) after truncation, but if truncation is “surrogate-safe” the string should never end with an unmatched surrogate at all. As written, it would pass for an output that ends with a dangling low surrogate (which is a broken string) and fail for a correctly-truncated output that ends on a complete code point. Please adjust the assertion to validate you *don’t* end with a lone high/low surrogate (and/or that the result round-trips via `Array.from` without replacement chars).
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Already fixed in dc37ce4 — the test now asserts no lone surrogates exist (lines 158–161). Each code point is checked against the 0xD800–0xDFFF range, so a dangling surrogate would fail the test.
5bcf43d to
dc37ce4
Compare
This comment was marked as spam.
This comment was marked as spam.
|
All Node.js checks are passing (lint, test, build, format, protocol on both Linux and Windows). The two remaining failures are unrelated to this PR:
These same failures appear across other PRs in the repo as well. |
d54882f to
be89676
Compare
bfc1ccb to
f92900f
Compare
|
This pull request has been automatically marked as stale due to inactivity. |
be89676 to
d73c2a5
Compare
1ad3abf to
8ec2f80
Compare
SDK auto-compaction hardcodes customInstructions to undefined, causing summaries to always be generated in English. This breaks persona and language continuity for non-English agents after context compaction. Add a DEFAULT_COMPACTION_INSTRUCTIONS constant that instructs the summarizer to preserve the conversation language and persona cues. Wire a config → runtime → safeguard fallback chain so users can override via compaction.customInstructions in the config. Changes: - New compaction-instructions.ts with resolve/compose utilities - Config schema + types: add optional customInstructions field - Runtime type: add customInstructions to WeakMap registry - Extension builder: pass config value to runtime - Safeguard extension: use precedence chain (event → config → default) across all three summarization paths (dropped/history/split-turn) - 35 unit tests covering precedence, normalization, Unicode-safe truncation, and split-turn composition
Replace "Preserve persona, character, and speaking-style cues" with "Focus on factual content: what was discussed, decisions made, and current state" to prevent the summarizer from injecting persona descriptions into the summary (wasting tokens and potentially conflicting with system prompt persona).
8ec2f80 to
3d432e4
Compare
|
Running a Russian-language persona via SOUL.md and hit this exact problem — post-compaction the agent switches to English for several turns, narration text leaks into Telegram, and the persona tone goes flat until enough context rebuilds. The current compaction prompt has no language or persona anchoring at all, so the summarizer defaults to English regardless of what the actual conversation language is. This PR's approach (inheriting language + persona from the active SOUL.md) is the right fix. Bumping this — would be a shame to lose it to stale-bot. |
ea8bd5d to
4518fb2
Compare
|
Merged via squash.
Thanks @keepitmello! |
…openclaw#10456) Merged via squash. Prepared head SHA: 4518fb2 Co-authored-by: keepitmello <71975659+keepitmello@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…openclaw#10456) Merged via squash. Prepared head SHA: 4518fb2 Co-authored-by: keepitmello <71975659+keepitmello@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…openclaw#10456) Merged via squash. Prepared head SHA: 4518fb2 Co-authored-by: keepitmello <71975659+keepitmello@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…openclaw#10456) Merged via squash. Prepared head SHA: 4518fb2 Co-authored-by: keepitmello <71975659+keepitmello@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…openclaw#10456) Merged via squash. Prepared head SHA: 4518fb2 Co-authored-by: keepitmello <71975659+keepitmello@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…openclaw#10456) Merged via squash. Prepared head SHA: 4518fb2 Co-authored-by: keepitmello <71975659+keepitmello@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
Background
I run a Korean-language persona agent on OpenClaw (custom SOUL.md + IDENTITY.md setup). After long conversations, when auto-compaction kicks in, the agent suddenly starts responding in English for a few turns before recovering. The English narration text also leaks into Telegram messages because of
blockStreamingBreak: "text_end".After digging into it, I found the root cause: the SDK's
autoCompact()inagent-session.jshardcodescustomInstructions: undefinedwhen emittingsession_before_compact. The summarization prompt and system prompt are both English-only, so the summary always comes out in English. Since the summary gets injected as ausermessage (viaCOMPACTION_SUMMARY_PREFIX), the large block of English text right before the model's next response biases it toward English output.The system prompt (with SOUL.md etc.) is correctly re-injected every run, so the persona eventually recovers — but for the first few turns after compaction, the agent is broken.
Approach
The
customInstructionsparameter already exists in the SDK'sgenerateSummary()pipeline — it's just never populated during auto-compaction. Since we can't change the SDK directly, this PR works within the safeguard extension layer:Config field — adds
compaction.customInstructionsto the agent config schema, so users can provide explicit instructions if needed.Default instructions — when no config is set, a
DEFAULT_COMPACTION_INSTRUCTIONSconstant is injected that tells the summarizer to:Precedence chain —
event (SDK) → config (runtime) → default constant, with normalization (trim, empty-string-to-undefined) to prevent blank values from short-circuiting the chain.All three summarization paths covered — dropped messages, history, and split-turn prefixes all go through the same resolver. The split-turn path composes the existing
TURN_PREFIX_INSTRUCTIONSwith the resolved instructions.Changes
compaction-instructions.tsresolveCompactionInstructions(),composeSplitTurnInstructions(), Unicode-safe truncation (800 char cap)compaction-instructions.test.tszod-schema.agent-defaults.tscustomInstructionsto compaction schematypes.agent-defaults.tscustomInstructionstoAgentCompactionConfigcompaction-safeguard-runtime.tscustomInstructionstoCompactionSafeguardRuntimeValueextensions.tssetCompactionSafeguardRuntime()compaction-safeguard.tsresolveCompactionInstructions()across all three pathsNotes
safeguardmode —defaultmode is untouched.Array.from()to avoid splitting surrogate pairs (emoji, CJK supplementary characters, etc).Test plan
tsc --noEmitpasses