fix: strip leaked model control tokens from user-facing text#42173
fix: strip leaked model control tokens from user-facing text#42173odysseus0 merged 1 commit intoopenclaw:mainfrom
Conversation
…w#40020) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens (e.g. `<|assistant|>`, `<|begin▁of▁sentence|>`) in their responses. These use the universal `<|...|>` convention and should never reach users. Uses a generic pattern `<[||]...[||]>` instead of a brittle allowlist, following the same architecture as `stripMinimaxToolCallXml` — provider bug defense in the text extraction pipeline. Closes openclaw#40020 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d5b542ee6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| * @see https://github.com/openclaw/openclaw/issues/40020 | ||
| */ | ||
| // Match both ASCII pipe <|...|> and full-width pipe <|...|> (U+FF5C) variants. | ||
| const MODEL_SPECIAL_TOKEN_RE = /<[||][^||]*[||]>/g; |
There was a problem hiding this comment.
Restrict token matcher to avoid removing legitimate
<|...|> text
The new MODEL_SPECIAL_TOKEN_RE is broad enough to treat any <|...|> span as a control token, so normal assistant output that contains language syntax or examples using <| and |> (for example f <| x |> g) will be stripped and corrupted before display/replay via extractAssistantText and sanitizeTextContent. Because this sanitizer now runs on all user-facing assistant/tool text, the current pattern can delete real content that is not a provider control token.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR adds
Confidence Score: 4/5
Last reviewed commit: 8d5b542 |
| const MODEL_SPECIAL_TOKEN_RE = /<[||][^||]*[||]>/g; | ||
|
|
||
| export function stripModelSpecialTokens(text: string): string { | ||
| if (!text) { | ||
| return text; | ||
| } | ||
| if (!MODEL_SPECIAL_TOKEN_RE.test(text)) { | ||
| return text; | ||
| } | ||
| MODEL_SPECIAL_TOKEN_RE.lastIndex = 0; | ||
| return text.replace(MODEL_SPECIAL_TOKEN_RE, " ").replace(/ +/g, " ").trim(); |
There was a problem hiding this comment.
Stateful global regex on module-level constant
The g flag makes MODEL_SPECIAL_TOKEN_RE stateful via lastIndex. The current code handles this correctly — when test() returns true, lastIndex is explicitly reset to 0 before calling replace(); when test() returns false, JavaScript automatically resets lastIndex to 0. So there is no bug here.
However, this pattern is a well-known JavaScript footgun. If a future change adds another test() / exec() call against this constant without a corresponding lastIndex reset, it will silently skip matches for every other call on the same string. Consider making the intent explicit with a comment, or avoid the footgun entirely by splitting into a non-global test regex and a local regex literal in replace():
| const MODEL_SPECIAL_TOKEN_RE = /<[||][^||]*[||]>/g; | |
| export function stripModelSpecialTokens(text: string): string { | |
| if (!text) { | |
| return text; | |
| } | |
| if (!MODEL_SPECIAL_TOKEN_RE.test(text)) { | |
| return text; | |
| } | |
| MODEL_SPECIAL_TOKEN_RE.lastIndex = 0; | |
| return text.replace(MODEL_SPECIAL_TOKEN_RE, " ").replace(/ +/g, " ").trim(); | |
| // Match both ASCII pipe <|...|> and full-width pipe <|...|> (U+FF5C) variants. | |
| // Note: no /g flag here — used only for the fast-path existence check (stateless). | |
| const MODEL_SPECIAL_TOKEN_RE_TEST = /<[||][^||]*[||]>/; | |
| // Separate global regex for replace(); defined here to avoid recompiling on every call. | |
| const MODEL_SPECIAL_TOKEN_RE = /<[||][^||]*[||]>/g; | |
| export function stripModelSpecialTokens(text: string): string { | |
| if (!text) { | |
| return text; | |
| } | |
| if (!MODEL_SPECIAL_TOKEN_RE_TEST.test(text)) { | |
| return text; | |
| } | |
| return text.replace(MODEL_SPECIAL_TOKEN_RE, " ").replace(/ +/g, " ").trim(); | |
| } |
This removes the need for the manual lastIndex = 0 reset and makes the two distinct usages self-documenting.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-utils.ts
Line: 49-59
Comment:
**Stateful global regex on module-level constant**
The `g` flag makes `MODEL_SPECIAL_TOKEN_RE` stateful via `lastIndex`. The current code handles this correctly — when `test()` returns `true`, `lastIndex` is explicitly reset to `0` before calling `replace()`; when `test()` returns `false`, JavaScript automatically resets `lastIndex` to `0`. So there is no bug here.
However, this pattern is a well-known JavaScript footgun. If a future change adds another `test()` / `exec()` call against this constant without a corresponding `lastIndex` reset, it will silently skip matches for every other call on the same string. Consider making the intent explicit with a comment, or avoid the footgun entirely by splitting into a non-global test regex and a local regex literal in `replace()`:
```suggestion
// Match both ASCII pipe <|...|> and full-width pipe <|...|> (U+FF5C) variants.
// Note: no /g flag here — used only for the fast-path existence check (stateless).
const MODEL_SPECIAL_TOKEN_RE_TEST = /<[||][^||]*[||]>/;
// Separate global regex for replace(); defined here to avoid recompiling on every call.
const MODEL_SPECIAL_TOKEN_RE = /<[||][^||]*[||]>/g;
export function stripModelSpecialTokens(text: string): string {
if (!text) {
return text;
}
if (!MODEL_SPECIAL_TOKEN_RE_TEST.test(text)) {
return text;
}
return text.replace(MODEL_SPECIAL_TOKEN_RE, " ").replace(/ +/g, " ").trim();
}
```
This removes the need for the manual `lastIndex = 0` reset and makes the two distinct usages self-documenting.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
* main: (43 commits) docs: add openclaw#42173 to CHANGELOG — strip leaked model control tokens (openclaw#42216) Agents: align onPayload callback and OAuth imports docs: add Tengji (George) Zhang to maintainer table (openclaw#42190) fix: strip leaked model control tokens from user-facing text (openclaw#42173) Changelog: add unreleased March 9 entries chore: add .dev-state to .gitignore (openclaw#41848) fix(agents): avoid duplicate same-provider cooldown probes in fallback runs (openclaw#41711) fix(mattermost): preserve markdown formatting and native tables (openclaw#18655) feat(acp): add resumeSessionId to sessions_spawn for ACP session resume (openclaw#41847) ACPX: bump bundled acpx to 0.1.16 (openclaw#41975) mattermost: fix DM media upload for unprefixed user IDs (openclaw#29925) fix(msteams): use General channel conversation ID as team key for Bot Framework compatibility (openclaw#41838) fix(mattermost): read replyTo param in plugin handleAction send (openclaw#41176) fix(sandbox): pass real workspace to sessions_spawn when workspaceAccess is ro (openclaw#40757) fix(ui): replace Manual RPC text input with sorted method dropdown (openclaw#14967) CI: select Swift 6.2 toolchain for CodeQL (openclaw#41787) fix(agents): forward memory flush write path (openclaw#41761) fix(telegram): move network fallback to resolver-scoped dispatchers (openclaw#40740) fix(security): harden replaceMarkers() to catch space/underscore boundary marker variants (openclaw#35983) fix(web-search): recover OpenRouter Perplexity citations from message annotations (openclaw#40881) ...
…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>
…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>
…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>
…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>
…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>
…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>
…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>
…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>
…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>
Summary
Rework of #40573 by @imwyvern — same fix, different architecture.
Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens (e.g.
<|assistant|>,<|tool_call_result_begin|>,<|begin▁of▁sentence|>) in their responses. This is a provider bug (#40020).What changed vs #40573:
<[||]...[||]>) instead of a brittle allowlist — all<|...|>tokens are internal control tokens by convention, no legitimate user text uses this formatstripModelSpecialTokens()inpi-embedded-utils.tsnext tostripMinimaxToolCallXml(), following the same provider-bug-defense patternextractAssistantText(embedded output) andsanitizeTextContent(session replay)errors.ts— original PR's changes there reverted (wrong layer, would affect error messages)Changes
src/agents/pi-embedded-utils.ts— newstripModelSpecialTokens(), wired intoextractAssistantTextsanitization chainsrc/agents/tools/sessions-helpers.ts— same function wired intosanitizeTextContentsrc/agents/pi-embedded-utils.strip-model-special-tokens.test.ts— 4 focused tests (core stripping, full-width variant, HTML false-positive guard, passthrough)Closes #40020
Supersedes #40573
Co-authored-by: imwyvern 100903837+imwyvern@users.noreply.github.com
Test plan
vitest run pi-embedded-utils.strip-model-special-tokens— 4/4 pass