fix: strip leaked model control tokens from user-facing text by odysseus0 · Pull Request #42173 · openclaw/openclaw

odysseus0 · 2026-03-10T13:25:21Z

Summary

Rework of #40573 by @imwyvern — same fix, different architecture.

Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens (e.g. <|assistant|>, <|tool_call_result_begin|>, <｜begin▁of▁sentence｜>) in their responses. This is a provider bug (#40020).

What changed vs #40573:

Generic pattern (<[|｜]...[|｜]>) instead of a brittle allowlist — all <|...|> tokens are internal control tokens by convention, no legitimate user text uses this format
Right architectural layer — stripModelSpecialTokens() in pi-embedded-utils.ts next to stripMinimaxToolCallXml(), following the same provider-bug-defense pattern
Both call sites covered — extractAssistantText (embedded output) and sanitizeTextContent (session replay)
No changes to errors.ts — original PR's changes there reverted (wrong layer, would affect error messages)

Changes

src/agents/pi-embedded-utils.ts — new stripModelSpecialTokens(), wired into extractAssistantText sanitization chain
src/agents/tools/sessions-helpers.ts — same function wired into sanitizeTextContent
src/agents/pi-embedded-utils.strip-model-special-tokens.test.ts — 4 focused tests (core stripping, full-width variant, HTML false-positive guard, passthrough)

Closes #40020
Supersedes #40573

Co-authored-by: imwyvern 100903837+imwyvern@users.noreply.github.com

Test plan

vitest run pi-embedded-utils.strip-model-special-tokens — 4/4 pass
Full suite: 881/882 (pre-existing unrelated failure)
Build passes, linter passes

…w#40020) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens (e.g. `<|assistant|>`, `<｜begin▁of▁sentence｜>`) in their responses. These use the universal `<|...|>` convention and should never reach users. Uses a generic pattern `<[|｜]...[|｜]>` instead of a brittle allowlist, following the same architecture as `stripMinimaxToolCallXml` — provider bug defense in the text extraction pipeline. Closes openclaw#40020 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d5b542ee6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-10T13:28:49Z

src/agents/pi-embedded-utils.ts

+ * @see https://github.com/openclaw/openclaw/issues/40020
+ */
+// Match both ASCII pipe <|...|> and full-width pipe <｜...｜> (U+FF5C) variants.
+const MODEL_SPECIAL_TOKEN_RE = /<[|｜][^|｜]*[|｜]>/g;


Restrict token matcher to avoid removing legitimate <|...|> text

The new MODEL_SPECIAL_TOKEN_RE is broad enough to treat any <|...|> span as a control token, so normal assistant output that contains language syntax or examples using <| and |> (for example f <| x |> g) will be stripped and corrupted before display/replay via extractAssistantText and sanitizeTextContent. Because this sanitizer now runs on all user-facing assistant/tool text, the current pattern can delete real content that is not a provider control token.

Useful? React with 👍 / 👎.

greptile-apps · 2026-03-10T13:32:54Z

Greptile Summary

This PR adds stripModelSpecialTokens() to sanitize control tokens (e.g. <|assistant|>, <｜begin▁of▁sentence｜>) leaked by GLM-5 and DeepSeek models before they reach end users. The fix is placed at the correct architectural layer alongside stripMinimaxToolCallXml() and is wired into both extractAssistantText (embedded output) and sanitizeTextContent (session replay).

The regex /<[|｜][^|｜]*[|｜]>/g correctly handles both ASCII and full-width pipe variants and avoids false-positives on plain HTML tags.
The g-flagged module-level constant is handled correctly (explicit lastIndex = 0 reset after a successful test()), but the pattern is a JavaScript footgun — splitting into a non-global test regex and a separate global replace regex would make the intent clearer and eliminate the manual reset.
Tests cover the main cases: ASCII tokens, full-width tokens, HTML passthrough, and clean passthrough. The trim behaviour on token-containing strings is implicitly covered.

Confidence Score: 4/5

Safe to merge — the logic is correct and the fix is well-scoped; only a minor stylistic concern about the stateful global regex constant.
The implementation correctly handles the lastIndex state of the global regex (explicit reset after a successful test(), automatic reset after a failed one), the sanitization pipeline ordering is sound, and the HTML false-positive guard in the regex is verified by tests. The one deduction is for the fragile module-level stateful regex pattern that could silently break if a future contributor adds another test()/exec() call without the corresponding reset.
src/agents/pi-embedded-utils.ts — specifically lines 49–59 where the module-level global regex and its lastIndex reset live.

_{Last reviewed commit: 8d5b542}

greptile-apps · 2026-03-10T13:32:58Z

src/agents/pi-embedded-utils.ts

+const MODEL_SPECIAL_TOKEN_RE = /<[|｜][^|｜]*[|｜]>/g;
+
+export function stripModelSpecialTokens(text: string): string {
+  if (!text) {
+    return text;
+  }
+  if (!MODEL_SPECIAL_TOKEN_RE.test(text)) {
+    return text;
+  }
+  MODEL_SPECIAL_TOKEN_RE.lastIndex = 0;
+  return text.replace(MODEL_SPECIAL_TOKEN_RE, " ").replace(/  +/g, " ").trim();


Stateful global regex on module-level constant

The g flag makes MODEL_SPECIAL_TOKEN_RE stateful via lastIndex. The current code handles this correctly — when test() returns true, lastIndex is explicitly reset to 0 before calling replace(); when test() returns false, JavaScript automatically resets lastIndex to 0. So there is no bug here.

However, this pattern is a well-known JavaScript footgun. If a future change adds another test() / exec() call against this constant without a corresponding lastIndex reset, it will silently skip matches for every other call on the same string. Consider making the intent explicit with a comment, or avoid the footgun entirely by splitting into a non-global test regex and a local regex literal in replace():

Suggested change

const MODEL_SPECIAL_TOKEN_RE = /<[|｜][^|｜]*[|｜]>/g;

export function stripModelSpecialTokens(text: string): string {

if (!text) {

return text;

}

if (!MODEL_SPECIAL_TOKEN_RE.test(text)) {

return text;

}

MODEL_SPECIAL_TOKEN_RE.lastIndex = 0;

return text.replace(MODEL_SPECIAL_TOKEN_RE, " ").replace(/ +/g, " ").trim();

// Match both ASCII pipe <|...|> and full-width pipe <｜...｜> (U+FF5C) variants.

// Note: no /g flag here — used only for the fast-path existence check (stateless).

const MODEL_SPECIAL_TOKEN_RE_TEST = /<[|｜][^|｜]*[|｜]>/;

// Separate global regex for replace(); defined here to avoid recompiling on every call.

const MODEL_SPECIAL_TOKEN_RE = /<[|｜][^|｜]*[|｜]>/g;

export function stripModelSpecialTokens(text: string): string {

if (!text) {

return text;

}

if (!MODEL_SPECIAL_TOKEN_RE_TEST.test(text)) {

return text;

}

return text.replace(MODEL_SPECIAL_TOKEN_RE, " ").replace(/ +/g, " ").trim();

}

This removes the need for the manual lastIndex = 0 reset and makes the two distinct usages self-documenting.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/pi-embedded-utils.ts Line: 49-59 Comment: **Stateful global regex on module-level constant** The `g` flag makes `MODEL_SPECIAL_TOKEN_RE` stateful via `lastIndex`. The current code handles this correctly — when `test()` returns `true`, `lastIndex` is explicitly reset to `0` before calling `replace()`; when `test()` returns `false`, JavaScript automatically resets `lastIndex` to `0`. So there is no bug here. However, this pattern is a well-known JavaScript footgun. If a future change adds another `test()` / `exec()` call against this constant without a corresponding `lastIndex` reset, it will silently skip matches for every other call on the same string. Consider making the intent explicit with a comment, or avoid the footgun entirely by splitting into a non-global test regex and a local regex literal in `replace()`: ```suggestion // Match both ASCII pipe <|...|> and full-width pipe <｜...｜> (U+FF5C) variants. // Note: no /g flag here — used only for the fast-path existence check (stateless). const MODEL_SPECIAL_TOKEN_RE_TEST = /<[|｜][^|｜]*[|｜]>/; // Separate global regex for replace(); defined here to avoid recompiling on every call. const MODEL_SPECIAL_TOKEN_RE = /<[|｜][^|｜]*[|｜]>/g; export function stripModelSpecialTokens(text: string): string { if (!text) { return text; } if (!MODEL_SPECIAL_TOKEN_RE_TEST.test(text)) { return text; } return text.replace(MODEL_SPECIAL_TOKEN_RE, " ").replace(/ +/g, " ").trim(); } ``` This removes the need for the manual `lastIndex = 0` reset and makes the two distinct usages self-documenting. How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

@imwyvern

…kens Thanks @imwyvern.

@imwyvern

…2216) Thanks @imwyvern.

* main: (43 commits) docs: add openclaw#42173 to CHANGELOG — strip leaked model control tokens (openclaw#42216) Agents: align onPayload callback and OAuth imports docs: add Tengji (George) Zhang to maintainer table (openclaw#42190) fix: strip leaked model control tokens from user-facing text (openclaw#42173) Changelog: add unreleased March 9 entries chore: add .dev-state to .gitignore (openclaw#41848) fix(agents): avoid duplicate same-provider cooldown probes in fallback runs (openclaw#41711) fix(mattermost): preserve markdown formatting and native tables (openclaw#18655) feat(acp): add resumeSessionId to sessions_spawn for ACP session resume (openclaw#41847) ACPX: bump bundled acpx to 0.1.16 (openclaw#41975) mattermost: fix DM media upload for unprefixed user IDs (openclaw#29925) fix(msteams): use General channel conversation ID as team key for Bot Framework compatibility (openclaw#41838) fix(mattermost): read replyTo param in plugin handleAction send (openclaw#41176) fix(sandbox): pass real workspace to sessions_spawn when workspaceAccess is ro (openclaw#40757) fix(ui): replace Manual RPC text input with sorted method dropdown (openclaw#14967) CI: select Swift 6.2 toolchain for CodeQL (openclaw#41787) fix(agents): forward memory flush write path (openclaw#41761) fix(telegram): move network fallback to resolver-scoped dispatchers (openclaw#40740) fix(security): harden replaceMarkers() to catch space/underscore boundary marker variants (openclaw#35983) fix(web-search): recover OpenRouter Perplexity citations from message annotations (openclaw#40881) ...

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>

@imwyvern

…kens (openclaw#42216) Thanks @imwyvern.

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>

@imwyvern

…kens (openclaw#42216) Thanks @imwyvern.

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>

@imwyvern

…kens (openclaw#42216) Thanks @imwyvern.

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>

@imwyvern

…kens (openclaw#42216) Thanks @imwyvern.

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>

@imwyvern

…kens (openclaw#42216) Thanks @imwyvern.

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>

@imwyvern

…kens (openclaw#42216) Thanks @imwyvern.

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>

@imwyvern

…kens (openclaw#42216) Thanks @imwyvern.

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <100903837+imwyvern@users.noreply.github.com>

@imwyvern

…kens (openclaw#42216) Thanks @imwyvern.

odysseus0 mentioned this pull request Mar 10, 2026

fix(sanitize): strip GLM/DeepSeek special tokens from user-facing text #40573

Closed

openclaw-barnacle bot added agents Agent runtime and tooling size: S labels Mar 10, 2026

odysseus0 merged commit 309162f into openclaw:main Mar 10, 2026
30 checks passed

chatgpt-codex-connector bot reviewed Mar 10, 2026

View reviewed changes

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

odysseus0 added a commit to odysseus0/openclaw that referenced this pull request Mar 10, 2026

docs: add openclaw#42173 to CHANGELOG — strip leaked model control to…

356894f

…kens Thanks @imwyvern.

odysseus0 mentioned this pull request Mar 10, 2026

docs: add #42173 to CHANGELOG #42216

Merged

odysseus0 added a commit that referenced this pull request Mar 10, 2026

docs: add #42173 to CHANGELOG — strip leaked model control tokens (#4…

f50fc29

…2216) Thanks @imwyvern.

github-actions bot mentioned this pull request Mar 10, 2026

📡 Upstream Digest — 2026-03-10 14:42 UTC curtismercier/openclaw-mods#227

Open

This was referenced Mar 10, 2026

Cherry-pick: Brave snippets fix, memory flush, web-search citations, OAuth imp remoteclaw/remoteclaw#929

Open

Cherry-pick: Telegram duplicate fix, memory flush changelog, Swift CI, PR revi remoteclaw/remoteclaw#930

Closed

Moshiii pushed a commit to Moshiii/openclaw that referenced this pull request Mar 11, 2026

docs: add openclaw#42173 to CHANGELOG — strip leaked model control to…

b97781b

…kens (openclaw#42216) Thanks @imwyvern.

Moshiii pushed a commit to Moshiii/openclaw that referenced this pull request Mar 11, 2026

docs: add openclaw#42173 to CHANGELOG — strip leaked model control to…

52df005

…kens (openclaw#42216) Thanks @imwyvern.

frankekn pushed a commit to MoerAI/openclaw that referenced this pull request Mar 11, 2026

docs: add openclaw#42173 to CHANGELOG — strip leaked model control to…

cde41fc

…kens (openclaw#42216) Thanks @imwyvern.

frankekn pushed a commit to Effet/openclaw that referenced this pull request Mar 11, 2026

docs: add openclaw#42173 to CHANGELOG — strip leaked model control to…

4434685

…kens (openclaw#42216) Thanks @imwyvern.

frankekn pushed a commit to ImLukeF/openclaw that referenced this pull request Mar 11, 2026

docs: add openclaw#42173 to CHANGELOG — strip leaked model control to…

0e0583d

…kens (openclaw#42216) Thanks @imwyvern.

dhoman pushed a commit to dhoman/chrono-claw that referenced this pull request Mar 11, 2026

docs: add openclaw#42173 to CHANGELOG — strip leaked model control to…

8f912b2

…kens (openclaw#42216) Thanks @imwyvern.

ahelpercn pushed a commit to ahelpercn/openclaw that referenced this pull request Mar 12, 2026

docs: add openclaw#42173 to CHANGELOG — strip leaked model control to…

68a44fa

…kens (openclaw#42216) Thanks @imwyvern.

Ruijie-Ysp pushed a commit to Ruijie-Ysp/clawdbot that referenced this pull request Mar 12, 2026

docs: add openclaw#42173 to CHANGELOG — strip leaked model control to…

11869b9

…kens (openclaw#42216) Thanks @imwyvern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: strip leaked model control tokens from user-facing text#42173

fix: strip leaked model control tokens from user-facing text#42173
odysseus0 merged 1 commit intoopenclaw:mainfrom
odysseus0:fix/glm-special-token-sanitize

odysseus0 commented Mar 10, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 10, 2026

Uh oh!

greptile-apps bot commented Mar 10, 2026

Uh oh!

greptile-apps bot Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

odysseus0 commented Mar 10, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 10, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant