Skip to content

Studio: strip orphan tool_call XML leaking into visible content#5735

Merged
danielhanchen merged 5 commits into
mainfrom
fix-tool-xml-leak-strip
May 24, 2026
Merged

Studio: strip orphan tool_call XML leaking into visible content#5735
danielhanchen merged 5 commits into
mainfrom
fix-tool-xml-leak-strip

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

The speculative-buffer state machine in studio/backend/core/inference/llama_cpp.py slices a tool_call XML block between the silent DRAINING path and visible content_accum, depending on when in the model's emission the BUFFERING -> STREAMING -> DRAINING transitions fire. Three leak shapes were observed in a 900-run sweep of Qwen3.5 / Qwen3.6 GGUFs:

  1. Well-formed pair (both opening and close in content).
  2. Orphan opening (close was DRAINED, only <tool_call><function=...><parameter=...>...</parameter></function ended up visible).
  3. Orphan close (opening was DRAINED, only ...</parameter></function></tool_call> ended up visible).

The existing _TOOL_XML_RE only matched well-formed pairs, so shapes 2 and 3 survived the strip and the user saw raw <tool_call> / </tool_call> fragments in the chat output.

Sweep evidence (2026-05-22, 900 runs across 15 configs)

XML leak rate per config:

Config leaks/60 %
Qwen3.6-35B-A3B Q8_0 4/60 6.7%
Qwen3.6-35B-A3B-MTP UD-Q4_K_XL 4/60 6.7%
Qwen3.5-35B-A3B Q8_0 3/60 5.0%
Qwen3.6-27B Q8_0 3/60 5.0%
Qwen3.6-27B UD-Q2_K_XL 2/60 3.3%
(4 more configs) 1/60 1.7%
all UD-Q4_K_XL dense configs 0/60 0%
Total 20/900 2.22%

Fix

_TOOL_XML_RE = _re.compile(
    r"<tool_call>.*?(?:</tool_call>|\Z)"
    r"|<function=\w+>.*?(?:</function>|\Z)"
    r"|</tool_call>"
    r"|</function>",
    _re.DOTALL,
)
  • (?:</tool_call>|\Z) and (?:</function>|\Z) extend the opening-match to end-of-string when no close is found (shape 2).
  • Bare </tool_call> / </function> alternatives strip orphan closes (shape 3).

Verification

Replayed the strip on all 900 stored sweep outputs:

Pre-fix leaks:  20/900 (2.22%)
Post-fix leaks: 0/900  (0.00%)
Reduction:      100% of detected XML leaks fixed

16 unit tests in studio/backend/tests/test_tool_xml_strip.py pin all three leak shapes, the no-regression on plain text / code fences / inline HTML, and parametrised checks on 5 actual real-world leak samples from the sweep data.

Test plan

  • 16 new unit tests pass
  • Replay on 900 stored sweep records: 20 -> 0 leaks
  • CI green

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@danielhanchen

Copy link
Copy Markdown
Member Author

Live randomized validation: 24/24 clean

Re-ran the Billboard prompt (historical 6.7% leak rate on the worst configs) on 8 fresh studios that imported the post-fix _TOOL_XML_RE at startup. Randomized seeds (Mersenne Twister with seed=2026), 3 waves of 8 concurrent trials each. Early stopping triggered.

=== wave 1/5 ===  0 leaks / 8 trials (0%)
=== wave 2/5 ===  0 leaks / 16 trials (0%)
=== wave 3/5 ===  0 leaks / 24 trials (0%)
*** EARLY STOP: 0 leaks after 3 waves, 24 trials ***

Pre-fix baseline (from 900-run sweep): 20/900 = 2.22% overall, 6.7% on Q8 35B-A3B and MTP configs which are represented here (slots 0, 1, 6, 7).

Probability of getting 0/24 if the underlying leak rate were unchanged at 6.7%: (1-0.067)^24 ≈ 0.193 — i.e. there's an ~81% chance the fix moved the rate. Combined with the regex-replay evidence (20/900 → 0/900 on stored output) and the 16 unit tests covering all three leak shapes, the fix is validated across multiple independent angles.

Records saved at outputs/bench_2026-05-22/xml_validation/v2/slot*.json (24 records, all xml_leak: false).

@danielhanchen

Copy link
Copy Markdown
Member Author

Follow-up from the gdpval sweep (192 trials across 8 configs):

4th leak shape: tail-only </parameter> orphan. 7/192 trials emitted a bare </parameter>\n\n at end-of-buffer (5 on Qwen3.5-27B Q8/Q4, 1 on Qwen3.6-27B Q8, 1 on Qwen3.6-27B Q4). The opening <tool_call><function=...><parameter=...> DRAINS as intended, but EOS truncation cuts the outer </function></tool_call> close, leaving the inner </parameter> close visible. Anchored to \s*\Z so legitimate mid-text <parameter>...</parameter> in user code samples / docs is preserved. Replay over the 192-trial corpus: before=7, after=0.

Regex tightened while in the neighborhood. Folded the five top-level alternations into three by sharing tag-name and prefix subgroups:

<tool_call>... | <function=\w+>...     ->  <(?:tool_call|function=\w+)>...
</tool_call>   | </function>           ->  </(?:tool_call|function)>

Semantically identical (verified by replay over the 192-trial corpus + adversarial inputs, 0 diffs); 1.34x faster on real workloads (11.1ms -> 8.2ms over 2000 ops). Backtracking-safety pinned by two new perf guards (256KB < spam, 1000x orphan opens).

Tests: 16 -> 28.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e05802c8ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# opening + body + close-or-EOF; close tag may be tool_call OR function
# (well-formed pairs nest; non-greedy stops at the inner one and the
# outer dangling close gets picked up by the next alternative).
r"<(?:tool_call|function=\w+)>.*?(?:</(?:tool_call|function)>|\Z)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid stripping unclosed <function=...> examples to EOF

The new \Z fallback in the opening-tag branch causes any literal <function=...> text to consume the rest of the assistant message when no closing </function>/</tool_call> appears. Because this regex is applied unconditionally to visible assistant content, a legitimate explanation like “use <function=python>” (without a closing tag) will truncate everything that follows, which is a user-visible regression compared with the previous pair-only matcher.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff1e8e95f0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

r"<tool_call>.*?</tool_call>|<function=\w+>.*?</function>",
r"<(?:tool_call|function=\w+)>.*?(?:</(?:tool_call|function)>|\Z)"
r"|</(?:tool_call|function)>"
r"|</parameter>\s*\Z",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict tail </parameter> stripping to tool-call context

The new |</parameter>\s*\Z branch removes any assistant output that ends with a literal </parameter>, even when it is legitimate prose/code (for example, an XML snippet that intentionally ends with that tag). This is a user-visible regression introduced by this commit: previously such text was preserved unless it was inside a matched tool-call block. Because the branch is unconditional and only anchored to end-of-string, normal answers that end with this literal are silently truncated.

Useful? React with 👍 / 👎.

danielhanchen and others added 5 commits May 23, 2026 14:00
The speculative-buffer state machine in
`studio/backend/core/inference/llama_cpp.py` can slice a tool_call XML
block between the silent DRAINING path and the user-visible
content_accum, depending on when in the model's emission the BUFFERING
-> STREAMING -> DRAINING transitions fire. Three leak shapes were
observed in a 2026-05-22 sweep of 900 Qwen3.5 / Qwen3.6 GGUF runs:

  Pre-fix XML leak rate: 20/900 (2.22%), concentrated 6.7% on the
  larger Q8 / MTP configs:

    Qwen3.6-35B-A3B Q8_0         4/60  (6.7%)
    Qwen3.6-35B-A3B-MTP Q4       4/60  (6.7%)
    Qwen3.5-35B-A3B Q8_0         3/60  (5.0%)
    Qwen3.6-27B Q8_0             3/60  (5.0%)

The existing `_TOOL_XML_RE` only matched well-formed
`<tool_call>...</tool_call>` and `<function=...></function>` pairs, so
unterminated openings (close was DRAINED) and orphan closes (opening
was DRAINED) survived the strip and reached the user.

Fix relaxes the regex to also strip:
  1. Orphan opening up to end-of-string: `(?:</tool_call>|\Z)`
  2. Orphan closing tag: bare `</tool_call>` / `</function>`

Verified on the full sweep: 20/900 -> 0/900 (100% of detected leaks
eliminated). 16 unit tests in `test_tool_xml_strip.py` pin all three
leak shapes plus the well-formed cases, plus parametrised checks on
the 5 actual real-world leak samples from the sweep data.
The 2026-05-22 gdpval sweep surfaced a 4th XML-leak shape not caught
by the earlier regex: a bare `</parameter>\n\n` at end-of-buffer (7
of 192 trials, all Qwen3.5-27B + a few Qwen3.6-27B). The model emits
the full `<tool_call><function=...><parameter=...>...content...
</parameter></function></tool_call>` envelope, the speculative buffer
DRAINS the opening tags as intended, but EOS (max_tokens cutoff)
truncates the outer `</function></tool_call>` close, leaving just
`</parameter>` as the visible tail.

We strip this ONLY when end-anchored (`\s*\Z`) so legitimate
mid-text uses (user code samples, documentation discussing the
Qwen tool-call XML shape) survive. Verified on the 192-trial
gdpval corpus: before=7, after=0.

While at it, fold the five top-level alternations into three by
sharing tag-name and prefix subgroups:

  <tool_call>...    + <function=\w+>...    +    -->  <(?:tool_call|function=\w+)>...
  </tool_call>      | </function>                  -->  </(?:tool_call|function)>

Semantically identical (verified by replay over the 192-trial
corpus + adversarial inputs, 0 diffs) and 1.34x faster on real
workloads. Backtracking-safety pinned by two new perf guards
(256KB '<' spam, 1000x orphan opens).

Tests: 16 -> 28 (6 new functional + 4 well-formed-vs-orphan +
2 perf guards).
Code says what it does; comments were repeating it. Strip the verbose
explanations down to the WHY-only bits (engine quirk, tail-anchor
rationale, real-world source of each test sample). No code changes.

inference.py:  21 -> 12 lines around _TOOL_XML_RE
test_tool_xml_strip.py: 343 -> 259 lines (-84)
Tests: 28/28 still pass.
@danielhanchen danielhanchen force-pushed the fix-tool-xml-leak-strip branch from ff1e8e9 to 1f78190 Compare May 23, 2026 14:00

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f78190b45

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# `\Z` so mid-text `<parameter>` in user code samples survives.
_TOOL_XML_RE = _re.compile(
r"<tool_call>.*?</tool_call>|<function=\w+>.*?</function>",
r"<(?:tool_call|function=\w+)>.*?(?:</(?:tool_call|function)>|\Z)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match opener and closer types when stripping tool XML

The new opener branch allows <tool_call> and <function=...> to terminate on either </tool_call> or </function>, so mismatched literal examples get over-stripped. For example, text like Use <tool_call> ... </function> in docs now deletes everything from <tool_call> through </function>, whereas the previous regex preserved it unless a real matching pair existed. This is a user-visible regression when assistants describe tool XML syntax or include partial snippets.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen merged commit f7f540a into main May 24, 2026
35 of 37 checks passed
@danielhanchen danielhanchen deleted the fix-tool-xml-leak-strip branch May 24, 2026 12:00
danielhanchen added a commit to NilayYadav/unsloth that referenced this pull request May 24, 2026
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 24, 2026
…othai#5735)

* Studio: strip orphan tool_call XML from streamed visible content

The speculative-buffer state machine in
`studio/backend/core/inference/llama_cpp.py` can slice a tool_call XML
block between the silent DRAINING path and the user-visible
content_accum, depending on when in the model's emission the BUFFERING
-> STREAMING -> DRAINING transitions fire. Three leak shapes were
observed in a 2026-05-22 sweep of 900 Qwen3.5 / Qwen3.6 GGUF runs:

  Pre-fix XML leak rate: 20/900 (2.22%), concentrated 6.7% on the
  larger Q8 / MTP configs:

    Qwen3.6-35B-A3B Q8_0         4/60  (6.7%)
    Qwen3.6-35B-A3B-MTP Q4       4/60  (6.7%)
    Qwen3.5-35B-A3B Q8_0         3/60  (5.0%)
    Qwen3.6-27B Q8_0             3/60  (5.0%)

The existing `_TOOL_XML_RE` only matched well-formed
`<tool_call>...</tool_call>` and `<function=...></function>` pairs, so
unterminated openings (close was DRAINED) and orphan closes (opening
was DRAINED) survived the strip and reached the user.

Fix relaxes the regex to also strip:
  1. Orphan opening up to end-of-string: `(?:</tool_call>|\Z)`
  2. Orphan closing tag: bare `</tool_call>` / `</function>`

Verified on the full sweep: 20/900 -> 0/900 (100% of detected leaks
eliminated). 16 unit tests in `test_tool_xml_strip.py` pin all three
leak shapes plus the well-formed cases, plus parametrised checks on
the 5 actual real-world leak samples from the sweep data.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: strip tail-only </parameter> orphan + tighten regex

The 2026-05-22 gdpval sweep surfaced a 4th XML-leak shape not caught
by the earlier regex: a bare `</parameter>\n\n` at end-of-buffer (7
of 192 trials, all Qwen3.5-27B + a few Qwen3.6-27B). The model emits
the full `<tool_call><function=...><parameter=...>...content...
</parameter></function></tool_call>` envelope, the speculative buffer
DRAINS the opening tags as intended, but EOS (max_tokens cutoff)
truncates the outer `</function></tool_call>` close, leaving just
`</parameter>` as the visible tail.

We strip this ONLY when end-anchored (`\s*\Z`) so legitimate
mid-text uses (user code samples, documentation discussing the
Qwen tool-call XML shape) survive. Verified on the 192-trial
gdpval corpus: before=7, after=0.

While at it, fold the five top-level alternations into three by
sharing tag-name and prefix subgroups:

  <tool_call>...    + <function=\w+>...    +    -->  <(?:tool_call|function=\w+)>...
  </tool_call>      | </function>                  -->  </(?:tool_call|function)>

Semantically identical (verified by replay over the 192-trial
corpus + adversarial inputs, 0 diffs) and 1.34x faster on real
workloads. Backtracking-safety pinned by two new perf guards
(256KB '<' spam, 1000x orphan opens).

Tests: 16 -> 28 (6 new functional + 4 well-formed-vs-orphan +
2 perf guards).

* Tighten comments in XML-strip regex and tests

Code says what it does; comments were repeating it. Strip the verbose
explanations down to the WHY-only bits (engine quirk, tail-anchor
rationale, real-world source of each test sample). No code changes.

inference.py:  21 -> 12 lines around _TOOL_XML_RE
test_tool_xml_strip.py: 343 -> 259 lines (-84)
Tests: 28/28 still pass.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
danielhanchen added a commit that referenced this pull request May 27, 2026
* Studio: expose --parallel / -np on `unsloth studio run`

The CLI was hardcoding `llama_parallel_slots=4` in `run_kwargs` at
`unsloth_cli/commands/studio.py`, leaving users unable to tune the
concurrent decode slot count even though the engine, KV-cache math,
and `studio.backend.run.run_server(llama_parallel_slots=...)`
plumbing all already accepted any N. This change adds a `--parallel`
/ `--n-parallel` / `-np` typer option (default 4 -- matches the
previous hardcoded value), forwards it into `run_kwargs`, and pins
the new surface with 4 unit tests.

Per-request state in `routes/inference.py` is already isolated
(`cancel_event` and `prev_text` are per-request locals in every
streaming handler; the `_lock` / `_serial_load_lock` only wrap
load/unload, not chat completions), so no concurrency refactor is
needed alongside this -- the engine layer already handles N
concurrent requests on one loaded model when llama-server is told
to.

Range guards: 1 <= N <= 64. With higher N each slot gets ctx/N KV
cache; users tuning this should be aware that per-call context
shrinks proportionally.

`unsloth studio` (the bare default command, no subcommand) still
defaults to llama_parallel_slots=1 via `run_server`'s own default;
this PR does not change that path -- it only exposes the knob on the
one-liner `studio run` command that already silently used 4.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Forward --parallel through venv re-exec and drop colliding short aliases

`unsloth studio run` re-execs into the Studio venv when invoked from
outside it (the common path). The arg-builder forwards every typer
option but the new --parallel, so the child re-execs at the default 4
and any user value is silently dropped. Worse: pre-PR users who
already pass `-np N` as a pass-through extra (where llama.cpp's
last-wins parsing made it stick) silently lose N after this PR lands.
Forward --parallel explicitly in the re-exec arg list.

While auditing the re-exec path, also drop the colliding 1-char
short aliases -m (--model) and -f (--frontend) plus the redundant
-hfr. Click's short-option clustering had been silently mis-parsing
~11 llama-server short flags via the pass-through path: -fa as
`-f a`, -mg 0 as `-m g` + stray 0, -fitt 1024 as `-f itt` + stray
1024, -hff path as `-f f` + stray `-h path`, -cmoe / -cram / -sm /
-ncmoe etc. The docstring promise ("any flag this command does not
recognize is forwarded verbatim") was silently violated.

-hf (2-char) is kept because Click treats multi-char shorts atomically
(no clustering of -hff / -hfv / -hffv / -hft) and -hf is documented
in basics/api/README.md. --model / --hf-repo / --frontend long forms
all unchanged. studio_default keeps -f because it has no pass-through.

Tests:
- test_studio_run_parallel_flag.py: 8 new re-exec coverage cases
  (all 3 aliases, 3 platforms via sys.platform mock, pre-PR `-np`
  regression, mixed with pass-through extras).
- test_studio_run_short_alias_clashes.py (new): surface checks that
  the removed shorts cannot reappear, plus 11 parametrized cases
  proving each previously-broken llama-server short flag now passes
  through verbatim, plus a happy-path test that documented -hf still
  works for `org/repo:variant` syntax.

All 27 tests pass. Negative test (revert either fix) shows the new
tests catch the regression.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix stale studio run docstring describing rejected llama-server flags

The pre-PR docstring listed --port, -c / --ctx-size, --api-key, -ngl,
--jinja, --flash-attn, --no-context-shift as "rejected with HTTP 400",
but only --port and --api-key (plus other networking / auth / model
identity / single-model UI flags) are actually in
studio/backend/core/inference/llama_server_args.py's denylist. -c /
-ngl / --jinja / --flash-attn / --no-context-shift are pass-through
and last-wins-override Studio's auto-set value.

Rewrite the docstring to match the real denylist groups and point at
the canonical source. Also add --parallel to one of the examples now
that it is a first-class flag.

* ci: broaden Linux + narrow Windows llama.cpp runtime patterns + trim #5741 comments (#5746)

* ci: broaden Linux llama.cpp runtime pattern to lib*.so*

#5741 patched the explicit Linux pattern list to add
``libllama-*-impl.so*`` after ggml-org/llama.cpp#23462 (between
b9279 and b9283) split each binary's entry code into a paired
``lib<binary>-impl.so`` shared library. Same class of upstream
repackaging will hit us again whenever a new shared lib is added.

Mirror what macOS already does and replace the per-lib list with a
single ``lib*.so*`` glob. ``copy_globs`` (line 3614) unions
patterns, so the per-variant ``libggml-cuda.so*`` / ``libggml-hip.so*``
entries were never filtering anything; the spec lives in
``runtime_payload_health_groups`` (line 5209) which keeps the
explicit minimum-required list per variant.

Dry-run against b9296-bin-ubuntu-x64.tar.gz: 40 files copied (all
ggml, llama, mtmd, impl variants + the two binaries we ship), 22
skipped (other CLIs, rpc-server, LICENSE). Functionally equal to
the post-#5741 set.

* cleanup: trim #5741 comments on the pydantic split

Comments added in #5741 explained the original bug in full each
time. They are mostly redundant with the commit message and the PR.
Trim them to one short paragraph per site.

No behavior change.

* ci: narrow Windows runtime pattern to llama-server.exe + llama-quantize.exe

Studio only invokes llama-server and llama-quantize. Mac and Linux
already filter to those two binaries; Windows was the odd one out
with ``*.exe`` copying every CLI upstream ships (llama-cli,
llama-bench, llama-mtmd-cli, ...).

Dry-run on b9296 (win cpu-x64, cpu-arm64, cuda-13.1, hip-radeon):
20 unused EXEs skipped per variant, all DLLs (incl. the new
llama-*-impl.dll family) still copied via ``*.dll``.

``existing_install_matches_choice`` already checks llama-server.exe
exists explicitly (line 5297), so the health gate is unchanged.

* Lower default weight_decay in RL config from 0.01 to 0.001 (#5747)

In full FT, AdamW weight decay shrinks the parameter directly so the
implicit prior is W -> 0. In LoRA the trained parameters are A and B
while the effective weight is W = W_init + (alpha/r) * B @ A; decaying
A and B separately drives BA -> 0, hence W -> W_init rather than 0.
The previous default of 0.01 inherited from full-FT recipes adds a
measurable pull on the merged adapter back toward the base model over
a few thousand steps. 0.001 keeps a small Frobenius-norm prior on
||A||^2 + ||B||^2 for numerical stability without meaningfully biasing
the merged weight toward init, and aligns with the value used across
the unsloth notebook templates.

* Studio: strip orphan tool_call XML leaking into visible content (#5735)

* Studio: strip orphan tool_call XML from streamed visible content

The speculative-buffer state machine in
`studio/backend/core/inference/llama_cpp.py` can slice a tool_call XML
block between the silent DRAINING path and the user-visible
content_accum, depending on when in the model's emission the BUFFERING
-> STREAMING -> DRAINING transitions fire. Three leak shapes were
observed in a 2026-05-22 sweep of 900 Qwen3.5 / Qwen3.6 GGUF runs:

  Pre-fix XML leak rate: 20/900 (2.22%), concentrated 6.7% on the
  larger Q8 / MTP configs:

    Qwen3.6-35B-A3B Q8_0         4/60  (6.7%)
    Qwen3.6-35B-A3B-MTP Q4       4/60  (6.7%)
    Qwen3.5-35B-A3B Q8_0         3/60  (5.0%)
    Qwen3.6-27B Q8_0             3/60  (5.0%)

The existing `_TOOL_XML_RE` only matched well-formed
`<tool_call>...</tool_call>` and `<function=...></function>` pairs, so
unterminated openings (close was DRAINED) and orphan closes (opening
was DRAINED) survived the strip and reached the user.

Fix relaxes the regex to also strip:
  1. Orphan opening up to end-of-string: `(?:</tool_call>|\Z)`
  2. Orphan closing tag: bare `</tool_call>` / `</function>`

Verified on the full sweep: 20/900 -> 0/900 (100% of detected leaks
eliminated). 16 unit tests in `test_tool_xml_strip.py` pin all three
leak shapes plus the well-formed cases, plus parametrised checks on
the 5 actual real-world leak samples from the sweep data.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: strip tail-only </parameter> orphan + tighten regex

The 2026-05-22 gdpval sweep surfaced a 4th XML-leak shape not caught
by the earlier regex: a bare `</parameter>\n\n` at end-of-buffer (7
of 192 trials, all Qwen3.5-27B + a few Qwen3.6-27B). The model emits
the full `<tool_call><function=...><parameter=...>...content...
</parameter></function></tool_call>` envelope, the speculative buffer
DRAINS the opening tags as intended, but EOS (max_tokens cutoff)
truncates the outer `</function></tool_call>` close, leaving just
`</parameter>` as the visible tail.

We strip this ONLY when end-anchored (`\s*\Z`) so legitimate
mid-text uses (user code samples, documentation discussing the
Qwen tool-call XML shape) survive. Verified on the 192-trial
gdpval corpus: before=7, after=0.

While at it, fold the five top-level alternations into three by
sharing tag-name and prefix subgroups:

  <tool_call>...    + <function=\w+>...    +    -->  <(?:tool_call|function=\w+)>...
  </tool_call>      | </function>                  -->  </(?:tool_call|function)>

Semantically identical (verified by replay over the 192-trial
corpus + adversarial inputs, 0 diffs) and 1.34x faster on real
workloads. Backtracking-safety pinned by two new perf guards
(256KB '<' spam, 1000x orphan opens).

Tests: 16 -> 28 (6 new functional + 4 well-formed-vs-orphan +
2 perf guards).

* Tighten comments in XML-strip regex and tests

Code says what it does; comments were repeating it. Strip the verbose
explanations down to the WHY-only bits (engine quirk, tail-anchor
rationale, real-world source of each test sample). No code changes.

inference.py:  21 -> 12 lines around _TOOL_XML_RE
test_tool_xml_strip.py: 343 -> 259 lines (-84)
Tests: 28/28 still pass.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Address review: deny pass-through --parallel, preserve legacy short aliases, fix test harness

Round 1 review fixes for #5737:

1. Deny --parallel / --n-parallel / -np in the pass-through validator.
   Without this, `unsloth studio run --model X --parallel 8 -- --parallel
   999` would last-win-override the running llama-server slot count while
   Studio's app.state.llama_parallel_slots and KV-cache fitting stay at
   the typer value (8), so the resource plan and the running process
   disagree. Also bypasses the typer 1..64 range guard. Reject so the
   only path is the first-class typer flag.

2. Backwards-compat shim for -m / -hfr / -f. Dropping the short aliases
   from typer broke any script using `unsloth studio run -m X` or
   `-hfr Y` or `-f dist`. Add _consume_legacy_short_aliases which pops
   EXACT whole-token matches (or `-x=value` inline form) from ctx.args
   into the corresponding typer parameter. Clustered tokens (`-fa`,
   `-mg`, `-fitt`, ...) are left in the pass-through tail unchanged.
   --model becomes Optional with an explicit missing-required check
   after the preprocessor so legacy `-m X` still satisfies the
   "must specify a model" requirement.

3. Drop mix_stderr from CliRunner. Typer 0.25.1 / Click 8.4.1 removed
   the kwarg; the test harness raised TypeError before exercising the
   PR behaviour. Tests run cleanly on current and older Typer/Click.

4. Correct the -np regression test docstring. Pre-PR `-np 8` was
   clustered by Click as `-p 8` (port=8) + stray `-n`, silently
   breaking the port binding -- not "passed through as 8 slots". The
   post-PR assertion (child gets --parallel 8) is unchanged.

5. Update studio run docstring listing rejected flags so it now
   correctly includes --parallel / -np / --n-parallel.

New tests:
- test_llama_server_args.py: parametrized denylist coverage for
  --parallel / --n-parallel / -np including equals-form, including
  out-of-range bypass attempts (999, 0). is_managed_flag flips True.
- test_studio_run_short_alias_clashes.py: legacy -m / -hfr / -f
  promote to typer params; --model X + -m Y conflict errors; clustered
  -mg / -fa / -fitt still pass through (the original bug fix holds).

132 tests pass (98 backend + 34 cli).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Extend legacy-alias shim tests for repo:variant, inline value form, and missing model

Three additional edge cases for the -m / -hfr / -f preprocessor:
- `-m unsloth/foo:UD-Q4_K_XL` round-trips through both the preprocessor
  and _split_repo_variant so the child sees --model + --gguf-variant.
- `-m=foo` inline value form is promoted just like `-m foo`.
- Missing --model after the preprocessor raises typer.Exit(2) cleanly
  (replacing typer's pre-PR required-flag enforcement now that --model
  is Optional to allow the legacy promotion path).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Scrub .github/workflows for staging push (matches staging base)

* Fix studio CLI argv handling and pass-through docstring drift

- studio/backend/core/inference/llama_server_args.py: drop the stale
  ``-np``/``--parallel`` entry from the docstring's pass-through tunable
  list. These flags moved into _DENYLIST_GROUPS so the docstring now
  contradicts the validator and would mislead future maintainers
  debugging the ValueError from validate_extra_args(["--parallel","8"]).
  The deleted wording was introduced by dbea77e ("Studio: forward
  llama-server args from `unsloth studio run`, activate `unsloth run`,
  and allow passing model:quant to load models") when --parallel was
  still a documented pass-through; the same commit's "quant" reference
  is about the model:quant syntax, unrelated to the parallel slot
  wording being deleted here.

- unsloth_cli/commands/studio.py: add _expand_attached_np_short next to
  _consume_legacy_short_aliases. Both work around Click's short-option
  clustering for this command -- the legacy preprocessor for `-m` / `-f`
  / `-hfr` and this one for the attached `-np<N>` form. Click clusters
  `-np8` as `-n -p 8` because `-p` is the typer short for `--port`,
  silently setting port=8 and dropping the parallel value; rewriting the
  attached form into separated `-np <N>` in sys.argv before Click
  parses preserves the user's value. Space/equals forms (`-np 8`,
  `-np=8`) already work and are left alone.

- unsloth_cli/__init__.py: import _expand_attached_np_short from the
  studio command and run it only when argv[0] looks like the unsloth
  console-script or workspace cli.py, so importing this module from a
  notebook or pytest run does not mutate the caller's argv.

* Tighten the -np canonicaliser comments

Drop the helper's co-location sentence (location is self-evident from
grep) and shorten the entry-gate rationale to one short sentence
covering the why.

* Sync .github/workflows with upstream author branch

* Sync .github/workflows with upstream author branch

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Bump install.sh / install.ps1 pin to unsloth>=2026.5.7 (#5753)

PyPI release unsloth 2026.5.7 is now live. Bumps the pinned floor in
install.sh and install.ps1 from unsloth>=2026.5.6 to unsloth>=2026.5.7
so fresh installs resolve to the new wheel.

Tagged on main as v0.1.416-beta.

* Catch attached `-np<N>` form in backend pass-through validator

The CLI-side `_expand_attached_np_short` rewrites `-np8` to `-np 8`
before Click parses, but HTTP /load `llama_extra_args=["-np8"]` goes
straight to `validate_extra_args` which only matched the exact token.
Reproducer: `validate_extra_args(["-np8"])` previously returned
`["-np8"]` instead of raising; once forwarded to llama-server it
last-win-overrode Studio's slot count while
`app.state.llama_parallel_slots` stayed at the typer value.

Normalise `-np<digits>` to `-np` in `_flag_name` so the denylist
catches the attached form alongside `-np`, `-np=8`, `--parallel`,
`--parallel=8`, and `--n-parallel`. Tests parametrize the new form
including out-of-range values.

* Restore _consume_legacy_short_aliases unit tests + _expand_attached_np_short tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Restore .github/workflows from origin/main

Earlier merge from claude_review's staging-scrub commits accidentally
deleted production CI workflows. Restore them to main's state.

* Scrub .github/workflows for staging push (matches staging base)

* Sync .github/workflows with upstream author branch

* Round 5+6: broaden -np gate to exact basenames + runtime parallel test

Reviewer-flagged improvements squashed into one commit so the auto-push
review bot doesn't keep stomping the branch:

- unsloth_cli/__init__.py: exact-basename match instead of
  endswith('cli.py'). Covers unsloth, unsloth.exe, unsloth-cli,
  unsloth-cli.exe, cli.py, unsloth-cli.py. A third-party mycli.py that
  happens to import unsloth_cli no longer has its argv mutated.

- unsloth_cli/tests/test_studio_run_parallel_flag.py: parametrised
  runtime test (N in {1, 4, 8, 64}) that fakes the in-venv path and
  asserts run_server is invoked with llama_parallel_slots=N.
  Complements the existing source-text check so refactors that preserve
  runtime semantics don't trip a false failure.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Round 7: respect '--' end-of-options and reject flag-as-value

Round 7 reviewer flagged three legitimate edge cases:

- _expand_attached_np_short rewrote post-'--' tokens. Convention: '--'
  ends option processing; payload after it is raw. Stop the loop there.

- _consume_legacy_short_aliases promoted post-'--' legacy aliases for
  the same reason. Treat post-'--' tail as raw.

- Legacy '-m -fa' silently consumed '-fa' as the model name, hiding
  the real CLI shape error. Reject any next-token that starts with '-'
  (except the lone '-' stdin/path sentinel) with a clear BadParameter.

Also expanded the missing-model error string to mention the still-
supported legacy '-m' / '-hfr' aliases so users hitting that diagnostic
on legacy scripts get the right migration hint.

Added four regression tests covering each new behaviour.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Round 8: soften flag-as-value to long-form only + normalise is_managed_flag

Round 8 reviewer flagged two cleanups:

- _consume_legacy_short_aliases rejected any next token starting with
  '-' as a flag, which would break legitimate values like '-foo'
  (path or model name with leading dash). Narrow the rejection to
  '--long' tokens only; '-x' short forms still pass through.

- is_managed_flag did raw _DENYLIST membership while validate_extra_args
  goes through _flag_name first, so '-np8' / '--parallel=8' /
  '--port=9000' classified as not-managed by the helper but rejected
  by the validator. Route is_managed_flag through _flag_name so the
  two helpers agree on every form callers might use.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Round 9: also catch -np-1 / -np+1 signed attached forms in denylist

Round 9 reviewer noticed _flag_name normalised -np<digits> but missed
signed variants -np-1 and -np+1, so validate_extra_args waved them
through while rejecting --parallel -1. llama.cpp would error out on
negative slot counts anyway, but the validator should classify every
form of the managed flag identically so the boundary is consistent.

* Round 10: signed -np in CLI canonicaliser + reject empty inline aliases

Round 10 reviewer flagged two real issues:

- _expand_attached_np_short rewrote only -np<digits>; signed forms
  -np-1 / -np+1 fell through. Backend _flag_name already classifies
  them as managed, so the CLI rewriter must too -- otherwise Click
  clusters -np-1 into -n -p -1 (port=-1) and never reaches the
  backend validator at all.

- -m= / -hfr= / -f= empty inline forms were accepted and produced
  --model '' / --frontend '' (then Path('') silently became '.') on
  re-exec. Reject empty inline values at the preprocessor with a
  clear BadParameter so the malformed input fails fast.

Both behaviours pinned with parametrised regression tests.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Expose --parallel on plain `unsloth studio` for API-path parity

The PR added --parallel to `unsloth studio run` but the plain
`unsloth studio` callback (used for API-only / bare-server launches)
still hardcoded llama_parallel_slots to its run_server default. With
--parallel now denied as a llama_extra_args pass-through, that flow
had no first-class way to raise concurrency.

- unsloth_cli/commands/studio.py: add --parallel / --n-parallel typer
  Option (default 4, range 1..64) to studio_default, forward through
  the venv re-exec, and pass llama_parallel_slots= to run_server in
  the in-venv path.
- studio/backend/run.py: argparse --parallel / --n-parallel with the
  same range guard so the spawned child accepts the forwarded flag.
- unsloth_cli/tests/test_studio_run_parallel_flag.py: test pins the
  new option presence, aliases, default and range guards.

* Round 12: narrow entry-point gate, preserve pre-PR plain-studio default, drop brittle source-text test

Three Opus subagent reviewers (security / backcompat / code-quality)
flagged the same handful of real issues. Consensus fixes:

- unsloth_cli/__init__.py: narrow the -np canonicaliser gate to just
  {unsloth, unsloth.exe} (the only pyproject-declared console_script).
  The previous cli.py / unsloth-cli.py entries would silently rewrite
  sys.argv for any third-party myproj/cli.py that happens to import
  unsloth_cli. Dev users running python cli.py ... -np N still work
  via the space form, which parses without the rewrite.

- unsloth_cli/commands/studio.py + studio/backend/run.py: restore the
  pre-PR llama_parallel_slots default of 1 on plain unsloth studio and
  python studio/backend/run.py. unsloth studio run keeps its
  hardcoded-pre-PR default of 4. Without this, my earlier API-path
  parity commit silently dropped per-call context to ctx/4 for the
  plain-studio flow.

- unsloth_cli/tests/test_studio_run_parallel_flag.py: drop the brittle
  source-text grep test (test_run_kwargs_use_parallel_value). The
  parametrised runtime test test_in_venv_path_passes_parallel_to_run_server
  already pins the same intent against actual behaviour.

- unsloth_cli/tests/test_studio_run_short_alias_clashes.py: pin the
  narrow entry-point gate with a parametrised negative test covering
  seven third-party argv[0] basenames (cli.py, /path/myproj/cli.py,
  pytest, unsloth-cli, etc.). Re-broadening the gate now trips a
  test instead of silently mutating an unrelated CLI's argv.

* Round 13: shared parallel constants, denylist invariant test, defence-in-depth

Three Opus subagent reviewers (adversarial-user / maintenance /
cross-file consistency) flagged a consistent set of cleanups; folded
into one commit to avoid the pre-commit.ci force-push race.

unsloth_cli/commands/studio.py:
- Extract _PARALLEL_MIN / _PARALLEL_MAX / _PARALLEL_DEFAULT_RUN /
  _PARALLEL_DEFAULT_PLAIN module-level constants and use them in both
  typer Options (plain studio_default = 1, studio run = 4).
- _expand_attached_np_short now rewrites -np<junk> when the suffix
  starts with a digit (or signed digit) so '-np8x' surfaces as a
  clean '-np takes an int' typer error instead of a baffling
  '--port invalid' complaint after Click clusters '-n -p 8x'.
- Re-exec forwarding emits --load-in-4bit / --no-load-in-4bit
  explicitly in both directions; previously the True default relied
  on both layers sharing the same default forever.
- run() docstring now explicitly says --parallel / -np pass-through
  via llama_extra_args is denied (use the typer flag above).

studio/backend/run.py:
- Mirror the parallel constants and route the argparse default,
  range check, and error message through them. Help text mentions
  the asymmetry with 'unsloth studio run' so direct-launch dev users
  aren't confused by Default 1 in isolation.

studio/backend/core/inference/llama_server_args.py:
- _flag_name strips surrounding whitespace before denylist lookup so
  a caller can't slip a managed flag past the boundary with a
  trailing space (the trimmed form is what downstream parsers see).

Tests:
- New typer-aliases-subset-of-denylist invariant: every alias the
  typer Option claims as --parallel on run() MUST be in the backend
  parallel denylist group. Catches the failure mode where someone
  adds a new alias and forgets the boundary.
- Extended denylist parametrize to cover ~14 previously untested
  aliases (-mu, -dr, -hfv/-hfrv/-hffv family, -mmu, full --ui group,
  --models-preset / --models-autoload / --no-models-autoload).
- Whitespace-padded denylist rejection (' --parallel', '-np ', etc).
- --load-in-4bit re-exec test pinning both polarities + default.
- -np<junk> argv rewriter regression tests.
- Cross-reference headers between the two test files.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: repair mlx studio base export save_method (#5727)

* Round 14: align backend -np recogniser with CLI rewriter + reject parent --parallel

Round 14 (reviewer.py --parallel 20 with gpt-5.3-codex-spark) flagged
two real P1s and a stale-rebase warning. All three addressed.

- studio/backend/core/inference/llama_server_args.py: widen
  _flag_name so -np<digit-prefix> with trailing junk (-np8x,
  -np-1foo, -np+1bar, -np9zzz) classifies as managed flag -np,
  matching the CLI _expand_attached_np_short rewriter. Without this,
  POST /api/inference/load with llama_extra_args=['-np8x'] slipped
  past the boundary while the CLI canonicalised the same form. The
  two sides now agree on every digit-prefix form.

- unsloth_cli/commands/studio.py: reject --parallel on the
  studio group when a subcommand is invoked. Pre-PR the studio
  callback had no --parallel; my Round 12 addition made
  'unsloth studio --parallel 8 run ...' silently drop the 8
  because typer doesn't propagate parent options into subcommand
  kwargs. Now errors with exit 2 and a message pointing the
  operator at the correct invocation
  ('unsloth studio run --parallel 8 ...').

- Picked up origin/main via merge (parent commit 0caf0526): the
  pre-flight stale-rebase detector found 2 lines on main in
  studio/backend/core/export/export.py missing from PR HEAD.
  Merged cleanly with no conflicts.

Tests:
- Parametrised denylist coverage for -np<digit-prefix>+junk forms.
- New runtime test confirms exit 2 + helpful error when the group
  --parallel is supplied alongside an invoked subcommand.
- Test that the default group --parallel value still lets a
  subcommand resolve (no false-positive rejection).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: tighten code comments across --parallel PR

Comment-only pass over the seven PR-touched files; trim verbose
docstrings, collapse multi-line section dividers, and drop
redundant prose that the code already conveys. No behaviour change.

* Studio: trim remaining verbose docstrings missed in last pass

Shorten the test_studio_run_parallel_flag.py module docstring and
the `Re-exec arg-builder coverage` block. No behaviour change.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: second comment-tightening pass across PR-touched code

Trim docstrings and inline comments in studio.py, run.py,
llama_server_args.py, and unsloth_cli/__init__.py. No behaviour change;
all 215 tests still pass.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: deny --embedding / --rerank / --tools pass-through

`--embedding` and `--rerank` flip llama-server into single-endpoint
mode, which breaks Studio's /v1/chat/completions hop. llama-server's
own `--tools` flag silently stacks on top of Studio's tool policy
resolved by `--enable-tools` / `--disable-tools`.

Add all three (plus the `--embeddings` / `--reranking` plural aliases)
to the boundary denylist so HTTP /load and pass-through extras both
reject them cleanly instead of silently desyncing the server surface.

Test added to the existing `test_denylist_rejects_all_aliases`
parametrize. 220 tests pass.

* Studio: make PR-touched tests robust to minimal envs + Windows

Two cross-OS CI findings:

1. `test_typer_parallel_aliases_are_subset_of_backend_denylist` was
   doing `from core.inference.llama_server_args import _DENYLIST_GROUPS`
   which triggers `core/inference/__init__.py` and pulls in the full
   backend chain (fastapi / structlog / loggers / utils.hardware).
   The invariant only needs the constants tuple, so load the module
   directly via `importlib.util.spec_from_file_location` -- the test
   now runs with just typer + pytest installed.

2. `test_legacy_frontend_alias_still_promotes_to_frontend` asserted
   the literal string `"/tmp/dist"` after the value round-trips through
   `Path()`. On Windows `str(Path("/tmp/dist"))` is `"\tmp\dist"`, so
   the assertion tripped on the same logical path. Compare via
   `Path(x) == Path("/tmp/dist")` so the test passes on every OS.

Both surfaced by the staging-4 cross-OS CI; no production-code change.
220 tests still pass locally.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: load llama_server_args.py directly in its unit tests

Same fix as the previous CLI-test commit: import the module via
`importlib.util.spec_from_file_location` instead of
`from core.inference.llama_server_args import ...`, so the test no
longer needs the full backend chain (fastapi / structlog / loggers /
utils.hardware) installed via `core/inference/__init__.py`.

The boundary validator is intentionally dependency-free; its unit
tests should reflect that.

* Fix test_main_composer_has_dir_auto anchor after PR #5784

PR #5784 ("Improve image generation UI") rewrote the message-input
textarea's static `aria-label="Message input"` into a JSX conditional
`aria-label={overlay ? "Image edit instructions" : "Message input"}`
but did not update the RTL bidi-attribute regression test, leaving
the literal-string `find('aria-label="Message input"')` anchor with
no match. The `Repo tests (CPU)` job has been red on main since.

Anchor on the inner `"Message input"` string literal instead -- it
survives both spellings and still pins the same textarea element so
the `dir="auto"` assertion has the right block to inspect.

Verified by re-running the exact CI command:
  954 passed, 3 skipped, 23 deselected (was 948 passed, 1 failed).

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Long Yixing <longyixing331@gmail.com>
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 27, 2026
…k-glm-kimi

Resolves three conflicts against the updated 5620 base (which itself
merged main after main moved on with PRs unslothai#5735 / unslothai#5775 / unslothai#5803 etc.
touching the same routes/inference.py and tool_call_parser.py surface):

* studio/backend/core/inference/tool_call_parser.py
  ``_TOOL_CLOSED_PATS``: kept 5624's full set (Mistral pre-v11 array,
  Mistral v11+ name{json}, DeepSeek envelope, Kimi section) on top of
  the 3-pattern base. New 5620 base reverted to the 3 base patterns
  because main never carried the tool-format extensions.

* studio/backend/routes/inference.py
  Merged the two regex bodies: kept 5624's elaborate python_tag
  ``(?:[^<]|<(?!\|))*`` clause and the new-family closed-pair
  patterns (DeepSeek envelope, Kimi section). DROPPED the inlined
  Mistral patterns in favour of the base's ``_strip_tool_xml`` helper
  which delegates Mistral handling to the parser module's
  ``_strip_mistral_closed_calls`` -- the non-greedy ``\{.*?\}`` form
  truncates at the first ``}`` of a nested JSON arg, so balanced
  brace/bracket scanning is correct here. Also kept the base's
  orphan-close and tail-only ``</parameter>`` patterns from the
  speculative-buffer split boundary work. Net: 9 call sites continue
  to use ``_strip_tool_xml(...)`` (Mistral-safe).

* studio/backend/tests/test_safetensors_tool_loop.py
  ``TestRoutesPythonTagStrip``: kept the base's wording for the
  section header (the two were near-identical) and switched the helper
  ``_strip`` back to ``_strip_tool_xml`` since the helper is restored.

Also retargeted ``test_pr5624_regressions.py``'s routes-layer strip
tests to ``_strip_tool_xml`` for consistency with the restored helper.

Tests:
  pytest studio/backend/tests/test_safetensors_tool_loop.py
         studio/backend/tests/test_safetensors_capability_advertise.py
         studio/backend/tests/test_pr5624_regressions.py -q
  -> 170 passed in 1.93s

  pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration'
  -> 2034 passed, 15 failed (pre-existing on the 5620 base; same
     set as before the merge: test_training_worker_flash_attn,
     test_desktop_auth, test_studio_api integration shims).
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…othai#5735)

* Studio: strip orphan tool_call XML from streamed visible content

The speculative-buffer state machine in
`studio/backend/core/inference/llama_cpp.py` can slice a tool_call XML
block between the silent DRAINING path and the user-visible
content_accum, depending on when in the model's emission the BUFFERING
-> STREAMING -> DRAINING transitions fire. Three leak shapes were
observed in a 2026-05-22 sweep of 900 Qwen3.5 / Qwen3.6 GGUF runs:

  Pre-fix XML leak rate: 20/900 (2.22%), concentrated 6.7% on the
  larger Q8 / MTP configs:

    Qwen3.6-35B-A3B Q8_0         4/60  (6.7%)
    Qwen3.6-35B-A3B-MTP Q4       4/60  (6.7%)
    Qwen3.5-35B-A3B Q8_0         3/60  (5.0%)
    Qwen3.6-27B Q8_0             3/60  (5.0%)

The existing `_TOOL_XML_RE` only matched well-formed
`<tool_call>...</tool_call>` and `<function=...></function>` pairs, so
unterminated openings (close was DRAINED) and orphan closes (opening
was DRAINED) survived the strip and reached the user.

Fix relaxes the regex to also strip:
  1. Orphan opening up to end-of-string: `(?:</tool_call>|\Z)`
  2. Orphan closing tag: bare `</tool_call>` / `</function>`

Verified on the full sweep: 20/900 -> 0/900 (100% of detected leaks
eliminated). 16 unit tests in `test_tool_xml_strip.py` pin all three
leak shapes plus the well-formed cases, plus parametrised checks on
the 5 actual real-world leak samples from the sweep data.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: strip tail-only </parameter> orphan + tighten regex

The 2026-05-22 gdpval sweep surfaced a 4th XML-leak shape not caught
by the earlier regex: a bare `</parameter>\n\n` at end-of-buffer (7
of 192 trials, all Qwen3.5-27B + a few Qwen3.6-27B). The model emits
the full `<tool_call><function=...><parameter=...>...content...
</parameter></function></tool_call>` envelope, the speculative buffer
DRAINS the opening tags as intended, but EOS (max_tokens cutoff)
truncates the outer `</function></tool_call>` close, leaving just
`</parameter>` as the visible tail.

We strip this ONLY when end-anchored (`\s*\Z`) so legitimate
mid-text uses (user code samples, documentation discussing the
Qwen tool-call XML shape) survive. Verified on the 192-trial
gdpval corpus: before=7, after=0.

While at it, fold the five top-level alternations into three by
sharing tag-name and prefix subgroups:

  <tool_call>...    + <function=\w+>...    +    -->  <(?:tool_call|function=\w+)>...
  </tool_call>      | </function>                  -->  </(?:tool_call|function)>

Semantically identical (verified by replay over the 192-trial
corpus + adversarial inputs, 0 diffs) and 1.34x faster on real
workloads. Backtracking-safety pinned by two new perf guards
(256KB '<' spam, 1000x orphan opens).

Tests: 16 -> 28 (6 new functional + 4 well-formed-vs-orphan +
2 perf guards).

* Tighten comments in XML-strip regex and tests

Code says what it does; comments were repeating it. Strip the verbose
explanations down to the WHY-only bits (engine quirk, tail-anchor
rationale, real-world source of each test sample). No code changes.

inference.py:  21 -> 12 lines around _TOOL_XML_RE
test_tool_xml_strip.py: 343 -> 259 lines (-84)
Tests: 28/28 still pass.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant