Skip to content

Add Mistral Guidance#37081

Closed
juliendenize wants to merge 6 commits into
vllm-project:mainfrom
juliendenize:fix_mistral_parsing
Closed

Add Mistral Guidance#37081
juliendenize wants to merge 6 commits into
vllm-project:mainfrom
juliendenize:fix_mistral_parsing

Conversation

@juliendenize

@juliendenize juliendenize commented Mar 14, 2026

Copy link
Copy Markdown
Contributor

Purpose

Fix Mistral tool calling and reasoning parsing by introducing Lark grammar-based structured output for Mistral models. This ensures correct tool call parsing for both streaming and non-streaming modes across all tool_choice values (auto, required, named, none) and all Mistral tokenizer versions.

Problems on main:

  • Streaming tool calls are broken for post-v15 Mistral tokenizers: [TOOL_CALLS] tokens leak into content instead of being parsed as tool calls.
  • tool_choice="required" and named tool choice produce unparseable arguments (e.g., [TOOL_CALLS]get_current_weather{...} embedded in the arguments string).
  • tool_choice="none" leaks raw [TOOL_CALLS] special tokens into user-visible content.
  • Reasoning content ([THINK]...[/THINK]) is not separated from content for pre-v15 models and not properly controlled via reasoning_effort for v15+ models.

This PR fixes by:

  • Introduce MistralGrammarFactory that generates Lark grammars from Jinja templates, selecting the appropriate grammar variant. This involves adding Lark grammar and support for Mistral Tokenizer to llguidace.
  • Update the OpenAI chat serving layer to use the Mistral grammar path for both streaming and non-streaming responses.

Test Plan

Unit tests (in this repo)

# Mistral tool parser tests (grammar factory, lark converter, adjust_request, streaming)
pytest tests/tool_parsers/test_mistral_tool_parser.py -v -s

# Mistral reasoning parser tests (is_reasoning_end with prefilled prompts)
pytest tests/reasoning/test_mistral_reasoning_parser.py -v -s

# Guidance backend lark grammar test
pytest tests/v1/structured_output/test_backend_guidance.py -v -s -k test_backend_guidance_lark_grammar

End-to-end tests (external repo)

Scripts and results are available at: https://github.com/juliendenize/vllm-test-tool-and-reasoning-parsing

Post-v15 tokenizer: 48 tests (8 scenarios × 2 stream modes × 3 reasoning_effort values)

python test_vllm_tools_post_v15.py --base-url http://localhost:8000/v1

Pre-v15 tokenizer: 16 tests (8 scenarios × 2 stream modes)

python test_vllm_tools_pre_v15.py --base-url http://localhost:8000/v1

Test Result

End-to-end results comparison

Model / Tokenizer Branch Passed Failed Total
Post-v15 This PR 48 0 48
Post-v15 main 26 22 48
Pre-v15 This PR 16 0 16
Pre-v15 main 12 4 16

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust, grammar-based approach using Lark to fix several long-standing issues with Mistral tool calling and reasoning parsing. The changes are comprehensive, covering both streaming and non-streaming modes, various tool choice options, and different tokenizer versions. The introduction of MistralGrammarFactory and the associated Jinja templates for grammar generation is a solid design choice for managing the complexity. The test suite has been significantly expanded, which is excellent for ensuring the correctness of these critical changes. I've found one critical issue in the streaming logic that could lead to corrupted output, for which I've provided a specific comment and suggestion.

Comment on lines +585 to +586
if self.bot_token not in delta_text:
return DeltaMessage(content=delta_text)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This condition can lead to corrupted output during streaming. If the bot_token (e.g., "[TOOL_CALLS]") is split across multiple streaming deltas, delta_text may not contain the full token. In this scenario, the current logic incorrectly returns a partial token string as content, which is incorrect.

The correct behavior should be to wait for more tokens by returning None, especially since the calling function has already confirmed that the bot_token exists in the cumulative current_text. This ensures that partial tool call markers are not leaked as content.

                return None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special tokens are not split so don't think it's true

Signed-off-by: juliendenize <julien.denize@mistral.ai>
@mergify

mergify Bot commented Mar 14, 2026

Copy link
Copy Markdown
Contributor

Hi @juliendenize, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: juliendenize <julien.denize@mistral.ai>
@mergify

mergify Bot commented Mar 15, 2026

Copy link
Copy Markdown
Contributor

Hi @juliendenize, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: juliendenize <julien.denize@mistral.ai>
@mergify

mergify Bot commented Mar 15, 2026

Copy link
Copy Markdown
Contributor

Hi @juliendenize, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: juliendenize <julien.denize@mistral.ai>
prompt_tokens = mistral_tokenizer.tokenizer.encode(
"Just a regular prompt.", bos=False, eos=False
)
assert parser.is_reasoning_end(prompt_tokens) is False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert parser.is_reasoning_end(prompt_tokens) is False
assert not parser.is_reasoning_end(prompt_tokens)

+ mistral_tokenizer.tokenizer.encode(prompt_text_after, bos=False, eos=False)
)

assert parser.is_reasoning_end(prompt_tokens) is True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert parser.is_reasoning_end(prompt_tokens) is True
assert parser.is_reasoning_end(prompt_tokens)

self.tool_parser, MistralToolParser
)
if _is_mistral_tool_parser and self.reasoning_parser_cls is not None:
MistralToolParser._has_reasoning_parser = True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MistralToolParser._has_reasoning_parser = True

why are we setting a bool of the MistralToolParser here? could we just do this directly in the get_tool_parser function?

)
harmony_tools_streamed[i] |= tools_streamed_flag
elif use_mistral_grammar and reasoning_parser:
assert tool_parser is not None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move all this new code into a chat_completion/mistral_grammar.py file?

and tokenizer.version >= 15
and request.reasoning_effort in (None, "none")
)
if not skip_reasoning:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is a lot of extra code here.

Couldn't we just set:

self.reasoning_parser = None

if skip_reasoning is False much earlier above so that the code should stay exactly the same here?

request, tokenizer, self.tool_parser
)
tool_call_items: list[ToolCall]
if use_mistral_grammar:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we maybe also import this code from another function?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this one is about Mistral API, not related with grammar no?

Comment on lines +1122 to 1129
use_mistral_grammar = (
isinstance(request, ChatCompletionRequest)
and tokenizer is not None
and is_mistral_lark_grammar_active(
request,
tokenizer,
tool_parser_cls, # type: ignore[arg-type]
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use_mistral_grammar = (
isinstance(request, ChatCompletionRequest)
and tokenizer is not None
and is_mistral_lark_grammar_active(
request,
tokenizer,
tool_parser_cls, # type: ignore[arg-type]
)
use_mistral_grammar = isinstance(tokenizer, MistralTokenizer) and tokenizer.is_grammar_active(request)

can't we inject more logic through the MistralTokenizer here?


if not request.tools:
# Sanitize tool_choice.
request.tool_choice = "none"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure about this?

@sfeng33 sfeng33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the correctness fixes! One note on the performance side:

The Lark grammar is applied for all tool_choice modes, including "none" and "auto". This means every Mistral tool-calling request pays guided decoding overhead (grammar compilation per request + bitmask computation per token), even when the grammar is essentially unconstrained (e.g. tool_choice="none" produces body: content which matches nearly everything).

For "none" specifically, the only real effect is preventing the <TOOL_CALLS> special token from being emitted — which could be achieved by just masking that single token ID without involving the grammar engine.

seen_special_tokens: set[str] = set()
for i in range(self._tokenizer.n_words):
# Convert square brackets to angle brackets for special tokens,
# since llg only recognizes the latter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now llg recognizes as well token ids but maybe keeping it this way is more modular here?

Signed-off-by: juliendenize <julien.denize@mistral.ai>
@juliendenize

Copy link
Copy Markdown
Contributor Author

I pushed a commit to enforce tool_choice="auto" for now and we inject grammar. This will be used for a special docker image as this PR entangles too many changes and require deeper design thoughts.
It would be better to break it in multiple smaller pr that target specific parts (grammar, parser, ...)

Signed-off-by: juliendenize <julien.denize@mistral.ai>
@mergify

mergify Bot commented Mar 17, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juliendenize.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 17, 2026
)


BASE_LARK_GRAMMAR = r"""start: body

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar definition

@mergify mergify Bot added the tool-calling label Mar 20, 2026
TrevorS added a commit to TrevorS/vllm that referenced this pull request Mar 25, 2026
When Mistral Lark grammar constrains tool call generation (PR vllm-project#37081),
the streaming tool parser's streamed_args_for_tool list is never
populated — the grammar path handles token generation differently from
the incremental [TOOL_CALLS] parsing path.

At finish_reason=tool_calls, the code accesses
streamed_args_for_tool[index] unconditionally, causing:

  IndexError: list index out of range

The grammar path already streams tool call arguments correctly via delta
chunks — the remaining-args diff at finish time is unnecessary.

Fix: bounds-check streamed_args_for_tool and skip the remaining-args
computation when the list is empty (grammar/structured output path).

Tested: short and long streaming tool calls produce valid JSON.
Affects: Any model using Mistral Lark grammar tool parsing with streaming.
Discovered: DGX Spark with Mistral Small 4 + Open WebUI, March 2026.
@androiddrew

Copy link
Copy Markdown

@juliendenize Is this still on going?

@juliendenize

Copy link
Copy Markdown
Contributor Author

@androiddrew yeah but this PR is on stale in favor of #38150 and #39217. Should be done soon !

@mergify mergify Bot added the mistral Related to Mistral models label Apr 13, 2026
@juliendenize

Copy link
Copy Markdown
Contributor Author

Closed in favor of #38150 and #39217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants