llama : add token matching support to llama-grammar by aldehir · Pull Request #17816 · ggml-org/llama.cpp

aldehir · 2025-12-06T06:46:31Z

Implementation of idea by @ngxson: #17750 (comment)

Problem

The llama-grammar implementation doesn't have a way to accept tokens directly, which creates a few problems:

Can't disambiguate between a special token (e.g. <|end|>) and the tokenized form <|, end, |> that may occur in content.
Requires awkward "exclusion" rules such as ( [^<] | "<" [^|] | "<|" [^e] | ... | "<|end|" [^>] )* to match chunks of characters that don't accumulate to the desired delimiter (<|end|>).
Adds extra work to grammar sampling from recursively applying character rules.

Proposed Solution

Borrowing some ideas from llguidance, you can define a token by id <[id]> or as raw token text <token> if encased in </>. I'm leaving out support for token id ranges/alternates since I don't see an immediate need for it.

You can negate by prefixing the token with !, e.g. !<|end|>.

Example (gpt-oss)

By token id:

root ::= analysis response
analysis ::= <[200005]> "analysis" <[200008]> (!<[200007]>)* <[200007]>
response ::= <[200006]> "assistant" <[200005]> "final" <[200008]> .*

That's not very readable, but useful for tokens not wrapped in </>. If they are, you can use them directly:

root ::= analysis response
analysis ::= <|channel|> "analysis" <|message|> (!<|end|>)* <|end|>
response ::= <|start|> "assistant" <|channel|> "final" <|message|> .*

Use Case: Reasoning Budget Enforcement

Assuming the model's vocab has unique tokens for its thinking tags, adopting a reasoning budget is fairly trivial via grammar:

root ::= analysis response
analysis ::= <|channel|> "analysis" <|message|> reasoning-with-budget
reasoning-with-budget ::= (!<|end|>){0,200} <|end|>
response ::= <|start|> "assistant" <|channel|> "final" <|message|> .*

# optionally, inject pieces to guide the model when it goes over
reasoning-with-budget ::= (!<|end|>){0,200} (<|end|> | "--I need to provide an immediate response" <|end|>)

Notes:

It is important the grammar is unambiguous, otherwise the model may find a way to continue thinking via other paths in the grammar.
gpt-oss may be a poor example since it has reasoning_effort, but the budget approach works pretty well.

To Do

Implement token support in llama-grammar
Refactor trigger_patterns to collect tokens and replay them after a successful trigger. Support partial token matches by feeding only the matched piece to the grammar.
Update grammar documentation and provide an example in grammars/

AI Disclosure: LLM was used to help understand the grammar code, assist in writing documentation and test cases, and review implementations. All output generated by an LLM has been reviewed.

aviallon · 2025-12-06T07:41:53Z

Very interesting. I'll see what I can build upon that.

ggerganov

Feel free to merge (use squash-merge).

pwilkin · 2025-12-08T13:53:28Z

One question: did you maybe test if there are no memory problems / performance problems if you do very long sequences, like {2000}? I remember there were some issues around that recently.

aldehir · 2025-12-08T17:37:58Z

One question: did you maybe test if there are no memory problems / performance problems if you do very long sequences, like {2000}? I remember there were some issues around that recently.

I didn't have any noticeable problems at 2000, and I didn't increase it in this PR. I do think we could increase this, from what I recall the issue was mostly when overflowed to MAX_UINT. I can test it out more thoroughly.

aldehir · 2025-12-09T06:31:24Z

No discernible memory or performance impact (beyond circumventing the grammar sampling shortcut) at 8k repetitions. We can keep the 2k max until there is a desire to go higher.

* llama : add token support to llama-grammar * fix inverse token comment * refactor trigger_patterns to replay tokens instead of the entire string * add token documentation * fix test-llama-grammar * improve test cases for tokens

llama : add token support to llama-grammar

eff2b88

aldehir changed the title ~~llama : add token support to llama-grammar~~ llama : add token matching support to llama-grammar Dec 6, 2025

fix inverse token comment

ab8d137

loci-dev mentioned this pull request Dec 6, 2025

UPSTREAM PR #17816: llama : add token matching support to llama-grammar auroralabs-loci/llama.cpp#468

Open

3 tasks

github-actions bot added the testing Everything test related label Dec 6, 2025

aldehir added 4 commits December 6, 2025 14:03

refactor trigger_patterns to replay tokens instead of the entire string

27dbaef

add token documentation

02621fe

fix test-llama-grammar

83cb005

improve test cases for tokens

49f16b4

aldehir marked this pull request as ready for review December 7, 2025 00:19

aldehir requested a review from ggerganov as a code owner December 7, 2025 00:19

ggerganov approved these changes Dec 8, 2025

View reviewed changes

aldehir merged commit e39502e into ggml-org:master Dec 9, 2025
77 of 78 checks passed

gabe-l-hart mentioned this pull request Dec 10, 2025

feat: llama.cpp bump (17f7f4) for SSM performance improvements ollama/ollama#13408

Merged

aldehir mentioned this pull request Dec 25, 2025

[WIP] tool-call: experimental migration of all parsers to peg-parser infra (w/ better test coverage) #18353

Draft

18 tasks

firecoperana mentioned this pull request Jan 19, 2026

Bug: Token-based grammar rules fail to parse in ik_llama.cpp (but work in llama.cpp) ikawrakow/ik_llama.cpp#1163

Closed

firecoperana mentioned this pull request Feb 2, 2026

llama : add token matching support to llama-grammar ikawrakow/ik_llama.cpp#1220

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add token matching support to llama-grammar#17816

llama : add token matching support to llama-grammar#17816
aldehir merged 6 commits intoggml-org:masterfrom
aldehir:grammar-token

aldehir commented Dec 6, 2025 •

edited

Loading

Uh oh!

aviallon commented Dec 6, 2025

Uh oh!

ggerganov left a comment

Uh oh!

pwilkin commented Dec 8, 2025

Uh oh!

aldehir commented Dec 8, 2025 •

edited

Loading

Uh oh!

aldehir commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aldehir commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Proposed Solution

Example (gpt-oss)

Use Case: Reasoning Budget Enforcement

To Do

Uh oh!

aviallon commented Dec 6, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Dec 8, 2025

Uh oh!

aldehir commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aldehir commented Dec 6, 2025 •

edited

Loading

aldehir commented Dec 8, 2025 •

edited

Loading