llama : add token matching support to llama-grammar#17816
llama : add token matching support to llama-grammar#17816aldehir merged 6 commits intoggml-org:masterfrom
Conversation
|
Very interesting. I'll see what I can build upon that. |
ggerganov
left a comment
There was a problem hiding this comment.
Feel free to merge (use squash-merge).
|
One question: did you maybe test if there are no memory problems / performance problems if you do very long sequences, like {2000}? I remember there were some issues around that recently. |
I didn't have any noticeable problems at 2000, and I didn't increase it in this PR. I do think we could increase this, from what I recall the issue was mostly when overflowed to MAX_UINT. I can test it out more thoroughly. |
|
No discernible memory or performance impact (beyond circumventing the grammar sampling shortcut) at 8k repetitions. We can keep the 2k max until there is a desire to go higher. |
* llama : add token support to llama-grammar * fix inverse token comment * refactor trigger_patterns to replay tokens instead of the entire string * add token documentation * fix test-llama-grammar * improve test cases for tokens
* llama : add token support to llama-grammar * fix inverse token comment * refactor trigger_patterns to replay tokens instead of the entire string * add token documentation * fix test-llama-grammar * improve test cases for tokens
* llama : add token support to llama-grammar * fix inverse token comment * refactor trigger_patterns to replay tokens instead of the entire string * add token documentation * fix test-llama-grammar * improve test cases for tokens
* llama : add token support to llama-grammar * fix inverse token comment * refactor trigger_patterns to replay tokens instead of the entire string * add token documentation * fix test-llama-grammar * improve test cases for tokens
Implementation of idea by @ngxson: #17750 (comment)
cc: @pwilkin @aviallon
Problem
The
llama-grammarimplementation doesn't have a way to accept tokens directly, which creates a few problems:<|end|>) and the tokenized form<|, end, |>that may occur in content.( [^<] | "<" [^|] | "<|" [^e] | ... | "<|end|" [^>] )*to match chunks of characters that don't accumulate to the desired delimiter (<|end|>).Proposed Solution
Borrowing some ideas from llguidance, you can define a token by id
<[id]>or as raw token text<token>if encased in</>. I'm leaving out support for token id ranges/alternates since I don't see an immediate need for it.You can negate by prefixing the token with
!, e.g.!<|end|>.Example (gpt-oss)
By token id:
That's not very readable, but useful for tokens not wrapped in
</>. If they are, you can use them directly:Use Case: Reasoning Budget Enforcement
Assuming the model's vocab has unique tokens for its thinking tags, adopting a reasoning budget is fairly trivial via grammar:
Notes:
gpt-ossmay be a poor example since it hasreasoning_effort, but the budget approach works pretty well.To Do
llama-grammartrigger_patternsto collect tokens and replay them after a successful trigger. Support partial token matches by feeding only the matched piece to the grammar.grammars/AI Disclosure: LLM was used to help understand the grammar code, assist in writing documentation and test cases, and review implementations. All output generated by an LLM has been reviewed.