What happened
When an OpenAI-compatible /v1/chat/completions request includes tools with many optional parameters, the GBNF grammar fails to compile and the server returns a 500 error. The grammar compiler generates (param_1 | ... | param_N){0,N} rules for optional parameters. With N=48 or N=94, the rule expansion exceeds MAX_REPETITION_THRESHOLD (2000), grammar compilation fails silently, and the PEG parser then fails on the unconstrained model output at the end of streaming:
data: {"error":{"code":500,"message":"Failed to parse input at pos 138: <tool_call>\n<function=write>\n..."}}
How I found this
I was running PinchBench (an AI agent benchmark suite) against a llama.cpp-served Qwen3.5-27B, using OpenClaw as the agent runtime. OpenClaw's standard toolset — the same set every OpenClaw user gets — includes tools like browser (48 params) and message (94 params). Every benchmark task failed with a 500 error before the agent could act.
This is not a pathological personal setup. PinchBench runs against the agent's default tools; any llama.cpp user running an OpenClaw or similar agent benchmark will hit this.
Reproduction
Send a streaming chat completion request with tools that have many optional parameters. A single tool with 48 optional parameters is enough to trigger it. The request works on builds before PR #18604.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "test",
"stream": true,
"messages": [{"role": "user", "content": "Write hello to /tmp/test.txt"}],
"tools": [{
"type": "function",
"function": {
"name": "browser",
"description": "Browser automation",
"parameters": {
"type": "object",
"properties": {
"action": {"type": "string"},
"url": {"type": "string"},
"selector": {"type": "string"},
"text": {"type": "string"},
"timeout": {"type": "number"},
"visible": {"type": "boolean"},
"screenshot": {"type": "boolean"},
"full_page": {"type": "boolean"},
"wait_for": {"type": "string"},
"scroll_x": {"type": "number"},
"scroll_y": {"type": "number"},
"key": {"type": "string"},
"modifiers": {"type": "string"},
"button": {"type": "string"},
"x": {"type": "number"},
"y": {"type": "number"},
"width": {"type": "number"},
"height": {"type": "number"},
"format": {"type": "string"},
"quality": {"type": "number"},
"clip": {"type": "boolean"},
"delay": {"type": "number"},
"navigation_timeout": {"type": "number"},
"wait_until": {"type": "string"},
"extra_http_headers": {"type": "string"},
"ignore_https_errors": {"type": "boolean"},
"java_script_enabled": {"type": "boolean"},
"bypass_csp": {"type": "boolean"},
"user_agent": {"type": "string"},
"viewport_width": {"type": "number"},
"viewport_height": {"type": "number"},
"device_scale_factor": {"type": "number"},
"is_mobile": {"type": "boolean"},
"has_touch": {"type": "boolean"},
"color_scheme": {"type": "string"},
"reduced_motion": {"type": "string"},
"forced_colors": {"type": "string"},
"accept_downloads": {"type": "boolean"},
"record_video": {"type": "boolean"},
"record_video_dir": {"type": "string"},
"record_video_size": {"type": "string"},
"proxy_server": {"type": "string"},
"proxy_bypass": {"type": "string"},
"proxy_username": {"type": "string"},
"proxy_password": {"type": "string"},
"storage_state": {"type": "string"},
"geolocation_latitude": {"type": "number"},
"geolocation_longitude": {"type": "number"}
},
"required": ["action"]
}
}
},
{
"type": "function",
"function": {
"name": "write",
"description": "Write file",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"}
},
"required": ["path", "content"]
}
}
}]
}'
Server log output:
parse: error parsing grammar: number of rules that are going to be repeated
multiplied by the new repetition exceeds sane defaults, please reduce the
number of repetitions or rule complexity
failed to parse grammar
Followed by a 500 error in the response when the PEG parser fails on the final (non-partial) parse.
Failure mode
The failure mode is particularly bad for streaming:
- Grammar compilation fails silently (logged, but the request proceeds)
- The model generates unconstrained output (native XML tool-call format)
- Partial PEG parsing during streaming detects the tool call start and sends a
tool_calls delta with arguments: "{"
- The final (non-partial) PEG parse fails on the unconstrained output
- The server returns a 500 error mid-stream
The client receives an inconsistent response: a partial tool call followed by an error.
Prior report
This was reported in #17473. The assignee @pwilkin said they would add a configuration parameter, but the issue was auto-closed as stale before that happened.
Proposed fix
Make MAX_REPETITION_THRESHOLD configurable via a server parameter (e.g. --grammar-max-repetitions), or raise the default. The current value of 2000 is too low for standard tool-calling use cases.
For context, raising the threshold to 100,000 on a private server had no measurable impact:
- Grammar compilation: <2 ms for 24 tools (including one with 94 optional params)
- Per-token sampling: 34.12 ms/token with grammar vs 34.67 ms/token without
- Memory: ~4 MB additional (negligible)
The DoS protection from PR #18604 (stack overflow via nested repetitions, hangs via unbounded expansion) could be preserved with a higher default or a separate, tighter limit on nesting depth rather than total rule count.
Environment
- Version: b8468 (commit 3306dba)
- Model: Qwen3.5-27B (unsloth/Qwen3.5-27B-GGUF:UD-Q5_K_XL)
- Chat format: peg-native
- Client: OpenClaw 2026.3.13, running PinchBench agent benchmark
What happened
When an OpenAI-compatible
/v1/chat/completionsrequest includes tools with many optional parameters, the GBNF grammar fails to compile and the server returns a 500 error. The grammar compiler generates(param_1 | ... | param_N){0,N}rules for optional parameters. With N=48 or N=94, the rule expansion exceedsMAX_REPETITION_THRESHOLD(2000), grammar compilation fails silently, and the PEG parser then fails on the unconstrained model output at the end of streaming:How I found this
I was running PinchBench (an AI agent benchmark suite) against a llama.cpp-served Qwen3.5-27B, using OpenClaw as the agent runtime. OpenClaw's standard toolset — the same set every OpenClaw user gets — includes tools like
browser(48 params) andmessage(94 params). Every benchmark task failed with a 500 error before the agent could act.This is not a pathological personal setup. PinchBench runs against the agent's default tools; any llama.cpp user running an OpenClaw or similar agent benchmark will hit this.
Reproduction
Send a streaming chat completion request with tools that have many optional parameters. A single tool with 48 optional parameters is enough to trigger it. The request works on builds before PR #18604.
Server log output:
Followed by a 500 error in the response when the PEG parser fails on the final (non-partial) parse.
Failure mode
The failure mode is particularly bad for streaming:
tool_callsdelta witharguments: "{"The client receives an inconsistent response: a partial tool call followed by an error.
Prior report
This was reported in #17473. The assignee @pwilkin said they would add a configuration parameter, but the issue was auto-closed as stale before that happened.
Proposed fix
Make
MAX_REPETITION_THRESHOLDconfigurable via a server parameter (e.g.--grammar-max-repetitions), or raise the default. The current value of 2000 is too low for standard tool-calling use cases.For context, raising the threshold to 100,000 on a private server had no measurable impact:
The DoS protection from PR #18604 (stack overflow via nested repetitions, hangs via unbounded expansion) could be preserved with a higher default or a separate, tighter limit on nesting depth rather than total rule count.
Environment