Skip to content

Misc. bug: JSON Schema to GBNF grammar fails with tools that use PCRE shorthands #22314

@deiteris

Description

@deiteris

Name and Version

.\llama-server.exe --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16375 MiB):
Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes, VRAM: 16375 MiB
version: 8893 (79ef1c0)
built with MSVC 19.44.35222.0 for x64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

Problem description & steps to reproduce

As explained in the title, the usage of PCRE shorthands fail tool calls due to JSON Schema to GBNF not supporting them. My use case is using a custom MCP server that converts OpenAPI endpoints to tools using Typescript MCP SDK, and as a result, the API definition may contain offending regexps. There're two issues I observed:

  1. Regexps containing \d, \w, and \s which often appear in API definitions, making the parser fail.
  2. Not sure what exactly, but something either in the harness (using Github Copilot) or Typescript MCP SDK seems to insert word boundary \b shorthands in complex regexps which are also observed in the attached error log.

I did the following fix that expands several shorthands I use and skips word boundary since it cannot be expressed with GBNF and it works for me. But the list of shorthands is not exhaustive, AI generated and I'm not confident with the fix since I didn't familiarize myself properly with the conversion code deiteris@3dec7a7

First Bad Commit

No response

Relevant log output

Logs
.\llama-server.exe -c 85000 -n 16384 -m C:\Temp\Qwen3.6-27B-UD-IQ3_XXS.gguf -ngl 99 -fa on --host 127.0.0.1 -a qwen3.5-27b --port 8080 --jinja -ctk q8_0 -ctv q8_0 --temperature 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --no-webui -np 1 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 --chat-template-kwargs '{\"preserve_thinking\": true}'
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16375 MiB):
  Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes, VRAM: 16375 MiB
build_info: b8917-c3ef7631c
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 15 threads for HTTP server
Web UI is disabled
start: binding port with default address family
main: loading model
srv    load_model: loading model 'C:\Temp\Qwen3.6-27B-UD-IQ3_XXS.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB]        | total    free     self   model   context   compute       unaccounted |
common_memory_breakdown_print: |   - CUDA0 (RTX 4070 Ti SUPER) | 16375 = 14935 + (14382 = 10907 +    2980 +     495) + 17592186031473 |
common_memory_breakdown_print: |   - Host                      |                    707 =   520 +       0 +     186                   |
common_params_fit_impl: projected to use 14382 MiB of device memory vs. 14935 MiB of free device memory
common_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 471 MiB
common_params_fit_impl: context size set by user to 85000 -> no change
common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to 99, abort
�[0mcommon_fit_params: fitting params to free memory took 0.40 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4070 Ti SUPER) (0000:01:00.0) - 15085 MiB free
llama_model_loader: loaded meta data with 51 key-value pairs and 851 tensors from C:\Temp\Qwen3.6-27B-UD-IQ3_XXS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.6-27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.6-27B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 27B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.6-2...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.6 27B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.6-27B
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                         qwen35.block_count u32              = 64
llama_model_loader: - kv  18:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  19:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  20:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  21:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  22:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  24:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  25:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  27:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  28:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  29:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  30:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  31:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  32:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  33:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  34:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  35:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  36:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  37:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  38:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  39:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  42:                tokenizer.ggml.bos_token_id u32              = 248044
llama_model_loader: - kv  43:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                          general.file_type u32              = 23
llama_model_loader: - kv  47:                      quantize.imatrix.file str              = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  48:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.6-27B.txt
llama_model_loader: - kv  49:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  50:              quantize.imatrix.chunks_count u32              = 76
llama_model_loader: - type  f32:  449 tensors
llama_model_loader: - type q3_K:    1 tensors
llama_model_loader: - type q4_K:   48 tensors
llama_model_loader: - type q5_K:    1 tensors
llama_model_loader: - type q6_K:   48 tensors
llama_model_loader: - type iq2_xs:    2 tensors
llama_model_loader: - type iq3_xxs:  254 tensors
llama_model_loader: - type iq3_s:   19 tensors
llama_model_loader: - type iq2_s:   24 tensors
llama_model_loader: - type iq4_xs:    5 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ3_XXS - 3.0625 bpw
print_info: file size   = 11.16 GiB (3.56 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 64
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 6144
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 48
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 27B
print_info: model params          = 26.90 B
print_info: general.name          = Qwen3.6-27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 248044 '<|endoftext|>'
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 63 repeating layers to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   521.00 MiB
load_tensors:        CUDA0 model buffer size = 10907.64 MiB
...........................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 85248
llama_context: n_ctx_seq     = 85248
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (85248) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
�[0mllama_context:  CUDA_Host  output buffer size =     0.95 MiB
llama_kv_cache:      CUDA0 KV buffer size =  2830.50 MiB
llama_kv_cache: size = 2830.50 MiB ( 85248 cells,  16 layers,  1/1 seqs), K (q8_0): 1415.25 MiB, V (q8_0): 1415.25 MiB
llama_kv_cache: attn_rot_k = 1, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 1, n_embd_head_k_all = 256
llama_memory_recurrent:      CUDA0 RS buffer size =   149.62 MiB
llama_memory_recurrent: size =  149.62 MiB (     1 cells,  64 layers,  1 seqs), R (f32):    5.62 MiB, S (f32):  144.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   495.00 MiB
sched_reserve:  CUDA_Host compute buffer size =   186.79 MiB
sched_reserve: graph nodes  = 3849
sched_reserve: graph splits = 2
sched_reserve: reserve took 16.73 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
�[0msrv    load_model: initializing slots, n_slots = 1
common_context_can_seq_rm: the target context does not support partial sequence removal
�[0msrv    load_model: speculative decoding will use checkpoints
�[0mcommon_speculative_init: initialized ngram_mod with n=24, size=4194304 (16.000 MB)
slot   load_model: id  0 | task -1 | speculative decoding context initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 85248
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
�[0msrv    load_model: use `--cache-ram 0` to disable the prompt cache
�[0msrv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
�[0msrv          init: init: --cache-idle-slots requires --kv-unified, disabling
�[0minit: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
<think>

</think>

Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: done request: GET /api/version 127.0.0.1 200
srv  log_server_r: done request: GET /api/tags 127.0.0.1 200
srv  log_server_r: done request: POST /api/show 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
�[0msrv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
�[0msrv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 85248 tokens, 8589934592 est)
�[0msrv  get_availabl: prompt cache update took 0.01 ms
�[0mparse: error parsing grammar: unknown escape at \d]+ "." [\d]+) ("~" ([a-z] [a-z0-9_]* "." [a-z] [a-z0-9_]* "." [a-z_] [a-z0-9_.]* ".v" [\d]+ "." [\d]+))* ("~" tool-mcp-spec-server-m-add-block-arg-body-schema-type-1{8,8} "\b-" tool-mcp-spec-server-m-add-block-arg-body-schema-type-1{4,4} "\b-" tool-mcp-spec-server-m-add-block-arg-body-schema-type-1{4,4} "\b-" tool-mcp-spec-server-m-add-block-arg-body-schema-type-1{4,4} "\b-" tool-mcp-spec-server-m-add-block-arg-body-schema-type-1{12,12})?) "\"" space

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions