Skip to content

Eval bug: spec-type ngram-mod crash with Qwen3Next #19267

@MaxKruse

Description

@MaxKruse

Name and Version

llama-server --version
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\tools\llamacpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\tools\llamacpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\tools\llamacpp\ggml-cpu-zen4.dll
version: 7907 (59377a6)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

GGML backends

CUDA

Hardware

Ryzen 7950x, 64GB ram. RTX 4070 12GB.

Models

qwen3-next-80b-a3b-instruct q4 k xl

Problem description & steps to reproduce

Running with new ngram self-spec decode:

llama-server -m "C:\Users\maxkr.lmstudio\models\unsloth\Qwen3-Next-80B-A3B-Instruct-GGUF\Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL.gguf" --ctx-size $(1024*60) --top-k 20 --temp 0.7 --batch-size 512 --parallel 1 --threads 12 --flash-attn on -ctvd q4_0 -ctkd q4_0 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 --fit on

And using the WebUI to do these queries:

write a small quicksort in python.
[LLM Output]
In the same codeblock, add a bubblesort as well.
[LLM Output starts, then crashes]

The same does not happen with GLM4.7-Flash at all, so i presume it has to do with the model architecture somehow.

First Bad Commit

No response

Relevant log output

Logs
srv          init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 61440, n_keep = 0, task.n_tokens = 16
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 16, batch.n_tokens = 16, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_tokens = 16, batch.n_tokens = 16
slot init_sampler: id  0 | task 0 | init sampler, took 0.00 ms, tokens: text = 16, total = 16
slot print_timing: id  0 | task 0 |
prompt eval time =     177.66 ms /    16 tokens (   11.10 ms per token,    90.06 tokens per second)
       eval time =   10918.42 ms /   289 tokens (   37.78 ms per token,    26.47 tokens per second)
      total time =   11096.08 ms /   305 tokens
statistics ngram_mod: #calls = 288, #gen drafts = 0, #acc drafts = 0, #gen tokens = 0, #acc tokens = 0, dur(b,g,a) = 0.002, 0.498, 0.000 ms
slot      release: id  0 | task 0 | stop processing: n_tokens = 304, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: done request: GET / 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.930 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 290 | processing task, is_child = 0
slot update_slots: id  0 | task 290 | new prompt, n_ctx_slot = 61440, n_keep = 0, task.n_tokens = 327
slot update_slots: id  0 | task 290 | n_tokens = 304, memory_seq_rm [304, end)
slot update_slots: id  0 | task 290 | prompt processing progress, n_tokens = 327, batch.n_tokens = 23, progress = 1.000000
slot update_slots: id  0 | task 290 | prompt done, n_tokens = 327, batch.n_tokens = 23
slot init_sampler: id  0 | task 290 | init sampler, took 0.03 ms, tokens: text = 327, total = 327
slot update_slots: id  0 | task 290 | created context checkpoint 1 of 8 (pos_min = 303, pos_max = 303, size = 75.376 MiB)
begin: ngram_mod occupancy = 303/4194304 (0.00)
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 5 tokens from 64 drafted tokens
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 634
 - the tokens for sequence 0 in the input batch have a starting position of Y = 576
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
srv  update_slots: Invalid input batch. i = 0, n_batch = 512, ret = -1
srv    send_error: task id = 290, error: Invalid input batch.
slot      release: id  0 | task 290 | stop processing: n_tokens = 577, truncated = 0
slot prompt_clear: id  0 | task -1 | clearing prompt with 577 tokens
srv  update_slots: all slots are idle
srv          stop: cancel task, id_task = 290
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

Logs with $env:LLAMA_TRACE=1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions