srv init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv update_slots: all slots are idle
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 61440, n_keep = 0, task.n_tokens = 16
slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 16, batch.n_tokens = 16, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_tokens = 16, batch.n_tokens = 16
slot init_sampler: id 0 | task 0 | init sampler, took 0.00 ms, tokens: text = 16, total = 16
slot print_timing: id 0 | task 0 |
prompt eval time = 177.66 ms / 16 tokens ( 11.10 ms per token, 90.06 tokens per second)
eval time = 10918.42 ms / 289 tokens ( 37.78 ms per token, 26.47 tokens per second)
total time = 11096.08 ms / 305 tokens
statistics ngram_mod: #calls = 288, #gen drafts = 0, #acc drafts = 0, #gen tokens = 0, #acc tokens = 0, dur(b,g,a) = 0.002, 0.498, 0.000 ms
slot release: id 0 | task 0 | stop processing: n_tokens = 304, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv log_server_r: done request: GET / 127.0.0.1 200
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.930 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 290 | processing task, is_child = 0
slot update_slots: id 0 | task 290 | new prompt, n_ctx_slot = 61440, n_keep = 0, task.n_tokens = 327
slot update_slots: id 0 | task 290 | n_tokens = 304, memory_seq_rm [304, end)
slot update_slots: id 0 | task 290 | prompt processing progress, n_tokens = 327, batch.n_tokens = 23, progress = 1.000000
slot update_slots: id 0 | task 290 | prompt done, n_tokens = 327, batch.n_tokens = 23
slot init_sampler: id 0 | task 290 | init sampler, took 0.03 ms, tokens: text = 327, total = 327
slot update_slots: id 0 | task 290 | created context checkpoint 1 of 8 (pos_min = 303, pos_max = 303, size = 75.376 MiB)
begin: ngram_mod occupancy = 303/4194304 (0.00)
accept: accepted 64 tokens from 64 drafted tokens
accept: accepted 5 tokens from 64 drafted tokens
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 634
- the tokens for sequence 0 in the input batch have a starting position of Y = 576
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
srv update_slots: Invalid input batch. i = 0, n_batch = 512, ret = -1
srv send_error: task id = 290, error: Invalid input batch.
slot release: id 0 | task 290 | stop processing: n_tokens = 577, truncated = 0
slot prompt_clear: id 0 | task -1 | clearing prompt with 577 tokens
srv update_slots: all slots are idle
srv stop: cancel task, id_task = 290
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
Name and Version
llama-server --version
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\tools\llamacpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\tools\llamacpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\tools\llamacpp\ggml-cpu-zen4.dll
version: 7907 (59377a6)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA
Hardware
Ryzen 7950x, 64GB ram. RTX 4070 12GB.
Models
qwen3-next-80b-a3b-instruct q4 k xl
Problem description & steps to reproduce
Running with new ngram self-spec decode:
And using the WebUI to do these queries:
The same does not happen with GLM4.7-Flash at all, so i presume it has to do with the model architecture somehow.
First Bad Commit
No response
Relevant log output
Logs
Logs with $env:LLAMA_TRACE=1