0.34.574.824 I srv params_from_: Chat format: peg-native
0.34.610.181 I srv prompt_get_n: message_spans: last user message: byte_pos=196871, media=0, n_before_user=80246
0.34.610.296 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
0.34.610.298 I srv get_availabl: updating prompt cache
0.34.610.304 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.34.610.309 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 129024 tokens, 8589934592 est)
0.34.610.310 I srv get_availabl: prompt cache update took 0.01 ms
0.34.614.156 I reasoning-budget: activated, budget=16384 tokens
0.34.614.191 I slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
0.34.614.202 I slot launch_slot_: id 0 | task -1 | sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 129024
top_k = 20, top_p = 0.950, min_p = 0.000, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
0.34.614.203 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
0.34.614.241 I slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 129024, n_keep = 0, task.n_tokens = 80282
0.34.614.530 I slot update_slots: id 0 | task 0 | cached n_tokens = 0, memory_seq_rm [0, end)
0.43.180.227 I sched_reserve: reserving ...
0.43.425.854 I sched_reserve: Vulkan1 compute buffer size = 3427.02 MiB
0.43.425.857 I sched_reserve: Vulkan_Host compute buffer size = 2098.08 MiB
0.43.425.858 I sched_reserve: graph nodes = 57
0.43.425.858 I sched_reserve: graph splits = 2
0.43.425.859 I sched_reserve: reserve took 245.62 ms, sched copies = 4
0.43.517.478 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 2048, progress = 0.03, t = 8.90 s / 230.03 tokens per second
0.43.517.483 I slot update_slots: id 0 | task 0 | cached n_tokens = 2048, memory_seq_rm [2048, end)
0.46.372.351 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 4096, progress = 0.05, t = 11.76 s / 348.35 tokens per second
0.46.372.356 I slot update_slots: id 0 | task 0 | cached n_tokens = 4096, memory_seq_rm [4096, end)
0.49.291.505 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 6144, progress = 0.08, t = 14.68 s / 418.61 tokens per second
0.49.291.509 I slot update_slots: id 0 | task 0 | cached n_tokens = 6144, memory_seq_rm [6144, end)
0.52.296.178 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 8192, progress = 0.10, t = 17.68 s / 463.30 tokens per second
0.52.296.182 I slot update_slots: id 0 | task 0 | cached n_tokens = 8192, memory_seq_rm [8192, end)
0.55.363.366 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 10240, progress = 0.13, t = 20.75 s / 493.51 tokens per second
0.55.363.369 I slot update_slots: id 0 | task 0 | cached n_tokens = 10240, memory_seq_rm [10240, end)
0.58.517.159 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 12288, progress = 0.15, t = 23.90 s / 514.08 tokens per second
0.58.517.163 I slot update_slots: id 0 | task 0 | cached n_tokens = 12288, memory_seq_rm [12288, end)
1.01.742.768 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 14336, progress = 0.18, t = 27.13 s / 528.45 tokens per second
1.01.742.772 I slot update_slots: id 0 | task 0 | cached n_tokens = 14336, memory_seq_rm [14336, end)
1.05.050.816 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 16384, progress = 0.20, t = 30.44 s / 538.30 tokens per second
1.05.050.819 I slot update_slots: id 0 | task 0 | cached n_tokens = 16384, memory_seq_rm [16384, end)
1.08.428.764 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 18432, progress = 0.23, t = 33.81 s / 545.09 tokens per second
1.08.428.767 I slot update_slots: id 0 | task 0 | cached n_tokens = 18432, memory_seq_rm [18432, end)
1.11.878.858 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 20480, progress = 0.26, t = 37.26 s / 549.58 tokens per second
1.11.878.861 I slot update_slots: id 0 | task 0 | cached n_tokens = 20480, memory_seq_rm [20480, end)
1.15.327.630 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 22528, progress = 0.28, t = 40.71 s / 553.33 tokens per second
1.15.327.634 I slot update_slots: id 0 | task 0 | cached n_tokens = 22528, memory_seq_rm [22528, end)
1.18.936.925 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 24576, progress = 0.31, t = 44.32 s / 554.48 tokens per second
1.18.936.928 I slot update_slots: id 0 | task 0 | cached n_tokens = 24576, memory_seq_rm [24576, end)
1.22.621.785 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 26624, progress = 0.33, t = 48.01 s / 554.58 tokens per second
1.22.621.788 I slot update_slots: id 0 | task 0 | cached n_tokens = 26624, memory_seq_rm [26624, end)
1.26.371.638 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 28672, progress = 0.36, t = 51.76 s / 553.97 tokens per second
1.26.371.641 I slot update_slots: id 0 | task 0 | cached n_tokens = 28672, memory_seq_rm [28672, end)
1.30.189.644 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 30720, progress = 0.38, t = 55.58 s / 552.76 tokens per second
1.30.189.648 I slot update_slots: id 0 | task 0 | cached n_tokens = 30720, memory_seq_rm [30720, end)
1.34.101.764 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 32768, progress = 0.41, t = 59.49 s / 550.84 tokens per second
1.34.101.768 I slot update_slots: id 0 | task 0 | cached n_tokens = 32768, memory_seq_rm [32768, end)
1.38.086.343 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 34816, progress = 0.43, t = 63.47 s / 548.52 tokens per second
1.38.086.346 I slot update_slots: id 0 | task 0 | cached n_tokens = 34816, memory_seq_rm [34816, end)
1.42.160.153 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 36864, progress = 0.46, t = 67.55 s / 545.76 tokens per second
1.42.160.157 I slot update_slots: id 0 | task 0 | cached n_tokens = 36864, memory_seq_rm [36864, end)
1.46.320.262 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 38912, progress = 0.48, t = 71.71 s / 542.66 tokens per second
1.46.320.265 I slot update_slots: id 0 | task 0 | cached n_tokens = 38912, memory_seq_rm [38912, end)
1.50.558.073 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 40960, progress = 0.51, t = 75.94 s / 539.35 tokens per second
1.50.558.077 I slot update_slots: id 0 | task 0 | cached n_tokens = 40960, memory_seq_rm [40960, end)
1.54.872.295 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 43008, progress = 0.54, t = 80.26 s / 535.87 tokens per second
1.54.872.298 I slot update_slots: id 0 | task 0 | cached n_tokens = 43008, memory_seq_rm [43008, end)
1.59.269.809 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 45056, progress = 0.56, t = 84.66 s / 532.23 tokens per second
1.59.269.813 I slot update_slots: id 0 | task 0 | cached n_tokens = 45056, memory_seq_rm [45056, end)
2.03.746.864 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 47104, progress = 0.59, t = 89.13 s / 528.47 tokens per second
2.03.746.867 I slot update_slots: id 0 | task 0 | cached n_tokens = 47104, memory_seq_rm [47104, end)
2.08.283.632 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 49152, progress = 0.61, t = 93.67 s / 524.74 tokens per second
2.08.283.635 I slot update_slots: id 0 | task 0 | cached n_tokens = 49152, memory_seq_rm [49152, end)
2.12.921.468 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 51200, progress = 0.64, t = 98.31 s / 520.82 tokens per second
2.12.921.471 I slot update_slots: id 0 | task 0 | cached n_tokens = 51200, memory_seq_rm [51200, end)
2.17.655.361 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 53248, progress = 0.66, t = 103.04 s / 516.76 tokens per second
2.17.655.364 I slot update_slots: id 0 | task 0 | cached n_tokens = 53248, memory_seq_rm [53248, end)
2.22.471.788 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 55296, progress = 0.69, t = 107.86 s / 512.68 tokens per second
2.22.471.792 I slot update_slots: id 0 | task 0 | cached n_tokens = 55296, memory_seq_rm [55296, end)
2.27.379.986 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 57344, progress = 0.71, t = 112.77 s / 508.52 tokens per second
2.27.379.989 I slot update_slots: id 0 | task 0 | cached n_tokens = 57344, memory_seq_rm [57344, end)
2.32.361.550 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 59392, progress = 0.74, t = 117.75 s / 504.40 tokens per second
2.32.361.553 I slot update_slots: id 0 | task 0 | cached n_tokens = 59392, memory_seq_rm [59392, end)
2.34.636.656 W srv next: stopping wait for next result due to should_stop condition (adjust the --timeout argument if needed)
2.34.636.660 W srv next: ref: https://github.com/ggml-org/llama.cpp/pull/22907
2.34.636.970 W srv stop: cancel task, id_task = 0
2.34.637.094 I srv log_server_r: done request: POST /v1/chat/completions 10.89.0.4 200
2.37.410.778 I slot release: id 0 | task 0 | stop processing: n_tokens = 61440, truncated = 0
2.37.410.789 I srv update_slots: all slots are idle
2.37.672.230 I srv params_from_: Chat format: peg-native
2.37.712.150 I srv prompt_get_n: message_spans: last user message: byte_pos=196871, media=0, n_before_user=80246
2.37.712.295 I slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.765 (> 0.100 thold), f_keep = 1.000
2.37.713.060 I reasoning-budget: activated, budget=16384 tokens
2.37.713.125 I slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
2.37.713.135 I slot launch_slot_: id 0 | task -1 | sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 129024
top_k = 20, top_p = 0.950, min_p = 0.000, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
2.37.713.139 I slot launch_slot_: id 0 | task 32 | processing task, is_child = 0
2.37.713.146 I slot update_slots: id 0 | task 32 | new prompt, n_ctx_slot = 129024, n_keep = 0, task.n_tokens = 80282
2.37.713.180 W slot update_slots: id 0 | task 32 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
2.37.713.182 I slot update_slots: id 0 | task 32 | cached n_tokens = 0, memory_seq_rm [0, end)
2.40.612.613 I slot update_slots: id 0 | task 32 | cached n_tokens = 2048, memory_seq_rm [2048, end)
2.43.548.226 I slot print_timing: id 0 | task 32 | prompt processing, n_tokens = 4096, progress = 0.05, t = 5.84 s / 701.96 tokens per second
2.43.548.230 I slot update_slots: id 0 | task 32 | cached n_tokens = 4096, memory_seq_rm [4096, end)
2.46.584.217 I slot print_timing: id 0 | task 32 | prompt processing, n_tokens = 6144, progress = 0.08, t = 8.87 s / 692.59 tokens per second
2.46.584.221 I slot update_slots: id 0 | task 32 | cached n_tokens = 6144, memory_seq_rm [6144, end)
Name and Version
llama-server
version: 9354 (9777256)
container vulkan-full
Operating systems
Linux
GGML backends
Vulkan
Hardware
RTX 3090 + RX 7900
Models
unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0
Problem description & steps to reproduce
Cannot load big context, it times at around 2 min despite explicit --timeout set to 600. KV cache is not getting created for the loaded chunk. I get stop: cancel task, id_task = 0 and it starts loading again, and times out again in the loop.
LLAMA_ARG_CACHE_RAM=8192
LLAMA_ARG_CTX_CHECKPOINTS=32
--hf-repo unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 --temp 0.6 --top-p 0.95 --min-p 0.00 --repeat-penalty 1 --reasoning-budget 16384 --no-mmproj --kv-unified --parallel 1 --main-gpu 1 --split-mode layer -lv 4 --spec-type draft-mtp --spec-draft-n-max 2 --timeout 600
First Bad Commit
Not sure. I guess I reached 2+ min load times only after MTP was introduced, although my guess it is relevant only because of prompt processing slowdown.
Relevant log output
Logs