...
0.00.285.991 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.285.993 I device_info:
0.00.349.851 I - CUDA0 : NVIDIA GeForce RTX 5080 (15842 MiB, 15559 MiB free)
0.00.407.570 I - CUDA1 : NVIDIA GeForce RTX 5070 Ti (15842 MiB, 15597 MiB free)
0.00.407.576 I - CPU : AMD Ryzen 7 9700X 8-Core Processor (60909 MiB, 60909 MiB free)
0.00.407.609 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.407.630 I srv init: running without SSL
0.00.407.640 I srv init: using 15 threads for HTTP server
0.00.407.688 W srv llama_server: -----------------
0.00.407.689 W srv llama_server: Built-in tools are enabled, do not expose server to untrusted environments
0.00.407.689 W srv llama_server: This feature is EXPERIMENTAL and may be changed in the future
0.00.407.689 W srv llama_server: -----------------
0.00.407.690 I srv start: binding port with default address family
0.00.408.819 I srv llama_server: loading model
0.00.408.822 I srv load_model: loading model '/home/user/.lmstudio/models/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4.gguf'
0.00.711.442 I srv load_model: [spec] estimated memory usage of MTP context is 1736.27 MiB
0.00.711.457 I common_init_result: fitting params to device memory ...
0.00.711.458 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.711.505 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
0.10.007.277 W NCCL not compiled in; falling back to internal AllReduce. Recompile with -DGGML_CUDA_NCCL=ON for best multi-GPU performance.
0.10.099.684 I srv load_model: creating MTP draft context against the target model '/home/user/.lmstudio/models/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4.gguf'
0.10.099.702 W NCCL not compiled in; falling back to internal AllReduce. Recompile with -DGGML_CUDA_NCCL=ON for best multi-GPU performance.
0.10.165.502 I srv load_model: initializing slots, n_slots = 1
0.10.614.464 I common_context_can_seq_rm: the context supports bounded partial sequence removal
0.10.622.978 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.10.622.982 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
0.10.622.987 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=q8_0, cache_v=q8_0, ctx_tgt=yes, ctx_dft=yes, devices=[default]
0.10.623.035 W set_sampler: backend sampling not supported with SPLIT_MODE_TENSOR; using CPU
0.10.623.036 W common_speculative_impl_draft_mtp: backend offload failed for seq_id=0; using CPU sampler
0.10.623.046 I srv load_model: speculative decoding context initialized
0.10.623.047 I slot load_model: id 0 | task -1 | new slot, n_ctx = 262144
0.10.623.082 I srv load_model: prompt cache is enabled, size limit: 24576 MiB
0.10.623.082 I srv load_model: use `--cache-ram 0` to disable the prompt cache
0.10.623.082 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.10.623.082 I srv load_model: context checkpoints enabled, max = 64, min spacing = 256
0.10.623.092 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.10.633.134 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
</think>
'
0.10.641.111 I srv init: init: chat template, thinking = 0
0.10.641.121 I srv llama_server: model loaded
0.10.641.124 I srv llama_server: server is listening on http://0.0.0.0:1234
0.10.641.127 I srv update_slots: all slots are idle
0.23.525.715 I srv params_from_: Chat format: peg-native
0.23.527.110 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
0.23.527.111 I srv get_availabl: updating prompt cache
0.23.527.115 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.23.527.117 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 24576.000 MiB, 262144 tokens, 25769803776 est)
0.23.527.118 I srv get_availabl: prompt cache update took 0.01 ms
0.23.527.272 I reasoning-budget: activated, budget=2147483647 tokens
0.23.527.273 I reasoning-budget: deactivated (natural end)
0.23.527.279 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
0.52.994.289 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 1118, progress = 0.81, t = 29.47 s / 37.94 tokens per second
0.59.254.081 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 1359, progress = 0.99, t = 35.73 s / 38.04 tokens per second
0.59.316.435 I slot create_check: id 0 | task 0 | created context checkpoint 1 of 64 (pos_min = 1358, pos_max = 1358, n_tokens = 1359, size = 152.472 MiB)
0.59.745.652 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 1374, progress = 1.00, t = 36.22 s / 37.94 tokens per second
1.04.254.825 I slot print_timing: id 0 | task 0 | n_decoded = 103, tg = 23.61 t/s
1.07.282.544 I slot print_timing: id 0 | task 0 | n_decoded = 189, tg = 25.57 t/s
1.10.382.480 I slot print_timing: id 0 | task 0 | n_decoded = 273, tg = 26.02 t/s
1.13.488.735 I slot print_timing: id 0 | task 0 | n_decoded = 361, tg = 26.55 t/s
1.16.622.245 I slot print_timing: id 0 | task 0 | n_decoded = 442, tg = 26.42 t/s
1.19.719.839 I slot print_timing: id 0 | task 0 | n_decoded = 524, tg = 26.43 t/s
1.22.798.764 I slot print_timing: id 0 | task 0 | n_decoded = 603, tg = 26.32 t/s
1.25.921.971 I slot print_timing: id 0 | task 0 | n_decoded = 690, tg = 26.51 t/s
1.26.877.254 I slot print_timing: id 0 | task 0 | prompt eval time = 36364.25 ms / 1378 tokens ( 26.39 ms per token, 37.89 tokens per second)
1.26.877.256 I slot print_timing: id 0 | task 0 | eval time = 26985.58 ms / 715 tokens ( 37.74 ms per token, 26.50 tokens per second)
1.26.877.256 I slot print_timing: id 0 | task 0 | total time = 63349.83 ms / 2093 tokens
1.26.877.257 I slot print_timing: id 0 | task 0 | graphs reused = 195
1.26.877.257 I slot print_timing: id 0 | task 0 | draft acceptance = 0.86767 ( 518 accepted / 597 generated)
1.26.877.275 I statistics draft-mtp: #calls(b,g,a) = 1 199 199, #gen drafts = 199, #acc drafts = 187, #gen tokens = 597, #acc tokens = 518, dur(b,g,a) = 0.000, 1558.203, 0.150 ms
1.26.877.324 I slot release: id 0 | task 0 | stop processing: n_tokens = 2095, truncated = 0
1.26.877.333 I srv update_slots: all slots are idle
1.27.031.203 I srv params_from_: Chat format: peg-native
1.27.033.237 I slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.646 (> 0.100 thold), f_keep = 0.656
1.27.033.533 I reasoning-budget: activated, budget=2147483647 tokens
1.27.033.534 I reasoning-budget: deactivated (natural end)
1.27.033.561 I slot launch_slot_: id 0 | task 204 | processing task, is_child = 0
1.27.033.571 I slot update_slots: id 0 | task 204 | Checking checkpoint with [1358, 1358] against 1374...
1.27.052.733 W slot update_slots: id 0 | task 204 | restored context checkpoint (pos_min = 1358, pos_max = 1358, n_tokens = 1359, n_past = 1359, size = 152.472 MiB)
1.40.104.459 I slot print_timing: id 0 | task 204 | prompt processing, n_tokens = 507, progress = 0.88, t = 13.07 s / 38.79 tokens per second
1.45.910.243 I slot print_timing: id 0 | task 204 | prompt processing, n_tokens = 732, progress = 0.98, t = 18.88 s / 38.78 tokens per second
1.45.953.370 I slot create_check: id 0 | task 204 | created context checkpoint 2 of 64 (pos_min = 2090, pos_max = 2090, n_tokens = 2091, size = 154.005 MiB)
1.46.839.558 I slot print_timing: id 0 | task 204 | prompt processing, n_tokens = 763, progress = 1.00, t = 19.81 s / 38.52 tokens per second
1.51.005.318 I slot print_timing: id 0 | task 204 | n_decoded = 101, tg = 25.07 t/s
1.54.151.455 I slot print_timing: id 0 | task 204 | n_decoded = 179, tg = 24.95 t/s
1.57.216.463 I slot print_timing: id 0 | task 204 | n_decoded = 236, tg = 23.05 t/s
1.59.043.207 I slot print_timing: id 0 | task 204 | prompt eval time = 19943.65 ms / 767 tokens ( 26.00 ms per token, 38.46 tokens per second)
1.59.043.209 I slot print_timing: id 0 | task 204 | eval time = 12065.86 ms / 273 tokens ( 44.20 ms per token, 22.63 tokens per second)
1.59.043.209 I slot print_timing: id 0 | task 204 | total time = 32009.51 ms / 1040 tokens
1.59.043.210 I slot print_timing: id 0 | task 204 | graphs reused = 281
1.59.043.211 I slot print_timing: id 0 | task 204 | draft acceptance = 0.70833 ( 187 accepted / 264 generated)
1.59.043.222 I statistics draft-mtp: #calls(b,g,a) = 2 287 287, #gen drafts = 287, #acc drafts = 266, #gen tokens = 861, #acc tokens = 705, dur(b,g,a) = 0.001, 2247.510, 0.222 ms
1.59.043.321 I slot release: id 0 | task 204 | stop processing: n_tokens = 2401, truncated = 0
1.59.043.352 I srv update_slots: all slots are idle
0.00.143.332 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.143.334 I device_info:
0.00.199.987 I - CUDA0 : NVIDIA GeForce RTX 5080 (15842 MiB, 15559 MiB free)
0.00.256.270 I - CUDA1 : NVIDIA GeForce RTX 5070 Ti (15842 MiB, 15597 MiB free)
0.00.256.277 I - CPU : AMD Ryzen 7 9700X 8-Core Processor (60909 MiB, 60909 MiB free)
0.00.256.313 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.256.348 I srv init: running without SSL
0.00.256.360 I srv init: using 15 threads for HTTP server
0.00.256.418 W srv llama_server: -----------------
0.00.256.419 W srv llama_server: Built-in tools are enabled, do not expose server to untrusted environments
0.00.256.419 W srv llama_server: This feature is EXPERIMENTAL and may be changed in the future
0.00.256.419 W srv llama_server: -----------------
0.00.256.421 I srv start: binding port with default address family
0.00.257.560 I srv llama_server: loading model
0.00.257.565 I srv load_model: loading model '/home/user/.lmstudio/models/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4.gguf'
0.00.521.093 I srv load_model: [spec] estimated memory usage of MTP context is 1736.27 MiB
0.00.521.103 I common_init_result: fitting params to device memory ...
0.00.521.103 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.521.155 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
0.05.635.746 W NCCL not compiled in; falling back to internal AllReduce. Recompile with -DGGML_CUDA_NCCL=ON for best multi-GPU performance.
0.05.714.889 I srv load_model: creating MTP draft context against the target model '/home/user/.lmstudio/models/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4.gguf'
0.05.714.923 W NCCL not compiled in; falling back to internal AllReduce. Recompile with -DGGML_CUDA_NCCL=ON for best multi-GPU performance.
0.05.777.238 I srv load_model: initializing slots, n_slots = 1
0.06.164.159 I common_context_can_seq_rm: the context supports bounded partial sequence removal
0.06.171.963 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.06.171.966 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
0.06.171.967 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=q8_0, cache_v=q8_0, ctx_tgt=yes, ctx_dft=yes, devices=[default]
0.06.172.012 W set_sampler: backend sampling not supported with SPLIT_MODE_TENSOR; using CPU
0.06.172.013 W common_speculative_impl_draft_mtp: backend offload failed for seq_id=0; using CPU sampler
0.06.172.021 I srv load_model: speculative decoding context initialized
0.06.172.022 I slot load_model: id 0 | task -1 | new slot, n_ctx = 262144
0.06.172.055 I srv load_model: prompt cache is enabled, size limit: 24576 MiB
0.06.172.056 I srv load_model: use `--cache-ram 0` to disable the prompt cache
0.06.172.056 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.06.172.056 I srv load_model: context checkpoints enabled, max = 64, min spacing = 256
0.06.172.065 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.06.182.012 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
</think>
'
0.06.189.551 I srv init: init: chat template, thinking = 0
0.06.189.559 I srv llama_server: model loaded
0.06.189.562 I srv llama_server: server is listening on http://0.0.0.0:1234
0.06.189.564 I srv update_slots: all slots are idle
0.10.013.583 I srv params_from_: Chat format: peg-native
0.10.015.670 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
0.10.015.671 I srv get_availabl: updating prompt cache
0.10.015.674 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.10.015.678 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 24576.000 MiB, 262144 tokens, 25769803776 est)
0.10.015.679 I srv get_availabl: prompt cache update took 0.01 ms
0.10.015.859 I reasoning-budget: activated, budget=2147483647 tokens
0.10.015.860 I reasoning-budget: deactivated (natural end)
0.10.015.868 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
0.10.829.185 I slot create_check: id 0 | task 0 | created context checkpoint 1 of 64 (pos_min = 1358, pos_max = 1358, n_tokens = 1359, size = 152.472 MiB)
0.11.859.962 I slot print_timing: id 0 | task 0 | n_decoded = 102, tg = 107.32 t/s
0.14.886.615 I slot print_timing: id 0 | task 0 | n_decoded = 485, tg = 121.95 t/s
0.17.887.046 I slot print_timing: id 0 | task 0 | n_decoded = 873, tg = 125.12 t/s
0.20.914.095 I slot print_timing: id 0 | task 0 | n_decoded = 1254, tg = 125.34 t/s
0.22.296.425 I slot print_timing: id 0 | task 0 | prompt eval time = 893.52 ms / 1378 tokens ( 0.65 ms per token, 1542.21 tokens per second)
0.22.296.428 I slot print_timing: id 0 | task 0 | eval time = 11386.87 ms / 1437 tokens ( 7.92 ms per token, 126.20 tokens per second)
0.22.296.428 I slot print_timing: id 0 | task 0 | total time = 12280.40 ms / 2815 tokens
0.22.296.428 I slot print_timing: id 0 | task 0 | graphs reused = 394
0.22.296.429 I slot print_timing: id 0 | task 0 | draft acceptance = 0.86500 ( 1038 accepted / 1200 generated)
0.22.296.446 I statistics draft-mtp: #calls(b,g,a) = 1 400 400, #gen drafts = 400, #acc drafts = 374, #gen tokens = 1200, #acc tokens = 1038, dur(b,g,a) = 0.000, 3104.165, 0.283 ms
0.22.296.498 I slot release: id 0 | task 0 | stop processing: n_tokens = 2816, truncated = 0
0.22.296.512 I srv update_slots: all slots are idle
0.22.447.274 I srv params_from_: Chat format: peg-native
0.22.449.980 I slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.483 (> 0.100 thold), f_keep = 0.488
0.22.449.981 I srv get_availabl: updating prompt cache
0.22.450.208 W srv prompt_save: - saving prompt with length 2816, total state size = 205.078 MiB (draft: 5.898 MiB)
0.22.532.601 I srv load: - looking for better prompt, base f_keep = 0.488, sim = 0.483
0.22.532.606 I srv update: - cache state: 1 prompts, 357.550 MiB (limits: 24576.000 MiB, 262144 tokens, 262144 est)
0.22.532.607 I srv update: - prompt 0x5f518e659df0: 2816 tokens, checkpoints: 1, 357.550 MiB
0.22.532.608 I srv get_availabl: prompt cache update took 82.63 ms
0.22.532.794 I reasoning-budget: activated, budget=2147483647 tokens
0.22.532.795 I reasoning-budget: deactivated (natural end)
0.22.532.822 I slot launch_slot_: id 0 | task 405 | processing task, is_child = 0
0.22.532.830 I slot update_slots: id 0 | task 405 | Checking checkpoint with [1358, 1358] against 1374...
0.22.552.712 W slot update_slots: id 0 | task 405 | restored context checkpoint (pos_min = 1358, pos_max = 1358, n_tokens = 1359, n_past = 1359, size = 152.472 MiB)
0.23.274.662 I slot create_check: id 0 | task 405 | created context checkpoint 2 of 64 (pos_min = 2811, pos_max = 2811, n_tokens = 2812, size = 155.515 MiB)
0.24.248.792 I slot print_timing: id 0 | task 405 | n_decoded = 103, tg = 116.01 t/s
0.26.183.467 I slot print_timing: id 0 | task 405 | prompt eval time = 827.99 ms / 1488 tokens ( 0.56 ms per token, 1797.12 tokens per second)
0.26.183.469 I slot print_timing: id 0 | task 405 | eval time = 2822.53 ms / 300 tokens ( 9.41 ms per token, 106.29 tokens per second)
0.26.183.469 I slot print_timing: id 0 | task 405 | total time = 3650.52 ms / 1788 tokens
0.26.183.470 I slot print_timing: id 0 | task 405 | graphs reused = 491
0.26.183.470 I slot print_timing: id 0 | task 405 | draft acceptance = 0.67340 ( 200 accepted / 297 generated)
0.26.183.482 I statistics draft-mtp: #calls(b,g,a) = 2 499 499, #gen drafts = 499, #acc drafts = 457, #gen tokens = 1497, #acc tokens = 1238, dur(b,g,a) = 0.001, 3876.570, 0.350 ms
0.26.183.625 I slot release: id 0 | task 405 | stop processing: n_tokens = 3146, truncated = 0
0.26.183.657 I srv update_slots: all slots are idle
Name and Version
Broken:
Working (last good):
Hardware
AMD Ryzen 7 9700X, 64 Gb RAM
NVIDIA RTX 5080 + RTX 5070 Ti , dual GPU with tensor split
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Model
Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF
https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF
Command line
Problem description & steps to reproduce
Performance dropped 4~5x after PR #24060 merged (2026-06-05). Model runs correctly, but extremely slow with tensor split on dual GPU.
Built from source with GNU 14.2.0, CUDA 13.1
Tested with a simple "create a simple c++ application".
Both GPUs are fully loaded, CPU idle, so the bottleneck is GPU-side.
The regression is specifically with
--split-mode tensor. The issue does not appear without tensor split (not tested extensively, but single-GPU runs unaffected)First Bad Commit
I bisected between 46fa662 (working) and 59917d3 (broken) - only 4 commits apart. Built and tested each one. The regression is introduced by the hparams refactor.
Relevant log output
Logs
Broken build (7acb4e8)
Working build (3ecfb15)