Skip to content

Misc. bug: ~4-5x performance regression with NVFP4 + tensor split after hparams refactor (#24060) #24182

@Evergreen-sml

Description

@Evergreen-sml

Name and Version

Broken:

version: 1124 (7acb4e8c)
built with GNU 14.2.0 for Linux x86_64

Working (last good):

version: 1123 (3ecfb150)
built with GNU 14.2.0 for Linux x86_64

Hardware

AMD Ryzen 7 9700X, 64 Gb RAM
NVIDIA RTX 5080 + RTX 5070 Ti , dual GPU with tensor split

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Model

Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF
https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF

Command line

llama serve \
  -m ~/models/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4.gguf \
  --host 0.0.0.0 --port 1234 \
  --jinja --keep -1 --no-mmap --no-mmproj --kv-unified \
  --chat-template-kwargs '{"preserve_thinking": false}' \
  --reasoning off --no-warmup --no-context-shift --cont-batching \
  --ctx-checkpoints 64 --threads 8 -c 262144 \
  -ctk q8_0 -ctv q8_0 -ctkd q8_0 -ctvd q8_0 -np 1 \
  -fa on -fit on --temp 0.6 --min-p 0.0 --top-k 20 --top-p 0.95 \
  --repeat-penalty 1.0 --presence-penalty 0.0 --tools all --n-predict -1 \
  --sampling-seq ekpmt --spec-type draft-mtp --spec-draft-n-max 3 \
  --draft-p-min 0.0 --prio-draft 2 --prio-batch-draft 2 \
  -b 2048 -ub 256 --cache-ram 24576 -mg 0 \
  --split-mode tensor -ts 1,1

Problem description & steps to reproduce

Performance dropped 4~5x after PR #24060 merged (2026-06-05). Model runs correctly, but extremely slow with tensor split on dual GPU.

Built from source with GNU 14.2.0, CUDA 13.1

  -DCMAKE_C_COMPILER=gcc-14 \
  -DCMAKE_CXX_COMPILER=g++-14 \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_NCCL=OFF \
  -DCMAKE_CUDA_ARCHITECTURES="120" \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_NATIVE=ON \
  -DGGML_CUDA_F16=ON \
  -DLLAMA_BUILD_APP=ON

Tested with a simple "create a simple c++ application".
Both GPUs are fully loaded, CPU idle, so the bottleneck is GPU-side.

The regression is specifically with --split-mode tensor. The issue does not appear without tensor split (not tested extensively, but single-GPU runs unaffected)

First Bad Commit

I bisected between 46fa662 (working) and 59917d3 (broken) - only 4 commits apart. Built and tested each one. The regression is introduced by the hparams refactor.

Relevant log output

Logs

Broken build (7acb4e8)

...
0.00.285.991 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.285.993 I device_info:
0.00.349.851 I   - CUDA0   : NVIDIA GeForce RTX 5080 (15842 MiB, 15559 MiB free)
0.00.407.570 I   - CUDA1   : NVIDIA GeForce RTX 5070 Ti (15842 MiB, 15597 MiB free)
0.00.407.576 I   - CPU     : AMD Ryzen 7 9700X 8-Core Processor (60909 MiB, 60909 MiB free)
0.00.407.609 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.407.630 I srv          init: running without SSL
0.00.407.640 I srv          init: using 15 threads for HTTP server
0.00.407.688 W srv  llama_server: -----------------
0.00.407.689 W srv  llama_server: Built-in tools are enabled, do not expose server to untrusted environments
0.00.407.689 W srv  llama_server: This feature is EXPERIMENTAL and may be changed in the future
0.00.407.689 W srv  llama_server: -----------------
0.00.407.690 I srv         start: binding port with default address family
0.00.408.819 I srv  llama_server: loading model
0.00.408.822 I srv    load_model: loading model '/home/user/.lmstudio/models/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4.gguf'
0.00.711.442 I srv    load_model: [spec] estimated memory usage of MTP context is 1736.27 MiB
0.00.711.457 I common_init_result: fitting params to device memory ...
0.00.711.458 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.711.505 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
0.10.007.277 W NCCL not compiled in; falling back to internal AllReduce.  Recompile with -DGGML_CUDA_NCCL=ON for best multi-GPU performance.
0.10.099.684 I srv    load_model: creating MTP draft context against the target model '/home/user/.lmstudio/models/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4.gguf'
0.10.099.702 W NCCL not compiled in; falling back to internal AllReduce.  Recompile with -DGGML_CUDA_NCCL=ON for best multi-GPU performance.
0.10.165.502 I srv    load_model: initializing slots, n_slots = 1
0.10.614.464 I common_context_can_seq_rm: the context supports bounded partial sequence removal
0.10.622.978 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.10.622.982 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
0.10.622.987 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=q8_0, cache_v=q8_0, ctx_tgt=yes, ctx_dft=yes, devices=[default]
0.10.623.035 W set_sampler: backend sampling not supported with SPLIT_MODE_TENSOR; using CPU
0.10.623.036 W common_speculative_impl_draft_mtp: backend offload failed for seq_id=0; using CPU sampler
0.10.623.046 I srv    load_model: speculative decoding context initialized
0.10.623.047 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 262144
0.10.623.082 I srv    load_model: prompt cache is enabled, size limit: 24576 MiB
0.10.623.082 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.10.623.082 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.10.623.082 I srv    load_model: context checkpoints enabled, max = 64, min spacing = 256
0.10.623.092 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.10.633.134 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
0.10.641.111 I srv          init: init: chat template, thinking = 0
0.10.641.121 I srv  llama_server: model loaded
0.10.641.124 I srv  llama_server: server is listening on http://0.0.0.0:1234
0.10.641.127 I srv  update_slots: all slots are idle
0.23.525.715 I srv  params_from_: Chat format: peg-native
0.23.527.110 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.23.527.111 I srv  get_availabl: updating prompt cache
0.23.527.115 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.23.527.117 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 24576.000 MiB, 262144 tokens, 25769803776 est)
0.23.527.118 I srv  get_availabl: prompt cache update took 0.01 ms
0.23.527.272 I reasoning-budget: activated, budget=2147483647 tokens
0.23.527.273 I reasoning-budget: deactivated (natural end)
0.23.527.279 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
0.52.994.289 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   1118, progress = 0.81, t =  29.47 s / 37.94 tokens per second
0.59.254.081 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   1359, progress = 0.99, t =  35.73 s / 38.04 tokens per second
0.59.316.435 I slot create_check: id  0 | task 0 | created context checkpoint 1 of 64 (pos_min = 1358, pos_max = 1358, n_tokens = 1359, size = 152.472 MiB)
0.59.745.652 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   1374, progress = 1.00, t =  36.22 s / 37.94 tokens per second
1.04.254.825 I slot print_timing: id  0 | task 0 | n_decoded =    103, tg =  23.61 t/s
1.07.282.544 I slot print_timing: id  0 | task 0 | n_decoded =    189, tg =  25.57 t/s
1.10.382.480 I slot print_timing: id  0 | task 0 | n_decoded =    273, tg =  26.02 t/s
1.13.488.735 I slot print_timing: id  0 | task 0 | n_decoded =    361, tg =  26.55 t/s
1.16.622.245 I slot print_timing: id  0 | task 0 | n_decoded =    442, tg =  26.42 t/s
1.19.719.839 I slot print_timing: id  0 | task 0 | n_decoded =    524, tg =  26.43 t/s
1.22.798.764 I slot print_timing: id  0 | task 0 | n_decoded =    603, tg =  26.32 t/s
1.25.921.971 I slot print_timing: id  0 | task 0 | n_decoded =    690, tg =  26.51 t/s
1.26.877.254 I slot print_timing: id  0 | task 0 | prompt eval time =   36364.25 ms /  1378 tokens (   26.39 ms per token,    37.89 tokens per second)
1.26.877.256 I slot print_timing: id  0 | task 0 |        eval time =   26985.58 ms /   715 tokens (   37.74 ms per token,    26.50 tokens per second)
1.26.877.256 I slot print_timing: id  0 | task 0 |       total time =   63349.83 ms /  2093 tokens
1.26.877.257 I slot print_timing: id  0 | task 0 |    graphs reused =        195
1.26.877.257 I slot print_timing: id  0 | task 0 | draft acceptance = 0.86767 (  518 accepted /   597 generated)
1.26.877.275 I statistics        draft-mtp: #calls(b,g,a) =    1    199    199, #gen drafts =    199, #acc drafts =   187, #gen tokens =    597, #acc tokens =   518, dur(b,g,a) = 0.000, 1558.203, 0.150 ms
1.26.877.324 I slot      release: id  0 | task 0 | stop processing: n_tokens = 2095, truncated = 0
1.26.877.333 I srv  update_slots: all slots are idle
1.27.031.203 I srv  params_from_: Chat format: peg-native
1.27.033.237 I slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.646 (> 0.100 thold), f_keep = 0.656
1.27.033.533 I reasoning-budget: activated, budget=2147483647 tokens
1.27.033.534 I reasoning-budget: deactivated (natural end)
1.27.033.561 I slot launch_slot_: id  0 | task 204 | processing task, is_child = 0
1.27.033.571 I slot update_slots: id  0 | task 204 | Checking checkpoint with [1358, 1358] against 1374...
1.27.052.733 W slot update_slots: id  0 | task 204 | restored context checkpoint (pos_min = 1358, pos_max = 1358, n_tokens = 1359, n_past = 1359, size = 152.472 MiB)
1.40.104.459 I slot print_timing: id  0 | task 204 | prompt processing, n_tokens =    507, progress = 0.88, t =  13.07 s / 38.79 tokens per second
1.45.910.243 I slot print_timing: id  0 | task 204 | prompt processing, n_tokens =    732, progress = 0.98, t =  18.88 s / 38.78 tokens per second
1.45.953.370 I slot create_check: id  0 | task 204 | created context checkpoint 2 of 64 (pos_min = 2090, pos_max = 2090, n_tokens = 2091, size = 154.005 MiB)
1.46.839.558 I slot print_timing: id  0 | task 204 | prompt processing, n_tokens =    763, progress = 1.00, t =  19.81 s / 38.52 tokens per second
1.51.005.318 I slot print_timing: id  0 | task 204 | n_decoded =    101, tg =  25.07 t/s
1.54.151.455 I slot print_timing: id  0 | task 204 | n_decoded =    179, tg =  24.95 t/s
1.57.216.463 I slot print_timing: id  0 | task 204 | n_decoded =    236, tg =  23.05 t/s
1.59.043.207 I slot print_timing: id  0 | task 204 | prompt eval time =   19943.65 ms /   767 tokens (   26.00 ms per token,    38.46 tokens per second)
1.59.043.209 I slot print_timing: id  0 | task 204 |        eval time =   12065.86 ms /   273 tokens (   44.20 ms per token,    22.63 tokens per second)
1.59.043.209 I slot print_timing: id  0 | task 204 |       total time =   32009.51 ms /  1040 tokens
1.59.043.210 I slot print_timing: id  0 | task 204 |    graphs reused =        281
1.59.043.211 I slot print_timing: id  0 | task 204 | draft acceptance = 0.70833 (  187 accepted /   264 generated)
1.59.043.222 I statistics        draft-mtp: #calls(b,g,a) =    2    287    287, #gen drafts =    287, #acc drafts =   266, #gen tokens =    861, #acc tokens =   705, dur(b,g,a) = 0.001, 2247.510, 0.222 ms
1.59.043.321 I slot      release: id  0 | task 204 | stop processing: n_tokens = 2401, truncated = 0
1.59.043.352 I srv  update_slots: all slots are idle

Working build (3ecfb15)

0.00.143.332 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.143.334 I device_info:
0.00.199.987 I   - CUDA0   : NVIDIA GeForce RTX 5080 (15842 MiB, 15559 MiB free)
0.00.256.270 I   - CUDA1   : NVIDIA GeForce RTX 5070 Ti (15842 MiB, 15597 MiB free)
0.00.256.277 I   - CPU     : AMD Ryzen 7 9700X 8-Core Processor (60909 MiB, 60909 MiB free)
0.00.256.313 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.256.348 I srv          init: running without SSL
0.00.256.360 I srv          init: using 15 threads for HTTP server
0.00.256.418 W srv  llama_server: -----------------
0.00.256.419 W srv  llama_server: Built-in tools are enabled, do not expose server to untrusted environments
0.00.256.419 W srv  llama_server: This feature is EXPERIMENTAL and may be changed in the future
0.00.256.419 W srv  llama_server: -----------------
0.00.256.421 I srv         start: binding port with default address family
0.00.257.560 I srv  llama_server: loading model
0.00.257.565 I srv    load_model: loading model '/home/user/.lmstudio/models/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4.gguf'
0.00.521.093 I srv    load_model: [spec] estimated memory usage of MTP context is 1736.27 MiB
0.00.521.103 I common_init_result: fitting params to device memory ...
0.00.521.103 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.521.155 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
0.05.635.746 W NCCL not compiled in; falling back to internal AllReduce.  Recompile with -DGGML_CUDA_NCCL=ON for best multi-GPU performance.
0.05.714.889 I srv    load_model: creating MTP draft context against the target model '/home/user/.lmstudio/models/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4.gguf'
0.05.714.923 W NCCL not compiled in; falling back to internal AllReduce.  Recompile with -DGGML_CUDA_NCCL=ON for best multi-GPU performance.
0.05.777.238 I srv    load_model: initializing slots, n_slots = 1
0.06.164.159 I common_context_can_seq_rm: the context supports bounded partial sequence removal
0.06.171.963 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.06.171.966 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
0.06.171.967 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=q8_0, cache_v=q8_0, ctx_tgt=yes, ctx_dft=yes, devices=[default]
0.06.172.012 W set_sampler: backend sampling not supported with SPLIT_MODE_TENSOR; using CPU
0.06.172.013 W common_speculative_impl_draft_mtp: backend offload failed for seq_id=0; using CPU sampler
0.06.172.021 I srv    load_model: speculative decoding context initialized
0.06.172.022 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 262144
0.06.172.055 I srv    load_model: prompt cache is enabled, size limit: 24576 MiB
0.06.172.056 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.06.172.056 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.06.172.056 I srv    load_model: context checkpoints enabled, max = 64, min spacing = 256
0.06.172.065 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.06.182.012 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
0.06.189.551 I srv          init: init: chat template, thinking = 0
0.06.189.559 I srv  llama_server: model loaded
0.06.189.562 I srv  llama_server: server is listening on http://0.0.0.0:1234
0.06.189.564 I srv  update_slots: all slots are idle
0.10.013.583 I srv  params_from_: Chat format: peg-native
0.10.015.670 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.10.015.671 I srv  get_availabl: updating prompt cache
0.10.015.674 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.10.015.678 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 24576.000 MiB, 262144 tokens, 25769803776 est)
0.10.015.679 I srv  get_availabl: prompt cache update took 0.01 ms
0.10.015.859 I reasoning-budget: activated, budget=2147483647 tokens
0.10.015.860 I reasoning-budget: deactivated (natural end)
0.10.015.868 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
0.10.829.185 I slot create_check: id  0 | task 0 | created context checkpoint 1 of 64 (pos_min = 1358, pos_max = 1358, n_tokens = 1359, size = 152.472 MiB)
0.11.859.962 I slot print_timing: id  0 | task 0 | n_decoded =    102, tg = 107.32 t/s
0.14.886.615 I slot print_timing: id  0 | task 0 | n_decoded =    485, tg = 121.95 t/s
0.17.887.046 I slot print_timing: id  0 | task 0 | n_decoded =    873, tg = 125.12 t/s
0.20.914.095 I slot print_timing: id  0 | task 0 | n_decoded =   1254, tg = 125.34 t/s
0.22.296.425 I slot print_timing: id  0 | task 0 | prompt eval time =     893.52 ms /  1378 tokens (    0.65 ms per token,  1542.21 tokens per second)
0.22.296.428 I slot print_timing: id  0 | task 0 |        eval time =   11386.87 ms /  1437 tokens (    7.92 ms per token,   126.20 tokens per second)
0.22.296.428 I slot print_timing: id  0 | task 0 |       total time =   12280.40 ms /  2815 tokens
0.22.296.428 I slot print_timing: id  0 | task 0 |    graphs reused =        394
0.22.296.429 I slot print_timing: id  0 | task 0 | draft acceptance = 0.86500 ( 1038 accepted /  1200 generated)
0.22.296.446 I statistics        draft-mtp: #calls(b,g,a) =    1    400    400, #gen drafts =    400, #acc drafts =   374, #gen tokens =   1200, #acc tokens =  1038, dur(b,g,a) = 0.000, 3104.165, 0.283 ms
0.22.296.498 I slot      release: id  0 | task 0 | stop processing: n_tokens = 2816, truncated = 0
0.22.296.512 I srv  update_slots: all slots are idle
0.22.447.274 I srv  params_from_: Chat format: peg-native
0.22.449.980 I slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.483 (> 0.100 thold), f_keep = 0.488
0.22.449.981 I srv  get_availabl: updating prompt cache
0.22.450.208 W srv   prompt_save:  - saving prompt with length 2816, total state size = 205.078 MiB (draft: 5.898 MiB)
0.22.532.601 I srv          load:  - looking for better prompt, base f_keep = 0.488, sim = 0.483
0.22.532.606 I srv        update:  - cache state: 1 prompts, 357.550 MiB (limits: 24576.000 MiB, 262144 tokens, 262144 est)
0.22.532.607 I srv        update:    - prompt 0x5f518e659df0:    2816 tokens, checkpoints:  1,   357.550 MiB
0.22.532.608 I srv  get_availabl: prompt cache update took 82.63 ms
0.22.532.794 I reasoning-budget: activated, budget=2147483647 tokens
0.22.532.795 I reasoning-budget: deactivated (natural end)
0.22.532.822 I slot launch_slot_: id  0 | task 405 | processing task, is_child = 0
0.22.532.830 I slot update_slots: id  0 | task 405 | Checking checkpoint with [1358, 1358] against 1374...
0.22.552.712 W slot update_slots: id  0 | task 405 | restored context checkpoint (pos_min = 1358, pos_max = 1358, n_tokens = 1359, n_past = 1359, size = 152.472 MiB)
0.23.274.662 I slot create_check: id  0 | task 405 | created context checkpoint 2 of 64 (pos_min = 2811, pos_max = 2811, n_tokens = 2812, size = 155.515 MiB)
0.24.248.792 I slot print_timing: id  0 | task 405 | n_decoded =    103, tg = 116.01 t/s
0.26.183.467 I slot print_timing: id  0 | task 405 | prompt eval time =     827.99 ms /  1488 tokens (    0.56 ms per token,  1797.12 tokens per second)
0.26.183.469 I slot print_timing: id  0 | task 405 |        eval time =    2822.53 ms /   300 tokens (    9.41 ms per token,   106.29 tokens per second)
0.26.183.469 I slot print_timing: id  0 | task 405 |       total time =    3650.52 ms /  1788 tokens
0.26.183.470 I slot print_timing: id  0 | task 405 |    graphs reused =        491
0.26.183.470 I slot print_timing: id  0 | task 405 | draft acceptance = 0.67340 (  200 accepted /   297 generated)
0.26.183.482 I statistics        draft-mtp: #calls(b,g,a) =    2    499    499, #gen drafts =    499, #acc drafts =   457, #gen tokens =   1497, #acc tokens =  1238, dur(b,g,a) = 0.001, 3876.570, 0.350 ms
0.26.183.625 I slot      release: id  0 | task 405 | stop processing: n_tokens = 3146, truncated = 0
0.26.183.657 I srv  update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions