Skip to content

Eval bug: Major performance drop since b7406 #18258

@Kaspur2012

Description

@Kaspur2012

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from D:\llamacpp\llama-b7495-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\llamacpp\llama-b7495-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\llamacpp\llama-b7495-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
version: 7495 (4117ae5)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

GGML backends

CUDA

Hardware

Ryzen 7 3700x
RTX 3090

Models

fully offload to a 3090

Qwen3-VL-32B-Thinking.Q4_K_M_TEXT_DRAFT:

  • llama-server.exe -m D:/lm_studio/mradermacher/Qwen3-VL-32B-Thinking-GGUF/Qwen3-VL-32B-Thinking.Q4_K_M.gguf --jinja -c 20000 -ngl 999 -fa on --temp 1.0 --top-k 40 --top-p 1.0 --min-p 0.05 --repeat-penalty 1.0 --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --no-warmup --main-gpu 0 -md D:/lm_studio/lmstudio-community/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf --cache-type-k-draft q8_0 --cache_type-v-draft q8_0 --device-draft CUDA0 --split-mode none

Magistral-Small-2509-UD-Q5_K_XL:

  • llama-server.exe -m D:\lm_studio\unsloth\Magistral-Small-2509-GGUF\Magistral-Small-2509-UD-Q5_K_XL.gguf --jinja -c 85000 -ngl 999 -fa on --temp 0.7 --top-k 40 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.1 --no-mmap --split-mode none --main-gpu 0 --cache-type-k q8_0 --cache-type-v q8_0 --no-warmup

Found similar result for Mistral-Small-3.2-24B-Instruct-2506-UD-Q5_K_XL but I do not recorded t/s for it.

Problem description & steps to reproduce

Lately I have major performance reduction on these 2 models: Qwen3-VL-32B-Thinking.Q4_K_M_TEXT_DRAFT and Magistral-Small-2509-UD-Q5_K_XL.

I have downloaded numerous binary build and located where the reduction in performance started.

Below are what I found:

b7406 :

  • Qwen3-VL-32B-Thinking.Q4_K_M_TEXT_DRAFT - 45 t/s
  • Magistral-Small-2509-UD-Q5_K_XL - 45 t/s

b7410 :

  • Qwen3-VL-32B-Thinking.Q4_K_M_TEXT_DRAFT - 10 t/s
  • Magistral-Small-2509-UD-Q5_K_XL - 20 t/s

b7495:

  • Qwen3-VL-32B-Thinking.Q4_K_M_TEXT_DRAFT - failed to load, log below
  • Magistral-Small-2509-UD-Q5_K_XL - 20 t/s

First Bad Commit

b7410 is when the performance started.

I am not sure when the loading problem started for Qwen3-VL-32B-Thinking.Q4_K_M.

Relevant log output

log for Qwen3-VL-32B-Thinking.Q4_K_M with draft that fail to load on b7495:

Working Dir: D:/llamacpp/llama-b7495-bin-win-cuda-12.4-x64
Executing Command: llama-server.exe -m D:/lm_studio/mradermacher/Qwen3-VL-32B-Thinking-GGUF/Qwen3-VL-32B-Thinking.Q4_K_M.gguf --jinja -c 20000 -ngl 999 -fa on --temp 1.0 --top-k 40 --top-p 1.0 --min-p 0.05 --repeat-penalty 1.0 --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --no-warmup --main-gpu 0 -md D:/lm_studio/lmstudio-community/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf --cache-type-k-draft q8_0 --cache_type-v-draft q8_0 --device-draft CUDA0 --split-mode none

================================================================================

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from D:\llamacpp\llama-b7495-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\llamacpp\llama-b7495-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\llamacpp\llama-b7495-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7495 (4117ae555) with Clang 19.1.5 for Windows x86_64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'D:/lm_studio/mradermacher/Qwen3-VL-32B-Thinking-GGUF/Qwen3-VL-32B-Thinking.Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on

llama_params_fit_impl: projected to use 21416 MiB of device memory vs. 24575 MiB of free device memory
llama_params_fit_impl: will leave 1887 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.46 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:0b:00.0) - 23304 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 707 tensors from D:/lm_studio/mradermacher/Qwen3-VL-32B-Thinking-GGUF/Qwen3-VL-32B-Thinking.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 32B Thinking
llama_model_loader: - kv   3:                           general.finetune str              = Thinking
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv   8:                        qwen3vl.block_count u32              = 64
llama_model_loader: - kv   9:                     qwen3vl.context_length u32              = 262144
llama_model_loader: - kv  10:                   qwen3vl.embedding_length u32              = 5120
llama_model_loader: - kv  11:                qwen3vl.feed_forward_length u32              = 25600
llama_model_loader: - kv  12:               qwen3vl.attention.head_count u32              = 64
llama_model_loader: - kv  13:            qwen3vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                     qwen3vl.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  15:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:               qwen3vl.attention.key_length u32              = 128
llama_model_loader: - kv  17:             qwen3vl.attention.value_length u32              = 128
llama_model_loader: - kv  18:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
llama_model_loader: - kv  19:                 qwen3vl.n_deepstack_layers u32              = 3
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                          general.file_type u32              = 15
llama_model_loader: - kv  32:                                general.url str              = https://huggingface.co/mradermacher/Q...
llama_model_loader: - kv  33:              mradermacher.quantize_version str              = 2
llama_model_loader: - kv  34:                  mradermacher.quantized_by str              = mradermacher
llama_model_loader: - kv  35:                  mradermacher.quantized_at str              = 2025-10-30T18:34:42+01:00
llama_model_loader: - kv  36:                  mradermacher.quantized_on str              = nico1
llama_model_loader: - kv  37:                         general.source.url str              = https://huggingface.co/Qwen/Qwen3-VL-...
llama_model_loader: - kv  38:                  mradermacher.convert_type str              = hf
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.40 GiB (4.82 BPW) 

load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3vl
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 5120
print_info: n_embd_inp       = 20480
print_info: n_layer          = 64
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 25600
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 40
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: mrope sections   = [24, 20, 20, 0]
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen3 VL 32B Thinking
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 63 repeating layers to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:          CPU model buffer size =   417.30 MiB
load_tensors:        CUDA0 model buffer size = 18423.65 MiB

.
....
...
....
...
...
....
....
...
...
....
...
...
....
...
...
....
...
....
...
....
...
...
...
....
...
...
...
....
.
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 20224
llama_context: n_ctx_seq     = 20224
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (20224) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.32 MiB
llama_kv_cache:      CUDA0 KV buffer size =  2686.00 MiB
llama_kv_cache: size = 2686.00 MiB ( 20224 cells,  64 layers,  4/1 seqs), K (q8_0): 1343.00 MiB, V (q8_0): 1343.00 MiB

llama_context:      CUDA0 compute buffer size =   306.75 MiB
llama_context:  CUDA_Host compute buffer size =    49.52 MiB
llama_context: graph nodes  = 2247
llama_context: graph splits = 2
srv    load_model: loading draft model 'D:/lm_studio/lmstudio-community/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on

llama_params_fit_impl: projected to use 2078 MiB of device memory vs. 24575 MiB of free device memory
llama_params_fit_impl: cannot fulfill margin of 1024 MiB, need to reduce device memory by 1216 MiB
llama_params_fit_impl: context size set by user to 20224 -> no change

llama_params_fit_impl: filling dense layers back-to-front:

llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  8 layers,    861 MiB used,   1024 MiB free
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took -3.75 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:0b:00.0) - 1886 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 311 tensors from D:/lm_studio/lmstudio-community/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 0.6B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 0.6B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 32768
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 1024
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 3072
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 16
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  113 tensors
llama_model_loader: - type q8_0:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 761.80 MiB (8.50 BPW) 
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26

load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 1024
print_info: n_embd_inp       = 1024
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 0.6B
print_info: model params     = 751.63 M
print_info: general.name     = Qwen3 0.6B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 7 repeating layers to GPU
load_tensors: offloaded 8/29 layers to GPU
load_tensors:        CUDA0 model buffer size =   269.28 MiB
load_tensors:    CUDA_Host model buffer size =   492.52 MiB

................................................
.............
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 20224
llama_context: n_ctx_seq     = 20224
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (20224) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:        CPU KV buffer size =   881.34 MiB

llama_kv_cache:      CUDA0 KV buffer size =   293.78 MiB
llama_kv_cache: size = 1175.12 MiB ( 20224 cells,  28 layers,  1/1 seqs), K (q8_0):  587.56 MiB, V (q8_0):  587.56 MiB

llama_context:      CUDA0 compute buffer size =   298.75 MiB
llama_context:  CUDA_Host compute buffer size =    43.51 MiB
llama_context: graph nodes  = 987
llama_context: graph splits = 275 (with bs=512), 2 (with bs=1)
srv    load_model: initializing slots, n_slots = 4
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 20224
llama_context: n_ctx_seq     = 20224
llama_context: n_batch       = 20224
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (20224) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:        CPU KV buffer size =   881.34 MiB

llama_kv_cache:      CUDA0 KV buffer size =   293.78 MiB

llama_kv_cache: size = 1175.12 MiB ( 20224 cells,  28 layers,  1/1 seqs), K (q8_0):  587.56 MiB, V (q8_0):  587.56 MiB
llama_context:      CUDA0 compute buffer size =   298.75 MiB
llama_context:  CUDA_Host compute buffer size =    43.51 MiB
llama_context: graph nodes  = 987
llama_context: graph splits = 275 (with bs=512), 2 (with bs=1)
slot   load_model: id  0 | task -1 | new slot, n_ctx = 20224
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 20224
llama_context: n_ctx_seq     = 20224
llama_context: n_batch       = 20224
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (20224) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:        CPU KV buffer size =   881.34 MiB

llama_kv_cache:      CUDA0 KV buffer size =   293.78 MiB
llama_kv_cache: size = 1175.12 MiB ( 20224 cells,  28 layers,  1/1 seqs), K (q8_0):  587.56 MiB, V (q8_0):  587.56 MiB

llama_context:      CUDA0 compute buffer size =   298.75 MiB
llama_context:  CUDA_Host compute buffer size =    43.51 MiB
llama_context: graph nodes  = 987
llama_context: graph splits = 275 (with bs=512), 2 (with bs=1)
slot   load_model: id  1 | task -1 | new slot, n_ctx = 20224
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 20224
llama_context: n_ctx_seq     = 20224
llama_context: n_batch       = 20224
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (20224) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:        CPU KV buffer size =   881.34 MiB

llama_kv_cache:      CUDA0 KV buffer size =   293.78 MiB
llama_kv_cache: size = 1175.12 MiB ( 20224 cells,  28 layers,  1/1 seqs), K (q8_0):  587.56 MiB, V (q8_0):  587.56 MiB
llama_context:      CUDA0 compute buffer size =   298.75 MiB
llama_context:  CUDA_Host compute buffer size =    43.51 MiB
llama_context: graph nodes  = 987
llama_context: graph splits = 275 (with bs=512), 2 (with bs=1)
slot   load_model: id  2 | task -1 | new slot, n_ctx = 20224
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 20224
llama_context: n_ctx_seq     = 20224
llama_context: n_batch       = 20224
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (20224) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:        CPU KV buffer size =   881.34 MiB

llama_kv_cache:      CUDA0 KV buffer size =   293.78 MiB
llama_kv_cache: size = 1175.12 MiB ( 20224 cells,  28 layers,  1/1 seqs), K (q8_0):  587.56 MiB, V (q8_0):  587.56 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 298.75 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 313262080
graph_reserve: failed to allocate compute buffers
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
srv    load_model: failed to create draft context
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
log for Qwen3-VL-32B-Thinking.Q4_K_M with draft that able to load on b7406:

Working Dir: D:/llamacpp/llama-b7406-bin-win-cuda-12.4-x64
Executing Command: llama-server.exe -m D:/lm_studio/mradermacher/Qwen3-VL-32B-Thinking-GGUF/Qwen3-VL-32B-Thinking.Q4_K_M.gguf --jinja -c 20000 -ngl 999 -fa on --temp 1.0 --top-k 40 --top-p 1.0 --min-p 0.05 --repeat-penalty 1.0 --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --no-warmup --main-gpu 0 -md D:/lm_studio/lmstudio-community/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf --cache-type-k-draft q8_0 --cache_type-v-draft q8_0 --device-draft CUDA0 --split-mode none

================================================================================

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from D:\llamacpp\llama-b7406-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\llamacpp\llama-b7406-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\llamacpp\llama-b7406-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
build: 7406 (4aced7a63) with Clang 19.1.5 for Windows x86_64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'D:/lm_studio/mradermacher/Qwen3-VL-32B-Thinking-GGUF/Qwen3-VL-32B-Thinking.Q4_K_M.gguf'

llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:0b:00.0) - 23304 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 707 tensors from D:/lm_studio/mradermacher/Qwen3-VL-32B-Thinking-GGUF/Qwen3-VL-32B-Thinking.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 VL 32B Thinking
llama_model_loader: - kv   3:                           general.finetune str              = Thinking
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-VL
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv   8:                        qwen3vl.block_count u32              = 64
llama_model_loader: - kv   9:                     qwen3vl.context_length u32              = 262144
llama_model_loader: - kv  10:                   qwen3vl.embedding_length u32              = 5120
llama_model_loader: - kv  11:                qwen3vl.feed_forward_length u32              = 25600
llama_model_loader: - kv  12:               qwen3vl.attention.head_count u32              = 64
llama_model_loader: - kv  13:            qwen3vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                     qwen3vl.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  15:   qwen3vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:               qwen3vl.attention.key_length u32              = 128
llama_model_loader: - kv  17:             qwen3vl.attention.value_length u32              = 128
llama_model_loader: - kv  18:            qwen3vl.rope.dimension_sections arr[i32,4]       = [24, 20, 20, 0]
llama_model_loader: - kv  19:                 qwen3vl.n_deepstack_layers u32              = 3
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                          general.file_type u32              = 15
llama_model_loader: - kv  32:                                general.url str              = https://huggingface.co/mradermacher/Q...
llama_model_loader: - kv  33:              mradermacher.quantize_version str              = 2
llama_model_loader: - kv  34:                  mradermacher.quantized_by str              = mradermacher
llama_model_loader: - kv  35:                  mradermacher.quantized_at str              = 2025-10-30T18:34:42+01:00
llama_model_loader: - kv  36:                  mradermacher.quantized_on str              = nico1
llama_model_loader: - kv  37:                         general.source.url str              = https://huggingface.co/Qwen/Qwen3-VL-...
llama_model_loader: - kv  38:                  mradermacher.convert_type str              = hf
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.40 GiB (4.82 BPW) 
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')

load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 5120
print_info: n_embd_inp       = 20480
print_info: n_layer          = 64
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 25600
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 40
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: mrope sections   = [24, 20, 20, 0]
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen3 VL 32B Thinking
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)

load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:          CPU model buffer size =   417.30 MiB
load_tensors:        CUDA0 model buffer size = 18423.65 MiB

.
...
.
..
.
..
..
.
..
..
.
..
..
..
..
.
..
..
.
..
..
..
..
.
..
..
.
..
.
..
..
..
.
..
.
..
..
.
..
..
..
..
.
..
..
..
.
..
..
..
.
..
..
.
..
..
.
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 20224
llama_context: n_ctx_seq     = 20224
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (20224) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:      CUDA0 KV buffer size =  2686.00 MiB

llama_kv_cache: size = 2686.00 MiB ( 20224 cells,  64 layers,  1/1 seqs), K (q8_0): 1343.00 MiB, V (q8_0): 1343.00 MiB
llama_context:      CUDA0 compute buffer size =   306.75 MiB
llama_context:  CUDA_Host compute buffer size =    49.52 MiB
llama_context: graph nodes  = 2247
llama_context: graph splits = 2
srv    load_model: loading draft model 'D:/lm_studio/lmstudio-community/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:0b:00.0) - 1886 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 311 tensors from D:/lm_studio/lmstudio-community/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 0.6B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 0.6B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 32768
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 1024
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 3072
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 16
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2

llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  113 tensors
llama_model_loader: - type q8_0:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 761.80 MiB (8.50 BPW) 

load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 1024
print_info: n_embd_inp       = 1024
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 0.6B
print_info: model params     = 751.63 M
print_info: general.name     = Qwen3 0.6B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)

load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:        CUDA0 model buffer size =   604.15 MiB
load_tensors:    CUDA_Host model buffer size =   157.65 MiB
.............
............................
...................
.
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 20224
llama_context: n_ctx_seq     = 20224
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (20224) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1175.12 MiB
llama_kv_cache: size = 1175.12 MiB ( 20224 cells,  28 layers,  1/1 seqs), K (q8_0):  587.56 MiB, V (q8_0):  587.56 MiB
llama_context:      CUDA0 compute buffer size =   298.75 MiB
llama_context:  CUDA_Host compute buffer size =    41.51 MiB
llama_context: graph nodes  = 987
llama_context: graph splits = 2

srv          init: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 20224
llama_context: n_ctx_seq     = 20224
llama_context: n_batch       = 20224
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (20224) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1175.12 MiB
llama_kv_cache: size = 1175.12 MiB ( 20224 cells,  28 layers,  1/1 seqs), K (q8_0):  587.56 MiB, V (q8_0):  587.56 MiB
llama_context:      CUDA0 compute buffer size =   298.75 MiB
llama_context:  CUDA_Host compute buffer size =    41.51 MiB
llama_context: graph nodes  = 987
llama_context: graph splits = 2
slot         init: id  0 | task -1 | new slot, n_ctx = 20224
srv          init: prompt cache is enabled, size limit: 8192 MiB
srv          init: use `--cache-ram 0` to disable the prompt cache
srv          init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: thinking = 1
init: chat template, chat_template: {%- set image_count = namespace(value=0) %}
{%- set video_count = namespace(value=0) %}
{%- macro render_content(content, do_vision_count) %}
    {%- if content is string %}
        {{- content }}
    {%- else %}
        {%- for item in content %}
            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
                {%- if do_vision_count %}
                    {%- set image_count.value = image_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
                <|vision_start|><|image_pad|><|vision_end|>
            {%- elif 'video' in item or item.type == 'video' %}
                {%- if do_vision_count %}
                    {%- set video_count.value = video_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
                <|vision_start|><|video_pad|><|vision_end|>
            {%- elif 'text' in item %}
                {{- item.text }}
            {%- endif %}
        {%- endfor %}
    {%- endif %}
{%- endmacro %}
{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- render_content(messages[0].content, false) + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + render_content(messages[0].content, false) + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" %}
        {%- set content = render_content(message.content, false) %}
        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
            {%- set ns.multi_step_tool = false %}
            {%- set ns.last_query_index = index %}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- set content = render_content(message.content, True) %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n<think>\n' }}
{%- endif %}
, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle


[INFO] Model is fully loaded.
[DIAGNOSTICS] Detected 'model loaded' string.
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 20224, n_keep = 0, task.n_tokens = 3692
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.554713

slot update_slots: id  0 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 3692, batch.n_tokens = 1644, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_tokens = 3692, batch.n_tokens = 1644

slot print_timing: id  0 | task 0 | 
prompt eval time =    3342.05 ms /  3692 tokens (    0.91 ms per token,  1104.71 tokens per second)
       eval time =   91573.07 ms /  4522 tokens (   20.25 ms per token,    49.38 tokens per second)
      total time =   94915.12 ms /  8214 tokens
draft acceptance rate = 0.71409 ( 3574 accepted /  5005 generated)
slot      release: id  0 | task 0 | stop processing: n_tokens = 8213, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
'''

Metadata

Metadata

Assignees

No one assigned

    Labels

    CUDARelated to the CUDA backendNvidia GPUIssues specific to Nvidia GPUsbug-unconfirmedperformanceSpeed related topicsregressionA regression introduced in a new build (something that was previously working correctly)stalewindowsIssues specific to Windows

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions