Skip to content

Eval bug: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 144817.04 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 151851674624 #20431

@Nekotekina

Description

@Nekotekina

Name and Version

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24126 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB (23863 MiB free)
version: 8284 (00de615)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB (23863 MiB free)

Models

Qwen3.5-397B-A17B-Heretic-Q4_K_S.gguf

Problem description & steps to reproduce

~/github/llama.cpp/build/bin/llama-server -m ~/Downloads/LLM-Models/Qwen3.5-397B-A17B-Heretic-Q4_K_S.gguf --mmproj ~/Downloads/LLM-Models/Qwen3.5-397B/mmproj-BF16.gguf -cmoe --jinja --threads 22 --n-gpu-layers 99 --ctx-size 262144 -kvu -fa on --no-warmup --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 -ub 1024 --reasoning-budget 0 --swa-full -ot ffn_exps=CPU

First Bad Commit

No response

Relevant log output

Logs
@:~/github/llama.cpp$ ~/github/llama.cpp/build/bin/llama-server -m ~/Downloads/LLM-Models/Qwen3.5-397B-A17B-Heretic-Q4_K_S.gguf --mmproj ~/Downloads/LLM-Models/Qwen3.5-397B/mmproj-BF16.gguf -cmoe --jinja --threads 22 --n-gpu-layers 99 --ctx-size 262144 -kvu -fa on --no-warmup --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 -ub 1024 --reasoning-budget 0 --swa-full -ot ffn_exps=CPU
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24126 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB (23863 MiB free)
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8284 (00de61534) with GNU 13.3.0 for Linux x86_64
system info: n_threads = 22, n_threads_batch = 22, total_threads = 24

system_info: n_threads = 22 (n_threads_batch = 22) / 24 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 23 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/fbx/Downloads/LLM-Models/Qwen3.5-397B-A17B-Heretic-Q4_K_S.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 155990 MiB of device memory vs. 23117 MiB of free device memory
llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 133897 MiB
llama_params_fit_impl: context size set by user to 262144 -> no change
llama_params_fit: failed to fit params to free device memory: n_gpu_layers already set by user to 99, abort
llama_params_fit: fitting params to free memory took 0.56 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:17:00.0) - 23863 MiB free
llama_model_loader: loaded meta data with 43 key-value pairs and 1038 tensors from /home/fbx/Downloads/LLM-Models/Qwen3.5-397B-A17B-Heretic-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.600000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.5 397B A17B Heretic
llama_model_loader: - kv   6:                           general.finetune str              = heretic
llama_model_loader: - kv   7:                           general.basename str              = Qwen3.5
llama_model_loader: - kv   8:                         general.size_label str              = 397B-A17B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv  11:                               general.tags arr[str,5]       = ["heretic", "uncensored", "decensored...
llama_model_loader: - kv  12:                      qwen35moe.block_count u32              = 60
llama_model_loader: - kv  13:                   qwen35moe.context_length u32              = 262144
llama_model_loader: - kv  14:                 qwen35moe.embedding_length u32              = 4096
llama_model_loader: - kv  15:             qwen35moe.attention.head_count u32              = 32
llama_model_loader: - kv  16:          qwen35moe.attention.head_count_kv u32              = 2
llama_model_loader: - kv  17:          qwen35moe.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  18:                   qwen35moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  19: qwen35moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                     qwen35moe.expert_count u32              = 512
llama_model_loader: - kv  21:                qwen35moe.expert_used_count u32              = 10
llama_model_loader: - kv  22:             qwen35moe.attention.key_length u32              = 256
llama_model_loader: - kv  23:           qwen35moe.attention.value_length u32              = 256
llama_model_loader: - kv  24:       qwen35moe.expert_feed_forward_length u32              = 1024
llama_model_loader: - kv  25: qwen35moe.expert_shared_feed_forward_length u32              = 1024
llama_model_loader: - kv  26:                  qwen35moe.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  27:                   qwen35moe.ssm.state_size u32              = 128
llama_model_loader: - kv  28:                  qwen35moe.ssm.group_count u32              = 16
llama_model_loader: - kv  29:               qwen35moe.ssm.time_step_rank u32              = 64
llama_model_loader: - kv  30:                   qwen35moe.ssm.inner_size u32              = 8192
llama_model_loader: - kv  31:          qwen35moe.full_attention_interval u32              = 4
llama_model_loader: - kv  32:             qwen35moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  40:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  41:               general.quantization_version u32              = 2
llama_model_loader: - kv  42:                          general.file_type u32              = 14
llama_model_loader: - type  f32:  541 tensors
llama_model_loader: - type q8_0:  142 tensors
llama_model_loader: - type q4_K:  167 tensors
llama_model_loader: - type q5_K:  165 tensors
llama_model_loader: - type q6_K:   23 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Small
print_info: file size   = 216.96 GiB (4.70 BPW) 
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35moe
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 60
print_info: n_head                = 32
print_info: n_head_kv             = 2
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 16
print_info: n_embd_k_gqa          = 512
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 0
print_info: n_expert              = 512
print_info: n_expert_used         = 10
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 8192
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 64
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 397B.A17B
print_info: model params          = 396.35 B
print_info: general.name          = Qwen3.5 397B A17B Heretic
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248044 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 144817.04 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 151851674624
llama_model_load: error loading model: unable to allocate CUDA0 buffer
...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions