Skip to content

Eval bug: Core dump with GLM-4.7 Q4 with CUDA when using turbo4 (no problem with turbo3) #28

@dan-and

Description

@dan-and

Name and Version

Git commit b90b5e0

build/bin/llama-server --webui-mcp-proxy --alias llamacpp-model -m ../models/GLM-4.7-Flash-Q4_K_M.gguf --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --kv-unified -ctk turbo4 -ctv turbo4 --ctx-size 202752 --no-mmap --jinja -np 1 --host 0.0.0.0 --port 18080 --alias llamacpp-model -ngl 99

ggml_cuda_init: found 4 CUDA devices (Total VRAM: 80219 MiB):
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20054 MiB
Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20054 MiB
Device 2: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20054 MiB
Device 3: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20054 MiB
register_backend: registered backend CUDA (4 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 3080)
register_device: registered device CUDA1 (NVIDIA GeForce RTX 3080)
register_device: registered device CUDA2 (NVIDIA GeForce RTX 3080)
register_device: registered device CUDA3 (NVIDIA GeForce RTX 3080)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 5 5500GT with Radeon Graphics)
DEPRECATED: argument '--alias' specified multiple times, use comma-separated values instead (only last value will be used)
build: 8653 (b90b5e0) with GNU 13.3.0 for Linux x86_64 (debug)
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Running without SSL
init: using 11 threads for HTTP server
srv main: -----------------
srv main: CORS proxy is enabled, do not expose server to untrusted environments
srv main: This feature is EXPERIMENTAL and may be removed or changed in future versions
srv main: -----------------
start: binding port with default address family
main: loading model
srv load_model: loading model '../models/GLM-4.7-Flash-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_init_from_model: K cache type turbo4 with block size 128 does not divide n_embd_head_k=576
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to create llama_context from model
llama_params_fit: fitting params to free memory took 1.71 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) (0000:15:00.0) - 19830 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3080) (0000:21:00.0) - 19830 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3080) (0000:22:00.0) - 19830 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 3080) (0000:29:00.0) - 19830 MiB free
llama_model_loader: loaded meta data with 59 key-value pairs and 844 tensors from ../models/GLM-4.7-Flash-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 3: general.name str = Glm-4.7-Flash
llama_model_loader: - kv 4: general.basename str = Glm-4.7-Flash
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 64x2.6B
llama_model_loader: - kv 7: general.license str = mit
llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = GLM 4.7 Flash
llama_model_loader: - kv 11: general.base_model.0.organization str = Zai Org
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/zai-org/GLM-4....
llama_model_loader: - kv 13: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 14: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 15: deepseek2.block_count u32 = 47
llama_model_loader: - kv 16: deepseek2.context_length u32 = 202752
llama_model_loader: - kv 17: deepseek2.embedding_length u32 = 2048
llama_model_loader: - kv 18: deepseek2.feed_forward_length u32 = 10240
llama_model_loader: - kv 19: deepseek2.attention.head_count u32 = 20
llama_model_loader: - kv 20: deepseek2.attention.head_count_kv u32 = 1
llama_model_loader: - kv 21: deepseek2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 22: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 23: deepseek2.expert_used_count u32 = 4
llama_model_loader: - kv 24: deepseek2.expert_group_count u32 = 1
llama_model_loader: - kv 25: deepseek2.expert_group_used_count u32 = 1
llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 27: deepseek2.leading_dense_block_count u32 = 1
llama_model_loader: - kv 28: deepseek2.vocab_size u32 = 154880
llama_model_loader: - kv 29: deepseek2.attention.q_lora_rank u32 = 768
llama_model_loader: - kv 30: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 31: deepseek2.attention.key_length u32 = 576
llama_model_loader: - kv 32: deepseek2.attention.value_length u32 = 512
llama_model_loader: - kv 33: deepseek2.attention.key_length_mla u32 = 256
llama_model_loader: - kv 34: deepseek2.attention.value_length_mla u32 = 256
llama_model_loader: - kv 35: deepseek2.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 36: deepseek2.expert_count u32 = 64
llama_model_loader: - kv 37: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 38: deepseek2.expert_weights_scale f32 = 1.800000
llama_model_loader: - kv 39: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 40: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 41: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 42: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 43: tokenizer.ggml.tokens arr[str,154880] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 44: tokenizer.ggml.token_type arr[i32,154880] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 45: tokenizer.ggml.merges arr[str,321649] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 46: tokenizer.ggml.eos_token_id u32 = 154820
llama_model_loader: - kv 47: tokenizer.ggml.padding_token_id u32 = 154821
llama_model_loader: - kv 48: tokenizer.ggml.bos_token_id u32 = 154822
llama_model_loader: - kv 49: tokenizer.ggml.eot_token_id u32 = 154827
llama_model_loader: - kv 50: tokenizer.ggml.unknown_token_id u32 = 154820
llama_model_loader: - kv 51: tokenizer.ggml.eom_token_id u32 = 154829
llama_model_loader: - kv 52: tokenizer.chat_template str = [gMASK]\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv 53: general.quantization_version u32 = 2
llama_model_loader: - kv 54: general.file_type u32 = 15
llama_model_loader: - kv 55: quantize.imatrix.file str = GLM-4.7-Flash-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv 56: quantize.imatrix.dataset str = unsloth_calibration_GLM-4.7-Flash.txt
llama_model_loader: - kv 57: quantize.imatrix.entries_count u32 = 607
llama_model_loader: - kv 58: quantize.imatrix.chunks_count u32 = 85
llama_model_loader: - type f32: 281 tensors
llama_model_loader: - type q8_0: 141 tensors
llama_model_loader: - type q4_K: 260 tensors
llama_model_loader: - type q5_K: 92 tensors
llama_model_loader: - type q6_K: 70 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 17.05 GiB (4.89 BPW)
load: 0 unused tokens
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 154820 ('<|endoftext|>')
load: - 154827 ('<|user|>')
load: - 154829 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9811 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 202752
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 47
print_info: n_head = 20
print_info: n_head_kv = 1
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 576
print_info: n_embd_head_v = 512
print_info: n_gqa = 20
print_info: n_embd_k_gqa = 576
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
Run 1/1 (batch size 1)...
Error during run: Cannot connect to host 127.0.01:18080 ssl:default [Connect call failed ('127.0.0.1', 18080)]
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 10240
print_info: n_expert = 64
Running test: pp=2048, tg=32, depth=143360, concurrency=1
Run 1/1 (batch size 1)...
print_info: n_expert_groups = 1
print_info: n_group_used = 1
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 202752
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 29.94 B
print_info: general.name = Glm-4.7-Flash
print_info: n_layer_dense_lead = 1
print_info: n_lora_q = 768
print_info: n_lora_kv = 512
print_info: n_embd_head_k_mla = 256
print_info: n_embd_head_v_mla = 256
print_info: n_ff_exp = 1536
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 1.8
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: vocab type = BPE
print_info: n_vocab = 154880
print_info: n_merges = 321649
print_info: BOS token = 154822 '[gMASK]'
print_info: EOS token = 154820 '<|endoftext|>'
print_info: EOT token = 154827 '<|user|>'
print_info: EOM token = 154829 '<|observation|>'
print_info: UNK token = 154820 '<|endoftext|>'
print_info: PAD token = 154821 '[MASK]'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 154838 '<|code_prefix|>'
print_info: FIM SUF token = 154840 '<|code_suffix|>'
print_info: FIM MID token = 154839 '<|code_middle|>'
print_info: EOG token = 154820 '<|endoftext|>'
print_info: EOG token = 154827 '<|user|>'
print_info: EOG token = 154829 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 46 repeating layers to GPU
load_tensors: offloaded 48/48 layers to GPU
load_tensors: CPU model buffer size = 170.16 MiB
load_tensors: CUDA0 model buffer size = 4151.17 MiB
load_tensors: CUDA1 model buffer size = 4344.35 MiB
load_tensors: CUDA2 model buffer size = 4344.35 MiB
load_tensors: CUDA3 model buffer size = 4444.97 MiB
....................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|user|> logit bias = -inf
common_init_result: added <|observation|> logit bias = -inf
Segmentation fault (core dumped)

Operating systems

Linux

GGML backends

CUDA

Hardware

Ryzen 5500 , 4x RTX 3080 20GB

Models

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

Problem description & steps to reproduce

Loading the model works, but with the first call, it core dumps:

What your log is actually showing

This is not a classic “analyze this core file” situation; it is a startup failure with a clear first error, then a segfault at the end.

  1. Real incompatibility (now fixed in tree)

K cache type turbo4 with block size 128 does not divide n_embd_head_k=576

• Turbo4 uses a 128-wide quant block (QK_TURBO4 in ggml).
• GLM‑4.7 / DeepSeek2 here reports n_embd_head_k = 576, and 576 % 128 ≠ 0, so the early guard in llama_init_from_model rejected the config.
• The KV cache already implements turbo zero-padding to the next multiple of 128 (576 → 640) in llama-kv-cache.cpp, and the graph pads Q to match K for attention. So the init check was stricter than the real layout, and
wrongly blocked a supported case.

That is why llama_params_fit blew up with failed to create llama_context from model: it probes memory by calling llama_init_from_model, which hit that error.

  1. Segfault after weight load

After tensors load and EOG logit biases are printed, the next heavy step is building the context (KV / sched / graphs). A segfault there often means UB in CUDA/FA or an inconsistent path; cleaning up the bogus “can’t init”
failure makes the pipeline sane again.
I aligned the flash-attn + quantized K/V divisibility checks with the same padding rule as the KV cache so turbo2/3/4 + non‑128‑aligned head dims are validated on the padded head size (640 for K in your case), not the raw
576.

────────────────────────────────────────

Code change: src/llama-context.cpp — K/V checks now apply the same turbo 128-alignment padding as llama_kv_cache (and skip V padding when is_mla(), matching the cache). The project builds successfully after the change.
Rebuild llama-server and rerun your command; if anything still crashes, capture:
gdb --args build/bin/llama-server ... → run → thread apply all bt
and we can tie it to a specific kernel/path.

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions