./bin/llama-server -m ~/.cache/llama.cpp/unsloth_GLM-4.6V-GGUF_UD-Q4_K_XL_GLM-4.6V-UD-Q4_K_XL-00001-of-00002.gguf -ctk q8_0 -ctv q8_0
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7979 (820ebfa6f) with GNU 15.2.1 for Linux x86_64
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/home/wim/.cache/llama.cpp/unsloth_GLM-4.6V-GGUF_UD-Q4_K_XL_GLM-4.6V-UD-Q4_K_XL-00001-of-00002.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 75037 MiB of device memory vs. 120451 MiB of free device memory
llama_params_fit_impl: will leave 45414 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.25 seconds
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) (0000:c3:00.0) - 120454 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 62 key-value pairs and 780 tensors from /home/wim/.cache/llama.cpp/unsloth_GLM-4.6V-GGUF_UD-Q4_K_XL_GLM-4.6V-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = glm4moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 2
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.600000
llama_model_loader: - kv 4: general.sampling.temp f32 = 0.800000
llama_model_loader: - kv 5: general.name str = Glm-4.6V
llama_model_loader: - kv 6: general.finetune str = 4.6V
llama_model_loader: - kv 7: general.basename str = Glm-4.6V
llama_model_loader: - kv 8: general.quantized_by str = Unsloth
llama_model_loader: - kv 9: general.size_label str = 128x8.0B
llama_model_loader: - kv 10: general.license str = mit
llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 12: general.base_model.count u32 = 1
llama_model_loader: - kv 13: general.base_model.0.name str = GLM 4.6V
llama_model_loader: - kv 14: general.base_model.0.organization str = Zai Org
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/zai-org/GLM-4.6V
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv 17: general.languages arr[str,2] = ["zh", "en"]
llama_model_loader: - kv 18: glm4moe.block_count u32 = 46
llama_model_loader: - kv 19: glm4moe.context_length u32 = 131072
llama_model_loader: - kv 20: glm4moe.embedding_length u32 = 4096
llama_model_loader: - kv 21: glm4moe.feed_forward_length u32 = 10944
llama_model_loader: - kv 22: glm4moe.attention.head_count u32 = 96
llama_model_loader: - kv 23: glm4moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 24: glm4moe.rope.dimension_sections arr[i32,4] = [8, 12, 12, 0]
llama_model_loader: - kv 25: glm4moe.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 26: glm4moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 27: glm4moe.expert_used_count u32 = 8
llama_model_loader: - kv 28: glm4moe.expert_group_count u32 = 1
llama_model_loader: - kv 29: glm4moe.expert_group_used_count u32 = 1
llama_model_loader: - kv 30: glm4moe.attention.key_length u32 = 128
llama_model_loader: - kv 31: glm4moe.attention.value_length u32 = 128
llama_model_loader: - kv 32: glm4moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 33: glm4moe.expert_count u32 = 128
llama_model_loader: - kv 34: glm4moe.expert_feed_forward_length u32 = 1408
llama_model_loader: - kv 35: glm4moe.expert_shared_count u32 = 1
llama_model_loader: - kv 36: glm4moe.leading_dense_block_count u32 = 1
llama_model_loader: - kv 37: glm4moe.expert_gating_func u32 = 2
llama_model_loader: - kv 38: glm4moe.expert_weights_scale f32 = 1.000000
llama_model_loader: - kv 39: glm4moe.expert_weights_norm bool = true
llama_model_loader: - kv 40: glm4moe.nextn_predict_layers u32 = 0
llama_model_loader: - kv 41: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 42: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 43: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 44: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 45: tokenizer.ggml.merges arr[str,318088] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 46: tokenizer.ggml.eos_token_id u32 = 151329
llama_model_loader: - kv 47: tokenizer.ggml.padding_token_id u32 = 151330
llama_model_loader: - kv 48: tokenizer.ggml.bos_token_id u32 = 151331
llama_model_loader: - kv 49: tokenizer.ggml.eot_token_id u32 = 151336
llama_model_loader: - kv 50: tokenizer.ggml.unknown_token_id u32 = 151329
llama_model_loader: - kv 51: tokenizer.ggml.eom_token_id u32 = 151338
llama_model_loader: - kv 52: tokenizer.chat_template str = {# Unsloth template fixes #}\n[gMASK]<...
llama_model_loader: - kv 53: general.quantization_version u32 = 2
llama_model_loader: - kv 54: general.file_type u32 = 15
llama_model_loader: - kv 55: quantize.imatrix.file str = GLM-4.6V-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv 56: quantize.imatrix.dataset str = unsloth_calibration_GLM-4.6V.txt
llama_model_loader: - kv 57: quantize.imatrix.entries_count u32 = 502
llama_model_loader: - kv 58: quantize.imatrix.chunks_count u32 = 90
llama_model_loader: - kv 59: split.no u16 = 0
llama_model_loader: - kv 60: split.tensors.count i32 = 780
llama_model_loader: - kv 61: split.count u16 = 2
llama_model_loader: - type f32: 321 tensors
llama_model_loader: - type q5_0: 33 tensors
llama_model_loader: - type q5_1: 13 tensors
llama_model_loader: - type q8_0: 45 tensors
llama_model_loader: - type q4_K: 334 tensors
llama_model_loader: - type q5_K: 23 tensors
llama_model_loader: - type q6_K: 11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 60.95 GiB (4.90 BPW)
load: 0 unused tokens
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 151329 ('<|endoftext|>')
load: - 151336 ('<|user|>')
load: - 151338 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
print_info: arch = glm4moe
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 46
print_info: n_head = 96
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 12
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 10944
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: n_expert_groups = 1
print_info: n_group_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 8
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: mrope sections = [8, 12, 12, 0]
print_info: model type = ?B
print_info: model params = 106.85 B
print_info: general.name = Glm-4.6V
print_info: vocab type = BPE
print_info: n_vocab = 151552
print_info: n_merges = 318088
print_info: BOS token = 151331 '[gMASK]'
print_info: EOS token = 151329 '<|endoftext|>'
print_info: EOT token = 151336 '<|user|>'
print_info: EOM token = 151338 '<|observation|>'
print_info: UNK token = 151329 '<|endoftext|>'
print_info: PAD token = 151330 '[MASK]'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151347 '<|code_prefix|>'
print_info: FIM SUF token = 151349 '<|code_suffix|>'
print_info: FIM MID token = 151348 '<|code_middle|>'
print_info: EOG token = 151329 '<|endoftext|>'
print_info: EOG token = 151336 '<|user|>'
print_info: EOG token = 151338 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 45 repeating layers to GPU
load_tensors: offloaded 47/47 layers to GPU
load_tensors: CPU_Mapped model buffer size = 333.00 MiB
load_tensors: ROCm0 model buffer size = 62077.27 MiB
.............................................................................................
Name and Version
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./bin/llama-server -m ~/.cache/llama.cpp/bartowski_stepfun-ai_Step-3.5-Flash-GGUF_stepfun-ai_Step-3.5-Flash-Q3_K_S_stepfun-ai_Step-3.5-Flash-Q3_K_S-00001-of-00003.gguf -ctk q8_0 -ctv q8_0Problem description & steps to reproduce
Loading large models with the ROCm backend hangs on my device. It loads in to memory (I can see memory usage grow) but it slows down as more memory is allocated and eventually just hangs. Or maybe slows down to a point where progress is so glacial I cannot observe it in a reasonable time. This bug only seems to affect larger models.
The following models exhibit this behaviour:
Loading does not finish and it seems to be using between 1.7-2.0 CPU cores once stuck.
The following models load and run correctly on ROCm:
All of these models load and run correctly using the Vulkan backend.
Additional info
I also tried some models with the CUDA backend and ZLUDA, which exhibited the same behaviour - smaller models load correctly, but loading larger models slows down until it seems to hang entirely.
I was curious and decided to dump the backtrace of the
llama-serveronce it hung while loading, perhaps this is useful to anyone trying to figure this out:System information:
First Bad Commit
This is the first time I'm using llama.cpp with ROCm on this platform in months.
Relevant log output
Logs