Skip to content

Misc. bug: larger model loading on ROCm hangs #19482

@de-wim

Description

@de-wim

Name and Version

$ ./bin/llama-cli --version
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
version: 7979 (820ebfa6f)
built with GNU 15.2.1 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./bin/llama-server -m ~/.cache/llama.cpp/bartowski_stepfun-ai_Step-3.5-Flash-GGUF_stepfun-ai_Step-3.5-Flash-Q3_K_S_stepfun-ai_Step-3.5-Flash-Q3_K_S-00001-of-00003.gguf -ctk q8_0 -ctv q8_0

Problem description & steps to reproduce

Loading large models with the ROCm backend hangs on my device. It loads in to memory (I can see memory usage grow) but it slows down as more memory is allocated and eventually just hangs. Or maybe slows down to a point where progress is so glacial I cannot observe it in a reasonable time. This bug only seems to affect larger models.

The following models exhibit this behaviour:

  • MiniMax M2.1 (mradermacher/MiniMax-M2.1-i1-GGUF:IQ3_XXS, ~82GB in weights)
  • Step 3.5 flash (bartowski/stepfun-ai_Step-3.5-Flash-GGUF:Q3_K_S, ~80GB)
  • PrimeIntellect INTELLECT-3 (bartowski/PrimeIntellect_INTELLECT-3-GGUF:Q5_K_S, ~71GB)
  • Qwen3-Next-80B-A3B-Instruct (Qwen/Qwen3-Next-80B-A3B-Instruct-GGUF:Q6_K, ~61GB)
  • GLM-4.6V (unsloth/GLM-4.6V-GGUF:Q4_K_XL, ~61GB)

Loading does not finish and it seems to be using between 1.7-2.0 CPU cores once stuck.

The following models load and run correctly on ROCm:

  • gpt-oss-120b (mradermacher/gpt-oss-120b-Derestricted-i1-GGUF, ~60GB - slows down a lot near the end)
  • GLM 4.7 Flash (unsloth/GLM-4.7-Flash-GGUF:Q8_K_XL, ~33GB - loads quickly)

All of these models load and run correctly using the Vulkan backend.

Additional info

I also tried some models with the CUDA backend and ZLUDA, which exhibited the same behaviour - smaller models load correctly, but loading larger models slows down until it seems to hang entirely.

I was curious and decided to dump the backtrace of the llama-server once it hung while loading, perhaps this is useful to anyone trying to figure this out:

#0  0x00007f760fe6f24e in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#1  0x00007f760fe6f3a0 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#2  0x00007f760fe72241 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#3  0x00007f7640a948cb in ?? () from /opt/rocm/lib/libamdhip64.so.7
#4  0x00007f7640a95fb8 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#5  0x00007f7640ad51ed in ?? () from /opt/rocm/lib/libamdhip64.so.7
#6  0x00007f7640ad5db1 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#7  0x00007f7640ad6242 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#8  0x00007f7640a99208 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#9  0x00007f7640a65cfd in ?? () from /opt/rocm/lib/libamdhip64.so.7
#10 0x00007f76407de48c in ?? () from /opt/rocm/lib/libamdhip64.so.7
#11 0x00007f76407def9f in ?? () from /opt/rocm/lib/libamdhip64.so.7
#12 0x00007f76408262b1 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#13 0x00007f76563c9bb0 in ggml_backend_cuda_buffer_set_tensor (buffer=<optimized out>, tensor=0x563e01a8f550, data=0x7f61fb8aee80, offset=<optimized out>, size=648806400)
    at /home/wim/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:631
#14 0x00007f7656d1dcb1 in llama_model_loader::load_all_data (this=this@entry=0x7fff29f75c50, ctx=<optimized out>, bufs=..., lmlocks=<optimized out>, progress_callback=<optimized out>, 
    progress_callback_user_data=<optimized out>) at /home/wim/src/llama.cpp/src/llama-model-loader.cpp:1121
#15 0x00007f7656d73466 in llama_model::load_tensors (this=this@entry=0x563dfffd95f0, ml=...) at /home/wim/src/llama.cpp/src/llama-model.cpp:7382
#16 0x00007f7656c91ce4 in llama_model_load (fname="/home/wim/.cache/llama.cpp/bartowski_stepfun-ai_Step-3.5-Flash-GGUF_stepfun-ai_Step-3.5-Flash-Q3_K_S_stepfun-ai_Step-3.5-Flash-Q3_K_S-00001-of-00003.gguf", 
    splits=std::vector of length 3, capacity 4 = {...}, model=..., params=...) at /home/wim/src/llama.cpp/src/llama.cpp:871
#17 llama_model_load_from_file_impl (path_model="/home/wim/.cache/llama.cpp/bartowski_stepfun-ai_Step-3.5-Flash-GGUF_stepfun-ai_Step-3.5-Flash-Q3_K_S_stepfun-ai_Step-3.5-Flash-Q3_K_S-00001-of-00003.gguf", 
    splits=std::vector of length 3, capacity 4 = {...}, params=...) at /home/wim/src/llama.cpp/src/llama.cpp:1006
#18 0x00007f7656c9265b in llama_model_load_from_file (
    path_model=0x563dffdcae10 "/home/wim/.cache/llama.cpp/bartowski_stepfun-ai_Step-3.5-Flash-GGUF_stepfun-ai_Step-3.5-Flash-Q3_K_S_stepfun-ai_Step-3.5-Flash-Q3_K_S-00001-of-00003.gguf", params=...)
    at /usr/include/c++/15.2.1/bits/basic_string.tcc:248
#19 0x0000563dd9d99d28 in common_init_result::common_init_result (this=0x563dffdd0c20, params=...) at /usr/include/c++/15.2.1/bits/basic_string.h:238
#20 0x0000563dd9d9c116 in common_init_from_params (params=...) at /home/wim/src/llama.cpp/common/common.cpp:1215
#21 0x0000563dd9c8fdf7 in server_context_impl::load_model (this=0x563dffdf6970, params=...) at /home/wim/src/llama.cpp/tools/server/server-context.cpp:625
#22 0x0000563dd9c673a8 in server_context::load_model (this=this@entry=0x7fff29f78218, params=...) at /home/wim/src/llama.cpp/tools/server/server-context.cpp:2862
#23 0x0000563dd9be4556 in main (argc=<optimized out>, argv=0x7fff29f7e808) at /home/wim/src/llama.cpp/tools/server/server.cpp:248

System information:

  • ROCm version 7.2.0
  • Linux version 6.18.9

First Bad Commit

This is the first time I'm using llama.cpp with ROCm on this platform in months.

Relevant log output

Logs
./bin/llama-server -m ~/.cache/llama.cpp/unsloth_GLM-4.6V-GGUF_UD-Q4_K_XL_GLM-4.6V-UD-Q4_K_XL-00001-of-00002.gguf -ctk q8_0 -ctv q8_0
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7979 (820ebfa6f) with GNU 15.2.1 for Linux x86_64
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

Running without SSL
init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/wim/.cache/llama.cpp/unsloth_GLM-4.6V-GGUF_UD-Q4_K_XL_GLM-4.6V-UD-Q4_K_XL-00001-of-00002.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 75037 MiB of device memory vs. 120451 MiB of free device memory
llama_params_fit_impl: will leave 45414 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.25 seconds
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) (0000:c3:00.0) - 120454 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 62 key-value pairs and 780 tensors from /home/wim/.cache/llama.cpp/unsloth_GLM-4.6V-GGUF_UD-Q4_K_XL_GLM-4.6V-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 2
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.600000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.800000
llama_model_loader: - kv   5:                               general.name str              = Glm-4.6V
llama_model_loader: - kv   6:                           general.finetune str              = 4.6V
llama_model_loader: - kv   7:                           general.basename str              = Glm-4.6V
llama_model_loader: - kv   8:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   9:                         general.size_label str              = 128x8.0B
llama_model_loader: - kv  10:                            general.license str              = mit
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = GLM 4.6V
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Zai Org
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/zai-org/GLM-4.6V
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv  18:                        glm4moe.block_count u32              = 46
llama_model_loader: - kv  19:                     glm4moe.context_length u32              = 131072
llama_model_loader: - kv  20:                   glm4moe.embedding_length u32              = 4096
llama_model_loader: - kv  21:                glm4moe.feed_forward_length u32              = 10944
llama_model_loader: - kv  22:               glm4moe.attention.head_count u32              = 96
llama_model_loader: - kv  23:            glm4moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  24:            glm4moe.rope.dimension_sections arr[i32,4]       = [8, 12, 12, 0]
llama_model_loader: - kv  25:                     glm4moe.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  26:   glm4moe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  27:                  glm4moe.expert_used_count u32              = 8
llama_model_loader: - kv  28:                 glm4moe.expert_group_count u32              = 1
llama_model_loader: - kv  29:            glm4moe.expert_group_used_count u32              = 1
llama_model_loader: - kv  30:               glm4moe.attention.key_length u32              = 128
llama_model_loader: - kv  31:             glm4moe.attention.value_length u32              = 128
llama_model_loader: - kv  32:               glm4moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  33:                       glm4moe.expert_count u32              = 128
llama_model_loader: - kv  34:         glm4moe.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  35:                glm4moe.expert_shared_count u32              = 1
llama_model_loader: - kv  36:          glm4moe.leading_dense_block_count u32              = 1
llama_model_loader: - kv  37:                 glm4moe.expert_gating_func u32              = 2
llama_model_loader: - kv  38:               glm4moe.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  39:                glm4moe.expert_weights_norm bool             = true
llama_model_loader: - kv  40:               glm4moe.nextn_predict_layers u32              = 0
llama_model_loader: - kv  41:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  42:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  43:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  44:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  45:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  46:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  47:            tokenizer.ggml.padding_token_id u32              = 151330
llama_model_loader: - kv  48:                tokenizer.ggml.bos_token_id u32              = 151331
llama_model_loader: - kv  49:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  50:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  51:                tokenizer.ggml.eom_token_id u32              = 151338
llama_model_loader: - kv  52:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n[gMASK]<...
llama_model_loader: - kv  53:               general.quantization_version u32              = 2
llama_model_loader: - kv  54:                          general.file_type u32              = 15
llama_model_loader: - kv  55:                      quantize.imatrix.file str              = GLM-4.6V-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  56:                   quantize.imatrix.dataset str              = unsloth_calibration_GLM-4.6V.txt
llama_model_loader: - kv  57:             quantize.imatrix.entries_count u32              = 502
llama_model_loader: - kv  58:              quantize.imatrix.chunks_count u32              = 90
llama_model_loader: - kv  59:                                   split.no u16              = 0
llama_model_loader: - kv  60:                        split.tensors.count i32              = 780
llama_model_loader: - kv  61:                                split.count u16              = 2
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q5_0:   33 tensors
llama_model_loader: - type q5_1:   13 tensors
llama_model_loader: - type q8_0:   45 tensors
llama_model_loader: - type q4_K:  334 tensors
llama_model_loader: - type q5_K:   23 tensors
llama_model_loader: - type q6_K:   11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 60.95 GiB (4.90 BPW) 
load: 0 unused tokens
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')
load:   - 151338 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
print_info: arch                  = glm4moe
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 131072
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 46
print_info: n_head                = 96
print_info: n_head_kv             = 8
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 12
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 10944
print_info: n_expert              = 128
print_info: n_expert_used         = 8
print_info: n_expert_groups       = 1
print_info: n_group_used          = 1
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 8
print_info: rope scaling          = linear
print_info: freq_base_train       = 500000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 131072
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [8, 12, 12, 0]
print_info: model type            = ?B
print_info: model params          = 106.85 B
print_info: general.name          = Glm-4.6V
print_info: vocab type            = BPE
print_info: n_vocab               = 151552
print_info: n_merges              = 318088
print_info: BOS token             = 151331 '[gMASK]'
print_info: EOS token             = 151329 '<|endoftext|>'
print_info: EOT token             = 151336 '<|user|>'
print_info: EOM token             = 151338 '<|observation|>'
print_info: UNK token             = 151329 '<|endoftext|>'
print_info: PAD token             = 151330 '[MASK]'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 151347 '<|code_prefix|>'
print_info: FIM SUF token         = 151349 '<|code_suffix|>'
print_info: FIM MID token         = 151348 '<|code_middle|>'
print_info: EOG token             = 151329 '<|endoftext|>'
print_info: EOG token             = 151336 '<|user|>'
print_info: EOG token             = 151338 '<|observation|>'
print_info: max token length      = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 45 repeating layers to GPU
load_tensors: offloaded 47/47 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   333.00 MiB
load_tensors:        ROCm0 model buffer size = 62077.27 MiB
.............................................................................................

Metadata

Metadata

Assignees

No one assigned

    Labels

    RoCMIssues related to the RoCM backendbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions