Name and Version
Docker Container, Unmodified, b9049.
Operating systems
Linux
GGML backends
Vulkan
Hardware
2x Intel ARC B50, 1x AMD 6900XT
Models
Qwen3.6-27B:
[qwen3-6-27b]
main-gpu = 3
parallel = 2
batch-size = 16384
ubatch-size = 512
hf = unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
ctx-size = 262144
seed = 3407
temp = 0.7
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0
flash-attn = true
split-mode = tensor
tensor-split = 1,1,1
jinja = true
chat-template-kwargs = {"enable_thinking":true}
reasoning-budget = 40000
Problem description & steps to reproduce
Crash during load of model, full log see below. Catch an assert for layer split along axis:
Full Log Details
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
build_info: b9049-2496f9c14
system_info: n_threads = 24 (n_threads_batch = 24) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 23 threads for HTTP server
srv load_models: Loaded 8 cached model presets
srv load_models: Loaded 0 local model presets from /models
srv load_models: Loaded 8 custom model presets from /app/models.ini
srv operator(): Available models (16) (*: custom preset)
srv operator(): DevQuasar/Qwen.Qwen3-VL-Embedding-2B-GGUF:Q8_0
srv operator(): * gemma-4-26b-a4b
srv operator(): * qwen3-5-2b
srv operator(): * qwen3-5-9b
srv operator(): * qwen3-5-9b-thinking
srv operator(): * qwen3-6-27b
srv operator(): * qwen3-6-35B-A3B
srv operator(): * qwen3-6-35B-A3B-thinking
srv operator(): * qwen3-vl-embedding-2b
srv operator(): unsloth/Qwen3.5-4B-GGUF:Q8_0
srv operator(): unsloth/Qwen3.5-9B-GGUF:Q4_K_M
srv operator(): unsloth/Qwen3.6-27B-GGUF:Q4_K_XL
srv operator(): unsloth/Qwen3.6-27B-GGUF:Q6_K
srv operator(): unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
srv operator(): unsloth/Qwen3.6-35B-A3B-GGUF:Q6_K
srv operator(): unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_XL
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://0.0.0.0:8080
main: NOTE: router mode is experimental
main: it is not recommended to use this mode in untrusted environments
srv ensure_model: model name=qwen3-6-27b is not loaded, loading...
srv load: spawning server instance with name=qwen3-6-27b on port 55079
srv load: spawning server instance with args:
srv load: /app/llama-server
srv load: --chat-template-kwargs
srv load: {"enable_thinking":true}
srv load: --host
srv load: 127.0.0.1
srv load: --jinja
srv load: --metrics
srv load: --min-p
srv load: 0.0
srv load: --no-mmap
srv load: --offline
srv load: --port
srv load: 55079
srv load: --presence-penalty
srv load: 0.0
srv load: --prio
srv load: 2
srv load: --reasoning-budget
srv load: 40000
srv load: --repeat-penalty
srv load: 1.0
srv load: --sleep-idle-seconds
srv load: 6000
srv load: --temperature
srv load: 0.7
srv load: --top-k
srv load: 20
srv load: --top-p
srv load: 0.95
srv load: --alias
srv load: qwen3-6-27b
srv load: --batch-size
srv load: 16384
srv load: --ctx-size
srv load: 262144
srv load: --flash-attn
srv load: true
srv load: --hf-repo
srv load: unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
srv load: --main-gpu
srv load: 3
srv load: --n-gpu-layers
srv load: 99
srv load: --parallel
srv load: 2
srv load: --seed
srv load: 3407
srv load: --split-mode
srv load: tensor
srv load: --ubatch-size
srv load: 512
srv ensure_model: waiting until model name=qwen3-6-27b is fully loaded...
[55079] warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
[55079] Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
[55079] warning: llama.cpp was compiled without support for GPU offload. Setting the main GPU has no effect.
[55079] warning: no usable GPU found, --gpu-layers option will be ignored
[55079] warning: one possible reason is that llama.cpp was compiled without GPU support
[55079] warning: consult docs/build.md for compilation instructions
[55079] warning: llama.cpp was compiled without support for GPU offload. Setting the split mode has no effect.
[55079] migrate_old_cache_to_hf_cache: skipping migration in offline mode (will run when online)
[55079] common_download_file_single: required file is not available in cache (offline mode): /root/.cache/llama.cpp/unsloth_Qwen3.6-27B-GGUF_preset.ini
[55079] no remote preset found, skipping
[55079] load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
[55079] load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
[55079] build_info: b9049-2496f9c14
[55079] system_info: n_threads = 24 (n_threads_batch = 24) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[55079] Running without SSL
[55079] init: using 23 threads for HTTP server
[55079] start: binding port with default address family
[55079] main: loading model
[55079] srv load_model: loading model '/root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-GGUF/snapshots/82d411acf4a06cfb8d9b073a5211bf410bfc29bf/Qwen3.6-27B-UD-Q4_K_XL.gguf'
[55079] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[55079] common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
[55079] common_fit_params: fitting params to free memory took 0.00 seconds
[55079] llama_model_loader: loaded meta data with 51 key-value pairs and 851 tensors from /root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-GGUF/snapshots/82d411acf4a06cfb8d9b073a5211bf410bfc29bf/Qwen3.6-27B-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
[55079] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[55079] llama_model_loader: - kv 0: general.architecture str = qwen35
[55079] llama_model_loader: - kv 1: general.type str = model
[55079] llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
[55079] llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
[55079] llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
[55079] llama_model_loader: - kv 5: general.name str = Qwen3.6-27B
[55079] llama_model_loader: - kv 6: general.basename str = Qwen3.6-27B
[55079] llama_model_loader: - kv 7: general.quantized_by str = Unsloth
[55079] llama_model_loader: - kv 8: general.size_label str = 27B
[55079] llama_model_loader: - kv 9: general.license str = apache-2.0
[55079] llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3.6-2...
[55079] llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
[55079] llama_model_loader: - kv 12: general.base_model.count u32 = 1
[55079] llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3.6 27B
[55079] llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
[55079] llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3.6-27B
[55079] llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
[55079] llama_model_loader: - kv 17: qwen35.block_count u32 = 64
[55079] llama_model_loader: - kv 18: qwen35.context_length u32 = 262144
[55079] llama_model_loader: - kv 19: qwen35.embedding_length u32 = 5120
[55079] llama_model_loader: - kv 20: qwen35.feed_forward_length u32 = 17408
[55079] llama_model_loader: - kv 21: qwen35.attention.head_count u32 = 24
[55079] llama_model_loader: - kv 22: qwen35.attention.head_count_kv u32 = 4
[55079] llama_model_loader: - kv 23: qwen35.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
[55079] llama_model_loader: - kv 24: qwen35.rope.freq_base f32 = 10000000.000000
[55079] llama_model_loader: - kv 25: qwen35.attention.layer_norm_rms_epsilon f32 = 0.000001
[55079] llama_model_loader: - kv 26: qwen35.attention.key_length u32 = 256
[55079] llama_model_loader: - kv 27: qwen35.attention.value_length u32 = 256
[55079] llama_model_loader: - kv 28: qwen35.ssm.conv_kernel u32 = 4
[55079] llama_model_loader: - kv 29: qwen35.ssm.state_size u32 = 128
[55079] llama_model_loader: - kv 30: qwen35.ssm.group_count u32 = 16
[55079] llama_model_loader: - kv 31: qwen35.ssm.time_step_rank u32 = 48
[55079] llama_model_loader: - kv 32: qwen35.ssm.inner_size u32 = 6144
[55079] llama_model_loader: - kv 33: qwen35.full_attention_interval u32 = 4
[55079] llama_model_loader: - kv 34: qwen35.rope.dimension_count u32 = 64
[55079] llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
[55079] llama_model_loader: - kv 36: tokenizer.ggml.pre str = qwen35
[55079] llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
[55079] llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[55079] llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
[55079] llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 248046
[55079] llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 248055
[55079] llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 248044
[55079] llama_model_loader: - kv 43: tokenizer.ggml.add_bos_token bool = false
[55079] llama_model_loader: - kv 44: tokenizer.chat_template str = {%- set image_count = namespace(value...
[55079] llama_model_loader: - kv 45: general.quantization_version u32 = 2
[55079] llama_model_loader: - kv 46: general.file_type u32 = 15
[55079] llama_model_loader: - kv 47: quantize.imatrix.file str = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
[55079] llama_model_loader: - kv 48: quantize.imatrix.dataset str = unsloth_calibration_Qwen3.6-27B.txt
[55079] llama_model_loader: - kv 49: quantize.imatrix.entries_count u32 = 496
[55079] llama_model_loader: - kv 50: quantize.imatrix.chunks_count u32 = 76
[55079] llama_model_loader: - type f32: 449 tensors
[55079] llama_model_loader: - type q8_0: 48 tensors
[55079] llama_model_loader: - type q4_K: 207 tensors
[55079] llama_model_loader: - type q5_K: 70 tensors
[55079] llama_model_loader: - type q6_K: 65 tensors
[55079] llama_model_loader: - type iq4_xs: 12 tensors
[55079] print_info: file format = GGUF V3 (latest)
[55079] print_info: file type = Q4_K - Medium
[55079] print_info: file size = 16.39 GiB (5.24 BPW)
[55079] llama_prepare_model_devices: skipping CPU (AMD Ryzen 9 5900X 12-Core Processor) for tensor parallelism
[55079] llama_prepare_model_devices: creating a Meta device for tensor parallelism from 3 devices:
[55079] llama_prepare_model_devices: - device 0: Vulkan0 (AMD Radeon RX 6900 XT (RADV NAVI21))
[55079] llama_prepare_model_devices: - device 1: Vulkan1 (Intel(R) Arc(tm) Pro B50 Graphics (BMG G21))
[55079] llama_prepare_model_devices: - device 2: Vulkan2 (Intel(R) Arc(tm) Pro B50 Graphics (BMG G21))
[55079] llama_prepare_model_devices: using device Meta() (Meta()) (unknown id) - 44659 MiB free
[55079] load: 0 unused tokens
[55079] load: printing all EOG tokens:
[55079] load: - 248044 ('<|endoftext|>')
[55079] load: - 248046 ('<|im_end|>')
[55079] load: - 248063 ('<|fim_pad|>')
[55079] load: - 248064 ('<|repo_name|>')
[55079] load: - 248065 ('<|file_sep|>')
[55079] load: special tokens cache size = 33
[55079] load: token to piece cache size = 1.7581 MB
[55079] print_info: arch = qwen35
[55079] print_info: vocab_only = 0
[55079] print_info: no_alloc = 0
[55079] print_info: n_ctx_train = 262144
[55079] print_info: n_embd = 5120
[55079] print_info: n_embd_inp = 5120
[55079] print_info: n_layer = 64
[55079] print_info: n_head = 24
[55079] print_info: n_head_kv = 4
[55079] print_info: n_rot = 64
[55079] print_info: n_swa = 0
[55079] print_info: is_swa_any = 0
[55079] print_info: n_embd_head_k = 256
[55079] print_info: n_embd_head_v = 256
[55079] print_info: n_gqa = 6
[55079] print_info: n_embd_k_gqa = 1024
[55079] print_info: n_embd_v_gqa = 1024
[55079] print_info: f_norm_eps = 0.0e+00
[55079] print_info: f_norm_rms_eps = 1.0e-06
[55079] print_info: f_clamp_kqv = 0.0e+00
[55079] print_info: f_max_alibi_bias = 0.0e+00
[55079] print_info: f_logit_scale = 0.0e+00
[55079] print_info: f_attn_scale = 0.0e+00
[55079] print_info: n_ff = 17408
[55079] print_info: n_expert = 0
[55079] print_info: n_expert_used = 0
[55079] print_info: n_expert_groups = 0
[55079] print_info: n_group_used = 0
[55079] print_info: causal attn = 1
[55079] print_info: pooling type = -1
[55079] print_info: rope type = 40
[55079] print_info: rope scaling = linear
[55079] print_info: freq_base_train = 10000000.0
[55079] print_info: freq_scale_train = 1
[55079] print_info: n_ctx_orig_yarn = 262144
[55079] print_info: rope_yarn_log_mul = 0.0000
[55079] print_info: rope_finetuned = unknown
[55079] print_info: mrope sections = [11, 11, 10, 0]
[55079] print_info: ssm_d_conv = 4
[55079] print_info: ssm_d_inner = 6144
[55079] print_info: ssm_d_state = 128
[55079] print_info: ssm_dt_rank = 48
[55079] print_info: ssm_n_group = 16
[55079] print_info: ssm_dt_b_c_rms = 0
[55079] print_info: model type = 27B
[55079] print_info: model params = 26.90 B
[55079] print_info: general.name = Qwen3.6-27B
[55079] print_info: vocab type = BPE
[55079] print_info: n_vocab = 248320
[55079] print_info: n_merges = 247587
[55079] print_info: BOS token = 248044 '<|endoftext|>'
[55079] print_info: EOS token = 248046 '<|im_end|>'
[55079] print_info: EOT token = 248046 '<|im_end|>'
[55079] print_info: PAD token = 248055 '<|vision_pad|>'
[55079] print_info: LF token = 198 'Ċ'
[55079] print_info: FIM PRE token = 248060 '<|fim_prefix|>'
[55079] print_info: FIM SUF token = 248062 '<|fim_suffix|>'
[55079] print_info: FIM MID token = 248061 '<|fim_middle|>'
[55079] print_info: FIM PAD token = 248063 '<|fim_pad|>'
[55079] print_info: FIM REP token = 248064 '<|repo_name|>'
[55079] print_info: FIM SEP token = 248065 '<|file_sep|>'
[55079] print_info: EOG token = 248044 '<|endoftext|>'
[55079] print_info: EOG token = 248046 '<|im_end|>'
[55079] print_info: EOG token = 248063 '<|fim_pad|>'
[55079] print_info: EOG token = 248064 '<|repo_name|>'
[55079] print_info: EOG token = 248065 '<|file_sep|>'
[55079] print_info: max token length = 256
[55079] load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
[55079] load_tensors: offloading output layer to GPU
[55079] load_tensors: offloading 63 repeating layers to GPU
[55079] load_tensors: offloaded 65/65 layers to GPU
[55079] load_tensors: Meta() model buffer size = 5389.43 MiB
[55079] load_tensors: Vulkan_Host model buffer size = 682.03 MiB
[55079] ............................................................................................
[55079] common_init_result: added <|endoftext|> logit bias = -inf
[55079] common_init_result: added <|im_end|> logit bias = -inf
[55079] common_init_result: added <|fim_pad|> logit bias = -inf
[55079] common_init_result: added <|repo_name|> logit bias = -inf
[55079] common_init_result: added <|file_sep|> logit bias = -inf
[55079] llama_context: constructing llama_context
[55079] llama_context: n_seq_max = 2
[55079] llama_context: n_ctx = 262144
[55079] llama_context: n_ctx_seq = 131072
[55079] llama_context: n_batch = 16384
[55079] llama_context: n_ubatch = 512
[55079] llama_context: causal_attn = 1
[55079] llama_context: flash_attn = enabled
[55079] llama_context: kv_unified = false
[55079] llama_context: freq_base = 10000000.0
[55079] llama_context: freq_scale = 1
[55079] llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[55079] llama_context: Vulkan_Host output buffer size = 1.89 MiB
[55079] llama_kv_cache: Meta() KV buffer size = 5632.00 MiB
[55079] llama_kv_cache: size = 16384.00 MiB (131072 cells, 16 layers, 2/2 seqs), K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
[55079] llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
[55079] llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
[55079] llama_memory_recurrent: Meta() RS buffer size = 99.75 MiB
[55079] llama_memory_recurrent: size = 299.25 MiB ( 2 cells, 64 layers, 2 seqs), R (f32): 11.25 MiB, S (f32): 288.00 MiB
[55079] sched_reserve: reserving ...
[55079] sched_reserve: resolving fused Gated Delta Net support:
[55079] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[55079] sched_reserve: fused Gated Delta Net (chunked) enabled
[55079] sched_reserve: Meta() compute buffer size = 495.00 MiB
[55079] sched_reserve: Vulkan_Host compute buffer size = 276.03 MiB
[55079] sched_reserve: graph nodes = 3689
[55079] sched_reserve: graph splits = 2
[55079] sched_reserve: reserve took 20.77 ms, sched copies = 1
[55079] common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[55079] /app/ggml/src/ggml-backend-meta.cpp:1013: GGML_ASSERT(split_state.ne[j] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
[55079] libggml-base.so.0(+0x1a7b6) [0x7f7cedf847b6]
[55079] libggml-base.so.0(ggml_print_backtrace+0x20d) [0x7f7cedf84c3d]
[55079] libggml-base.so.0(ggml_abort+0x166) [0x7f7cedf84e26]
[55079] libggml-base.so.0(+0x3fbe4) [0x7f7cedfa9be4]
[55079] libggml-base.so.0(+0x3ff32) [0x7f7cedfa9f32]
[55079] libggml-base.so.0(+0x430bc) [0x7f7cedfad0bc]
[55079] libggml-base.so.0(ggml_gallocr_alloc_graph+0x4a5) [0x7f7cedf9b1d5]
[55079] libggml-base.so.0(ggml_backend_sched_alloc_graph+0x111) [0x7f7cedfa1751]
[55079] libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe7) [0x7f7cee103e67]
[55079] libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x36f) [0x7f7cee10aa2f]
[55079] libllama.so.0(llama_decode+0x12) [0x7f7cee10c5a2]
[55079] libllama-common.so.0(_Z23common_init_from_paramsR13common_params+0x33d) [0x7f7cee649bcd]
[55079] /app/llama-server(+0x11c92e) [0x55f959d3792e]
[55079] /app/llama-server(+0x6d6d1) [0x55f959c886d1]
[55079] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a601) [0x7f7ced97a601]
[55079] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x88) [0x7f7ced97a718]
[55079] /app/llama-server(+0x6e265) [0x55f959c89265]
First Bad Commit
Not sure
Relevant log output
Logs
Name and Version
Docker Container, Unmodified, b9049.
Operating systems
Linux
GGML backends
Vulkan
Hardware
2x Intel ARC B50, 1x AMD 6900XT
Models
Qwen3.6-27B:
[qwen3-6-27b]
main-gpu = 3
parallel = 2
batch-size = 16384
ubatch-size = 512
hf = unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
ctx-size = 262144
seed = 3407
temp = 0.7
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0
flash-attn = true
split-mode = tensor
tensor-split = 1,1,1
jinja = true
chat-template-kwargs = {"enable_thinking":true}
reasoning-budget = 40000
Problem description & steps to reproduce
Crash during load of model, full log see below. Catch an assert for layer split along axis:
Full Log Details
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
build_info: b9049-2496f9c14
system_info: n_threads = 24 (n_threads_batch = 24) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 23 threads for HTTP server
srv load_models: Loaded 8 cached model presets
srv load_models: Loaded 0 local model presets from /models
srv load_models: Loaded 8 custom model presets from /app/models.ini
srv operator(): Available models (16) (*: custom preset)
srv operator(): DevQuasar/Qwen.Qwen3-VL-Embedding-2B-GGUF:Q8_0
srv operator(): * gemma-4-26b-a4b
srv operator(): * qwen3-5-2b
srv operator(): * qwen3-5-9b
srv operator(): * qwen3-5-9b-thinking
srv operator(): * qwen3-6-27b
srv operator(): * qwen3-6-35B-A3B
srv operator(): * qwen3-6-35B-A3B-thinking
srv operator(): * qwen3-vl-embedding-2b
srv operator(): unsloth/Qwen3.5-4B-GGUF:Q8_0
srv operator(): unsloth/Qwen3.5-9B-GGUF:Q4_K_M
srv operator(): unsloth/Qwen3.6-27B-GGUF:Q4_K_XL
srv operator(): unsloth/Qwen3.6-27B-GGUF:Q6_K
srv operator(): unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
srv operator(): unsloth/Qwen3.6-35B-A3B-GGUF:Q6_K
srv operator(): unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_XL
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://0.0.0.0:8080
main: NOTE: router mode is experimental
main: it is not recommended to use this mode in untrusted environments
srv ensure_model: model name=qwen3-6-27b is not loaded, loading...
srv load: spawning server instance with name=qwen3-6-27b on port 55079
srv load: spawning server instance with args:
srv load: /app/llama-server
srv load: --chat-template-kwargs
srv load: {"enable_thinking":true}
srv load: --host
srv load: 127.0.0.1
srv load: --jinja
srv load: --metrics
srv load: --min-p
srv load: 0.0
srv load: --no-mmap
srv load: --offline
srv load: --port
srv load: 55079
srv load: --presence-penalty
srv load: 0.0
srv load: --prio
srv load: 2
srv load: --reasoning-budget
srv load: 40000
srv load: --repeat-penalty
srv load: 1.0
srv load: --sleep-idle-seconds
srv load: 6000
srv load: --temperature
srv load: 0.7
srv load: --top-k
srv load: 20
srv load: --top-p
srv load: 0.95
srv load: --alias
srv load: qwen3-6-27b
srv load: --batch-size
srv load: 16384
srv load: --ctx-size
srv load: 262144
srv load: --flash-attn
srv load: true
srv load: --hf-repo
srv load: unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
srv load: --main-gpu
srv load: 3
srv load: --n-gpu-layers
srv load: 99
srv load: --parallel
srv load: 2
srv load: --seed
srv load: 3407
srv load: --split-mode
srv load: tensor
srv load: --ubatch-size
srv load: 512
srv ensure_model: waiting until model name=qwen3-6-27b is fully loaded...
[55079] warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
[55079] Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
[55079] warning: llama.cpp was compiled without support for GPU offload. Setting the main GPU has no effect.
[55079] warning: no usable GPU found, --gpu-layers option will be ignored
[55079] warning: one possible reason is that llama.cpp was compiled without GPU support
[55079] warning: consult docs/build.md for compilation instructions
[55079] warning: llama.cpp was compiled without support for GPU offload. Setting the split mode has no effect.
[55079] migrate_old_cache_to_hf_cache: skipping migration in offline mode (will run when online)
[55079] common_download_file_single: required file is not available in cache (offline mode): /root/.cache/llama.cpp/unsloth_Qwen3.6-27B-GGUF_preset.ini
[55079] no remote preset found, skipping
[55079] load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
[55079] load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
[55079] build_info: b9049-2496f9c14
[55079] system_info: n_threads = 24 (n_threads_batch = 24) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[55079] Running without SSL
[55079] init: using 23 threads for HTTP server
[55079] start: binding port with default address family
[55079] main: loading model
[55079] srv load_model: loading model '/root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-GGUF/snapshots/82d411acf4a06cfb8d9b073a5211bf410bfc29bf/Qwen3.6-27B-UD-Q4_K_XL.gguf'
[55079] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[55079] common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort
[55079] common_fit_params: fitting params to free memory took 0.00 seconds
[55079] llama_model_loader: loaded meta data with 51 key-value pairs and 851 tensors from /root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-GGUF/snapshots/82d411acf4a06cfb8d9b073a5211bf410bfc29bf/Qwen3.6-27B-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
[55079] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[55079] llama_model_loader: - kv 0: general.architecture str = qwen35
[55079] llama_model_loader: - kv 1: general.type str = model
[55079] llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
[55079] llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
[55079] llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
[55079] llama_model_loader: - kv 5: general.name str = Qwen3.6-27B
[55079] llama_model_loader: - kv 6: general.basename str = Qwen3.6-27B
[55079] llama_model_loader: - kv 7: general.quantized_by str = Unsloth
[55079] llama_model_loader: - kv 8: general.size_label str = 27B
[55079] llama_model_loader: - kv 9: general.license str = apache-2.0
[55079] llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3.6-2...
[55079] llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
[55079] llama_model_loader: - kv 12: general.base_model.count u32 = 1
[55079] llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3.6 27B
[55079] llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
[55079] llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3.6-27B
[55079] llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
[55079] llama_model_loader: - kv 17: qwen35.block_count u32 = 64
[55079] llama_model_loader: - kv 18: qwen35.context_length u32 = 262144
[55079] llama_model_loader: - kv 19: qwen35.embedding_length u32 = 5120
[55079] llama_model_loader: - kv 20: qwen35.feed_forward_length u32 = 17408
[55079] llama_model_loader: - kv 21: qwen35.attention.head_count u32 = 24
[55079] llama_model_loader: - kv 22: qwen35.attention.head_count_kv u32 = 4
[55079] llama_model_loader: - kv 23: qwen35.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
[55079] llama_model_loader: - kv 24: qwen35.rope.freq_base f32 = 10000000.000000
[55079] llama_model_loader: - kv 25: qwen35.attention.layer_norm_rms_epsilon f32 = 0.000001
[55079] llama_model_loader: - kv 26: qwen35.attention.key_length u32 = 256
[55079] llama_model_loader: - kv 27: qwen35.attention.value_length u32 = 256
[55079] llama_model_loader: - kv 28: qwen35.ssm.conv_kernel u32 = 4
[55079] llama_model_loader: - kv 29: qwen35.ssm.state_size u32 = 128
[55079] llama_model_loader: - kv 30: qwen35.ssm.group_count u32 = 16
[55079] llama_model_loader: - kv 31: qwen35.ssm.time_step_rank u32 = 48
[55079] llama_model_loader: - kv 32: qwen35.ssm.inner_size u32 = 6144
[55079] llama_model_loader: - kv 33: qwen35.full_attention_interval u32 = 4
[55079] llama_model_loader: - kv 34: qwen35.rope.dimension_count u32 = 64
[55079] llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
[55079] llama_model_loader: - kv 36: tokenizer.ggml.pre str = qwen35
[55079] llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
[55079] llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[55079] llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
[55079] llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 248046
[55079] llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 248055
[55079] llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 248044
[55079] llama_model_loader: - kv 43: tokenizer.ggml.add_bos_token bool = false
[55079] llama_model_loader: - kv 44: tokenizer.chat_template str = {%- set image_count = namespace(value...
[55079] llama_model_loader: - kv 45: general.quantization_version u32 = 2
[55079] llama_model_loader: - kv 46: general.file_type u32 = 15
[55079] llama_model_loader: - kv 47: quantize.imatrix.file str = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
[55079] llama_model_loader: - kv 48: quantize.imatrix.dataset str = unsloth_calibration_Qwen3.6-27B.txt
[55079] llama_model_loader: - kv 49: quantize.imatrix.entries_count u32 = 496
[55079] llama_model_loader: - kv 50: quantize.imatrix.chunks_count u32 = 76
[55079] llama_model_loader: - type f32: 449 tensors
[55079] llama_model_loader: - type q8_0: 48 tensors
[55079] llama_model_loader: - type q4_K: 207 tensors
[55079] llama_model_loader: - type q5_K: 70 tensors
[55079] llama_model_loader: - type q6_K: 65 tensors
[55079] llama_model_loader: - type iq4_xs: 12 tensors
[55079] print_info: file format = GGUF V3 (latest)
[55079] print_info: file type = Q4_K - Medium
[55079] print_info: file size = 16.39 GiB (5.24 BPW)
[55079] llama_prepare_model_devices: skipping CPU (AMD Ryzen 9 5900X 12-Core Processor) for tensor parallelism
[55079] llama_prepare_model_devices: creating a Meta device for tensor parallelism from 3 devices:
[55079] llama_prepare_model_devices: - device 0: Vulkan0 (AMD Radeon RX 6900 XT (RADV NAVI21))
[55079] llama_prepare_model_devices: - device 1: Vulkan1 (Intel(R) Arc(tm) Pro B50 Graphics (BMG G21))
[55079] llama_prepare_model_devices: - device 2: Vulkan2 (Intel(R) Arc(tm) Pro B50 Graphics (BMG G21))
[55079] llama_prepare_model_devices: using device Meta() (Meta()) (unknown id) - 44659 MiB free
[55079] load: 0 unused tokens
[55079] load: printing all EOG tokens:
[55079] load: - 248044 ('<|endoftext|>')
[55079] load: - 248046 ('<|im_end|>')
[55079] load: - 248063 ('<|fim_pad|>')
[55079] load: - 248064 ('<|repo_name|>')
[55079] load: - 248065 ('<|file_sep|>')
[55079] load: special tokens cache size = 33
[55079] load: token to piece cache size = 1.7581 MB
[55079] print_info: arch = qwen35
[55079] print_info: vocab_only = 0
[55079] print_info: no_alloc = 0
[55079] print_info: n_ctx_train = 262144
[55079] print_info: n_embd = 5120
[55079] print_info: n_embd_inp = 5120
[55079] print_info: n_layer = 64
[55079] print_info: n_head = 24
[55079] print_info: n_head_kv = 4
[55079] print_info: n_rot = 64
[55079] print_info: n_swa = 0
[55079] print_info: is_swa_any = 0
[55079] print_info: n_embd_head_k = 256
[55079] print_info: n_embd_head_v = 256
[55079] print_info: n_gqa = 6
[55079] print_info: n_embd_k_gqa = 1024
[55079] print_info: n_embd_v_gqa = 1024
[55079] print_info: f_norm_eps = 0.0e+00
[55079] print_info: f_norm_rms_eps = 1.0e-06
[55079] print_info: f_clamp_kqv = 0.0e+00
[55079] print_info: f_max_alibi_bias = 0.0e+00
[55079] print_info: f_logit_scale = 0.0e+00
[55079] print_info: f_attn_scale = 0.0e+00
[55079] print_info: n_ff = 17408
[55079] print_info: n_expert = 0
[55079] print_info: n_expert_used = 0
[55079] print_info: n_expert_groups = 0
[55079] print_info: n_group_used = 0
[55079] print_info: causal attn = 1
[55079] print_info: pooling type = -1
[55079] print_info: rope type = 40
[55079] print_info: rope scaling = linear
[55079] print_info: freq_base_train = 10000000.0
[55079] print_info: freq_scale_train = 1
[55079] print_info: n_ctx_orig_yarn = 262144
[55079] print_info: rope_yarn_log_mul = 0.0000
[55079] print_info: rope_finetuned = unknown
[55079] print_info: mrope sections = [11, 11, 10, 0]
[55079] print_info: ssm_d_conv = 4
[55079] print_info: ssm_d_inner = 6144
[55079] print_info: ssm_d_state = 128
[55079] print_info: ssm_dt_rank = 48
[55079] print_info: ssm_n_group = 16
[55079] print_info: ssm_dt_b_c_rms = 0
[55079] print_info: model type = 27B
[55079] print_info: model params = 26.90 B
[55079] print_info: general.name = Qwen3.6-27B
[55079] print_info: vocab type = BPE
[55079] print_info: n_vocab = 248320
[55079] print_info: n_merges = 247587
[55079] print_info: BOS token = 248044 '<|endoftext|>'
[55079] print_info: EOS token = 248046 '<|im_end|>'
[55079] print_info: EOT token = 248046 '<|im_end|>'
[55079] print_info: PAD token = 248055 '<|vision_pad|>'
[55079] print_info: LF token = 198 'Ċ'
[55079] print_info: FIM PRE token = 248060 '<|fim_prefix|>'
[55079] print_info: FIM SUF token = 248062 '<|fim_suffix|>'
[55079] print_info: FIM MID token = 248061 '<|fim_middle|>'
[55079] print_info: FIM PAD token = 248063 '<|fim_pad|>'
[55079] print_info: FIM REP token = 248064 '<|repo_name|>'
[55079] print_info: FIM SEP token = 248065 '<|file_sep|>'
[55079] print_info: EOG token = 248044 '<|endoftext|>'
[55079] print_info: EOG token = 248046 '<|im_end|>'
[55079] print_info: EOG token = 248063 '<|fim_pad|>'
[55079] print_info: EOG token = 248064 '<|repo_name|>'
[55079] print_info: EOG token = 248065 '<|file_sep|>'
[55079] print_info: max token length = 256
[55079] load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
[55079] load_tensors: offloading output layer to GPU
[55079] load_tensors: offloading 63 repeating layers to GPU
[55079] load_tensors: offloaded 65/65 layers to GPU
[55079] load_tensors: Meta() model buffer size = 5389.43 MiB
[55079] load_tensors: Vulkan_Host model buffer size = 682.03 MiB
[55079] ............................................................................................
[55079] common_init_result: added <|endoftext|> logit bias = -inf
[55079] common_init_result: added <|im_end|> logit bias = -inf
[55079] common_init_result: added <|fim_pad|> logit bias = -inf
[55079] common_init_result: added <|repo_name|> logit bias = -inf
[55079] common_init_result: added <|file_sep|> logit bias = -inf
[55079] llama_context: constructing llama_context
[55079] llama_context: n_seq_max = 2
[55079] llama_context: n_ctx = 262144
[55079] llama_context: n_ctx_seq = 131072
[55079] llama_context: n_batch = 16384
[55079] llama_context: n_ubatch = 512
[55079] llama_context: causal_attn = 1
[55079] llama_context: flash_attn = enabled
[55079] llama_context: kv_unified = false
[55079] llama_context: freq_base = 10000000.0
[55079] llama_context: freq_scale = 1
[55079] llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[55079] llama_context: Vulkan_Host output buffer size = 1.89 MiB
[55079] llama_kv_cache: Meta() KV buffer size = 5632.00 MiB
[55079] llama_kv_cache: size = 16384.00 MiB (131072 cells, 16 layers, 2/2 seqs), K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
[55079] llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
[55079] llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
[55079] llama_memory_recurrent: Meta() RS buffer size = 99.75 MiB
[55079] llama_memory_recurrent: size = 299.25 MiB ( 2 cells, 64 layers, 2 seqs), R (f32): 11.25 MiB, S (f32): 288.00 MiB
[55079] sched_reserve: reserving ...
[55079] sched_reserve: resolving fused Gated Delta Net support:
[55079] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[55079] sched_reserve: fused Gated Delta Net (chunked) enabled
[55079] sched_reserve: Meta() compute buffer size = 495.00 MiB
[55079] sched_reserve: Vulkan_Host compute buffer size = 276.03 MiB
[55079] sched_reserve: graph nodes = 3689
[55079] sched_reserve: graph splits = 2
[55079] sched_reserve: reserve took 20.77 ms, sched copies = 1
[55079] common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[55079] /app/ggml/src/ggml-backend-meta.cpp:1013: GGML_ASSERT(split_state.ne[j] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
[55079] libggml-base.so.0(+0x1a7b6) [0x7f7cedf847b6]
[55079] libggml-base.so.0(ggml_print_backtrace+0x20d) [0x7f7cedf84c3d]
[55079] libggml-base.so.0(ggml_abort+0x166) [0x7f7cedf84e26]
[55079] libggml-base.so.0(+0x3fbe4) [0x7f7cedfa9be4]
[55079] libggml-base.so.0(+0x3ff32) [0x7f7cedfa9f32]
[55079] libggml-base.so.0(+0x430bc) [0x7f7cedfad0bc]
[55079] libggml-base.so.0(ggml_gallocr_alloc_graph+0x4a5) [0x7f7cedf9b1d5]
[55079] libggml-base.so.0(ggml_backend_sched_alloc_graph+0x111) [0x7f7cedfa1751]
[55079] libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe7) [0x7f7cee103e67]
[55079] libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x36f) [0x7f7cee10aa2f]
[55079] libllama.so.0(llama_decode+0x12) [0x7f7cee10c5a2]
[55079] libllama-common.so.0(_Z23common_init_from_paramsR13common_params+0x33d) [0x7f7cee649bcd]
[55079] /app/llama-server(+0x11c92e) [0x55f959d3792e]
[55079] /app/llama-server(+0x6d6d1) [0x55f959c886d1]
[55079] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a601) [0x7f7ced97a601]
[55079] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x88) [0x7f7ced97a718]
[55079] /app/llama-server(+0x6e265) [0x55f959c89265]
First Bad Commit
Not sure
Relevant log output
Logs