PS D:\LLAMA> .\llama.cpp.sycl\llama-server.exe -m .\llama.cpp.sycl\Models\HY-MT1.5-1.8B-Q8_0.gguf -c 4096
load_backend: loaded RPC backend from D:\LLAMA\llama.cpp.sycl\ggml-rpc.dll
load_backend: loaded SYCL backend from D:\LLAMA\llama.cpp.sycl\ggml-sycl.dll
load_backend: loaded CPU backend from D:\LLAMA\llama.cpp.sycl\ggml-cpu-haswell.dll
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b8816-3f7c29d31
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 11 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '.\llama.cpp.sycl\Models\HY-MT1.5-1.8B-Q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 2311 MiB of device memory vs. 8286 MiB of free device memory
llama_params_fit_impl: will leave 5974 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.80 seconds
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) B570 Graphics) (unknown id) - 8286 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 354 tensors from .\llama.cpp.sycl\Models\HY-MT1.5-1.8B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = hunyuan-dense
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.800000
llama_model_loader: - kv 4: general.sampling.temp f32 = 0.700000
llama_model_loader: - kv 5: general.name str = HY MT 1.8B 1222
llama_model_loader: - kv 6: general.version str = 1222
llama_model_loader: - kv 7: general.basename str = HY-MT
llama_model_loader: - kv 8: general.size_label str = 1.8B
llama_model_loader: - kv 9: hunyuan-dense.block_count u32 = 32
llama_model_loader: - kv 10: hunyuan-dense.context_length u32 = 262144
llama_model_loader: - kv 11: hunyuan-dense.embedding_length u32 = 2048
llama_model_loader: - kv 12: hunyuan-dense.feed_forward_length u32 = 6144
llama_model_loader: - kv 13: hunyuan-dense.attention.head_count u32 = 16
llama_model_loader: - kv 14: hunyuan-dense.attention.head_count_kv u32 = 4
llama_model_loader: - kv 15: hunyuan-dense.rope.freq_base f32 = 11158840.000000
llama_model_loader: - kv 16: hunyuan-dense.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: hunyuan-dense.attention.key_length u32 = 128
llama_model_loader: - kv 18: hunyuan-dense.attention.value_length u32 = 128
llama_model_loader: - kv 19: hunyuan-dense.rope.scaling.type str = none
llama_model_loader: - kv 20: hunyuan-dense.rope.scaling.factor f32 = 1.000000
llama_model_loader: - kv 21: hunyuan-dense.rope.scaling.original_context_length u32 = 262144
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = hunyuan-dense
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,120818] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,120818] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,119758] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 120000
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 120020
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 120002
llama_model_loader: - kv 30: tokenizer.ggml.seperator_token_id u32 = 120007
llama_model_loader: - kv 31: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - kv 33: general.file_type u32 = 7
llama_model_loader: - type f32: 129 tensors
llama_model_loader: - type q8_0: 225 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 1.77 GiB (8.50 BPW)
load: 0 unused tokens
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 120020 ('<|hy_place▁holder▁no▁2|>')
load: special tokens cache size = 818
load: token to piece cache size = 0.8089 MB
print_info: arch = hunyuan-dense
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 32
print_info: n_head = 16
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 6144
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = none
print_info: freq_base_train = 11158840.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 1.8B
print_info: model params = 1.79 B
print_info: general.name = HY MT 1.8B 1222
print_info: vocab type = BPE
print_info: n_vocab = 120818
print_info: n_merges = 119758
print_info: BOS token = 120000 '<|hy_begin▁of▁sentence|>'
print_info: EOS token = 120020 '<|hy_place▁holder▁no▁2|>'
print_info: SEP token = 120007 '<|hy_Assistant|>'
print_info: PAD token = 120002 '<|hy_▁pad▁|>'
print_info: LF token = 185 'Ċ'
print_info: EOG token = 120020 '<|hy_place▁holder▁no▁2|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 31 repeating layers to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: CPU_Mapped model buffer size = 250.72 MiB
load_tensors: SYCL0 model buffer size = 1815.26 MiB
..............................................................................
common_init_result: added <|hy_place▁holder▁no▁2|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 11158840.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: SYCL_Host output buffer size = 1.84 MiB
llama_kv_cache: SYCL0 KV buffer size = 256.00 MiB
llama_kv_cache: size = 256.00 MiB ( 4096 cells, 32 layers, 4/1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: SYCL0 compute buffer size = 239.97 MiB
sched_reserve: SYCL_Host compute buffer size = 16.01 MiB
sched_reserve: graph nodes = 1127
sched_reserve: graph splits = 2
sched_reserve: reserve took 10.41 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\mmvq.cpp:687: GGML_ASSERT(block_num_y % num_subgroups == 0) failed
PS D:\LLAMA>
Name and Version
PS D:\LLAMA> .\llama.cpp.sycl\llama-cli.exe --version
load_backend: loaded RPC backend from D:\LLAMA\llama.cpp.sycl\ggml-rpc.dll
load_backend: loaded SYCL backend from D:\LLAMA\llama.cpp.sycl\ggml-sycl.dll
load_backend: loaded CPU backend from D:\LLAMA\llama.cpp.sycl\ggml-cpu-haswell.dll
version: 8816 (3f7c29d)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
SYCL
Hardware
Ryzen 5 5500 + Intel Arc B570
Models
HY-MT1.5-1.8B-Q8_0.gguf but happens with other models from the same family
Problem description & steps to reproduce
It just crashes when trying to load that family of models, tried with models not from Tencent but is the same. The model works on Vulkan, but current Intel drivers crash when trying to run inference with Vulkan backend (IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT#1330 (comment)) so I'm stuck with using SYCL.
First Bad Commit
No response
Relevant log output
Logs