Skip to content

Bug: Issue when using igpu ( syscl backend)  #18296

@LIJUCHACKO

Description

@LIJUCHACKO

Build details
Ver b7501 Compiled with syscl backend.

Hardware used is intel ARC igpu (Intel Core Ultra 7 155H cpu).

OS is Ubuntu 24.04.3 LTS

Model fails to load into gpu when context length is passed as an argument. Without context length parameter, gguf model loads and run. But the default context length is very large ~ 40000.

Here is the command line output

../llama-server --model /media/DATA2/LLM_models/Qwen3-8B/Qwen3-8B-Q6_K.gguf -ngl 99 --no-mmap --reasoning-format none --ctx-size 5000 --jinja
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 0 (unknown) with IntelLLVM 2025.3.1 for Linux x86_64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 22

system_info: n_threads = 6 (n_threads_batch = 6) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

init: using 21 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/media/DATA2/LLM_models/Qwen3-8B/Qwen3-8B-Q6_K.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_params_fit_impl: projected to use 7013 MiB of device memory vs. 59341 MiB of free device memory
llama_params_fit_impl: will leave 52327 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.15 seconds
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) Graphics) (unknown id) - 59341 MiB free
llama_model_loader: loaded meta data with 28 key-value pairs and 399 tensors from /media/DATA2/LLM_models/Qwen3-8B/Qwen3-8B-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: qwen3.block_count u32 = 36
llama_model_loader: - kv 7: qwen3.context_length u32 = 40960
llama_model_loader: - kv 8: qwen3.embedding_length u32 = 4096
llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 12288
llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 32
llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - kv 27: general.file_type u32 = 18
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type q6_K: 254 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 6.26 GiB (6.56 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 40960
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 36
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 40960
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 8B
print_info: model params = 8.19 B
print_info: general.name = Qwen3 8B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CPU model buffer size = 486.86 MiB
load_tensors: SYCL0 model buffer size = 5921.78 MiB
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
.llama_model_load: error loading model: read error: Bad address
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/media/DATA2/LLM_models/Qwen3-8B/Qwen3-8B-Q6_K.gguf'
srv load_model: failed to load model, '/media/DATA2/LLM_models/Qwen3-8B/Qwen3-8B-Q6_K.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions