SYCL op GET_ROWS unsupported - expected Arc Pro B60 performance #23670

cwriter · 2026-05-25T13:52:29Z

cwriter
May 25, 2026

Hi

I'm not sure if this is issue-worthy. I'm trying to run Qwen3.6 27B as follows:

llama.cpp b9294
1x Intel Arc Pro B60
Q4_K_S quantization, Q8 KV quant
131072 (half) context size

Run command:

export GGML_BACKEND=SYCL
GPU_BACKEND="SYCL0"
MODEL="unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_S"
MODEL_ARGS="--spec-type draft-mtp --spec-draft-n-max 2"
PARALLEL="1"
UBATCH_SIZE="1024"
BATCH_SIZE="1024"
FLASH_ATTENTION="-fa on -ctk q8_0 -ctv q8_0 -ctkd q8_0 -ctvd q8_0"
CTX_SIZE="131072"
NLAYERS=99

llama-server \
  --host 0.0.0.0 \
  --mmap \
  --no-mmproj-offload $FLASH_ATTENTION \
  --cache-ram 0 \
  --poll 0 \
  --fit off \
  -t 16 \
  -tb 16 \
  -hf $MODEL $MODEL_ARGS \
  --device $GPU_BACKEND \
  -np $PARALLEL \
  -ngl $NLAYERS \
  -c $CTX_SIZE \
  --checkpoint-every-n-tokens -1 \
  --ctx-checkpoints 0 \
  --batch-size $BATCH_SIZE \
  --ubatch-size $UBATCH_SIZE \
  --no-kv-unified \
  --temp 0.6 \
  --top-p 0.95 \
  --min-p 0.0 \
  --top-k 20 \
  --presence_penalty 0.0 \
  --repeat_penalty 1.0 \
  --jinja

These settings just barely fit ,but I get the following message on startup:
W llama_sampler_backend_support: device 'SYCL0' does not have support for op GET_ROWS needed for sampler 'top-k'
I also see one CPU core locked at 100% load, presumably due to the offload. This this expected?

The PP rate is between 100 and 400 t/s, output is at 9-16 t/s in this setup. Even though the GPU utilization is also at 99%, I'm wondering if the missing GET_ROWS and CPU offload is bottlenecking. Unfortunately, trying the Vulkan backend, the PP performance would never exceed 200 t/s.

This is kind of painful with OpenCode, which likes huge contexts. Is the issue in the quantization that triggers the offload to a single CPU? Is there an option to parallelize the min-k on the CPU or is this a driver issue? And would adding a second b60 help, given that at least part of the workload is being moved to the CPU?

Thanks :)

Startup log below:

0.00.665.972 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.665.975 I device_info:
0.00.792.148 I   - SYCL0   : Intel(R) Arc(TM) Pro B60 Graphics (24480 MiB, 24405 MiB free)
0.00.792.898 I   - Vulkan0 : Intel(R) Arc(tm) Pro B60 Graphics (BMG G21) (24480 MiB, 21964 MiB free)
0.00.792.901 I   - BLAS    : OpenBLAS (0 MiB, 0 MiB free)
0.00.792.913 I   - CPU     : AMD EPYC 7443 24-Core Processor (515620 MiB, 515620 MiB free)
0.00.793.259 I system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 |>
0.00.793.301 I srv          init: running without SSL
0.00.793.453 I srv          init: using 15 threads for HTTP server
0.00.793.558 I srv         start: binding port with default address family
0.00.794.697 I srv  llama_server: loading model
0.00.794.702 I srv    load_model: loading model '/root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-MTP-GGUF/snapshots/b3a58239d8d40b953>
0.09.665.553 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.09.721.775 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.10.029.522 I srv    load_model: creating MTP draft context against the target model '/root/.cache/huggingface/hub/
0.10.029.546 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.10.060.240 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
0.10.060.243 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
0.10.060.243 W load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842
0.10.527.813 I srv    load_model: loaded multimodal model, '/root/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-MTP-GGUF/snapshots/b3a582>
0.10.527.824 I srv    load_model: initializing slots, n_slots = 1
0.10.640.509 I common_context_can_seq_rm: the context supports bounded partial sequence removal
0.10.646.290 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.10.646.293 I common_speculative_impl_draft_mtp: - n_max=2, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
0.10.646.294 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=q8_0, cache_v=q8_0, ctx_tgt=yes, ctx_dft=yes, devices=[default]
0.10.646.353 W llama_sampler_backend_support: device 'SYCL0' does not have support for op GET_ROWS needed for sampler 'top-k'
0.10.646.366 I srv    load_model: speculative decoding context initialized
0.10.646.367 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 131072
0.10.646.386 I srv    load_model: prompt cache is disabled - use `--cache-ram N` to enable it
0.10.646.386 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.10.646.404 W srv          init: --cache-idle-slots requires --kv-unified, disabling

Answered by arthw

May 26, 2026

@cwriter
Fixed by PR: #23710.

Thank you!

View full answer

arthw · 2026-05-26T04:28:35Z

arthw
May 26, 2026
Collaborator

@cwriter
It's due to GET_ROWS can't support the data type q4_K in your case.
We will support it.

Thank you for your reporting!

0 replies

arthw · 2026-05-26T10:12:42Z

arthw
May 26, 2026
Collaborator

@cwriter
Fixed by PR: #23710.

Thank you!

2 replies

cwriter May 26, 2026
Author

Wow, that was extremely quick! Thank you very much!

arthw May 27, 2026
Collaborator

The CI is very busy recently.
The PR would be merged slowly.

If you meet more issue, please report it as issue!

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL op GET_ROWS unsupported - expected Arc Pro B60 performance #23670

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

SYCL op GET_ROWS unsupported - expected Arc Pro B60 performance #23670

Uh oh!

cwriter May 25, 2026

Replies: 2 comments · 2 replies

Uh oh!

arthw May 26, 2026 Collaborator

Uh oh!

arthw May 26, 2026 Collaborator

Uh oh!

cwriter May 26, 2026 Author

Uh oh!

arthw May 27, 2026 Collaborator

cwriter
May 25, 2026

Replies: 2 comments 2 replies

arthw
May 26, 2026
Collaborator

arthw
May 26, 2026
Collaborator

cwriter May 26, 2026
Author

arthw May 27, 2026
Collaborator