Bug report: GGML_ASSERT(i01 >= 0 && i01 < ne01) crash in get_rows / mtmd_helper_decode_image_chunk when using MTP + MoE model + vision (Qwen3.6-35B-A3B)
Name and Version
# built from ggml-org/llama.cpp master (May 2026), CUDA SM86, WSL2 Ubuntu
./build/bin/llama-server --version
Operating systems
Linux (WSL2 on Windows 11)
GGML backends
CUDA
Hardware
- NVIDIA GeForce RTX 3080 12 GB (11103 MiB free at server start — not OOM)
- Intel i9-12900K
- CUDA 13, WSL2 Ubuntu
Models
Problem description & steps to reproduce
Server starts normally. Sending the first image via the chat API causes a crash inside the vision decode path.
The processing image... log line appears, then ~16 find_slot: non-consecutive token position warnings fire, then the process aborts:
/ggml/src/ggml-cpu/ops.cpp:4745: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
ggml_compute_forward_get_rows
mtmd_helper_decode_image_chunk
mtmd_helper_eval_chunk_single
server_tokens::process_chunk
server_context_impl::update_slots
Flags used (crash):
./build/bin/llama-server \
-m models/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf \
--mmproj models/mmproj-BF16.gguf \
--image-min-tokens 1024 \
--fit on \
--ctx-size 98304 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--no-mmap \
--threads 8 \
--parallel 1 \
--spec-type draft-mtp --spec-draft-n-max 3 --draft-p-min 0.0 \
--host 0.0.0.0
Workaround: Remove --spec-type draft-mtp --spec-draft-n-max 3 --draft-p-min 0.0. Without MTP, images process correctly with the same model and mmproj.
Things already ruled out
| Hypothesis |
Result |
| OOM / VRAM pressure |
No — 11103 MiB free, mmproj only 1134 MiB |
Multi-slot KV reuse (--parallel 4 default) |
No — crash persists with --parallel 1 |
| q8_0 KV cache corrupting image embeddings |
No — crash persists with KV at default f16 |
--image-min-tokens 1024 missing |
No — crash persists even with this flag |
| MTP active with MoE model |
Yes — removing --spec-type draft-mtp fixes it |
Not reproducible with dense model
A colleague runs Qwen3.6-27B-A3B (dense, not MoE) with --spec-type draft-mtp, --cache-type-k q8_0, and --mmproj simultaneously on native Linux (RTX 4090) without any crash. The bug appears specific to the MoE variant.
Relevant log output
0.00.278.337 I - CUDA0 : NVIDIA GeForce RTX 3080 (12287 MiB, 11103 MiB free)
...
0.24.044.881 I srv load_model: loaded multimodal model, 'models/mmproj-BF16.gguf'
0.24.044.948 I srv load_model: initializing slots, n_slots = 1
...
1.50.450.029 I srv process_chun: processing image...
1.59.025.234 W find_slot: non-consecutive token position 7 after 6 for sequence 0 with 512 new tokens
[... ~16 more find_slot warnings ...]
2.05.173.708 I srv process_chun: image processed in 14724 ms
2.05.173.733 I srv process_chun: processing image...
/mnt/c/.../ggml/src/ggml-cpu/ops.cpp:4745: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
...
libmtmd.so.0(mtmd_helper_decode_image_chunk+0x77c)
libmtmd.so.0(mtmd_helper_eval_chunk_single+0x164)
libllama-server-impl.so(_ZNK13server_tokens13process_chunkE...)
libllama-server-impl.so(_ZN19server_context_impl12update_slotsEv...)
Aborted (core dumped)
Bug report:
GGML_ASSERT(i01 >= 0 && i01 < ne01)crash inget_rows/mtmd_helper_decode_image_chunkwhen using MTP + MoE model + vision (Qwen3.6-35B-A3B)Name and Version
Operating systems
Linux (WSL2 on Windows 11)
GGML backends
CUDA
Hardware
Models
unsloth/Qwen3.6-35B-A3B-MTP-GGUF→Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf(MoE, 35B total / ~3B active)mmproj-BF16.gguf(Qwen-VL vision projector, same repo)Problem description & steps to reproduce
Server starts normally. Sending the first image via the chat API causes a crash inside the vision decode path.
The
processing image...log line appears, then ~16find_slot: non-consecutive token positionwarnings fire, then the process aborts:Flags used (crash):
Workaround: Remove
--spec-type draft-mtp --spec-draft-n-max 3 --draft-p-min 0.0. Without MTP, images process correctly with the same model and mmproj.Things already ruled out
--parallel 4default)--parallel 1--image-min-tokens 1024missing--spec-type draft-mtpfixes itNot reproducible with dense model
A colleague runs Qwen3.6-27B-A3B (dense, not MoE) with
--spec-type draft-mtp,--cache-type-k q8_0, and--mmprojsimultaneously on native Linux (RTX 4090) without any crash. The bug appears specific to the MoE variant.Relevant log output