GGML_ASSERT(i01 >= 0 && i01 < ne01) crash in get_rows / mtmd_helper_decode_image_chunk when using MTP + MoE model + vision (Qwen3.6-35B-A3B)

## Bug report: `GGML_ASSERT(i01 >= 0 && i01 < ne01)` crash in `get_rows` / `mtmd_helper_decode_image_chunk` when using MTP + MoE model + vision (Qwen3.6-35B-A3B)

### Name and Version

```
# built from ggml-org/llama.cpp master (May 2026), CUDA SM86, WSL2 Ubuntu
./build/bin/llama-server --version
```

### Operating systems

Linux (WSL2 on Windows 11)

### GGML backends

CUDA

### Hardware

- NVIDIA GeForce RTX 3080 12 GB (11103 MiB free at server start — not OOM)
- Intel i9-12900K
- CUDA 13, WSL2 Ubuntu

### Models

- [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) → `Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf` (**MoE**, 35B total / ~3B active)
- `mmproj-BF16.gguf` (Qwen-VL vision projector, same repo)

### Problem description & steps to reproduce

Server starts normally. Sending the **first image** via the chat API causes a crash inside the vision decode path.

The `processing image...` log line appears, then ~16 `find_slot: non-consecutive token position` warnings fire, then the process aborts:

```
/ggml/src/ggml-cpu/ops.cpp:4745: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
ggml_compute_forward_get_rows
mtmd_helper_decode_image_chunk
mtmd_helper_eval_chunk_single
server_tokens::process_chunk
server_context_impl::update_slots
```

**Flags used (crash):**

```bash
./build/bin/llama-server \
  -m models/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf \
  --mmproj models/mmproj-BF16.gguf \
  --image-min-tokens 1024 \
  --fit on \
  --ctx-size 98304 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --no-mmap \
  --threads 8 \
  --parallel 1 \
  --spec-type draft-mtp --spec-draft-n-max 3 --draft-p-min 0.0 \
  --host 0.0.0.0
```

**Workaround:** Remove `--spec-type draft-mtp --spec-draft-n-max 3 --draft-p-min 0.0`. Without MTP, images process correctly with the same model and mmproj.

### Things already ruled out

| Hypothesis | Result |
|---|---|
| OOM / VRAM pressure | No — 11103 MiB free, mmproj only 1134 MiB |
| Multi-slot KV reuse (`--parallel 4` default) | No — crash persists with `--parallel 1` |
| q8_0 KV cache corrupting image embeddings | No — crash persists with KV at default f16 |
| `--image-min-tokens 1024` missing | No — crash persists even with this flag |
| **MTP active with MoE model** | **Yes — removing `--spec-type draft-mtp` fixes it** |

### Not reproducible with dense model

A colleague runs **Qwen3.6-27B-A3B** (dense, not MoE) with `--spec-type draft-mtp`, `--cache-type-k q8_0`, and `--mmproj` simultaneously on native Linux (RTX 4090) without any crash. The bug appears specific to the **MoE variant**.

### Relevant log output

```
0.00.278.337 I   - CUDA0   : NVIDIA GeForce RTX 3080 (12287 MiB, 11103 MiB free)
...
0.24.044.881 I srv    load_model: loaded multimodal model, 'models/mmproj-BF16.gguf'
0.24.044.948 I srv    load_model: initializing slots, n_slots = 1
...
1.50.450.029 I srv  process_chun: processing image...
1.59.025.234 W find_slot: non-consecutive token position 7 after 6 for sequence 0 with 512 new tokens
[... ~16 more find_slot warnings ...]
2.05.173.708 I srv  process_chun: image processed in 14724 ms
2.05.173.733 I srv  process_chun: processing image...
/mnt/c/.../ggml/src/ggml-cpu/ops.cpp:4745: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
...
libmtmd.so.0(mtmd_helper_decode_image_chunk+0x77c)
libmtmd.so.0(mtmd_helper_eval_chunk_single+0x164)
libllama-server-impl.so(_ZNK13server_tokens13process_chunkE...)
libllama-server-impl.so(_ZN19server_context_impl12update_slotsEv...)
Aborted (core dumped)
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GGML_ASSERT(i01 >= 0 && i01 < ne01) crash in get_rows / mtmd_helper_decode_image_chunk when using MTP + MoE model + vision (Qwen3.6-35B-A3B) #23585

Bug report: `GGML_ASSERT(i01 >= 0 && i01 < ne01)` crash in `get_rows` / `mtmd_helper_decode_image_chunk` when using MTP + MoE model + vision (Qwen3.6-35B-A3B)

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Things already ruled out

Not reproducible with dense model

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Hypothesis	Result
OOM / VRAM pressure	No — 11103 MiB free, mmproj only 1134 MiB
Multi-slot KV reuse (`--parallel 4` default)	No — crash persists with `--parallel 1`
q8_0 KV cache corrupting image embeddings	No — crash persists with KV at default f16
`--image-min-tokens 1024` missing	No — crash persists even with this flag
MTP active with MoE model	Yes — removing `--spec-type draft-mtp` fixes it

GGML_ASSERT(i01 >= 0 && i01 < ne01) crash in get_rows / mtmd_helper_decode_image_chunk when using MTP + MoE model + vision (Qwen3.6-35B-A3B) #23585

Description

Bug report: GGML_ASSERT(i01 >= 0 && i01 < ne01) crash in get_rows / mtmd_helper_decode_image_chunk when using MTP + MoE model + vision (Qwen3.6-35B-A3B)

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Things already ruled out

Not reproducible with dense model

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug report: `GGML_ASSERT(i01 >= 0 && i01 < ne01)` crash in `get_rows` / `mtmd_helper_decode_image_chunk` when using MTP + MoE model + vision (Qwen3.6-35B-A3B)