Skip to content

GGML_ASSERT(i01 >= 0 && i01 < ne01) crash in get_rows / mtmd_helper_decode_image_chunk when using MTP + MoE model + vision (Qwen3.6-35B-A3B) #23585

@1191577

Description

@1191577

Bug report: GGML_ASSERT(i01 >= 0 && i01 < ne01) crash in get_rows / mtmd_helper_decode_image_chunk when using MTP + MoE model + vision (Qwen3.6-35B-A3B)

Name and Version

# built from ggml-org/llama.cpp master (May 2026), CUDA SM86, WSL2 Ubuntu
./build/bin/llama-server --version

Operating systems

Linux (WSL2 on Windows 11)

GGML backends

CUDA

Hardware

  • NVIDIA GeForce RTX 3080 12 GB (11103 MiB free at server start — not OOM)
  • Intel i9-12900K
  • CUDA 13, WSL2 Ubuntu

Models

Problem description & steps to reproduce

Server starts normally. Sending the first image via the chat API causes a crash inside the vision decode path.

The processing image... log line appears, then ~16 find_slot: non-consecutive token position warnings fire, then the process aborts:

/ggml/src/ggml-cpu/ops.cpp:4745: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
ggml_compute_forward_get_rows
mtmd_helper_decode_image_chunk
mtmd_helper_eval_chunk_single
server_tokens::process_chunk
server_context_impl::update_slots

Flags used (crash):

./build/bin/llama-server \
  -m models/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf \
  --mmproj models/mmproj-BF16.gguf \
  --image-min-tokens 1024 \
  --fit on \
  --ctx-size 98304 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --no-mmap \
  --threads 8 \
  --parallel 1 \
  --spec-type draft-mtp --spec-draft-n-max 3 --draft-p-min 0.0 \
  --host 0.0.0.0

Workaround: Remove --spec-type draft-mtp --spec-draft-n-max 3 --draft-p-min 0.0. Without MTP, images process correctly with the same model and mmproj.

Things already ruled out

Hypothesis Result
OOM / VRAM pressure No — 11103 MiB free, mmproj only 1134 MiB
Multi-slot KV reuse (--parallel 4 default) No — crash persists with --parallel 1
q8_0 KV cache corrupting image embeddings No — crash persists with KV at default f16
--image-min-tokens 1024 missing No — crash persists even with this flag
MTP active with MoE model Yes — removing --spec-type draft-mtp fixes it

Not reproducible with dense model

A colleague runs Qwen3.6-27B-A3B (dense, not MoE) with --spec-type draft-mtp, --cache-type-k q8_0, and --mmproj simultaneously on native Linux (RTX 4090) without any crash. The bug appears specific to the MoE variant.

Relevant log output

0.00.278.337 I   - CUDA0   : NVIDIA GeForce RTX 3080 (12287 MiB, 11103 MiB free)
...
0.24.044.881 I srv    load_model: loaded multimodal model, 'models/mmproj-BF16.gguf'
0.24.044.948 I srv    load_model: initializing slots, n_slots = 1
...
1.50.450.029 I srv  process_chun: processing image...
1.59.025.234 W find_slot: non-consecutive token position 7 after 6 for sequence 0 with 512 new tokens
[... ~16 more find_slot warnings ...]
2.05.173.708 I srv  process_chun: image processed in 14724 ms
2.05.173.733 I srv  process_chun: processing image...
/mnt/c/.../ggml/src/ggml-cpu/ops.cpp:4745: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
...
libmtmd.so.0(mtmd_helper_decode_image_chunk+0x77c)
libmtmd.so.0(mtmd_helper_eval_chunk_single+0x164)
libllama-server-impl.so(_ZNK13server_tokens13process_chunkE...)
libllama-server-impl.so(_ZN19server_context_impl12update_slotsEv...)
Aborted (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions