forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 89
Segfault: ggml_can_mul_mat assertion with MXFP4 weights + turbo KV #42
Copy link
Copy link
Open
Description
Bug Report
Priority: P1
Reporter: @NigelTufnel12345 (via X)
Status: Confirmed crash, awaiting logs
Environment
- Hardware: M3 Ultra, 512GB RAM
- Model: GPT-OSS-120B, MXFP4 quantization
- Config:
-ctk q8_0 -ctv turbo4 -fa on - Branch: feature/turboquant-kv-cache
Symptom
Immediate segfault on startup. Backtrace fires on:
ggml.c:3246: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
Model runs fine with regular q8_0 KV (no turbo). Crash is specific to the turbo code path.
Probable Cause
ggml_can_mul_mat assertion indicates tensor shape mismatch. MXFP4 is a newer weight quantization format and the turbo KV insertion or flash attention path likely has an unhandled shape interaction with MXFP4 weight tensors.
Possible areas to investigate:
- Turbo SET_ROWS kernel shape assumptions with MXFP4 input tensors
- Flash attention tile/vec kernel dispatch with non-standard weight types
- head_dim detection or zero-padding logic with MXFP4 metadata
Logs Needed
@NigelTufnel12345 — please upload:
- Full backtrace (the complete crash output, not just the assertion line)
- Model startup log (the lines showing
n_embd,n_head,n_embd_head_k, and any turbo-related log lines) - Exact command used to launch the server/cli
Workaround
Use -ctk q8_0 -ctv q8_0 (no turbo compression) until this is resolved.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels