llama : fix pooling assertion crash in chunked GDN detection path by ZeroV0LT · Pull Request #20468 · ggml-org/llama.cpp

ZeroV0LT · 2026-03-12T19:37:38Z

Summary

Fix ggml_mul_mat assertion failure (ggml_can_mul_mat(a, b)) when using --pooling mean with embedding models (e.g. bge-m3).

Regression introduced by #20340 (commit d28961d, first affected release: b8278). Same class of bug as #12517, fixed by #12545.

Root cause

The chunked fused GDN detection in sched_reserve() calls:

graph_reserve(16*n_seqs, n_seqs, n_outputs, mctx.get(), true);

where n_outputs = n_seqs. This builds the full computation graph including build_pooling(). For embedding models with mean pooling:

build_inp_mean() creates a tensor with shape [n_tokens = 16*n_seqs, n_seqs_unq]
res->t_embd is reduced to [n_outputs = n_seqs, ...] via ggml_get_rows(cur, inp_out_ids)
ggml_mul_mat(transpose(t_embd), inp_mean) asserts because n_seqs != 16*n_seqs

The other detection calls (Flash Attention, GDN AR) pass n_tokens=1, which triggers the rounding guard (n_tokens % n_seqs != 0 → n_outputs = max(n_outputs, n_tokens)), so they don't crash. The chunked path with n_tokens = 16*n_seqs passes the guard evenly and hits the mismatch.

Fix

Pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations:

const uint32_t n_tokens_ch = 16*n_seqs;
auto * gf = graph_reserve(n_tokens_ch, n_seqs, n_tokens_ch, mctx.get(), true);

Also adds test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to the server test suite — this pooling mode was previously untested.

Reproduction

llama-server -m bge-m3-Q8_0.gguf --pooling mean --embedding -c 8192 --parallel 2
# GGML_ASSERT(ggml_can_mul_mat(a, b)) in build_pooling

Crashes on: b8278+ (master as of 2026-03-12)
Works on: b8155 and earlier
--pooling cls and --pooling last are not affected (use ggml_get_rows, not ggml_mul_mat)

Testing

Tested on AMD gfx1151 (ROCm 7.11) and CPU-only builds:

Model	Architecture	Use case	Unpatched	Patched
bge-m3 FP16	BERT	`--pooling mean --embedding`	CRASH	OK
bge-m3 FP16	BERT	`--pooling cls --embedding`	OK	OK
bge-m3 FP16	BERT	`--pooling last --embedding`	OK	OK
Qwen3.5-4B Q4_K_M	GDN	Chat completion	OK (not affected)	OK

Existing server test suite: 21/21 passed. 2 new mean pooling tests added and passing locally.

AI was used in an assistive capacity (analysis and drafting). All code was manually reviewed and tested.

The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by ggml-org#20340 (d28961d). Same class of bug as ggml-org#12517, fixed by ggml-org#12545.

Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by ggml-org#20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path.

Domenico Crupi added 2 commits March 12, 2026 18:33

ZeroV0LT requested a review from ggerganov as a code owner March 12, 2026 19:37

github-actions bot added examples python python script changes server labels Mar 12, 2026

ggerganov approved these changes Mar 13, 2026

View reviewed changes

ggerganov linked an issue Mar 13, 2026 that may be closed by this pull request

Eval bug: Embedding models crash either upon loading or during use #20514

Closed

deadprogram mentioned this pull request Mar 13, 2026

llama.cpp: pooling assertion crash in chunked GDN detection path hybridgroup/yzma#232

Open

ggerganov linked an issue Mar 13, 2026 that may be closed by this pull request

Eval bug: the fused GDN reserve should skip for embedding/pooling models #20523

Closed

ggerganov merged commit f17b3be into ggml-org:master Mar 13, 2026
77 of 79 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : fix pooling assertion crash in chunked GDN detection path#20468

llama : fix pooling assertion crash in chunked GDN detection path#20468
ggerganov merged 2 commits intoggml-org:masterfrom
ZeroV0LT:fix/pooling-gdn-chunked-crash

ZeroV0LT commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZeroV0LT commented Mar 12, 2026

Summary

Root cause

Fix

Reproduction

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants