Skip to content

llama : fix pooling assertion crash in chunked GDN detection path#20468

Merged
ggerganov merged 2 commits intoggml-org:masterfrom
ZeroV0LT:fix/pooling-gdn-chunked-crash
Mar 13, 2026
Merged

llama : fix pooling assertion crash in chunked GDN detection path#20468
ggerganov merged 2 commits intoggml-org:masterfrom
ZeroV0LT:fix/pooling-gdn-chunked-crash

Conversation

@ZeroV0LT
Copy link
Contributor

Summary

Fix ggml_mul_mat assertion failure (ggml_can_mul_mat(a, b)) when using --pooling mean with embedding models (e.g. bge-m3).

Regression introduced by #20340 (commit d28961d, first affected release: b8278). Same class of bug as #12517, fixed by #12545.

Root cause

The chunked fused GDN detection in sched_reserve() calls:

graph_reserve(16*n_seqs, n_seqs, n_outputs, mctx.get(), true);

where n_outputs = n_seqs. This builds the full computation graph including build_pooling(). For embedding models with mean pooling:

  • build_inp_mean() creates a tensor with shape [n_tokens = 16*n_seqs, n_seqs_unq]
  • res->t_embd is reduced to [n_outputs = n_seqs, ...] via ggml_get_rows(cur, inp_out_ids)
  • ggml_mul_mat(transpose(t_embd), inp_mean) asserts because n_seqs != 16*n_seqs

The other detection calls (Flash Attention, GDN AR) pass n_tokens=1, which triggers the rounding guard (n_tokens % n_seqs != 0n_outputs = max(n_outputs, n_tokens)), so they don't crash. The chunked path with n_tokens = 16*n_seqs passes the guard evenly and hits the mismatch.

Fix

Pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations:

const uint32_t n_tokens_ch = 16*n_seqs;
auto * gf = graph_reserve(n_tokens_ch, n_seqs, n_tokens_ch, mctx.get(), true);

Also adds test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to the server test suite — this pooling mode was previously untested.

Reproduction

llama-server -m bge-m3-Q8_0.gguf --pooling mean --embedding -c 8192 --parallel 2
# GGML_ASSERT(ggml_can_mul_mat(a, b)) in build_pooling
  • Crashes on: b8278+ (master as of 2026-03-12)
  • Works on: b8155 and earlier
  • --pooling cls and --pooling last are not affected (use ggml_get_rows, not ggml_mul_mat)

Testing

Tested on AMD gfx1151 (ROCm 7.11) and CPU-only builds:

Model Architecture Use case Unpatched Patched
bge-m3 FP16 BERT --pooling mean --embedding CRASH OK
bge-m3 FP16 BERT --pooling cls --embedding OK OK
bge-m3 FP16 BERT --pooling last --embedding OK OK
Qwen3.5-4B Q4_K_M GDN Chat completion OK (not affected) OK

Existing server test suite: 21/21 passed. 2 new mean pooling tests added and passing locally.

AI was used in an assistive capacity (analysis and drafting). All code was manually reviewed and tested.

Domenico Crupi added 2 commits March 12, 2026 18:33
The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).

Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.

Regression introduced by ggml-org#20340 (d28961d).
Same class of bug as ggml-org#12517, fixed by ggml-org#12545.
Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.

These tests would have caught the regression introduced by ggml-org#20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.
@ZeroV0LT ZeroV0LT requested a review from ggerganov as a code owner March 12, 2026 19:37
@github-actions github-actions bot added examples python python script changes server labels Mar 12, 2026
@ggerganov ggerganov linked an issue Mar 13, 2026 that may be closed by this pull request
@ggerganov ggerganov merged commit f17b3be into ggml-org:master Mar 13, 2026
77 of 79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: the fused GDN reserve should skip for embedding/pooling models Eval bug: Embedding models crash either upon loading or during use

2 participants