llama : fix pooling assertion crash in chunked GDN detection path#20468
Merged
ggerganov merged 2 commits intoggml-org:masterfrom Mar 13, 2026
Merged
llama : fix pooling assertion crash in chunked GDN detection path#20468ggerganov merged 2 commits intoggml-org:masterfrom
ggerganov merged 2 commits intoggml-org:masterfrom
Conversation
added 2 commits
March 12, 2026 18:33
The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by ggml-org#20340 (d28961d). Same class of bug as ggml-org#12517, fixed by ggml-org#12545.
Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by ggml-org#20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path.
ggerganov
approved these changes
Mar 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix
ggml_mul_matassertion failure (ggml_can_mul_mat(a, b)) when using--pooling meanwith embedding models (e.g. bge-m3).Regression introduced by #20340 (commit d28961d, first affected release: b8278). Same class of bug as #12517, fixed by #12545.
Root cause
The chunked fused GDN detection in
sched_reserve()calls:where
n_outputs = n_seqs. This builds the full computation graph includingbuild_pooling(). For embedding models with mean pooling:build_inp_mean()creates a tensor with shape[n_tokens = 16*n_seqs, n_seqs_unq]res->t_embdis reduced to[n_outputs = n_seqs, ...]viaggml_get_rows(cur, inp_out_ids)ggml_mul_mat(transpose(t_embd), inp_mean)asserts becausen_seqs != 16*n_seqsThe other detection calls (Flash Attention, GDN AR) pass
n_tokens=1, which triggers the rounding guard (n_tokens % n_seqs != 0→n_outputs = max(n_outputs, n_tokens)), so they don't crash. The chunked path withn_tokens = 16*n_seqspasses the guard evenly and hits the mismatch.Fix
Pass
n_tokensasn_outputsin the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations:Also adds
test_embedding_pooling_meanandtest_embedding_pooling_mean_multipleto the server test suite — this pooling mode was previously untested.Reproduction
llama-server -m bge-m3-Q8_0.gguf --pooling mean --embedding -c 8192 --parallel 2 # GGML_ASSERT(ggml_can_mul_mat(a, b)) in build_pooling--pooling clsand--pooling lastare not affected (useggml_get_rows, notggml_mul_mat)Testing
Tested on AMD gfx1151 (ROCm 7.11) and CPU-only builds:
--pooling mean --embedding--pooling cls --embedding--pooling last --embeddingExisting server test suite: 21/21 passed. 2 new mean pooling tests added and passing locally.
AI was used in an assistive capacity (analysis and drafting). All code was manually reviewed and tested.