ggml : fix SSM_SCAN for n_groups > 1#15625
Conversation
gabe-l-hart
left a comment
There was a problem hiding this comment.
The changes look good to me, but I don't have them fully verified yet. When merged into my NemotronH branch, the results are much better than previously, but they still don't match the transformers CUDA output. This is likely somewhere else in the model implementation unrelated to this bug, though. I have run Mamba-Codestral-7B with the fix and it shows good quality output.
gabe-l-hart
left a comment
There was a problem hiding this comment.
I now have this working for NemotronH. Without this fix, the results are still garbage, but with this fix, I get coherent results that match (with --temp 0) across CPU, Metal, and CUDA. I've also verified that results on Mamba-Codestral-7B match across CPU and Metal and show good quality, and for granite-4.0-tiny-preview (to sanity check a n_groups == 1 model).
It's good to ship in my book!
|
@compilade Assuming you don't have any further testing you want to do, let's merge this one! |
…nemotron-nano-15409 * origin/master: ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625) kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638) model : jina-embeddings-v3 support (ggml-org#13693)
…upport * origin/master: ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625) kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638) model : jina-embeddings-v3 support (ggml-org#13693) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This fixes a problem noticed by @gabe-l-hart in #15507 (comment).
The upstream implementation of the SSM scan repeats the grouped parts of B and C like
repeat_interleavebehaves instead of likerepeat.Since most Mamba2 models use
n_groups == 1, and that Mamba-Codestral-7B-v0.1 (which usesn_groups == 8) had a non-extreme perplexity, this was not really noticed until #15507.On CPU, this reduces the perplexity of a
Q8_0Mamba-Codestral-7B-v0.1 on the first 10 chunks ofwiki.test.rawquite a lot:Before:
After:
To be clear, there's no need for reconversion of the affected models because this is purely a problem in the
SSM_SCANoperation.However, if
imatrixwas used, it would make sense to recompute that.TODO:
Make sure to read the contributing guidelines before submitting a PR