ggml-cpu: handle 3d tensors in repack mat_mul#17241
ggml-cpu: handle 3d tensors in repack mat_mul#17241max-krasnyansky merged 7 commits intoggml-org:masterfrom
Conversation
|
@max-krasnyansky can you please give this PR a shot and let me know if the perf is fixed? I've simplified a lot the chunking (essentially left it as it is for 2d tensors and "iterating" over planes) |
Yep. Looks great! Thanks for the quick followup. I'm marking it as ready to merge and approving. btw If you have some more time/energy it'd be great to add chunking to the repacked mul_mat_id for the MOE models. |
|
Thanks @max-krasnyansky. I'll be collaborating every now and then, but I have a couple of implementations of the repacked q4_K to address first. Not sure if you are able to merge, if not, I'll just ping gerganov once ci passes. Thanks again! Edit: Not sure if by marking "ready to merge" you meant to merge once ci passed |
I meant switching from "Draft" to "Ready" :) |
* ggml-cpu: handle 3d tensors in repack mul_mat * Removed unnecessary branch, removed need for <algorithm> * Fixed dst_ptr pointer in chunk + clang_format * GGML_ASSERT to check wdata within bounds * Accidental ggml.h inclusion * Improved GGML_ASSERT on wdata boundaries * Address performance regression in Qwen and llama.cpp due to chunking
* ggml-cpu: handle 3d tensors in repack mul_mat * Removed unnecessary branch, removed need for <algorithm> * Fixed dst_ptr pointer in chunk + clang_format * GGML_ASSERT to check wdata within bounds * Accidental ggml.h inclusion * Improved GGML_ASSERT on wdata boundaries * Address performance regression in Qwen and llama.cpp due to chunking
This is a continuation of #17030 after a performance regression was reported.
Perplexity Comparison (Repack vs Non-Repack)
Command:
Llama-bench
(M4 Max)
build: c77bafd (6967) THIS PR
build: 2776db6 (7047) MASTER