Conversation
c0a69df to
22ee634
Compare
|
@ggerganov after #16649 and this PR, tg for gpt-oss models should increase by ~9-10% |
|
Curious but how much does this increase binary size for the cuda backend? |
It increases about ~20% (from 30M to 36M on my machine) |
JohannesGaessler
left a comment
There was a problem hiding this comment.
I'll do performance testing on either Friday or Saturday when (hopefully) I'll finally be able to get the RTX 5090 that NVIDIA sent me to work.
|
Regarding binary size: when I compile the CUDA backend with In any case, for MMVF we can shave off a bit independently of this PR by only compiling it for cases not covered by MMF. |
|
Since the main-use is ncols=1,I am also okay in just doing fusion for that case. |
|
That would I think also be fine. Matrix multiplications with small batch sizes > 1 are relevant for batched inference throughput and speculative decoding but we can always revisit those cases later. |
a6e0d34 to
9b95697
Compare
|
Simplified the code to just fuse on ncols_dst = 1, now binary size and compilation time should be mostly unaffected with this change |
6614a9b to
65a098f
Compare
|
When I tested performance:
On the P40 the fused MMVQ kernel does not seem to be consistently faster so I would suggest enabling fusion of that kernel only for Volta and newer. |
|
Thanks for testing! |
|
This might need to be disabled for compute capability 8.7 specifically in addition to pascal and older devices, right now I'm seeing a 10% performance loss on a Jetson AGX Orin. Benchmark results: #16815 |
…is resolved revert ggml-org#16715 (+2 squashed commit) Squashed commit: [289af2ee2] Revert "Hide latency of bias and gate-loading (ggml-org#16847)" This reverts commit 8b11dee. [a3e5c1e95] Revert "CUDA: add unused vars to mmvf and mmvq (ggml-org#16807)" This reverts commit 463bbf2.
This is a follow up to #16630. This PR adds ability to fuse the following common GEMV operations:
It uses a template bool to determine if we are in the fusion path, then does runtime checks for which fusion path to take. This PR also splits up
mmvq(by type) andmmvf(byncols-dst) as their compile times were becoming large after this change. This change helps TG (which is IO bound) to almost all class of models. Apart from adding tests totest-backend-opsI also spot-checked perplexity on a couple of models and it is unchanged by this change.Tested on 6x 4090