Conversation
This PR adds fused QKV multiplication
|
I pulled this change but I don't think I'm seeing any change in the output from ggml_backend_sched_print_assignments, so I don't think it's kicking in. I'm using Qwen_Qwen3-30B-A3B-Q2_K.gguf. Is there anything I need to do to enable it? |
|
@jeffbolznv I only tested |
|
I grabbed the Q4_0 model and I do see the combined weights with it. I think the Q2_K model may not have had matching types for the weights. For the Vulkan backend, I see maybe a couple percent gain for pp512, maybe less than a percent gain for tg128 (not sure if it's just noise). I think Vulkan is already getting most of the benefits of this from reordering and scheduling optimizations in the backend, and those optimizations work even with mismatching types. I haven't enabled all of those optimizations for pp (I initially saw some slowdowns and didn't pursue it further). |
|
@jeffbolznv currently the CUDA backend does not do scheduling the way Vulkan does. From what I understand we would need to create separate streams in CUDA graphs and we currently only do 1 stream (unless doing Nevertheless, this PR would be useful everywhere where we don't have graph optimizations yet, although a couple percent of PP is also not undesirable :) |
|
On M2 Ultra I don't see much difference from this change (it's not better if anything): ./scripts/compare-commits.sh master pr/16813 llama-bench -m ./models/qwen3-30b-a3b-coder/ggml-model-q4_0.gguf -m ./models/qwen3-8b-base/ggml-model-f16.gguf -t 1 -fa 1 -b 16384 -ub 2048 -p 512,2048 -n 64 -mmp 0
The Metal backend does have the graph optimizaion for making the Q, K and V multiplication run concurrently, so it's likely expected to not see gains here.
This seems like the better approach. |
|
Actually from what I see, a lot of quants have different weights for Q, K and V. So it might not be super beneficial to do this anyway |
This is a draft PR for comments about QKV fusion. Performance wise I checked CUDA it seems to be roughly ~4-5% performance gain (in PP and TG) for qwen3 models.
This tries to merge the weights together to form a single GEMM, rather than 3 separate ones. The main drawback of this approach is that there might be other tensors in between the Q,K,V weights (like LoRAs) which would break mmap (as @slaren pointed out). However, doing this on the backend is a much more involved change which might not worth the benefit. For now I'm thinking the best way forward be
convert_hf_to_gguf.py--fuse-qkv-weightscan be provided to take this path.IMO the QKV weights should always live in the GGUF file together without exceptions as they're naturally used together.
Some performance numbers on a 4090:
Without fusion
With fusion