ggml-webgpu: updated matrix-vector multiplication#21738
Conversation
|
I've also been working on optimizing the path of k-quants mat-vec. |
|
Hey @yomaytk that's awesome! I was going to start looking at that this week as well. @neha-ha might have some numbers but she's also developing on an older Intel Mac, so they might not be representative of more modern hardware. I was thinking of doing a rewrite of the mat-vec shader so that it matches some of the other llama.cpp backends more closely. The general idea is to have one workgroup work on a set of rows of the matrix, and have each thread tile parts of the vector in registers, creating a set of local accumulation values. per-thread. Then, do either a subgroup or a workgroup memory reduction per-output, depending on what the device supports. I was thinking this would be simpler than the current approach of tiling in shared memory and doing a more complex splitting of a workgroup across rows, while hopefully still maintaining/beating its performance. Not sure what optimizations specifically you're working on for k-quants right now, but does this make sense to you? And do you think your optimizations would fit in/complement these changes? |
|
@reeselevine Your approach makes sense to me — each thread tiling the vector in registers is simpler and could improve performance. I'm curious to see the results. I've been working on porting the current Q6_K logic to other k-quants types, but since you're already working on a full rewrite, I'm happy to leave the mat-vec work to you and @neha-ha, and focus on other WebGPU tasks :) |
4125941 to
83a0d38
Compare
* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>


Overview
Improved performance of the matrix-vector multiplication kernel.
Requirements