Skip to content

ggml-webgpu: updated matrix-vector multiplication#21738

Merged
reeselevine merged 10 commits into
ggml-org:masterfrom
reeselevine:k_quant_speedup
Apr 20, 2026
Merged

ggml-webgpu: updated matrix-vector multiplication#21738
reeselevine merged 10 commits into
ggml-org:masterfrom
reeselevine:k_quant_speedup

Conversation

@neha-ha

@neha-ha neha-ha commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Overview

Improved performance of the matrix-vector multiplication kernel.

Requirements

@neha-ha neha-ha requested review from a team and ggerganov as code owners April 10, 2026 17:31
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 10, 2026
@yomaytk

yomaytk commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

I've also been working on optimizing the path of k-quants mat-vec.
Could you share some benchmarks on the changes, if possible?

@reeselevine

Copy link
Copy Markdown
Contributor

Hey @yomaytk that's awesome! I was going to start looking at that this week as well. @neha-ha might have some numbers but she's also developing on an older Intel Mac, so they might not be representative of more modern hardware.

I was thinking of doing a rewrite of the mat-vec shader so that it matches some of the other llama.cpp backends more closely. The general idea is to have one workgroup work on a set of rows of the matrix, and have each thread tile parts of the vector in registers, creating a set of local accumulation values. per-thread. Then, do either a subgroup or a workgroup memory reduction per-output, depending on what the device supports.

I was thinking this would be simpler than the current approach of tiling in shared memory and doing a more complex splitting of a workgroup across rows, while hopefully still maintaining/beating its performance. Not sure what optimizations specifically you're working on for k-quants right now, but does this make sense to you? And do you think your optimizations would fit in/complement these changes?

@yomaytk

yomaytk commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

@reeselevine Your approach makes sense to me — each thread tiling the vector in registers is simpler and could improve performance. I'm curious to see the results. I've been working on porting the current Q6_K logic to other k-quants types, but since you're already working on a full rewrite, I'm happy to leave the mat-vec work to you and @neha-ha, and focus on other WebGPU tasks :)

@reeselevine

Copy link
Copy Markdown
Contributor

Some graphs with performance comparisons based on this branch vs. current master using test-backend-ops perf and this variant for different quantization types:

MUL_MAT(type_a=<TYPE>,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1)

Apple M3:
image

NVIDIA RTX 4070:
image

@reeselevine reeselevine requested a review from CISC April 17, 2026 18:59
@reeselevine reeselevine added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Apr 17, 2026
Comment thread src/ggml-webgpu.cpp Outdated
@reeselevine reeselevine merged commit a6cc43c into ggml-org:master Apr 20, 2026
50 of 51 checks passed
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants