ggml-webgpu: updated matrix-vector multiplication by neha-ha · Pull Request #21738 · ggml-org/llama.cpp

neha-ha · 2026-04-10T17:31:15Z

Overview

Improved performance of the matrix-vector multiplication kernel.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

yomaytk · 2026-04-13T02:58:26Z

I've also been working on optimizing the path of k-quants mat-vec.
Could you share some benchmarks on the changes, if possible?

reeselevine · 2026-04-13T04:49:03Z

Hey @yomaytk that's awesome! I was going to start looking at that this week as well. @neha-ha might have some numbers but she's also developing on an older Intel Mac, so they might not be representative of more modern hardware.

I was thinking of doing a rewrite of the mat-vec shader so that it matches some of the other llama.cpp backends more closely. The general idea is to have one workgroup work on a set of rows of the matrix, and have each thread tile parts of the vector in registers, creating a set of local accumulation values. per-thread. Then, do either a subgroup or a workgroup memory reduction per-output, depending on what the device supports.

I was thinking this would be simpler than the current approach of tiling in shared memory and doing a more complex splitting of a workgroup across rows, while hopefully still maintaining/beating its performance. Not sure what optimizations specifically you're working on for k-quants right now, but does this make sense to you? And do you think your optimizations would fit in/complement these changes?

yomaytk · 2026-04-13T07:08:09Z

@reeselevine Your approach makes sense to me — each thread tiling the vector in registers is simpler and could improve performance. I'm curious to see the results. I've been working on porting the current Q6_K logic to other k-quants types, but since you're already working on a full rewrite, I'm happy to leave the mat-vec work to you and @neha-ha, and focus on other WebGPU tasks :)

reeselevine · 2026-04-17T17:39:12Z

Some graphs with performance comparisons based on this branch vs. current master using test-backend-ops perf and this variant for different quantization types:

MUL_MAT(type_a=<TYPE>,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1)

Apple M3:

NVIDIA RTX 4070:

* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>

merged properly, but slow q3_k and q5_k with u32 indexing

3c36b55

neha-ha requested review from a team and ggerganov as code owners April 10, 2026 17:31

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 10, 2026

reeselevine added 7 commits April 14, 2026 08:25

Start on new mat-vec

3c9e474

New format float paths working

0bcf75c

Working q4_0

01bd912

Work on remaining legacy q-types

f839c10

port k-quants to new matvec

ba96122

remove old shader

b4b6ffc

Merge remote-tracking branch 'upstream/master' into k_quant_speedup

83a0d38

reeselevine force-pushed the k_quant_speedup branch from 4125941 to 83a0d38 Compare April 17, 2026 17:11

Remove old constants, format

ca49e73

reeselevine approved these changes Apr 17, 2026

View reviewed changes

reeselevine requested a review from CISC April 17, 2026 18:59

reeselevine added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Apr 17, 2026

Constannnnnt mentioned this pull request Apr 17, 2026

ggml-webgpu(shader): support conv2d kernels. #21964

Merged

CISC approved these changes Apr 17, 2026

View reviewed changes

Comment thread src/ggml-webgpu.cpp Outdated

remove accidental file

b92011e

reeselevine approved these changes Apr 19, 2026

View reviewed changes

ggerganov approved these changes Apr 20, 2026

View reviewed changes

reeselevine merged commit a6cc43c into ggml-org:master Apr 20, 2026
50 of 51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-webgpu: updated matrix-vector multiplication#21738

ggml-webgpu: updated matrix-vector multiplication#21738
reeselevine merged 10 commits into
ggml-org:masterfrom
reeselevine:k_quant_speedup

neha-ha commented Apr 10, 2026

Uh oh!

yomaytk commented Apr 13, 2026

Uh oh!

reeselevine commented Apr 13, 2026

Uh oh!

yomaytk commented Apr 13, 2026 •

edited

Loading

Uh oh!

reeselevine commented Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

neha-ha commented Apr 10, 2026

Overview

Requirements

Uh oh!

yomaytk commented Apr 13, 2026

Uh oh!

reeselevine commented Apr 13, 2026

Uh oh!

yomaytk commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reeselevine commented Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yomaytk commented Apr 13, 2026 •

edited

Loading