ggml webgpu: faster matrix multiplication/matrix-vector multiplication by reeselevine · Pull Request #17031 · ggml-org/llama.cpp

reeselevine · 2025-11-05T17:02:24Z

Adds the following

Two matrix multiplication implementations: one using register tiling and the other "subgroup matrices" (WebGPU's feature to allow access to tensor cores/optimized subgroup (warp) routines on devices that have them). Currently, subgroup matrices are experimental, and on devices where it's not supported, the code will fall back to the register tiling approach
A somewhat sped up matrix vector multiplication (still needs some work, but it's a decent start I think)
Support for f32/f16/q4_0 for this code, but set up in a way that I think will make integration of other quantization types easier.
Updates the dawn version the WebGPU backend is built against
Moving to a new format for pipeline initialization, with the eventual goal of making initialization lazy/smarter so we don't carry around a ton of compiled shaders that are never used in the browser

Some preliminary performance numbers on my M3:

Llama-3.2-1B-Instruct-F16

WebGPU:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 1B F16                   |   2.30 GiB |     1.24 B | WebGPU     |  99 |           pp512 |       1014.17 ± 9.38 |
| llama 1B F16                   |   2.30 GiB |     1.24 B | WebGPU     |  99 |           tg128 |         28.71 ± 0.19 |

Metal:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 1B F16                   |   2.30 GiB |     1.24 B | Metal      |  99 |           pp512 |       1368.47 ± 0.95 |
| llama 1B F16                   |   2.30 GiB |     1.24 B | Metal      |  99 |           tg128 |         35.99 ± 0.78 |

Llama-3.2-1B-Instruct-Q4_0

WebGPU:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 1B Q4_0                  | 729.75 MiB |     1.24 B | WebGPU     |  99 |           pp512 |        960.52 ± 6.05 |
| llama 1B Q4_0                  | 729.75 MiB |     1.24 B | WebGPU     |  99 |           tg128 |         41.76 ± 0.62 |

Metal:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 1B Q4_0                  | 729.75 MiB |     1.24 B | Metal      |  99 |           pp512 |       1346.68 ± 1.21 |
| llama 1B Q4_0                  | 729.75 MiB |     1.24 B | Metal      |  99 |           tg128 |        103.92 ± 0.37 |

Add fast matrix and matrix/vector multiplication.

CISC

Who can/should review the webgpu part?

.github/workflows/build.yml

reeselevine · 2025-11-05T22:34:00Z

Who can/should review the webgpu part?

@CISC perhaps the other person in this repository who has the most context on web stuff in general is @ngxson, but I'm not sure how busy they are. Otherwise, it would be great to get 1/2 more serious collaborators on the WebGPU backend, but I'm not sure who that would be at the moment. I think we're getting closer to a demo on the browser running llama.cpp with WebGPU integration, so if that is publicized a bit when it happens maybe it'll lead to some more interest in helping out.

CISC · 2025-11-06T08:27:11Z

Who can/should review the webgpu part?

@CISC perhaps the other person in this repository who has the most context on web stuff in general is @ngxson, but I'm not sure how busy they are. Otherwise, it would be great to get 1/2 more serious collaborators on the WebGPU backend, but I'm not sure who that would be at the moment.

Yes, it's always a little tricky if there's only one person with knowledge on a codebase piece.

I think we're getting closer to a demo on the browser running llama.cpp with WebGPU integration, so if that is publicized a bit when it happens maybe it'll lead to some more interest in helping out.

For sure, drumming up some publicity here and on the usual channels like LocalLlama etc when the time comes is a given. @ggerganov make a mental note. :)

ngxson

I think having a second contributor dedicated on webgpu would be nice. Personally I know more about web development in general, not particularly good at webgpu stuff.

Re. the introduction of ggml_webgpu_process_shader_repls in this PR, probably it's not necessary as shaders/kernels are compiled statically to different version on other backends too. These shaders are often small, so I think we should keep compiling them statically for simplification.

ggml/src/ggml-webgpu/ggml-webgpu.cpp

reeselevine · 2025-11-07T03:26:08Z

Re. the introduction of ggml_webgpu_process_shader_repls in this PR, probably it's not necessary as shaders/kernels are compiled statically to different version on other backends too. These shaders are often small, so I think we should keep compiling them statically for simplification.

The reason I added this is that some values, e.g., WEBGPU_MUL_MAT_SUBGROUP_MATRIX_M, cannot be override constants due to some constraints in WGSL compilers (which I think could be improved). So to avoid defining these values in two places and having to manually make sure they're in sync, this function guarantees they are.

I think there's a bigger question around how/when to generate shaders, which I'm open to ideas/feedback on. Probably it makes sense to move away from the Python script at some point, which I made just for early speed/iteration, and use a C++ solution, like what Vulkan does.

reeselevine · 2025-11-08T03:27:12Z

I also just want to drop a quick acknowledgement here to @SharmaRithik, @xuyanwen2012, @Ant-28, and @tyler-utah, who helped write and design the infrastructure around making these shaders possible.

ggml-org#17031) * Faster tensors (ggml-org#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings

…n (#17031) * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings

Faster tensors (#8)

c6bc125

Add fast matrix and matrix/vector multiplication.

reeselevine requested a review from CISC as a code owner November 5, 2025 17:02

github-actions bot added python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Nov 5, 2025

CISC reviewed Nov 5, 2025

View reviewed changes

.github/workflows/build.yml Show resolved Hide resolved

ngxson approved these changes Nov 6, 2025

View reviewed changes

ggml/src/ggml-webgpu/ggml-webgpu.cpp Outdated Show resolved Hide resolved

Use map for shader replacements instead of pair of strings

7c2b2ef

reeselevine merged commit 647b960 into ggml-org:master Nov 8, 2025
65 of 70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml webgpu: faster matrix multiplication/matrix-vector multiplication#17031

ggml webgpu: faster matrix multiplication/matrix-vector multiplication#17031
reeselevine merged 2 commits intoggml-org:masterfrom
reeselevine:master

reeselevine commented Nov 5, 2025

Uh oh!

CISC left a comment

Uh oh!

Uh oh!

reeselevine commented Nov 5, 2025

Uh oh!

CISC commented Nov 6, 2025

Uh oh!

ngxson left a comment •

edited

Loading

Uh oh!

Uh oh!

reeselevine commented Nov 7, 2025

Uh oh!

reeselevine commented Nov 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

reeselevine commented Nov 5, 2025

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

reeselevine commented Nov 5, 2025

Uh oh!

CISC commented Nov 6, 2025

Uh oh!

ngxson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

reeselevine commented Nov 7, 2025

Uh oh!

reeselevine commented Nov 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson left a comment •

edited

Loading