ggml webgpu: faster normal quant and some k-quant matrix operations, better shader parameter handling by reeselevine · Pull Request #20173 · ggml-org/llama.cpp

reeselevine · 2026-03-06T18:25:02Z

TLDR: The WebGPU implementation is pretty fast now for some quantization types, and hopefully it's more stable/works on a decent number of devices now. Try this code out in wllama here: https://reeselevine.github.io/wllama/. At least on my machine, it mostly seems to outperform WebLLM and OnnxRuntimeWeb (through transformers.js) on roughly equivalent models. And if there are any issues let's try to fix them!

Adds faster matrix-matrix and matrix-vector multiplication for all normal (q4-q8) quantization types.
Adds faster q6_k matrix-vector dequantization and decently faster matrix-matrix dequantization for all q_k types
Pretty major improvement in shader parameter handling. I realized we can just use queue.WriteBuffer to avoid having an extra buffer to copy parameters from host to device, which removes blocking operations from the encode path.
Also moved to using a single error buffer for set_rows and checking it at the end of each graph_compute, instead of having separate error buffers for each set_rows operation.
Better separation of the GPU profiling timestamp buffers. This can be improved further but leaving it for now.
Some other minor formatting cleanups.

* Basic JIT compilation for mul_mat, get_rows, and scale (#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * no gibberish, all k quants added, merged * vec memory fix * q6_k matching metal on my machine, tests passing * Set tile size for q6_k separately * Separate out fast shaders --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

…er for parameter buffers, minor cleanups

ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl

nikhilJain17

The logic changes around future submissions and waiting lgtm!

…better shader parameter handling (ggml-org#20173) * K quant speedup (ggml-org#20) * Basic JIT compilation for mul_mat, get_rows, and scale (ggml-org#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * no gibberish, all k quants added, merged * vec memory fix * q6_k matching metal on my machine, tests passing * Set tile size for q6_k separately * Separate out fast shaders --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> * Move towards writeBuffer for params * Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups * Remove extra file * Formatting --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

reeselevine and others added 5 commits March 3, 2026 11:46

Move towards writeBuffer for params

3a0d3e1

Move away from multiple buffers for set_rows errors, remove host buff…

efab3df

…er for parameter buffers, minor cleanups

Merge remote-tracking branch 'upstream/master'

d77731c

Remove extra file

02cac09

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 6, 2026

CISC approved these changes Mar 6, 2026

View reviewed changes

ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl Outdated Show resolved Hide resolved

yomaytk mentioned this pull request Mar 8, 2026

ggml-webgpu: Add supports for GGML_OP_REPEAT #20230

Merged

Formatting

1dbdc5b

nikhilJain17 approved these changes Mar 9, 2026

View reviewed changes

reeselevine merged commit aa2d278 into ggml-org:master Mar 10, 2026
71 of 78 checks passed

loci-dev mentioned this pull request Mar 11, 2026

UPSTREAM PR #20230: ggml-webgpu: Add supports for GGML_OP_REPEAT auroralabs-loci/llama.cpp#1240

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml webgpu: faster normal quant and some k-quant matrix operations, better shader parameter handling#20173

ggml webgpu: faster normal quant and some k-quant matrix operations, better shader parameter handling#20173
reeselevine merged 6 commits intoggml-org:masterfrom
reeselevine:master

reeselevine commented Mar 6, 2026

Uh oh!

Uh oh!

nikhilJain17 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

reeselevine commented Mar 6, 2026

Uh oh!

Uh oh!

nikhilJain17 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants