Skip to content

ggml webgpu: faster normal quant and some k-quant matrix operations, better shader parameter handling#20173

Merged
reeselevine merged 6 commits intoggml-org:masterfrom
reeselevine:master
Mar 10, 2026
Merged

ggml webgpu: faster normal quant and some k-quant matrix operations, better shader parameter handling#20173
reeselevine merged 6 commits intoggml-org:masterfrom
reeselevine:master

Conversation

@reeselevine
Copy link
Contributor

TLDR: The WebGPU implementation is pretty fast now for some quantization types, and hopefully it's more stable/works on a decent number of devices now. Try this code out in wllama here: https://reeselevine.github.io/wllama/. At least on my machine, it mostly seems to outperform WebLLM and OnnxRuntimeWeb (through transformers.js) on roughly equivalent models. And if there are any issues let's try to fix them!

  • Adds faster matrix-matrix and matrix-vector multiplication for all normal (q4-q8) quantization types.
  • Adds faster q6_k matrix-vector dequantization and decently faster matrix-matrix dequantization for all q_k types
  • Pretty major improvement in shader parameter handling. I realized we can just use queue.WriteBuffer to avoid having an extra buffer to copy parameters from host to device, which removes blocking operations from the encode path.
  • Also moved to using a single error buffer for set_rows and checking it at the end of each graph_compute, instead of having separate error buffers for each set_rows operation.
  • Better separation of the GPU profiling timestamp buffers. This can be improved further but leaving it for now.
  • Some other minor formatting cleanups.

reeselevine and others added 5 commits March 3, 2026 11:46
* Basic JIT compilation for mul_mat, get_rows, and scale (#17)

* scale jit working

* preliminary working jit for getrows and mulmat, needs refining

* simplified mul_mat preprocessing switch statement

* get_rows fixes, mul_mat refinement

* formatted + last edits

* removed some extraneous prints

* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish

* small fix

* some changes, working

* get_rows and mul_mat jit fixed and working

* Update formatting

* formatting

* Add header

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on all-encompassing shader library

* refactor argmax, set_rows

* Refactor all but flashattention, mat mul

* no gibberish, all k quants added, merged

* vec memory fix

* q6_k matching metal on my machine, tests passing

* Set tile size for q6_k separately

* Separate out fast shaders

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 6, 2026
Copy link
Contributor

@nikhilJain17 nikhilJain17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic changes around future submissions and waiting lgtm!

@reeselevine reeselevine merged commit aa2d278 into ggml-org:master Mar 10, 2026
71 of 78 checks passed
ProgenyAlpha pushed a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
…better shader parameter handling (ggml-org#20173)

* K quant speedup (ggml-org#20)

* Basic JIT compilation for mul_mat, get_rows, and scale (ggml-org#17)

* scale jit working

* preliminary working jit for getrows and mulmat, needs refining

* simplified mul_mat preprocessing switch statement

* get_rows fixes, mul_mat refinement

* formatted + last edits

* removed some extraneous prints

* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish

* small fix

* some changes, working

* get_rows and mul_mat jit fixed and working

* Update formatting

* formatting

* Add header

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on all-encompassing shader library

* refactor argmax, set_rows

* Refactor all but flashattention, mat mul

* no gibberish, all k quants added, merged

* vec memory fix

* q6_k matching metal on my machine, tests passing

* Set tile size for q6_k separately

* Separate out fast shaders

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

* Move towards writeBuffer for params

* Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups

* Remove extra file

* Formatting

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants