ggml webgpu: faster matrix multiplication/matrix-vector multiplication#17031
ggml webgpu: faster matrix multiplication/matrix-vector multiplication#17031reeselevine merged 2 commits intoggml-org:masterfrom
Conversation
Add fast matrix and matrix/vector multiplication.
CISC
left a comment
There was a problem hiding this comment.
Who can/should review the webgpu part?
@CISC perhaps the other person in this repository who has the most context on web stuff in general is @ngxson, but I'm not sure how busy they are. Otherwise, it would be great to get 1/2 more serious collaborators on the WebGPU backend, but I'm not sure who that would be at the moment. I think we're getting closer to a demo on the browser running llama.cpp with WebGPU integration, so if that is publicized a bit when it happens maybe it'll lead to some more interest in helping out. |
Yes, it's always a little tricky if there's only one person with knowledge on a codebase piece.
For sure, drumming up some publicity here and on the usual channels like LocalLlama etc when the time comes is a given. @ggerganov make a mental note. :) |
There was a problem hiding this comment.
I think having a second contributor dedicated on webgpu would be nice. Personally I know more about web development in general, not particularly good at webgpu stuff.
Re. the introduction of ggml_webgpu_process_shader_repls in this PR, probably it's not necessary as shaders/kernels are compiled statically to different version on other backends too. These shaders are often small, so I think we should keep compiling them statically for simplification.
The reason I added this is that some values, e.g., I think there's a bigger question around how/when to generate shaders, which I'm open to ideas/feedback on. Probably it makes sense to move away from the Python script at some point, which I made just for early speed/iteration, and use a C++ solution, like what Vulkan does. |
|
I also just want to drop a quick acknowledgement here to @SharmaRithik, @xuyanwen2012, @Ant-28, and @tyler-utah, who helped write and design the infrastructure around making these shaders possible. |
ggml-org#17031) * Faster tensors (ggml-org#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings
…n (#17031) * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings
Adds the following
Some preliminary performance numbers on my M3:
Llama-3.2-1B-Instruct-F16
WebGPU:
Metal:
Llama-3.2-1B-Instruct-Q4_0
WebGPU:
Metal: