-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[webgpu] Implement SubGroupMatrix based MatMulNBits for Metal #23729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Fixed
Show fixed
Hide fixed
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h
Fixed
Show fixed
Hide fixed
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Show resolved
Hide resolved
e90b823 to
09e30be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
92db1cf to
f7ddbb0
Compare
|
the ort web pipeline compiles webgpu ep with emscripten which fails with: Possible the headerfile that comes with emscripten doesn't know that featurename yet Maybe use |
done ! |
|
/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline |
|
Azure Pipelines successfully started running 7 pipeline(s). |
### Description Recent progress with SubGroupMatrix prototype in Dawn https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix Functions to webgpu. This shader implements a matmulnbits using that primitive. Observed perf gains, in terms of LLM inference speed, prefill perf for Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s. With Changes ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 5.42498e+06 <<< SubGroupMatrix 5.4s avg (tokens/s): 184.517 p50 (us): 5.41982e+06 stddev (us): 12023.8 n: 5 * 1001 token(s) Token generation: avg (us): 91138.5 avg (tokens/s): 10.9723 p50 (us): 89488.5 stddev (us): 35136.2 n: 635 * 1 token(s) ``` Baseline ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.45507e+07 <<< Baseline 14.5s avg (tokens/s): 68.7938 p50 (us): 1.45413e+07 stddev (us): 22208.9 n: 5 * 1001 token(s) Token generation: avg (us): 94109.8 avg (tokens/s): 10.6259 p50 (us): 89660 stddev (us): 61579 n: 635 * 1 token(s) ```
### Description Recent progress with SubGroupMatrix prototype in Dawn https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix Functions to webgpu. This shader implements a matmulnbits using that primitive. Observed perf gains, in terms of LLM inference speed, prefill perf for Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s. With Changes ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 5.42498e+06 <<< SubGroupMatrix 5.4s avg (tokens/s): 184.517 p50 (us): 5.41982e+06 stddev (us): 12023.8 n: 5 * 1001 token(s) Token generation: avg (us): 91138.5 avg (tokens/s): 10.9723 p50 (us): 89488.5 stddev (us): 35136.2 n: 635 * 1 token(s) ``` Baseline ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.45507e+07 <<< Baseline 14.5s avg (tokens/s): 68.7938 p50 (us): 1.45413e+07 stddev (us): 22208.9 n: 5 * 1001 token(s) Token generation: avg (us): 94109.8 avg (tokens/s): 10.6259 p50 (us): 89660 stddev (us): 61579 n: 635 * 1 token(s) ```
Description
Recent progress with SubGroupMatrix prototype in Dawn https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix Functions to webgpu. This shader implements a matmulnbits using that primitive.
Observed perf gains, in terms of LLM inference speed, prefill perf for Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s.
With Changes
Baseline