Skip to content

Conversation

@sushraja-msft
Copy link
Contributor

@sushraja-msft sushraja-msft commented Jan 14, 2025

Description

This change implements accuracy level 4 - quantize A to int8 matmul for the WebGPU EP. The matmul kernel here uses DP4A for matrix multiplication, in order to keep the DP4A fed co-operative matrix multiplication is implemented which preloads the row/col into local variables before the multiplication operation.

Credits to @qjia7 for help with the quantizer shader.

Performance metrics on intel ADL/TGL GPU.

PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       2.76762e+06
        **avg (tokens/s): 181.022**   <<< Prefill speed
        p50 (us):       2.74843e+06
        stddev (us):    41756.4
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81500.7
        avg (tokens/s): 12.2698
        p50 (us):       81104.1
        stddev (us):    2961.31
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       13.1836
        avg (tokens/s): 75851.9
        p50 (us):       12
        stddev (us):    6.47085
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       13120
        p50 (ms):       13081.6
        stddev (ms):    114.689
        n:              5
Peak working set size (bytes): 5467533312
WebGPU device lost (2): Device was destroyed.

This kernel is 2.10x faster than its F16 counterpart for a 500 token prefill. Previous prefill record is 86tks/s.

In order to support devices with subgroup size 8/32, a no subgroup version of the same shader is included. Performance is slower than the subgroup version on ADL.

PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       4.11989e+06
        avg (tokens/s): 121.605
        p50 (us):       4.11847e+06
        stddev (us):    2147.48
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81174.9
        avg (tokens/s): 12.3191
        p50 (us):       81301.1
        stddev (us):    2177.2
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       14.7998
        avg (tokens/s): 67568.3
        p50 (us):       12.3
        stddev (us):    11.5481
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       14431.1
        p50 (ms):       14433.8
        stddev (ms):    5.02473
        n:              5
Peak working set size (bytes): 5466480640
WebGPU device lost (2): Device was destroyed.

@sushraja-msft sushraja-msft changed the title Dp4MatMulNBits low accuracy matmul for WebGPU EP WIP: Dp4MatMulNBits low accuracy matmul for WebGPU EP Jan 14, 2025
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jan 16, 2025
@sushraja-msft sushraja-msft changed the title WIP: Dp4MatMulNBits low accuracy matmul for WebGPU EP WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP Jan 17, 2025
@sushraja-msft sushraja-msft force-pushed the user/sushraja/dp4_matmul branch from 0dd9e67 to 73ee5d1 Compare January 17, 2025 20:23
@guschmue
Copy link
Contributor

perf and accuracy look good on Xe, NV and Metal

@guschmue guschmue merged commit 58c29d3 into main Jan 21, 2025
98 checks passed
@guschmue guschmue deleted the user/sushraja/dp4_matmul branch January 21, 2025 23:46
ashrit-ms pushed a commit that referenced this pull request Jan 23, 2025
### Description

This change implements accuracy level 4 - quantize A to int8 matmul for
the WebGPU EP. The matmul kernel here uses DP4A for matrix
multiplication, in order to keep the DP4A fed co-operative matrix
multiplication is implemented which preloads the row/col into local
variables before the multiplication operation.

Credits to @qjia7 for help with the quantizer shader.

Performance metrics on intel ADL/TGL GPU.

```
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       2.76762e+06
        **avg (tokens/s): 181.022**   <<< Prefill speed
        p50 (us):       2.74843e+06
        stddev (us):    41756.4
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81500.7
        avg (tokens/s): 12.2698
        p50 (us):       81104.1
        stddev (us):    2961.31
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       13.1836
        avg (tokens/s): 75851.9
        p50 (us):       12
        stddev (us):    6.47085
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       13120
        p50 (ms):       13081.6
        stddev (ms):    114.689
        n:              5
Peak working set size (bytes): 5467533312
WebGPU device lost (2): Device was destroyed.

```
This kernel is 2.10x faster than its F16 counterpart for a 500 token
prefill. Previous prefill record is 86tks/s.

In order to support devices with subgroup size 8/32, a no subgroup
version of the same shader is included. Performance is slower than the
subgroup version on ADL.

```
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       4.11989e+06
        avg (tokens/s): 121.605
        p50 (us):       4.11847e+06
        stddev (us):    2147.48
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81174.9
        avg (tokens/s): 12.3191
        p50 (us):       81301.1
        stddev (us):    2177.2
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       14.7998
        avg (tokens/s): 67568.3
        p50 (us):       12.3
        stddev (us):    11.5481
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       14431.1
        p50 (ms):       14433.8
        stddev (ms):    5.02473
        n:              5
Peak working set size (bytes): 5466480640
WebGPU device lost (2): Device was destroyed.
```
ashrit-ms added a commit that referenced this pull request Jan 23, 2025
### Description
This PR is to update the win-ort-main branch to the tip main branch as
of 2025-01-23.

### PR List
ddf0d37 [QNN EP] Add LoggingManager::HasDefaultLogger() to provider
bridge API (#23467)
05fbbdf [QNN EP] Make QNN EP a shared library (#23120)
1336566 Add custom vcpkg ports (#23456)
2e1173c Update the compile flags for vcpkg packages (#23455)
1f628a9 [Mobile] Add BrowserStack Android MAUI Test (#23383)
009cae0 [js/webgpu] Optimize ConvTranspose (Continue) (#23429)
04a4a69 Use onnx_protobuf.h to suppress some GCC warnings (#23453)
2e3b62b Suppress some strict-aliasing related warnings in WebGPU EP
(#23454)
b708f9b Bump ruff from 0.9.1 to 0.9.2 (#23427)
c0afc66 [WebNN] Remove workarounds for TFLite backend (#23406)
8a821ff Bump vite from 6.0.7 to 6.0.11 in
/js/web/test/e2e/exports/testcases/vite-default (#23446)
220c1a2 Make ORT and Dawn use the same protobuf/abseil source code
(#23447)
b7b5792 Change MacOS-13 to ubuntu on for
android-java-api-aar-test.yml. (#23444)
19d0d2a WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP
(#23365)
95b8eff [QNN EP]: Clean up QNN logging resources if an error occurs
during initialization (#23435)
626134c Bump clang-format from 19.1.6 to 19.1.7 (#23428)
0cf9753 Fix eigen external deps (#23439)
f9440ae Moving RN_CI Android Testing to Linux (#23422)
1aa5902 [QNN EP] workaround for QNN validation bug for Tanh with
uint16 quantized output (#23432)
7f5582a Seperate RN andriod and IOS into 2 separated Stages. (#23400)
73deac2 Implement some missing element wise Add/Sub/Mul/Div/Neg
operations for CPU and CUDA EPs (#23090)
949fe42 Upgrade Java version from react-native/android to Java 17
(#23066)
0892c23 Update Qnn SDK default version to 2.30 (#23411)
94c099b Fix type cast build error (#23423)
d633e57 [WebNN EP] Fix AddInitializersToSkip issues (#23354)
e988ef0 [QNN EP] Fix regression for MatMul with two quantized/dynamic
uint16 inputs (#23419)
7538795 Update onnxruntime binary size checks ci pipeline's docker
image (#23405)
6c5ea41 Revert "[QNN EP] Clean up correctly from a partial setup
(#23320)" (#23420)
e866804 Enable comprehension simplification in ruff rules (#23414)
0a5f1f3 bugfix: string_view of invalid memory (#23417)
4cc38e0 fix crash when first input of BatchNormalization is 1-D
(#23387)
0334414 Target py310 and modernize codebase with ruff (#23401)
87341ac [QNN EP] Fix segfault when unregistering HTP shared memory
handles (#23402)

### Motivation and Context
This update includes the change to make QNN-EP a shared library.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Peishen Yan <peishen.yan@intel.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Alexis Tsogias <1114095+Zyrin@users.noreply.github.com>
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sushraja-msft <44513542+sushraja-msft@users.noreply.github.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Caroline Zhu <wolfivyaura@gmail.com>
@sushraja-msft sushraja-msft changed the title WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP Feb 6, 2025
@sushraja-msft sushraja-msft changed the title Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP DP4MatMulNBits accuracy level 4 matmul for WebGPU EP Feb 6, 2025
guschmue pushed a commit that referenced this pull request Mar 6, 2025
### Description

This change implements accuracy level 4 - quantize A to int8 matmul for
the WebGPU EP. The matmul kernel here uses DP4A for matrix
multiplication, in order to keep the DP4A fed co-operative matrix
multiplication is implemented which preloads the row/col into local
variables before the multiplication operation.

Credits to @qjia7 for help with the quantizer shader.

Performance metrics on intel ADL/TGL GPU.

```
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       2.76762e+06
        **avg (tokens/s): 181.022**   <<< Prefill speed
        p50 (us):       2.74843e+06
        stddev (us):    41756.4
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81500.7
        avg (tokens/s): 12.2698
        p50 (us):       81104.1
        stddev (us):    2961.31
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       13.1836
        avg (tokens/s): 75851.9
        p50 (us):       12
        stddev (us):    6.47085
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       13120
        p50 (ms):       13081.6
        stddev (ms):    114.689
        n:              5
Peak working set size (bytes): 5467533312
WebGPU device lost (2): Device was destroyed.

```
This kernel is 2.10x faster than its F16 counterpart for a 500 token
prefill. Previous prefill record is 86tks/s.

In order to support devices with subgroup size 8/32, a no subgroup
version of the same shader is included. Performance is slower than the
subgroup version on ADL.

```
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       4.11989e+06
        avg (tokens/s): 121.605
        p50 (us):       4.11847e+06
        stddev (us):    2147.48
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81174.9
        avg (tokens/s): 12.3191
        p50 (us):       81301.1
        stddev (us):    2177.2
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       14.7998
        avg (tokens/s): 67568.3
        p50 (us):       12.3
        stddev (us):    11.5481
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       14431.1
        p50 (ms):       14433.8
        stddev (ms):    5.02473
        n:              5
Peak working set size (bytes): 5466480640
WebGPU device lost (2): Device was destroyed.
```
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
### Description

This change implements accuracy level 4 - quantize A to int8 matmul for
the WebGPU EP. The matmul kernel here uses DP4A for matrix
multiplication, in order to keep the DP4A fed co-operative matrix
multiplication is implemented which preloads the row/col into local
variables before the multiplication operation.

Credits to @qjia7 for help with the quantizer shader.

Performance metrics on intel ADL/TGL GPU.

```
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       2.76762e+06
        **avg (tokens/s): 181.022**   <<< Prefill speed
        p50 (us):       2.74843e+06
        stddev (us):    41756.4
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81500.7
        avg (tokens/s): 12.2698
        p50 (us):       81104.1
        stddev (us):    2961.31
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       13.1836
        avg (tokens/s): 75851.9
        p50 (us):       12
        stddev (us):    6.47085
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       13120
        p50 (ms):       13081.6
        stddev (ms):    114.689
        n:              5
Peak working set size (bytes): 5467533312
WebGPU device lost (2): Device was destroyed.

```
This kernel is 2.10x faster than its F16 counterpart for a 500 token
prefill. Previous prefill record is 86tks/s.

In order to support devices with subgroup size 8/32, a no subgroup
version of the same shader is included. Performance is slower than the
subgroup version on ADL.

```
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       4.11989e+06
        avg (tokens/s): 121.605
        p50 (us):       4.11847e+06
        stddev (us):    2147.48
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81174.9
        avg (tokens/s): 12.3191
        p50 (us):       81301.1
        stddev (us):    2177.2
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       14.7998
        avg (tokens/s): 67568.3
        p50 (us):       12.3
        stddev (us):    11.5481
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       14431.1
        p50 (ms):       14433.8
        stddev (ms):    5.02473
        n:              5
Peak working set size (bytes): 5466480640
WebGPU device lost (2): Device was destroyed.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants