DP4MatMulNBits accuracy level 4 matmul for WebGPU EP #23365

sushraja-msft · 2025-01-14T23:07:53Z

Description

This change implements accuracy level 4 - quantize A to int8 matmul for the WebGPU EP. The matmul kernel here uses DP4A for matrix multiplication, in order to keep the DP4A fed co-operative matrix multiplication is implemented which preloads the row/col into local variables before the multiplication operation.

Credits to @qjia7 for help with the quantizer shader.

Performance metrics on intel ADL/TGL GPU.

PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       2.76762e+06
        **avg (tokens/s): 181.022**   <<< Prefill speed
        p50 (us):       2.74843e+06
        stddev (us):    41756.4
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81500.7
        avg (tokens/s): 12.2698
        p50 (us):       81104.1
        stddev (us):    2961.31
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       13.1836
        avg (tokens/s): 75851.9
        p50 (us):       12
        stddev (us):    6.47085
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       13120
        p50 (ms):       13081.6
        stddev (ms):    114.689
        n:              5
Peak working set size (bytes): 5467533312
WebGPU device lost (2): Device was destroyed.

This kernel is 2.10x faster than its F16 counterpart for a 500 token prefill. Previous prefill record is 86tks/s.

In order to support devices with subgroup size 8/32, a no subgroup version of the same shader is included. Performance is slower than the subgroup version on ADL.

PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       4.11989e+06
        avg (tokens/s): 121.605
        p50 (us):       4.11847e+06
        stddev (us):    2147.48
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81174.9
        avg (tokens/s): 12.3191
        p50 (us):       81301.1
        stddev (us):    2177.2
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       14.7998
        avg (tokens/s): 67568.3
        p50 (us):       12.3
        stddev (us):    11.5481
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       14431.1
        p50 (ms):       14433.8
        stddev (ms):    5.02473
        n:              5
Peak working set size (bytes): 5466480640
WebGPU device lost (2): Device was destroyed.

…79s for 500 tk prompt. 126tps or 7.9s for 1k prompt. On latest driver around 140tps for 500tk

…et because model is not made with accuracy level 4.

no subgroup Prompt processing (time to first token): avg (us): 4.11989e+06 avg (tokens/s): 121.605 with subgroup Prompt processing (time to first token): avg (us): 2.77983e+06 avg (tokens/s): 180.227

…ing the write out of lane output.

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

…A in its instruction set.

guschmue · 2025-01-21T23:46:46Z

perf and accuracy look good on Xe, NV and Metal

@qjia7

### Description This change implements accuracy level 4 - quantize A to int8 matmul for the WebGPU EP. The matmul kernel here uses DP4A for matrix multiplication, in order to keep the DP4A fed co-operative matrix multiplication is implemented which preloads the row/col into local variables before the multiplication operation. Credits to @qjia7 for help with the quantizer shader. Performance metrics on intel ADL/TGL GPU. ``` PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 Batch size: 1, prompt tokens: 501, tokens to generate: 128 Prompt processing (time to first token): avg (us): 2.76762e+06 **avg (tokens/s): 181.022** <<< Prefill speed p50 (us): 2.74843e+06 stddev (us): 41756.4 n: 5 * 501 token(s) Token generation: avg (us): 81500.7 avg (tokens/s): 12.2698 p50 (us): 81104.1 stddev (us): 2961.31 n: 635 * 1 token(s) Token sampling: avg (us): 13.1836 avg (tokens/s): 75851.9 p50 (us): 12 stddev (us): 6.47085 n: 640 * 1 token(s) E2E generation (entire generation loop): avg (ms): 13120 p50 (ms): 13081.6 stddev (ms): 114.689 n: 5 Peak working set size (bytes): 5467533312 WebGPU device lost (2): Device was destroyed. ``` This kernel is 2.10x faster than its F16 counterpart for a 500 token prefill. Previous prefill record is 86tks/s. In order to support devices with subgroup size 8/32, a no subgroup version of the same shader is included. Performance is slower than the subgroup version on ADL. ``` PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 Batch size: 1, prompt tokens: 501, tokens to generate: 128 Prompt processing (time to first token): avg (us): 4.11989e+06 avg (tokens/s): 121.605 p50 (us): 4.11847e+06 stddev (us): 2147.48 n: 5 * 501 token(s) Token generation: avg (us): 81174.9 avg (tokens/s): 12.3191 p50 (us): 81301.1 stddev (us): 2177.2 n: 635 * 1 token(s) Token sampling: avg (us): 14.7998 avg (tokens/s): 67568.3 p50 (us): 12.3 stddev (us): 11.5481 n: 640 * 1 token(s) E2E generation (entire generation loop): avg (ms): 14431.1 p50 (ms): 14433.8 stddev (ms): 5.02473 n: 5 Peak working set size (bytes): 5466480640 WebGPU device lost (2): Device was destroyed. ```

### Description This PR is to update the win-ort-main branch to the tip main branch as of 2025-01-23. ### PR List ddf0d37 [QNN EP] Add LoggingManager::HasDefaultLogger() to provider bridge API (#23467) 05fbbdf [QNN EP] Make QNN EP a shared library (#23120) 1336566 Add custom vcpkg ports (#23456) 2e1173c Update the compile flags for vcpkg packages (#23455) 1f628a9 [Mobile] Add BrowserStack Android MAUI Test (#23383) 009cae0 [js/webgpu] Optimize ConvTranspose (Continue) (#23429) 04a4a69 Use onnx_protobuf.h to suppress some GCC warnings (#23453) 2e3b62b Suppress some strict-aliasing related warnings in WebGPU EP (#23454) b708f9b Bump ruff from 0.9.1 to 0.9.2 (#23427) c0afc66 [WebNN] Remove workarounds for TFLite backend (#23406) 8a821ff Bump vite from 6.0.7 to 6.0.11 in /js/web/test/e2e/exports/testcases/vite-default (#23446) 220c1a2 Make ORT and Dawn use the same protobuf/abseil source code (#23447) b7b5792 Change MacOS-13 to ubuntu on for android-java-api-aar-test.yml. (#23444) 19d0d2a WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP (#23365) 95b8eff [QNN EP]: Clean up QNN logging resources if an error occurs during initialization (#23435) 626134c Bump clang-format from 19.1.6 to 19.1.7 (#23428) 0cf9753 Fix eigen external deps (#23439) f9440ae Moving RN_CI Android Testing to Linux (#23422) 1aa5902 [QNN EP] workaround for QNN validation bug for Tanh with uint16 quantized output (#23432) 7f5582a Seperate RN andriod and IOS into 2 separated Stages. (#23400) 73deac2 Implement some missing element wise Add/Sub/Mul/Div/Neg operations for CPU and CUDA EPs (#23090) 949fe42 Upgrade Java version from react-native/android to Java 17 (#23066) 0892c23 Update Qnn SDK default version to 2.30 (#23411) 94c099b Fix type cast build error (#23423) d633e57 [WebNN EP] Fix AddInitializersToSkip issues (#23354) e988ef0 [QNN EP] Fix regression for MatMul with two quantized/dynamic uint16 inputs (#23419) 7538795 Update onnxruntime binary size checks ci pipeline's docker image (#23405) 6c5ea41 Revert "[QNN EP] Clean up correctly from a partial setup (#23320)" (#23420) e866804 Enable comprehension simplification in ruff rules (#23414) 0a5f1f3 bugfix: string_view of invalid memory (#23417) 4cc38e0 fix crash when first input of BatchNormalization is 1-D (#23387) 0334414 Target py310 and modernize codebase with ruff (#23401) 87341ac [QNN EP] Fix segfault when unregistering HTP shared memory handles (#23402) ### Motivation and Context This update includes the change to make QNN-EP a shared library. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Peishen Yan <peishen.yan@intel.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: Alexis Tsogias <1114095+Zyrin@users.noreply.github.com> Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: sushraja-msft <44513542+sushraja-msft@users.noreply.github.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Caroline Zhu <wolfivyaura@gmail.com>

@qjia7

### Description This change implements accuracy level 4 - quantize A to int8 matmul for the WebGPU EP. The matmul kernel here uses DP4A for matrix multiplication, in order to keep the DP4A fed co-operative matrix multiplication is implemented which preloads the row/col into local variables before the multiplication operation. Credits to @qjia7 for help with the quantizer shader. Performance metrics on intel ADL/TGL GPU. ``` PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 Batch size: 1, prompt tokens: 501, tokens to generate: 128 Prompt processing (time to first token): avg (us): 2.76762e+06 **avg (tokens/s): 181.022** <<< Prefill speed p50 (us): 2.74843e+06 stddev (us): 41756.4 n: 5 * 501 token(s) Token generation: avg (us): 81500.7 avg (tokens/s): 12.2698 p50 (us): 81104.1 stddev (us): 2961.31 n: 635 * 1 token(s) Token sampling: avg (us): 13.1836 avg (tokens/s): 75851.9 p50 (us): 12 stddev (us): 6.47085 n: 640 * 1 token(s) E2E generation (entire generation loop): avg (ms): 13120 p50 (ms): 13081.6 stddev (ms): 114.689 n: 5 Peak working set size (bytes): 5467533312 WebGPU device lost (2): Device was destroyed. ``` This kernel is 2.10x faster than its F16 counterpart for a 500 token prefill. Previous prefill record is 86tks/s. In order to support devices with subgroup size 8/32, a no subgroup version of the same shader is included. Performance is slower than the subgroup version on ADL. ``` PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 Batch size: 1, prompt tokens: 501, tokens to generate: 128 Prompt processing (time to first token): avg (us): 4.11989e+06 avg (tokens/s): 121.605 p50 (us): 4.11847e+06 stddev (us): 2147.48 n: 5 * 501 token(s) Token generation: avg (us): 81174.9 avg (tokens/s): 12.3191 p50 (us): 81301.1 stddev (us): 2177.2 n: 635 * 1 token(s) Token sampling: avg (us): 14.7998 avg (tokens/s): 67568.3 p50 (us): 12.3 stddev (us): 11.5481 n: 640 * 1 token(s) E2E generation (entire generation loop): avg (ms): 14431.1 p50 (ms): 14433.8 stddev (ms): 5.02473 n: 5 Peak working set size (bytes): 5466480640 WebGPU device lost (2): Device was destroyed. ```

@qjia7

### Description This change implements accuracy level 4 - quantize A to int8 matmul for the WebGPU EP. The matmul kernel here uses DP4A for matrix multiplication, in order to keep the DP4A fed co-operative matrix multiplication is implemented which preloads the row/col into local variables before the multiplication operation. Credits to @qjia7 for help with the quantizer shader. Performance metrics on intel ADL/TGL GPU. ``` PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 Batch size: 1, prompt tokens: 501, tokens to generate: 128 Prompt processing (time to first token): avg (us): 2.76762e+06 **avg (tokens/s): 181.022** <<< Prefill speed p50 (us): 2.74843e+06 stddev (us): 41756.4 n: 5 * 501 token(s) Token generation: avg (us): 81500.7 avg (tokens/s): 12.2698 p50 (us): 81104.1 stddev (us): 2961.31 n: 635 * 1 token(s) Token sampling: avg (us): 13.1836 avg (tokens/s): 75851.9 p50 (us): 12 stddev (us): 6.47085 n: 640 * 1 token(s) E2E generation (entire generation loop): avg (ms): 13120 p50 (ms): 13081.6 stddev (ms): 114.689 n: 5 Peak working set size (bytes): 5467533312 WebGPU device lost (2): Device was destroyed. ``` This kernel is 2.10x faster than its F16 counterpart for a 500 token prefill. Previous prefill record is 86tks/s. In order to support devices with subgroup size 8/32, a no subgroup version of the same shader is included. Performance is slower than the subgroup version on ADL. ``` PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 Batch size: 1, prompt tokens: 501, tokens to generate: 128 Prompt processing (time to first token): avg (us): 4.11989e+06 avg (tokens/s): 121.605 p50 (us): 4.11847e+06 stddev (us): 2147.48 n: 5 * 501 token(s) Token generation: avg (us): 81174.9 avg (tokens/s): 12.3191 p50 (us): 81301.1 stddev (us): 2177.2 n: 635 * 1 token(s) Token sampling: avg (us): 14.7998 avg (tokens/s): 67568.3 p50 (us): 12.3 stddev (us): 11.5481 n: 640 * 1 token(s) E2E generation (entire generation loop): avg (ms): 14431.1 p50 (ms): 14433.8 stddev (ms): 5.02473 n: 5 Peak working set size (bytes): 5466480640 WebGPU device lost (2): Device was destroyed. ```

sushraja-msft changed the title ~~Dp4MatMulNBits low accuracy matmul for WebGPU EP~~ WIP: Dp4MatMulNBits low accuracy matmul for WebGPU EP Jan 14, 2025

guschmue added the ep:WebGPU ort-web webgpu provider label Jan 16, 2025

sushraja-msft changed the title ~~WIP: Dp4MatMulNBits low accuracy matmul for WebGPU EP~~ WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP Jan 17, 2025

sushraja-msft added 16 commits January 17, 2025 10:17

Initial commit with the dequantizer

5d97234

Enable the matmul

9f20029

Add comment

82236b8

Fix quantize to support sg_size 8

af2bc78

Rearrange code to keep inputs in registry. On WU driver 179 tps or 2.…

b9df530

…79s for 500 tk prompt. 126tps or 7.9s for 1k prompt. On latest driver around 140tps for 500tk

Quantize now supports all subgroup sizes.

283f4b1

removing subgroup stuff from loadSHMA

61c1d60

loadSHMB should now work for all sg_size

05570aa

Ready for sgsize implementation of perform_16_64_16 matmul

b739ad6

Remove dequantize and read accuracy level. Not using accuracy level y…

429b784

…et because model is not made with accuracy level 4.

Add a no subgroup version, perf on adl 500tk

97a359b

no subgroup Prompt processing (time to first token): avg (us): 4.11989e+06 avg (tokens/s): 121.605 with subgroup Prompt processing (time to first token): avg (us): 2.77983e+06 avg (tokens/s): 180.227

Ran the linter and added back the accuracy level check.

b690857

Fix typos

ceddd08

Add additional check that N%16 == 0, because we dont check bounds dur…

23701b9

…ing the write out of lane output.

Fix last typo

ab38991

Add subgroup check

73ee5d1

sushraja-msft force-pushed the user/sushraja/dp4_matmul branch from 0dd9e67 to 73ee5d1 Compare January 17, 2025 20:23

qjia7 reviewed Jan 18, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Show resolved Hide resolved

sushraja-msft added 3 commits January 18, 2025 18:11

Reduce k tile size to 32, to relieve registr memory pressure.

4872775

add f32 support

524da06

Disable dp4a matmul on macOS, since metal does not appear to have DP4…

d56b9c3

…A in its instruction set.

guschmue approved these changes Jan 21, 2025

View reviewed changes

guschmue merged commit 58c29d3 into main Jan 21, 2025
98 checks passed

guschmue deleted the user/sushraja/dp4_matmul branch January 21, 2025 23:46

guschmue mentioned this pull request Jan 22, 2025

[webgpu] Only apply subgroup to intel gen-12lp #23377

Closed

ashrit-ms mentioned this pull request Jan 23, 2025

Update win-ort-main to tip main 250123 #23473

Merged

sushraja-msft changed the title ~~WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP~~ Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP Feb 6, 2025

sushraja-msft changed the title ~~Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP~~ DP4MatMulNBits accuracy level 4 matmul for WebGPU EP Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DP4MatMulNBits accuracy level 4 matmul for WebGPU EP #23365

DP4MatMulNBits accuracy level 4 matmul for WebGPU EP #23365

Uh oh!

sushraja-msft commented Jan 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

guschmue commented Jan 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DP4MatMulNBits accuracy level 4 matmul for WebGPU EP #23365

DP4MatMulNBits accuracy level 4 matmul for WebGPU EP #23365

Uh oh!

Conversation

sushraja-msft commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

guschmue commented Jan 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sushraja-msft commented Jan 14, 2025 •

edited

Loading