[webgpu] Implement SubGroupMatrix based MatMulNBits for Metal #23729

sushraja-msft · 2025-02-17T22:43:17Z

Description

Recent progress with SubGroupMatrix prototype in Dawn https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix Functions to webgpu. This shader implements a matmulnbits using that primitive.

Observed perf gains, in terms of LLM inference speed, prefill perf for Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s.

With Changes

./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
	avg (us):       5.42498e+06                    <<< SubGroupMatrix 5.4s
	avg (tokens/s): 184.517
	p50 (us):       5.41982e+06
	stddev (us):    12023.8
	n:              5 * 1001 token(s)
Token generation:
	avg (us):       91138.5
	avg (tokens/s): 10.9723
	p50 (us):       89488.5
	stddev (us):    35136.2
	n:              635 * 1 token(s)

Baseline

./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
	avg (us):       1.45507e+07                     <<< Baseline 14.5s
	avg (tokens/s): 68.7938
	p50 (us):       1.45413e+07
	stddev (us):    22208.9
	n:              5 * 1001 token(s)
Token generation:
	avg (us):       94109.8
	avg (tokens/s): 10.6259
	p50 (us):       89660
	stddev (us):    61579
	n:              635 * 1 token(s)

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h

onnxruntime/core/providers/webgpu/webgpu_context.h

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/core/providers/coreml/model/model.mm

guschmue · 2025-02-20T17:27:27Z

the ort web pipeline compiles webgpu ep with emscripten which fails with:
shader_helper.cc:355:45: error: no member named 'ChromiumExperimentalSubgroupMatrix' in 'wgpu::FeatureName'
355 | if (device_.HasFeature(wgpu::FeatureName::ChromiumExperimentalSubgroupMatrix))

Possible the headerfile that comes with emscripten doesn't know that featurename yet

Maybe use
#if !defined(wasm)

sushraja-msft · 2025-02-20T22:07:08Z

the ort web pipeline compiles webgpu ep with emscripten which fails with: shader_helper.cc:355:45: error: no member named 'ChromiumExperimentalSubgroupMatrix' in 'wgpu::FeatureName' 355 | if (device_.HasFeature(wgpu::FeatureName::ChromiumExperimentalSubgroupMatrix))

Possible the headerfile that comes with emscripten doesn't know that featurename yet

Maybe use #if !defined(wasm)

done !

fs-eire · 2025-02-21T03:04:05Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

azure-pipelines · 2025-02-21T03:04:32Z

Azure Pipelines successfully started running 7 pipeline(s).

### Description Recent progress with SubGroupMatrix prototype in Dawn https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix Functions to webgpu. This shader implements a matmulnbits using that primitive. Observed perf gains, in terms of LLM inference speed, prefill perf for Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s. With Changes ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 5.42498e+06 <<< SubGroupMatrix 5.4s avg (tokens/s): 184.517 p50 (us): 5.41982e+06 stddev (us): 12023.8 n: 5 * 1001 token(s) Token generation: avg (us): 91138.5 avg (tokens/s): 10.9723 p50 (us): 89488.5 stddev (us): 35136.2 n: 635 * 1 token(s) ``` Baseline ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.45507e+07 <<< Baseline 14.5s avg (tokens/s): 68.7938 p50 (us): 1.45413e+07 stddev (us): 22208.9 n: 5 * 1001 token(s) Token generation: avg (us): 94109.8 avg (tokens/s): 10.6259 p50 (us): 89660 stddev (us): 61579 n: 635 * 1 token(s) ```

github-actions bot reviewed Feb 17, 2025

View reviewed changes

github-advanced-security bot found potential problems Feb 17, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h Fixed Show fixed Hide fixed

sushraja-msft marked this pull request as ready for review February 19, 2025 00:30

sushraja-msft assigned guschmue and qjia7 Feb 19, 2025

qjia7 reviewed Feb 19, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc Show resolved Hide resolved

sushraja-msft added 13 commits February 19, 2025 12:29

Upgrade Dawn

7e2b8db

Code Builds

91cda62

Runs but garbage outputs

9f838ff

Add restriction to use subgroup matmul for prefill only

aaeb58b

First round bug fixes

2d406da

Remove safeMatrixStore

a3102cd

Add FP32 support

2866c26

Add support for compute precision

acc3ab6

Restrict to accuracy level 4

2ad480c

Lintrunner

8d7ae12

Revert change to .gitignore

c5856df

Fix mac build break

aa6e8c0

add comments

09e30be

sushraja-msft force-pushed the user/sushraja/subgroupMatrix branch from e90b823 to 09e30be Compare February 19, 2025 20:37

guschmue added the ep:WebGPU ort-web webgpu provider label Feb 19, 2025

github-actions bot reviewed Feb 19, 2025

View reviewed changes

onnxruntime/core/providers/coreml/model/model.mm Outdated Show resolved Hide resolved

onnxruntime/core/providers/coreml/model/model.mm Outdated Show resolved Hide resolved

sushraja-msft requested a review from guschmue February 19, 2025 23:31

Update cmake for dawn

f7ddbb0

sushraja-msft force-pushed the user/sushraja/subgroupMatrix branch from 92db1cf to f7ddbb0 Compare February 20, 2025 00:32

Fix WASM builds

81c0262

guschmue approved these changes Feb 21, 2025

View reviewed changes

guschmue merged commit 8eb5513 into main Feb 21, 2025
96 of 98 checks passed

guschmue deleted the user/sushraja/subgroupMatrix branch February 21, 2025 17:23

dneto0 mentioned this pull request Mar 28, 2025

Subgroup matrix gpuweb/gpuweb#4195

Open

[webgpu] Implement SubGroupMatrix based MatMulNBits for Metal #23729

[webgpu] Implement SubGroupMatrix based MatMulNBits for Metal #23729

Uh oh!

Conversation

sushraja-msft commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

guschmue commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sushraja-msft commented Feb 20, 2025

Uh oh!

fs-eire commented Feb 21, 2025

Uh oh!

azure-pipelines bot commented Feb 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sushraja-msft commented Feb 17, 2025 •

edited

Loading

guschmue commented Feb 20, 2025 •

edited

Loading