Skip to content

Conversation

@sushraja-msft
Copy link
Contributor

@sushraja-msft sushraja-msft commented Feb 17, 2025

Description

Recent progress with SubGroupMatrix prototype in Dawn https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix Functions to webgpu. This shader implements a matmulnbits using that primitive.

Observed perf gains, in terms of LLM inference speed, prefill perf for Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s.

With Changes

./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
	avg (us):       5.42498e+06                    <<< SubGroupMatrix 5.4s
	avg (tokens/s): 184.517
	p50 (us):       5.41982e+06
	stddev (us):    12023.8
	n:              5 * 1001 token(s)
Token generation:
	avg (us):       91138.5
	avg (tokens/s): 10.9723
	p50 (us):       89488.5
	stddev (us):    35136.2
	n:              635 * 1 token(s)

Baseline

./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
	avg (us):       1.45507e+07                     <<< Baseline 14.5s
	avg (tokens/s): 68.7938
	p50 (us):       1.45413e+07
	stddev (us):    22208.9
	n:              5 * 1001 token(s)
Token generation:
	avg (us):       94109.8
	avg (tokens/s): 10.6259
	p50 (us):       89660
	stddev (us):    61579
	n:              635 * 1 token(s)

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@sushraja-msft sushraja-msft marked this pull request as ready for review February 19, 2025 00:30
@sushraja-msft sushraja-msft force-pushed the user/sushraja/subgroupMatrix branch from e90b823 to 09e30be Compare February 19, 2025 20:37
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Feb 19, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@sushraja-msft sushraja-msft force-pushed the user/sushraja/subgroupMatrix branch from 92db1cf to f7ddbb0 Compare February 20, 2025 00:32
@guschmue
Copy link
Contributor

guschmue commented Feb 20, 2025

the ort web pipeline compiles webgpu ep with emscripten which fails with:
shader_helper.cc:355:45: error: no member named 'ChromiumExperimentalSubgroupMatrix' in 'wgpu::FeatureName'
355 | if (device_.HasFeature(wgpu::FeatureName::ChromiumExperimentalSubgroupMatrix))

Possible the headerfile that comes with emscripten doesn't know that featurename yet

Maybe use
#if !defined(wasm)

@sushraja-msft
Copy link
Contributor Author

the ort web pipeline compiles webgpu ep with emscripten which fails with: shader_helper.cc:355:45: error: no member named 'ChromiumExperimentalSubgroupMatrix' in 'wgpu::FeatureName' 355 | if (device_.HasFeature(wgpu::FeatureName::ChromiumExperimentalSubgroupMatrix))

Possible the headerfile that comes with emscripten doesn't know that featurename yet

Maybe use #if !defined(wasm)

done !

@fs-eire
Copy link
Contributor

fs-eire commented Feb 21, 2025

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@guschmue guschmue merged commit 8eb5513 into main Feb 21, 2025
96 of 98 checks passed
@guschmue guschmue deleted the user/sushraja/subgroupMatrix branch February 21, 2025 17:23
guschmue pushed a commit that referenced this pull request Mar 6, 2025
### Description
Recent progress with SubGroupMatrix prototype in Dawn
https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix
Functions to webgpu. This shader implements a matmulnbits using that
primitive.

Observed perf gains, in terms of LLM inference speed, prefill perf for
Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s.

With Changes
```
./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
	avg (us):       5.42498e+06                    <<< SubGroupMatrix 5.4s
	avg (tokens/s): 184.517
	p50 (us):       5.41982e+06
	stddev (us):    12023.8
	n:              5 * 1001 token(s)
Token generation:
	avg (us):       91138.5
	avg (tokens/s): 10.9723
	p50 (us):       89488.5
	stddev (us):    35136.2
	n:              635 * 1 token(s)

```
Baseline
```
./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
	avg (us):       1.45507e+07                     <<< Baseline 14.5s
	avg (tokens/s): 68.7938
	p50 (us):       1.45413e+07
	stddev (us):    22208.9
	n:              5 * 1001 token(s)
Token generation:
	avg (us):       94109.8
	avg (tokens/s): 10.6259
	p50 (us):       89660
	stddev (us):    61579
	n:              635 * 1 token(s)
```
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
### Description
Recent progress with SubGroupMatrix prototype in Dawn
https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix
Functions to webgpu. This shader implements a matmulnbits using that
primitive.

Observed perf gains, in terms of LLM inference speed, prefill perf for
Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s.

With Changes
```
./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
	avg (us):       5.42498e+06                    <<< SubGroupMatrix 5.4s
	avg (tokens/s): 184.517
	p50 (us):       5.41982e+06
	stddev (us):    12023.8
	n:              5 * 1001 token(s)
Token generation:
	avg (us):       91138.5
	avg (tokens/s): 10.9723
	p50 (us):       89488.5
	stddev (us):    35136.2
	n:              635 * 1 token(s)

```
Baseline
```
./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
	avg (us):       1.45507e+07                     <<< Baseline 14.5s
	avg (tokens/s): 68.7938
	p50 (us):       1.45413e+07
	stddev (us):    22208.9
	n:              5 * 1001 token(s)
Token generation:
	avg (us):       94109.8
	avg (tokens/s): 10.6259
	p50 (us):       89660
	stddev (us):    61579
	n:              635 * 1 token(s)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants