Skip to content

Conversation

@xhcao
Copy link
Contributor

@xhcao xhcao commented Mar 25, 2025

If the sizes of batch_size and sequence_length are ones, split the hidden_size to improve parallelism.

Description

Motivation and Context

If the sizes of batch_size and sequence_length are ones,
split the hidden_size to improve parallelism.
@xhcao xhcao force-pushed the skip-norm-layer branch from fc3af9b to 32dd4cd Compare March 25, 2025 08:50
@xhcao
Copy link
Contributor Author

xhcao commented Mar 25, 2025

The outputs of SkipLayerNormalization operator in phi3.5 are output and input_skip_bias_sum, and their shapes are [batch_size, sequence_length, hidden_size], on decoding stage, the batch_size and sequence_length are always 1, the outputs' shapes are [1, 1, 3072], there is only one work group, which does not use gpu resources well.
For this situation, 1. the PR splits hidden dim to add workgroup, although adds workload, but reduces the average workload of one work group. 2. Handle output and input_skip_bias_sum in different work groups. If so, the total work groups are 12 for the shape [1, 1, 3072].
Use Intel GPA tool to capture the data, from ~20us to ~10us.
SkipNL-Before
SkipNL-After
@jchen10 @hujiajie PTAL, thanks

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Mar 25, 2025
@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

@guschmue
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@guschmue
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@guschmue
Copy link
Contributor

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@guschmue
Copy link
Contributor

guschmue commented Apr 1, 2025

lgtm.
CI pipelines changed - can you merge with main?

@xhcao
Copy link
Contributor Author

xhcao commented Apr 1, 2025

lgtm. CI pipelines changed - can you merge with main?

Updated

@guschmue
Copy link
Contributor

guschmue commented Apr 8, 2025

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@guschmue guschmue merged commit 0acb048 into microsoft:main Apr 8, 2025
60 of 69 checks passed
zhaoxul-qti pushed a commit to CodeLinaro/onnxruntime that referenced this pull request Apr 17, 2025
If the sizes of batch_size and sequence_length are ones, split the
hidden_size to improve parallelism.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants