Rope imbedding kernel to use avx2 #23694

liqunfu · 2025-02-14T00:04:17Z

Description

Credit to chethanpk who provided with Rope Embedding in a patch. The patch is in the first commit of this PR.

I have been confirming perf improvement with this code change. My analysis is based on phi-3-mini-4k-instruct-int4-int8-blklen32.
Benchmark from onnxruntim-genai does not show clear improvement. this is because GQA only takes a small portion of the whole model (<10%) and Rope within GQA only take small portion of the whole GQA (12%). The following is the profile with and without avx2

we see cost of RoPE dropped from 82.42 to 18.86. Therefore I still recommend to merge this PR.

with avx2 RoPE:
Name: GroupQueryAttention_rotary, Mean Duration: 18.86, Percentage: 3.16%

plain c++ RoPE:
Name: GroupQueryAttention_rotary, Mean Duration: 82.42, Percentage: 12.20%

mlas benchmark:

dim	interleaved	baseline	new
128	false	735	18.1
256	false	1470	31.7
512	false	2938	59.2
1024	false	5876	81.5
128	true	368	23.1
256	true	735	34.3
512	true	1470	62.0
1024	true	2937	125

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

…rnel-to-use-AVX2

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

onnxruntime/test/python/transformers/test_gqa_cpu.py

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

yihonglyu · 2025-02-19T03:07:55Z

Description

Credit to chethanpk who provided with Rope Embedding in a patch. The patch is in the first commit of this PR.

I have been confirming perf improvement with this code change. My analysis is based on phi-3-mini-4k-instruct-int4-int8-blklen32. Benchmark from onnxruntim-genai does not show clear improvement. this is because GQA only takes a small portion of the whole model (<10%) and Rope within GQA only take small portion of the whole GQA (12%). The following is the profile with and without avx2

we see cost of RoPE dropped from 82.42 to 18.86. Therefore I still recommend to merge this PR.

with avx2 RoPE: Name: GroupQueryAttention_rotary, Mean Duration: 18.86, Percentage: 3.16%

plain c++ RoPE: Name: GroupQueryAttention_rotary, Mean Duration: 82.42, Percentage: 12.20%

Motivation and Context

improve GQP performance on intel x64.

What is the unit of Mean Duration? Is it in milliseconds (ms)?

…rios, clear interleaved attribute Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/unittest/test_rope.cpp

liqunfu · 2025-02-19T18:07:29Z

Description

Credit to chethanpk who provided with Rope Embedding in a patch. The patch is in the first commit of this PR.
I have been confirming perf improvement with this code change. My analysis is based on phi-3-mini-4k-instruct-int4-int8-blklen32. Benchmark from onnxruntim-genai does not show clear improvement. this is because GQA only takes a small portion of the whole model (<10%) and Rope within GQA only take small portion of the whole GQA (12%). The following is the profile with and without avx2
we see cost of RoPE dropped from 82.42 to 18.86. Therefore I still recommend to merge this PR.
with avx2 RoPE: Name: GroupQueryAttention_rotary, Mean Duration: 18.86, Percentage: 3.16%
plain c++ RoPE: Name: GroupQueryAttention_rotary, Mean Duration: 82.42, Percentage: 12.20%

Motivation and Context

improve GQP performance on intel x64.

What is the unit of Mean Duration? Is it in milliseconds (ms)?

in microseconds,

onnxruntime/onnxruntime/core/common/profiler.cc

Line 79 in 038266c

long long dur = TimeDiffMicroSeconds(start_time);

The results was from another PR that I am working on but it cannot mixed with this one due to timing of the release.

liqunfu · 2025-02-19T18:09:59Z

there should be unit tests to check the correctness

Added mlas test for RoPE. It only runs in X64 build because that is the only machine I have. Tests for other scenarios can be enabled later.

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

…com/microsoft/onnxruntime into liqun/Intel-ROPE-kernel-to-use-AVX2

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

yihonglyu

Could you add a microbenchmark under onnxruntime\test\mlas\bench or another suitable location and collect the performance metrics with and without the patch?

onnxruntime/core/mlas/inc/mlas_attention.h

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

onnxruntime/test/mlas/bench/bench_rope.cpp

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/bench/bench_rope.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

liqunfu · 2025-02-20T00:51:30Z

Could you add a microbenchmark under onnxruntime\test\mlas\bench or another suitable location and collect the performance metrics with and without the patch?

added mlas_rope benchmark

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

onnxruntime/core/mlas/lib/rotary_embedding_kernel_avx2_fp32.cpp

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

### Description  Credit to [chethanpk](https://github.com/chethanpk) who provided with Rope Embedding in a patch. The patch is in the first commit of this PR. I have been confirming perf improvement with this code change. My analysis is based on phi-3-mini-4k-instruct-int4-int8-blklen32. Benchmark from onnxruntim-genai does not show clear improvement. this is because GQA only takes a small portion of the whole model (<10%) and Rope within GQA only take small portion of the whole GQA (12%). The following is the profile with and without avx2 we see cost of RoPE dropped from 82.42 to 18.86. Therefore I still recommend to merge this PR. with avx2 RoPE: Name: GroupQueryAttention_rotary, Mean Duration: 18.86, Percentage: 3.16% plain c++ RoPE: Name: GroupQueryAttention_rotary, Mean Duration: 82.42, Percentage: 12.20% mlas benchmark: dim|interleaved|baseline|new -|-|-|- 128 |false|735|18.1 256 |false|1470|31.7 512 |false|2938|59.2 1024 |false|5876|81.5 128 |true|368|23.1 256 |true|735|34.3 512 |true|1470|62.0 1024 |true|2937|125 --------- Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

This fixes an issue that _mm256_maskload_ps intrinsic used in remainder-handling logic introduced in #23694. The core of the problem is that _mm256_maskload_ps (and its store equivalent) can read beyond the masked elements. Even if mask correctly specifies that you only want to load, for example, 3 floats, the intrinsic may still read the full 32 bytes (8 floats) from the provided memory address. The invalid access occurs when one of buffers (input, sin_data, or cos_data) ends near the boundary of a memory page, and the part of the 32-byte read that you don't care about (i.e., the masked-off part) falls onto an unmapped page. This will cause a segmentation fault (invalid access). The Solution: Use a Scalar Remainder Loop The simplest, safest, and most robust solution is to replace the masked AVX remainder logic with a simple scalar loop. This is the exact strategy already used by your RopeKernel_Avx2_fp16_Impl functions, which are safe from this bug. The performance impact of this change will be negligible, as this loop only processes the final 1-15 elements. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

This fixes an issue that _mm256_maskload_ps intrinsic used in remainder-handling logic introduced in microsoft#23694. The core of the problem is that _mm256_maskload_ps (and its store equivalent) can read beyond the masked elements. Even if mask correctly specifies that you only want to load, for example, 3 floats, the intrinsic may still read the full 32 bytes (8 floats) from the provided memory address. The invalid access occurs when one of buffers (input, sin_data, or cos_data) ends near the boundary of a memory page, and the part of the 32-byte read that you don't care about (i.e., the masked-off part) falls onto an unmapped page. This will cause a segmentation fault (invalid access). The Solution: Use a Scalar Remainder Loop The simplest, safest, and most robust solution is to replace the masked AVX remainder logic with a simple scalar loop. This is the exact strategy already used by your RopeKernel_Avx2_fp16_Impl functions, which are safe from this bug. The performance impact of this change will be negligible, as this loop only processes the final 1-15 elements. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

liqunfu added 4 commits January 16, 2025 18:54

profile init code

4523b3d

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

from patch

59e2760

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Merge branch 'liqun/GQA' into Intel-ROPE-kernel-to-use-AVX2

1611fcc

Merge branch 'Intel-ROPE-kernel-to-use-AVX2' into liqun/Intel-ROPE-ke…

6e2f414

…rnel-to-use-AVX2

liqunfu requested a review from a team as a code owner February 14, 2025 00:04

github-actions bot reviewed Feb 14, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc Outdated Show resolved Hide resolved

liqunfu added 3 commits February 14, 2025 00:24

node_name and remove profiler wrapper

35bf517

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

m:erge branch 'main' into liqun/GQA

e1232fc

Merge branch 'liqun/GQA' into liqun/Intel-ROPE-kernel-to-use-AVX2

b36fa2c

github-actions bot reviewed Feb 14, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc Outdated Show resolved Hide resolved

onnxruntime/test/python/transformers/test_gqa_cpu.py Show resolved Hide resolved

onnxruntime/test/python/transformers/test_gqa_cpu.py Outdated Show resolved Hide resolved

snnn reviewed Feb 14, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc Outdated Show resolved Hide resolved

remove profiling code

3964acc

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions bot reviewed Feb 15, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc Outdated Show resolved Hide resolved

liqunfu added 3 commits February 15, 2025 01:21

undo test_gqa_cpu.py

40a6854

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

lint

43bdb44

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

some edit

46353d8

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions bot reviewed Feb 15, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h Show resolved Hide resolved

fix data correctness in interleaved cases

e13ac56

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

jywu-msft requested review from fajin-corp and yihonglyu February 18, 2025 18:13

one more data correctness fix, add MLAS RoPE test to covert all scena…

9867ee7

…rios, clear interleaved attribute Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions bot reviewed Feb 19, 2025

View reviewed changes

onnxruntime/test/mlas/unittest/test_rope.cpp Show resolved Hide resolved

github-advanced-security bot found potential problems Feb 19, 2025

View reviewed changes

onnxruntime/test/mlas/unittest/test_rope.cpp Fixed Show fixed Hide fixed

liqunfu added 4 commits February 19, 2025 18:12

lint

754be92

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

fix build - declaration error

4dc471b

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

unused RoPERegisterAllShortExecuteTests ci failure

5ecee3c

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Merge branch 'liqun/Intel-ROPE-kernel-to-use-AVX2' of https://github.…

4e128da

…com/microsoft/onnxruntime into liqun/Intel-ROPE-kernel-to-use-AVX2

missing implementation

62d5eef

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

yihonglyu reviewed Feb 19, 2025

View reviewed changes

fajin-corp reviewed Feb 19, 2025

View reviewed changes

onnxruntime/core/mlas/inc/mlas_attention.h Outdated Show resolved Hide resolved

fajin-corp reviewed Feb 19, 2025

View reviewed changes

onnxruntime/core/mlas/inc/mlas_attention.h Outdated Show resolved Hide resolved

add benchmark, etc.

d6de70f

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-advanced-security bot found potential problems Feb 19, 2025

View reviewed changes

onnxruntime/test/mlas/bench/bench_rope.cpp Fixed Show fixed Hide fixed

github-actions bot reviewed Feb 19, 2025

View reviewed changes

onnxruntime/test/mlas/bench/bench_rope.cpp Show resolved Hide resolved

Update onnxruntime/test/mlas/bench/bench_rope.cpp

89e1411

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

unaligned store

b46927e

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

yihonglyu reviewed Feb 20, 2025

View reviewed changes

onnxruntime/core/mlas/lib/rotary_embedding_kernel_avx2_fp32.cpp Outdated Show resolved Hide resolved

yihonglyu reviewed Feb 20, 2025

View reviewed changes

onnxruntime/core/mlas/lib/rotary_embedding_kernel_avx2_fp32.cpp Outdated Show resolved Hide resolved

sin ->sin_data, constexpr

484d128

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

fajin-corp approved these changes Feb 20, 2025

View reviewed changes

yihonglyu approved these changes Feb 20, 2025

View reviewed changes

liqunfu merged commit af04b20 into main Feb 20, 2025
95 of 97 checks passed

liqunfu deleted the liqun/Intel-ROPE-kernel-to-use-AVX2 branch February 20, 2025 20:08

tianleiwu mentioned this pull request Oct 22, 2025

[MLAS] Fix rotary avx2 kernel invalid access #26389

Merged

Rope imbedding kernel to use avx2 #23694

Rope imbedding kernel to use avx2 #23694

Uh oh!

Conversation

liqunfu commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yihonglyu commented Feb 19, 2025

Description

Motivation and Context

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liqunfu commented Feb 19, 2025

Description

Motivation and Context

Uh oh!

liqunfu commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yihonglyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liqunfu commented Feb 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

liqunfu commented Feb 14, 2025 •

edited

Loading

liqunfu commented Feb 19, 2025 •

edited

Loading