Skip to content

[DML EP] Support partial rotary embedding#22417

Merged
PatriceVignola merged 3 commits intomainfrom
user/pavignol/support-partial-rotary-embedding
Oct 16, 2024
Merged

[DML EP] Support partial rotary embedding#22417
PatriceVignola merged 3 commits intomainfrom
user/pavignol/support-partial-rotary-embedding

Conversation

@PatriceVignola
Copy link
Copy Markdown
Contributor

@PatriceVignola PatriceVignola commented Oct 12, 2024

Description

This adds support for partial RotaryEmbedding to DML. Essentially, partial RotaryEmbedding simply consists of doing the rotary embedding calculation on a subregion of the input tensor of as if its head size was rotary_embedding_dim, while leaving the second part of the tensor (i.e. head_size - rotary_embedding_dim) alone.

To achieve this, all we need to do is follow the following steps:

  1. Split the tensor into 2 parts
  2. Run the rotary embedding algorithm on the first part, just like we were doing before on the entire tensor
  3. Join the 2 parts back together

Since we're leaving the middle part intact, the RotaryEmbedding fusion will still be done within DML. Also, the concat at the end is essentially free because DML optimizes it out and directly allocate the result of RotaryEmbedding at the right place. The only overhead here is the splitting of the tensor at the beginning, which we should eventually make part of the RotaryEmbedding fusion within DML.

Motivation and Context

This fix allows us to correctly run models that have a partial_rotary_factor setting in huggingface, including Nvidia's Nemotron: https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc
Comment thread onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc
BLSharda added a commit to BLSharda/onnxruntime-genai that referenced this pull request Oct 12, 2024
Patrice has raised PR that fixed RotaryEmbedding for partial factor in ORT so workaround
 is not needed anymore.
microsoft/onnxruntime#22417
@sumitsays
Copy link
Copy Markdown
Contributor

If possible can we have the ASCII DML graph. It will be helpful to understand the implementation.


Refers to: onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorRotaryEmbedding.cpp:27 in 7370243. [](commit_id = 7370243, deletion_comment = False)

sumitsays
sumitsays previously approved these changes Oct 15, 2024
Copy link
Copy Markdown
Contributor

@sumitsays sumitsays left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@sumitsays sumitsays dismissed their stale review October 15, 2024 23:51

revoking review

@PatriceVignola PatriceVignola dismissed github-actions[bot]’s stale review October 16, 2024 01:38

It would change the existing style of the test that we didn't add. All we did was re-enabling a DML test.

@PatriceVignola PatriceVignola merged commit f610605 into main Oct 16, 2024
@PatriceVignola PatriceVignola deleted the user/pavignol/support-partial-rotary-embedding branch October 16, 2024 20:28
BLSharda added a commit to BLSharda/onnxruntime-genai that referenced this pull request Oct 17, 2024
Patrice has raised PR that fixed RotaryEmbedding for partial factor in ORT so workaround
 is not needed anymore.
microsoft/onnxruntime#22417
guschmue pushed a commit that referenced this pull request Oct 18, 2024
### Description
This adds support for partial RotaryEmbedding to DML. Essentially,
partial RotaryEmbedding simply consists of doing the rotary embedding
calculation on a subregion of the input tensor of as if its head size
was `rotary_embedding_dim`, while leaving the second part of the tensor
(i.e. `head_size - rotary_embedding_dim`) alone.

To achieve this, all we need to do is follow the following steps:

1. Split the tensor into 2 parts
2. Run the rotary embedding algorithm on the first part, just like we
were doing before on the entire tensor
3. Join the 2 parts back together

Since we're leaving the middle part intact, the RotaryEmbedding fusion
will still be done within DML. Also, the concat at the end is
essentially free because DML optimizes it out and directly allocate the
result of RotaryEmbedding at the right place. The only overhead here is
the splitting of the tensor at the beginning, which we should eventually
make part of the RotaryEmbedding fusion within DML.



### Motivation and Context
This fix allows us to correctly run models that have a
`partial_rotary_factor` setting in huggingface, including Nvidia's
Nemotron: https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct
apsonawane pushed a commit that referenced this pull request Oct 21, 2024
### Description
This adds support for partial RotaryEmbedding to DML. Essentially,
partial RotaryEmbedding simply consists of doing the rotary embedding
calculation on a subregion of the input tensor of as if its head size
was `rotary_embedding_dim`, while leaving the second part of the tensor
(i.e. `head_size - rotary_embedding_dim`) alone.

To achieve this, all we need to do is follow the following steps:

1. Split the tensor into 2 parts
2. Run the rotary embedding algorithm on the first part, just like we
were doing before on the entire tensor
3. Join the 2 parts back together

Since we're leaving the middle part intact, the RotaryEmbedding fusion
will still be done within DML. Also, the concat at the end is
essentially free because DML optimizes it out and directly allocate the
result of RotaryEmbedding at the right place. The only overhead here is
the splitting of the tensor at the beginning, which we should eventually
make part of the RotaryEmbedding fusion within DML.



### Motivation and Context
This fix allows us to correctly run models that have a
`partial_rotary_factor` setting in huggingface, including Nvidia's
Nemotron: https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct
@sophies927 sophies927 added the cherry-picked Cherry-picked for a cherrypicks branch label Oct 22, 2024
@snnn
Copy link
Copy Markdown
Contributor

snnn commented Sep 5, 2025

This PR has been cherry-picked into the rel-1.20.0 branch in PR #22526. Removing the release:1.20.0 label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked Cherry-picked for a cherrypicks branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants