Minor improvements to `token_type_ids` extension for PA by p-wysocki · Pull Request #34661 · openvinotoolkit/openvino

p-wysocki · 2026-03-12T11:34:01Z

Details:

Implements code review from [PagedAttention] Add bidirectional attention mask within image groups #34111

Tickets:

N/A

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

…nto attn_idea_2 Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

…into attn_idea_2 Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

CuriousPanCake · 2026-03-12T12:09:27Z

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

    // Shared flag to track whether the model is Gemma3, set when any layer matches
    // the gptoss_gemma3 sliding window pattern. Combined with the token_type_ids check,
    // this uniquely identifies Gemma3 (gpt-oss shares the pattern but lacks token_type_ids).
-    auto is_gptoss_gemma3 = std::make_shared<bool>(false);


Can we define this variable inside the callback?

Agree it looks strange is required to define it outside

Gemma3 has a repeating sequence of attention layers: 5x sliding window attention, 1x full attention. The pattern we currently have detects sliding window, but token_type_ids has to be passed to full attention layers as well.

has_token_type_ids is defined outside of the callback as shared_ptr, because it has to stay consistent between all lambda callbacks - since lambda's capture is =, it gets a new shared pointer to the object. Without it, the token_type_ids would be routed to PA only for sliding window PAa, and not for full attention PAs.

Technically we could detect full attention pattern to avoid gpt-oss/gemma3 mixup and do the same trick, but then the first 5x sliding window attentions would not receive token_type_ids input, because only the first full attention layer (6th in line) would set the variable to true.

Summing up, it may not be as clean as I'd like it to be, but it works. If you insist that this piece of code will cause issues I can keep looking for a universal pattern which would:

separate gpt-oss and gemma3

work for both sliding window and full attention layers

Yeah, it's a little dirty solution, but if there's no other option, I believe we can live with it.

I can keep looking for a universal pattern

Any ideas of what this could be?

I'll be improving the GenAI integration next week, I'll give finding a better pattern another go, as I'll be modifying this file anyway. If I find it, the whole thing will be solved gracefully. For now IMO the PR can be merged, as overall it's a net positive change over master.

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

mitruska · 2026-03-12T11:59:54Z

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

            sliding_window = std::make_shared<v1::Subtract>(v0::Constant::create(element::i32, Shape{}, {2}), offset);
        } else if (pattern_map.count(gptoss_gemma3_offset)) {
-            *is_gptoss_gemma3 = true;
+            is_gemma3 = optional_model_wide_params.count("token_type_ids");


In fact any model with token_type_ids and matching sliding window pattern will set this is_gemma3 flag true, why not simply name this variable has_token_type_ids?
Or set has_sliding_window here instead, and use below.
Also currently is_gemma3 will be false for causal mask case (no sliding window) within the same model.

I renamed the variable, and regarding the no sliding window case, the explanation is provided in #34661 (comment).

mitruska · 2026-03-12T12:19:55Z

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

+        if (is_gemma3) {
            pa_arguments.insert(pa_arguments.begin() + 25, handle_gemma3_token_type_ids(optional_model_wide_params));
        } else {
            pa_arguments.insert(pa_arguments.begin() + 25, v0::Constant::create(element::i32, Shape{0}, {}));


The variable naming is tight to gemma3 but it can be generic for any model having has_token_type_ids and has_sliding_window true.
It is currently applied for sliding_window case only, but as a next step it could be extended to causal case as well then this if else will be reduced to single case:

pa_arguments.insert(pa_arguments.begin() + 25, handle_token_type_ids(optional_model_wide_params));

Suggested change

if (is_gemma3) {

pa_arguments.insert(pa_arguments.begin() + 25, handle_gemma3_token_type_ids(optional_model_wide_params));

} else {

pa_arguments.insert(pa_arguments.begin() + 25, v0::Constant::create(element::i32, Shape{0}, {}));

if (has_sliding_window) {

pa_arguments.insert(pa_arguments.begin() + 25, handle_token_type_ids(optional_model_wide_params));

} else {

pa_arguments.insert(pa_arguments.begin() + 25, v0::Constant::create(element::i32, Shape{0}, {}));

I changed the variable name. The token_type_ids is currently working also for causal case, see #34661 (comment).

…into attn_fixes Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Copilot

Pull request overview

This PR refines Gemma3 token_type_ids handling for the SDPA→PagedAttention transformation and strengthens PagedAttentionExtension type-propagation coverage around the newly-supported token_type_ids ranks.

Changes:

Add type-prop tests validating token_type_ids acceptance for rank-1/rank-2, dynamic shape, and invalid type/rank cases.
Simplify token_type_ids retrieval/conversion in the Gemma3 path by assuming presence when the Gemma3 condition is met and avoiding an internal fallback.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/core/tests/type_prop/paged_attention.cpp`	Adds dedicated type-prop tests for `token_type_ids` rank/type validation.
`src/common/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp`	Adjusts Gemma3 detection flagging and streamlines `token_type_ids` handling (convert-to-i32 when needed).

Copilot · 2026-03-31T05:21:10Z

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

+    // Set to true once a sliding_attention layer matching the gptoss_gemma3 pattern is found
+    // alongside a token_type_ids model input - the combination that uniquely identifies Gemma3
+    // since pattern for full attention mask in Gemma3 is different than sliding window
+    // it has to be persistent in the callback, so shared_ptr is used
+    auto has_token_type_ids = std::make_shared<bool>(false);


[LOW] has_token_type_ids is used as a persisted “Gemma3 detected / enable token_type_ids wiring” flag (it’s only updated when the gptoss_gemma3 sliding-window pattern matches), so the name is misleading—there are cases where the model may have a token_type_ids input but this flag stays false until that pattern is seen. Consider renaming it to something like is_gemma3 / enable_gemma3_token_type_ids to reflect the actual semantics and reduce the chance of future misuse.

mitruska · 2026-04-01T09:43:40Z

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

 static std::shared_ptr<ov::Node> handle_gemma3_token_type_ids(
    const std::map<std::string, std::shared_ptr<v0::Parameter>>& optional_model_wide_params) {
-    if (optional_model_wide_params.find("token_type_ids") != optional_model_wide_params.end()) {
-        auto param = optional_model_wide_params.at("token_type_ids");
-        if (param->get_element_type() != ov::element::i32) {
-            return std::make_shared<v0::Convert>(param, ov::element::i32);
-        }
-        return param;
+    auto param = optional_model_wide_params.at("token_type_ids");
+    if (param->get_element_type() != ov::element::i32) {
+        return std::make_shared<v0::Convert>(param, ov::element::i32);
    }
-    return v0::Constant::create(ov::element::i32, ov::Shape{0}, {});
+    return param;


Now this helper looks unsafe as can be used without the pre-check if the input token_type_ids exists, also it just add Convert, if the type is not aligned, what is done for other inputs as well, offsets for example. It's not unique for Gemma, maybe this handle_gemma3_token_type_ids helper can just be just skipped/removed in this PR, and as a separate contribution common apply_convert helper can be added and reused for other inputs as well.

openvino/src/common/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

Lines 618 to 621 in e91a513

auto offset = pattern_map.at(phi3_offset).get_node_shared_ptr();

if (offset->get_element_type() != element::i32) {

offset = std::make_shared<v0::Convert>(offset, element::i32);

}

Helper has been inlined, now the PR proposes using just the has_token_type_ids and its convert/insertion is inlined instead of being handled by a util.

mitruska · 2026-04-01T09:57:34Z

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

        OPENVINO_ASSERT(pa_arguments.size() == 25);

-        if (*is_gptoss_gemma3) {
+        if (*has_token_type_ids) {


This handle_gemma3_token_type_ids brings more confusion than clarity, it just converts the type, whithout any model specific logic, so I would recommend to rename the helper to be more generic like "apply_convert" and reuse along the transformaion or for now just put the Convert insertion here explicitly, as it is done for other cases like offsets

Suggested change

if (*has_token_type_ids) {

if (*has_token_type_ids) {

auto token_type_ids = optional_model_wide_params.at("token_type_ids");

if (param->get_element_type() != ov::element::i32) {

token_type_ids = std::make_shared<v0::Convert>(token_type_ids, ov::element::i32));

}

pa_arguments.insert(pa_arguments.begin() + 25, token_type_ids);

Applied, the logic has been simplified.

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

p-wysocki added 30 commits February 9, 2026 14:49

WIP

1abe8ec

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

WIP

34763dd

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

WIP

e7f8238

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

WIP

daef794

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Add tests

fdb3a73

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

initial clenaup

829c430

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Set input as optional

a04d165

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Correct tests

96188fb

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Remove reshape from graph

4d9f607

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Remove debug prints

1f6a6d1

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Clenaup

2bf68b5

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Merge branch 'master' into attn_idea_2

bcbb855

Sliding window working

ed3374d

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Move sw to gptoss logic

0fd5001

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Working, with debug prints

5fbbf5e

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Cleanup

81ab320

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Merge branch 'attn_idea_2' of https://github.com/p-wysocki/openvino i…

4e0d5ac

…nto attn_idea_2 Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Cleanup

5f5af24

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

update copyright

362cb80

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Fix transformation tests, add new one

7841c70

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Fix convert input tests

9810dc3

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Fix clang

890804b

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Fix smoke tests

6a62dda

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Fix smoke test

dfc6e1f

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Update GPU input count

a412f6c

CR

810130d

Add token_type_ids to gemma only

acbd73e

Fix gpu test

2da2303

Merge branch 'master' into attn_idea_2

03e9935

Merge branch 'master' of https://github.com/openvinotoolkit/openvino …

e2347e1

…into attn_idea_2 Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

p-wysocki requested a review from a team as a code owner March 12, 2026 11:34

github-actions bot added category: Core OpenVINO Core (aka ngraph) category: transformations OpenVINO Runtime library - Transformations labels Mar 12, 2026

p-wysocki mentioned this pull request Mar 12, 2026

[PagedAttention] Add bidirectional attention mask within image groups #34111

Merged

CuriousPanCake reviewed Mar 12, 2026

View reviewed changes

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp Outdated Show resolved Hide resolved

mitruska reviewed Mar 12, 2026

View reviewed changes

p-wysocki added 3 commits March 19, 2026 11:03

Merge branch 'master' of https://github.com/openvinotoolkit/openvino …

7e6ba0e

…into attn_fixes Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

working

84042e8

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

Apply CR

6bc0315

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

p-wysocki requested review from CuriousPanCake, mitruska and praasz March 19, 2026 12:58

CuriousPanCake requested a review from mryzhov March 20, 2026 11:54

mlukasze requested a review from Copilot March 31, 2026 05:16

Copilot started reviewing on behalf of mlukasze March 31, 2026 05:17 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

Merge branch 'master' into attn_fixes

0c34a5f

mitruska reviewed Apr 1, 2026

View reviewed changes

Simplify PR

c930ecd

Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>

p-wysocki requested a review from mitruska April 1, 2026 12:40

mitruska approved these changes Apr 1, 2026

View reviewed changes

PiotrKrzem approved these changes Apr 1, 2026

View reviewed changes

mlukasze enabled auto-merge April 1, 2026 19:08

mlukasze added this pull request to the merge queue Apr 3, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 3, 2026

p-wysocki added this pull request to the merge queue Apr 3, 2026

Merged via the queue into openvinotoolkit:master with commit b286a61 Apr 3, 2026
226 of 228 checks passed

p-wysocki deleted the attn_fixes branch April 3, 2026 16:08

	auto offset = pattern_map.at(phi3_offset).get_node_shared_ptr();
	if (offset->get_element_type() != element::i32) {
	offset = std::make_shared<v0::Convert>(offset, element::i32);
	}

Conversation

p-wysocki commented Mar 12, 2026

Details:

Tickets:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

p-wysocki Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

p-wysocki Apr 1, 2026 •

edited

Loading