Skip to content

[Attention] Refactor CUDA attention backend selection logic#24794

Merged
mgoin merged 121 commits intovllm-project:mainfrom
MatthewBonanni:backend_selection_refactor
Nov 11, 2025
Merged

[Attention] Refactor CUDA attention backend selection logic#24794
mgoin merged 121 commits intovllm-project:mainfrom
MatthewBonanni:backend_selection_refactor

Conversation

@MatthewBonanni
Copy link
Copy Markdown
Collaborator

@MatthewBonanni MatthewBonanni commented Sep 13, 2025

Purpose

CudaPlatformBase.get_attention_backend_cls has gotten complex and messy over time. This PR cleans up the logic (without changing the behavior) and standardizes the interface.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added rocm Related to AMD ROCm speculative-decoding v1 tpu Related to Google TPUs labels Sep 13, 2025
@mergify
Copy link
Copy Markdown

mergify bot commented Sep 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 13, 2025
@MatthewBonanni MatthewBonanni changed the title [Attention] Refactor CUDA attention backend selection logic [WIP][Attention] Refactor CUDA attention backend selection logic Sep 13, 2025
@MatthewBonanni MatthewBonanni changed the title [WIP][Attention] Refactor CUDA attention backend selection logic [Attention] Refactor CUDA attention backend selection logic Sep 16, 2025
@MatthewBonanni MatthewBonanni marked this pull request as ready for review September 16, 2025 13:22
@mergify mergify bot removed the needs-rebase label Sep 16, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments; we should figure out who the owner of the plugin mechanism is and figure out how to notify downstream HW plugins since I think this will affect them pretty dramatically

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@MatthewBonanni
Copy link
Copy Markdown
Collaborator Author

@LucasWilkinson thanks for your review! I've already notified @ILikeIneine but I'm not sure if there's anyone else we should reach out to?

@ILikeIneine
Copy link
Copy Markdown
Contributor

@MatthewBonanni Hi, would this refactor be able to merge into v0.11.1?

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@MatthewBonanni
Copy link
Copy Markdown
Collaborator Author

@ILikeIneine we were planning on waiting until after v0.11.1, we don't want to risk further delaying the release and because it changes the platform interface, it might be better to be part of v0.12

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@LucasWilkinson
Copy link
Copy Markdown
Collaborator

@MatthewBonanni how hard would it be to keep backwards compatibility between _Backend and AttentionBackendEnum for a version with a warning?

@LucasWilkinson
Copy link
Copy Markdown
Collaborator

With #26487 potential in the pipe what do we think about having a get_mla_attn_backend_cls instead of is_mla? @Yikun ?

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@MatthewBonanni
Copy link
Copy Markdown
Collaborator Author

@MatthewBonanni how hard would it be to keep backwards compatibility between _Backend and AttentionBackendEnum for a version with a warning?

@LucasWilkinson done in d0f4698

@NickLucche
Copy link
Copy Markdown
Collaborator

Discussed offline thanks for the work @MatthewBonanni !

return AttentionBackendEnum[name]


class _Backend(metaclass=_BackendMeta):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change

Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release has been cut, let's go for it on main

@hmellor
Copy link
Copy Markdown
Member

hmellor commented Nov 11, 2025

The merge commit of this PR failed pre-commit because the base of the branch was out of date

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm speculative-decoding structured-output tpu Related to Google TPUs v1

Projects

Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

10 participants