[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 by JartX · Pull Request #35923 · vllm-project/vllm

JartX · 2026-03-03T22:11:10Z

This PR adds multiple of 16 to the list of supported kernel block sizes in RocmAttentionBackend

When running Qwen3.5 models using the ROCM_ATTN backend, the model produces broken, nonsensical outputs (e.g., repeating exclamation marks like !!!!!!!!!!). This happens because Qwen3.5 utilizes a non-standard block size of 1056. Since this size was not explicitly permitted, the model failed to correctly route the value_cache through the optimized Triton kernel fallback (triton_reshape_and_cache_flash).

Signed-off-by: JartX <sagformas@epdcenter.es>

gemini-code-assist

Code Review

This pull request correctly adds support for the Qwen3.5 model on ROCm by including its non-standard block size of 1056 in the RocmAttentionBackend. This change is a simple and effective fix for the reported issue, allowing the model to use the appropriate Triton kernel fallback. The implementation is correct and I have no further suggestions for improvement.

Signed-off-by: JartX <sagformas@epdcenter.es>

mergify · 2026-03-03T22:38:01Z

Hi @JartX, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: JartX <sagformas@epdcenter.es>

mergify · 2026-03-03T22:41:05Z

Documentation preview: https://vllm--35923.org.readthedocs.build/en/35923/

Signed-off-by: JartX <sagformas@epdcenter.es>

JartX · 2026-03-05T17:45:33Z

/gemini review

gemini-code-assist

Code Review

This pull request addresses a bug where Qwen3.5 models produced incorrect outputs on the ROCm backend. The fix correctly identifies that the non-standard block size was the issue and generalizes the supported block sizes for the ROCM_ATTN backend to any multiple of 16. This is a good change that improves robustness for future models. The corresponding documentation has also been updated. However, I've found a critical issue related to this change that could cause failures for other models.

gemini-code-assist · 2026-03-05T17:47:38Z

vllm/v1/attention/backends/rocm_attn.py

    def get_supported_kernel_block_sizes() -> list[int | MultipleOf]:
-        # ROCM paged attention kernel only supports block sizes 16 and 32
+        # ROCM paged attention native C++ kernel only supports block sizes 16 and 32
        # due to shared memory (LDS) constraints on AMD GPUs.
        # See csrc/rocm/attention.cu CALL_CUSTOM_LAUNCHER_BLK macro.
-
-        # However, The limitations in [16, 32] are reasonable for a native C++ kernel,
-        # but vLLM should allow support for non-standard sizes via the Triton path,
-        # as addressed in this PR: https://github.com/vllm-project/vllm/pull/31380,
-        # where the Triton kernel under rocm_atten does not support inference
-        # for a non-standard qwen3-next model with a block_size of 544.
-        # We have fixed the Triton kernel so that the standard model uses the original
-        # bit-addressing logic, while the non-standard model
-        # uses our optimized kernel logic.
-        return [16, 32, 544]
+        # However, vLLM allows support for any multiple of 16 via the Triton path.
+        # As addressed in PR: https://github.com/vllm-project/vllm/pull/31380,
+        # non-standard models (like qwen3-next with block_size 544, or qwen3_5
+        # with 784 and 1056) are dynamically routed to our optimized Triton kernel
+        # in `do_kv_cache_update`.
+        return [MultipleOf(16)]


While this change to allow any block size that is a multiple of 16 is correct for supporting models like Qwen3.5, it introduces a potential failure for other models.

The dispatch logic in do_kv_cache_update (lines 450-480) uses is_pow2 to decide whether to use the native C++ kernel or the Triton fallback. The native C++ kernel, as noted in the comments and confirmed in csrc/rocm/attention.cu, only supports block sizes of 16 and 32.

With this PR, a model using a block size that is a power of two but not 16 or 32 (e.g., 64) will be incorrectly routed to the native C++ kernel, which will then raise an error.

To fix this, the condition in do_kv_cache_update should be changed from if is_pow2: to if block_size in (16, 32):. This will ensure that only the explicitly supported block sizes are routed to the native kernel, and all others (including other powers of two) use the Triton fallback.

mergify · 2026-03-06T05:28:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @JartX.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

JartX · 2026-03-06T10:57:30Z

/gemini review

JartX · 2026-03-08T00:00:53Z

@AndreasKaratzas all test passed :)

AndreasKaratzas · 2026-03-08T00:03:18Z

That's great :) Unfortunately, even though my tag says "member" my approval won't turn your PR green (I only have read permissions 😅). I have forwarded your PR to the right channels.

JartX · 2026-03-08T00:16:32Z

@AndreasKaratzas many thanks ! Hahah :)

JartX · 2026-03-08T13:14:07Z

@tjtanaa Please check this out when you can :)

mergify · 2026-03-09T03:29:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @JartX.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…attn

tjtanaa

LGTM

Signed-off-by: JartX <sagformas@epdcenter.es>

…ention to support qwen3_5 (vllm-project#35923) Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: akaratza <akaratza@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>

…ention to support qwen3_5 (vllm-project#35923) Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: akaratza <akaratza@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…ention to support qwen3_5 (vllm-project#35923) Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: akaratza <akaratza@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

New bugfix eval instances curated from sglang, vllm, and aiter repos: - aiter-mxfp4-rounding-fix (ROCm/aiter#2249) - sglang-json-nonfinite-fix (sgl-project/sglang#20714) - vllm-corrupt-image-400 (vllm-project/vllm#38253) - vllm-mxfp4-moe-fallback (vllm-project/vllm#35893) - vllm-rocm-attn-blocksize-qwen35 (vllm-project/vllm#35923) - vllm-rocm-cross-attn-dispatch (vllm-project/vllm#38450) - vllm-rocm-fused-moe-fix (vllm-project/vllm#36100) - vllm-rocm-lru-cache-fix (vllm-project/vllm#37547) - vllm-rocm-nonpow2-blocksize (vllm-project/vllm#31380) All instances validated end-to-end: test FAILS without fix, PASSES with fix. vllm instances use rocm/vllm-dev base image.

…ention to support qwen3_5 (vllm-project#35923) Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: akaratza <akaratza@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>

@alex

* feat(evals): add optimization instances, fix bugs, update model config - Add aiter-sigmoid-fastmath eval instance (Hard, aiter PR #1879) - Add aiter-mla-reduce-optimize eval instance (Very Hard, aiter PR #1896) - Fix curate_eval.py EVALS_DIR path bug (was evals/evals/instances) - Add AMDPILOT_MODEL_URL env var support in curate_eval.py, run_issue.py - Update all task.yaml model endpoints from hardcoded internal IP to localhost default (override via AMDPILOT_MODEL_URL env var) - Update README with correct instance count and harness classifications Made-with: Cursor * refactor(config): remove dev-specific model name, GPU from task.yaml - Remove hardcoded model: "qwen-3.5" from all 9 task.yaml files (model now comes from AMDPILOT_MODEL env var) - Remove hardcoded gpu: "0" from all task.yaml container sections (GPU now comes from AMDPILOT_GPU env var or --gpu CLI flag) - Fix classify_pr/classify_issue returning "fix" instead of "bugfix" - Remove model name from generated task.yaml templates Made-with: Cursor * fix(eval): remove data leak from sglang-fused-moe-fix test harness Replaced source-level inspection (checking for _is_cuda guards around get_global_server_args calls) with pure behavioral tests: module imports without NameError, function is accessible, and fresh subprocess import succeeds. The old harness explicitly told the agent what fix to apply. Made-with: Cursor * fix(evals): fix data leak, JIT rebuild, and test harness bugs in optimization evals - All 4 optimization Dockerfiles: delete pre-compiled JIT .so modules to force rebuild from reverted source (fixes incorrect cached binaries) - All 4 optimization Dockerfiles: squash git history after source revert to prevent agent from viewing optimized solution via git diff/log - aiter-mla-reduce-optimize: rewrite performance test to use persistent-mode MLA decode which actually exercises the reduce.cu kernel (old test measured PyTorch SDPA which is unrelated to the reduce kernel) - aiter-sigmoid-fastmath: fix test to use aiter.ops.aiter_operator.sigmoid instead of non-existent generic activation() function Made-with: Cursor * fix(evals): adjust test thresholds and fix edge cases - sigmoid: lower threshold to 18us (baseline ~22us, optimized ~15us), fix 1D tensor edge case, relax silu_and_mul tolerance - mla-reduce: adjust threshold to 25us (baseline ~28us, optimized ~23us) - all Dockerfiles: add git remote removal for complete data leak prevention Made-with: Cursor * fix(evals): adjust moe-align threshold from 140us to 175us The PR's optimization achieves ~170us for E=64 (our test config), making the original 140us target unreachable. The new 175us threshold requires meaningful optimization (~10% over baseline ~192us) while being achievable. Made-with: Cursor * feat(scripts): use platform auto-detection for base image in run_issue.py Replace hardcoded BASE_IMAGES dict with auto-detection from amdpilot.orchestrator.platform. The issue resolver now automatically selects the correct rocm/sgl-dev image for the host GPU and ROCm version. Made-with: Cursor * fix(evals): add JIT cache deletion to all aiter task descriptions The agent was modifying kernel .cu files but changes had no effect because AITER's JIT-compiled .so modules were not being rebuilt. Updated all 4 aiter optimization task descriptions to include the correct rebuild command with explicit JIT cache deletion. Root cause of MOE align failure: agent tried 19+ optimization strategies but none took effect because the old cached binary was always used. Made-with: Cursor * feat(eval): add sglang-glm5-optimize instance for MI355X Adapted from xsun_wip branch glm5-optimize job for 8× MI355X GPUs. Includes Dockerfile with sglang config patch for glm_moe_dsa model type, pre-built benchmark script, and focused task description. * fix(eval): use sglang 20260311 base image for GLM-5 AMD support The 20260311 image auto-selects tilelang NSA decode backend on AMD GPUs, fixing the flash_attn_with_kvcache NameError. Upgrade transformers in Dockerfile for native glm_moe_dsa config loading. Remove manual patch. * feat(eval): strengthen glm5 task to require source-level optimizations Expanded task description with concrete kernel-level targets: attention backend tuning, MoE dispatch profiling, all-reduce analysis, CUDA graph capture checks. Config-only tuning is explicitly marked as insufficient. * feat(sglang-kimi-moe-tune): add Tier 0 profiling requirement to test harness (#2) Restructured scoring from 10+10+80 to 15+10+10+65: - Tier 0 (15 pts): Profiling evidence -- checks for rocprof output files (results.stats.csv) and profiling references in optimization_state.json - Tier 3: Reduced from 80 to 65 pts, capped at 50% without profiling Without profiling, max achievable score drops from 100 to ~52.5. This forces the agent to actually run rocprof/torch.profiler instead of skipping straight to blind config tuning. Co-authored-by: jhinpan <311651cb+jhinpan@users.noreply.github.com> * fix(sglang-glm5-optimize): improve task description with backend detection guidance * fix(sglang-glm5-optimize): add benchmark timeout guidance for long-loading models * refactor(glm5): simplify task description, remove overfitting Remove step-by-step optimization walkthrough and leaked targets/baselines. Keep only environment description, benchmark instructions, and rules. Optimization knowledge belongs in agent skills, targets in supervisor stages. * feat(glm5): add fast benchmark script and enable thinking - Add bench_glm5_fast.sh: uses --disable-cuda-graph for faster iteration during profiling. Uses `| tee` instead of $() for real-time output streaming when run in background. - Enable kimi_cli.thinking: true — required for Qwen3.5 thinking model to properly return content via reasoning_content field. - Add fast_profile_command to task.yaml pointing to the fast script. * fix(glm5): simplify task spec — remove unnecessary fast benchmark Remove bench_glm5_fast.sh and fast_profile_command. The agent can use --disable-cuda-graph on its own if needed. Simplify task_description.md to match the original xsun_wip style. Keep thinking:true (required for Qwen3.5 on sglang). * fix(glm5): correct benchmark timing and add backend info Benchmark takes ~25 min (not 50-60 min). Document that bench_one_batch supports backend selection flags (--attention-backend, etc.) so agents know backends can be configured. * fix(glm5): source bench_config.env for reproducible verification The benchmark script now sources /workspace/bench_config.env if present. This lets the agent persist environment variables (e.g. backend selection) in a file that both the agent's run and the orchestrator's verification run will use, ensuring consistent results. * fix(glm5): update benchmark time estimate for local NVMe With model weights on local NVMe instead of NFS, first benchmark run takes ~5 minutes (was ~25 minutes on NFS). Update the executor guidance accordingly. * chore(glm5): document local NVMe volume path in task.yaml * feat(evals): add sglang-qwen-vl-optimize task instance Qwen3-VL serving throughput optimization on MI355X. SGLang has a 33% regression vs vLLM (1235 vs 1648 tok/s). Self-contained bench_serving benchmark, PYTHONPATH fix for sglang.benchmark.datasets, fork cloned to /workspace/sglang-fork/ to avoid namespace shadowing. * fix(evals): add timeouts to bench_qwen_vl.sh to prevent hangs bench_serving blocks indefinitely when the sglang server enters a stuck graceful-shutdown state (common with aiter backend on VL models). Three fixes: - Wrap both warmup and benchmark bench_serving calls with `timeout` (default 900s, configurable via BENCH_SERVING_TIMEOUT) - Use kill -9 in cleanup instead of SIGTERM (hung servers ignore SIGTERM) - Kill ALL sglang child processes (scheduler, detokenizer) and free the port with fuser on cleanup, not just the launch_server parent - Bump recommended agent timeout from 1200 to 2400s in task description * fix(evals): lock triton backend in bench_qwen_vl.sh, reframe task as regression fix The previous benchmark allowed the agent to override ATTENTION_BACKEND via bench_config.env, enabling a bypass (switch to aiter = 2000 tok/s) instead of fixing the actual triton regression (1235 tok/s). - Hardcode ATTENTION_BACKEND="triton" in the benchmark script - Remove ATTENTION_BACKEND from bench_config.env support - Rewrite task description: fix must be source-level changes to the triton attention path, not a backend switch - Update investigation areas to focus on triton kernel tuning, CUDA graph interaction, and VL-specific decode inefficiencies * add sglang-kimi-k25-optimize eval instance * update kimi-k25 task: switch executor to Kimi-K2.5, remove result leakage - Update model_endpoint to moonshotai/Kimi-K2.5 - Rewrite task_description.md: remove all prior-run result leakage, make optimization approach fully flexible (no backend restrictions), update delivery branch to v3 * fix eval tasks: remove deprecated instance, fix test harness - Remove aiter-moe-align-optimize instance (deprecated) - Fix vllm-ck-mxfp4-moe test harness * feat: LLM-powered SFT data curation pipeline Three-phase pipeline that internalizes nudge agent signals into executor trajectories for on-policy SFT training: Phase 1 (regex): Structurally identify and remove _steer tool calls + nudge tool results from the JSONL trajectory. Phase 2 (Claude opus): For each nudge, call Claude via AMD Gateway to rewrite the executor's thinking so it reads as independent reasoning. The LLM sees the full nudge content, prior context, and executor response as plain text (~3K chars per call). Never sees JSONL directly. Phase 3 (Claude opus validation): Final check that zero nudge traces remain. Only flags nudge-specific references — supervisor hints from retry_with_hints are preserved as legitimate inter-trial context. Tested on kimi-k25-optimize (25 nudges, 9 trials) and glm5-optimize (16 nudges, 2 trials). Zero nudge traces in curated output. * feat(evals): add 9 validated eval instances from merged PRs New bugfix eval instances curated from sglang, vllm, and aiter repos: - aiter-mxfp4-rounding-fix (ROCm/aiter#2249) - sglang-json-nonfinite-fix (sgl-project/sglang#20714) - vllm-corrupt-image-400 (vllm-project/vllm#38253) - vllm-mxfp4-moe-fallback (vllm-project/vllm#35893) - vllm-rocm-attn-blocksize-qwen35 (vllm-project/vllm#35923) - vllm-rocm-cross-attn-dispatch (vllm-project/vllm#38450) - vllm-rocm-fused-moe-fix (vllm-project/vllm#36100) - vllm-rocm-lru-cache-fix (vllm-project/vllm#37547) - vllm-rocm-nonpow2-blocksize (vllm-project/vllm#31380) All instances validated end-to-end: test FAILS without fix, PASSES with fix. vllm instances use rocm/vllm-dev base image. * Add 14 new AMD GPU eval instances, fix 7 existing, drop 2 broken New instances (14): - aiter: asm-pa-headsize-fix, lru-cache-pollution, nonpow2-blocksize-crash, splitk-buffer-fix - sglang: cutedsl-lazy-import, fp8-w8a8-gfx950-tune, kscale-vscale-fix, mla-ps-kernel-guard, shuffle-weight-attrs - vllm: aiter-import-fix, cache-stride-fix, mla-last-page-len, mla-nhead-fix, quark-dtype-fix, spec-decode-dispatch Fixed existing (7): - sglang-qwen35-rope-fix, sglang-rotary-crash: Dockerfile + test harness improvements - vllm-ck-mxfp4-moe, vllm-encoder-rocm: base image + metadata fixes - vllm-mxfp4-moe-fallback, vllm-rocm-attn-blocksize-qwen35, vllm-rocm-nonpow2-blocksize: test harness rewrites Dropped (2): - sglang-fused-moe-fix: bug untestable in container (triton deps fail before buggy line) - sglang-kimi-moe-tune: optimization, not a bug fix All 28 remaining instances validated: score < 100 without fix, score = 100 with fix. * Add 5 validated vLLM ROCm eval instances (follow-up) premature-cuda-init, dynamo-arch-crash, cache-blocksize-backend, aiter-headsize-fallback, slidingwin-cudagraph-fix — all validated pre-fix <100%, post-fix 100%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix data leak: clean up FETCH_HEAD and git history in all Dockerfiles The merge commit (containing the fix diff) was accessible via FETCH_HEAD and git reflog after checkout. An agent could trivially cat .git/FETCH_HEAD then git show to see the solution. Fix: remove FETCH_HEAD, delete origin remote, expire reflog, and gc prune after checkout in all 33 affected Dockerfiles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix splitk harness buffer formula to match aiter _moe_sorting_impl (#6) * Fix splitk harness buffer formula to match aiter _moe_sorting_impl The test harness check 6a used a ceil-based formula (ceil(token_num*topk/block_m)*block_m) to compute sorted_token_ids length, but the actual aiter kernel uses a different formula from _moe_sorting_impl: token_num*topk + num_experts*block_m - topk. These formulas diverge on many parameter combinations (e.g., DeepSeek V3 decode: tn=1, tk=8, bm=4, ne=8 gives ceil=8 vs aiter=32), making the harness unreliable. Changes: - Align check 6 sorted_len formula with _moe_sorting_impl - Add check 7: diverging formula regression (DeepSeek decode params) where ceil formula incorrectly reports no overflow but actual formula correctly detects it (32 > 8) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Check 7: assert actual formula divergence (ceil vs aiter) Per review: Check 7 previously only computed the aiter formula, so it would pass even if the harness reverted to ceil. Now computes both paths and asserts the divergence explicitly: - ceil_overflow=False (ceil(8/4)*4 = 8, no overflow) - aiter_overflow=True (8 + 8*4 - 8 = 32, overflow detected) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * Add aiter-blockscale-stream-fix eval instance FP8 GEMM kernel does not respect caller HIP stream context, breaking non-default stream callers. Source inspection + runtime harness validates the fix (aiter PR #2520). * fix: strip solution leakage from 33 eval instances Remove Fix: docstrings, Affected Files sections, and inline file path references that disclosed solutions to the agent. Instances now describe symptoms only, requiring independent diagnosis. Changes across 33 instances (23 LEAK + 10 BORDERLINE): - 18 harness docstrings: stripped Fix: paragraphs - 26 task descriptions: removed Affected Files sections - 10 task descriptions: replaced inline file paths with generic refs - 7 instances: individual task description and docstring rewrites * fix(eval): harden sglang-speculative-decode-fix harness for intermittent bug - Increase prompts from 8 to 20 to reduce false positive probability (from ~2.3% to <10^-8 per round with 62.5% per-prompt coherent rate) - Require 2 consecutive passing rounds for score 100.0 - Remove root-cause hint from task description (purely behavioral now) Previous false positive: agent scored 100.0 with zero code changes on unmodified codebase because 8/8 prompts happened to pass by chance. * fix(eval): restart server between rounds for independent evidence Address review feedback from @alex: - Each round now starts a fresh server and shuts it down afterward, ensuring rounds are independent (different GPU init, stream state) - Remove "non-speculative inference under TP=2" clue from task description — not source-grounded in the original issue report --------- Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: jhinpan <311651cb+jhinpan@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

add 1056 block_size to triton fallback

90e602f

Signed-off-by: JartX <sagformas@epdcenter.es>

JartX requested review from gshtras and tjtanaa as code owners March 3, 2026 22:11

mergify bot added qwen Related to Qwen models rocm Related to AMD ROCm v1 bug Something isn't working labels Mar 3, 2026

github-project-automation bot added this to AMD Mar 3, 2026

github-project-automation bot moved this to Todo in AMD Mar 3, 2026

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

precommit

2dbcd4b

Signed-off-by: JartX <sagformas@epdcenter.es>

precommit doc

bd4b501

Signed-off-by: JartX <sagformas@epdcenter.es>

mergify bot added the documentation Improvements or additions to documentation label Mar 3, 2026

Clarify comment on non-standard model block sizes

6212617

Signed-off-by: JartX <sagformas@epdcenter.es>

JartX force-pushed the bugfix/qwen35_rocm_attn branch from c07fc35 to 6212617 Compare March 4, 2026 10:28

JartX added 2 commits March 4, 2026 12:09

qwen3.5 27b

9c161ae

Signed-off-by: JartX <sagformas@epdcenter.es>

qwen3.5 27b

5848396

Signed-off-by: JartX <sagformas@epdcenter.es>

JartX changed the title ~~[Bugfix] Add 1056 block_size to triton fallback on rocm Attention to support qwen3_5~~ [Bugfix] Add 784,1056 block_size to triton fallback on rocm Attention to support qwen3_5 Mar 4, 2026

JartX changed the title ~~[Bugfix] Add 784,1056 block_size to triton fallback on rocm Attention to support qwen3_5~~ [Bugfix] Add 784 and 1056 block_size to triton fallback on rocm Attention to support qwen3_5 Mar 4, 2026

allow multiple of 16 via triton path

1bbc515

Signed-off-by: JartX <sagformas@epdcenter.es>

JartX changed the title ~~[Bugfix] Add 784 and 1056 block_size to triton fallback on rocm Attention to support qwen3_5~~ [Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 Mar 5, 2026

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

mergify bot added the needs-rebase label Mar 6, 2026

Merge branch 'main' into bugfix/qwen35_rocm_attn

c12fe6f

mergify bot removed the needs-rebase label Mar 6, 2026

JartX force-pushed the bugfix/qwen35_rocm_attn branch from 4f16fcd to cd8be20 Compare March 7, 2026 20:02

Merge branch 'main' into bugfix/qwen35_rocm_attn

bffc779

mergify bot added the needs-rebase label Mar 9, 2026

Merge remote-tracking branch 'upstream/main' into bugfix/qwen35_rocm_…

3e3f037

…attn

mergify bot removed the needs-rebase label Mar 9, 2026

JartX added 2 commits March 9, 2026 18:44

Merge branch 'main' into bugfix/qwen35_rocm_attn

c832a9a

Merge branch 'main' into bugfix/qwen35_rocm_attn

d987032

tjtanaa approved these changes Mar 10, 2026

View reviewed changes

Merge branch 'main' into bugfix/qwen35_rocm_attn

95ad792

tjtanaa enabled auto-merge (squash) March 10, 2026 13:33

remove redundant code

cc9db38

Signed-off-by: JartX <sagformas@epdcenter.es>

auto-merge was automatically disabled March 10, 2026 22:02
Head branch was pushed to by a user without write access

AndreasKaratzas mentioned this pull request Mar 10, 2026

[CI Failure][ROCm]: CrossLayer KV layout Distributed NixlConnector PD accuracy tests (4 GPUs) #35132

Closed

3 tasks

Merge branch 'main' into bugfix/qwen35_rocm_attn

eb6d6f5

DarkLight1337 enabled auto-merge (squash) March 11, 2026 04:26

DarkLight1337 merged commit a40ee48 into vllm-project:main Mar 11, 2026
53 checks passed

github-project-automation bot moved this from Todo to Done in AMD Mar 11, 2026

Uh oh!

Conversation

JartX commented Mar 3, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

JartX commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 6, 2026

Uh oh!

JartX commented Mar 6, 2026

Uh oh!

JartX commented Mar 8, 2026

Uh oh!

AndreasKaratzas commented Mar 8, 2026

Uh oh!

JartX commented Mar 8, 2026

Uh oh!

JartX commented Mar 8, 2026

Uh oh!

mergify bot commented Mar 9, 2026

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JartX commented Mar 3, 2026 •

edited by github-actions bot

Loading