Skip to content

[ROCm][CI] Fix cross-attention dispatch for encoder-decoder models#38450

Merged
vllm-bot merged 2 commits intovllm-project:mainfrom
ROCm:akaratza_fix_cross_attention
Mar 29, 2026
Merged

[ROCm][CI] Fix cross-attention dispatch for encoder-decoder models#38450
vllm-bot merged 2 commits intovllm-project:mainfrom
ROCm:akaratza_fix_cross_attention

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

Cross-attention layers in encoder-decoder models (Whisper, BART, etc.) produce incorrect beam search results on ROCM_ATTN and ROCM_AITER_FA. This PR removes ENCODER_DECODER from their supports_attn_type so the backend
dispatch selects a working backend for cross-attention layers instead.

Motivation

The test test_whisper_beam_search_single_beam fails on ROCm because single-beam beam search does not match greedy decoding. The mismatch is caused by two backends computing cross-attention incorrectly when max_query_len > 1.

Greedy decoding is unaffected because max_query_len=1 skips the faulty code path in both backends.

Observed results

WER comparison across all ROCm backends on Whisper (openai/whisper-large-v3-turbo), transcribing mary_had_lamb audio. Greedy and beam search (n=1) should produce identical output:

Backend Greedy == Beam(n=1)
ROCM_ATTN No
ROCM_AITER_FA No
ROCM_AITER_UNIFIED_ATTN Yes
TRITON_ATTN Yes

Example difference for ROCM_ATTN:

Greedy: "...its streets were quite as slow..."
Beam:   "...its streets were quite as snow..."

Technical explanation

Background: how cross-attention works in vLLM

In encoder-decoder models each decoder layer has two attention sublayers:

  1. Self-attention (AttentionType.DECODER): decoder tokens attend to previous decoder tokens. Q, K, V all come from the decoder hidden state.
  2. Cross-attention (AttentionType.ENCODER_DECODER): decoder tokens attend to encoder output. Q comes from the decoder, K/V come from the encoder.

vLLM handles cross-attention through CrossAttentionImpl (vllm/model_executor/layers/attention/cross_attention.py), which wraps the selected backend. On the first decoder step it calls do_kv_cache_update to write the encoder K/V into the paged cache. On subsequent steps the cache already contains the full encoder output.

The backend dispatch (vllm/platforms/rocm.py) selects a backend per attention type. Different layers in the same model can use different backends. This is the mechanism this PR relies on.

ROCM_ATTN: prefill kernel assumes self-attention

RocmAttentionImpl.forward calls chunked_prefill_paged_decode, which has two phases:

chunked_prefill_paged_decode(query, key, value, key_cache, value_cache, ...)
    if max_query_len > 1:
        context_attention_fwd(q=query, k=key, v=value, k_cache, v_cache, ...)
    paged_decode(query, key_cache, value_cache, ...)

context_attention_fwd is a Triton prefill kernel that assumes self-attention semantics. It treats the passed key/value tensors as new tokens being appended to the sequence, and reads earlier tokens from the cache. It effectively computes attention over [cached_kv..., new_kv].

For decoder self-attention this is correct: new K/V are projections of new decoder tokens that have not yet been cached.

For cross-attention this is wrong: the key/value passed to forward are encoder outputs, which CrossAttentionImpl has already written to the cache. The prefill kernel sees the encoder tokens twice -- once from the raw tensors, once from the cache -- double-counting some and missing others depending on the slot mapping.

When max_query_len=1 (greedy decode), the prefill kernel is skipped entirely and only paged_decode runs. paged_decode reads exclusively from the cache, which is correct. This is why greedy works but beam search does not.

ROCM_AITER_FA: prefill uses wrong sequence boundaries

AiterFlashAttentionImpl.forward has a separate prefill path that calls flash_attn_varlen_func:

rocm_aiter_ops.flash_attn_varlen_func(
    q=prefill_query,
    k=prefill_key,         # encoder K/V
    v=prefill_value,       # encoder K/V
    cu_seqlens_q=...,      # decoder query boundaries
    cu_seqlens_k=attn_metadata.prefill_metadata.query_start_loc,  # BUG
    causal=True,           # BUG
)

Two problems:

  1. cu_seqlens_k is set to query_start_loc, which contains decoder query cumulative lengths. For cross-attention, the K/V come from the encoder and have different sequence lengths. This tells the kernel the wrong number of key tokens per sequence.

  2. causal=True applies a lower-triangular mask, which is wrong for cross-attention where every decoder token should attend to every encoder token.

Why TRITON_ATTN and ROCM_AITER_UNIFIED_ATTN work

Both backends call unified_attention(k=key_cache, v=value_cache, ...), passing the paged cache tensors directly as K/V arguments. The raw key/value from the forward call are never passed to the attention kernel. Since the cache was correctly populated by CrossAttentionImpl.do_kv_cache_update, the attention computation reads the right data.

unified_attention does pass causal=True, but with paged attention the causal offset is context_len + query_pos. For cross-attention context_len is large (encoder_seq_len - decoder_prompt_len), so the mask is effectively open for all practical decoder prompt sizes.


Changes

vllm/v1/attention/backends/rocm_attn.py

Removed AttentionType.ENCODER_DECODER from supports_attn_type. When the dispatch runs for a cross-attention layer, ROCM_ATTN is now skipped and the next valid backend is selected.

vllm/v1/attention/backends/rocm_aiter_fa.py

Same change. Removed AttentionType.ENCODER_DECODER from supports_attn_type.

vllm/platforms/rocm.py

Improved backend selection logging:

  • Primary backend is logged once with info_once.
  • When a backend is selected via fallback (some higher-priority backends were skipped), a separate line is logged showing the attention type and which backends were skipped. This makes it clear when cross-attention layers use a different backend than decoder self-attention.

Test plan

  • pytest -s -v tests/entrypoints/openai/speech_to_text/test_transcription_validation_whisper.py
  • pytest -s -v tests/v1/attention

cc @kenroche

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 29, 2026

Documentation preview: https://vllm--38450.org.readthedocs.build/en/38450/

@mergify mergify bot added documentation Improvements or additions to documentation rocm Related to AMD ROCm v1 labels Mar 29, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 29, 2026
@AndreasKaratzas
Copy link
Copy Markdown
Collaborator Author

This PR is motivated by: #38321 (comment)
cc @gshtras

PR #38321 is likely to just be closed should this one be merged.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request restricts the supported attention types for ROCm backends by removing ENCODER_DECODER support from ROCM_AITER_FA and ROCM_ATTN due to specific implementation limitations in cross-attention. It also enhances the Whisper transcription validation tests to explicitly cover multiple ROCm attention backends and improves backend selection logging. A review comment correctly identifies that the documentation table for ROCM_ATTN in attention_backends.md needs to be updated to reflect these changes and maintain consistency with the codebase.

| `FLASH_ATTN_DIFFKV` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ✅ | Decoder | Any |
| `FLEX_ATTENTION` | | fp16, bf16, fp32 | `auto`, `float16`, `bfloat16` | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any |
| `ROCM_AITER_FA` | | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder, Enc-Dec | N/A |
| `ROCM_AITER_FA` | | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder | N/A |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While you've correctly updated the supported attention types for ROCM_AITER_FA, the documentation for ROCM_ATTN on line 178 is now inconsistent with the code changes. This PR removes ENCODER_DECODER support from ROCM_ATTN in vllm/v1/attention/backends/rocm_attn.py, but this table still lists All for its supported attention types. Please update it to reflect that ENCODER_DECODER is no longer supported (e.g., by listing Decoder, Encoder, Encoder Only).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a staticly set value in the auto generator. I modified this such that it collects the number of possible attention types.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 29, 2026
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) March 29, 2026 03:56
@DarkLight1337 DarkLight1337 disabled auto-merge March 29, 2026 03:56
@vllm-bot vllm-bot merged commit 43cc513 into vllm-project:main Mar 29, 2026
59 of 62 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 29, 2026
@AndreasKaratzas AndreasKaratzas deleted the akaratza_fix_cross_attention branch March 29, 2026 05:08
Elm8116 pushed a commit to Elm8116/vllm that referenced this pull request Mar 30, 2026
…llm-project#38450)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…llm-project#38450)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
benenzhu pushed a commit to benenzhu/vllm that referenced this pull request Mar 31, 2026
…llm-project#38450)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>
neweyes pushed a commit to neweyes/vllm that referenced this pull request Mar 31, 2026
…llm-project#38450)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: neweyes <328719365@qq.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…llm-project#38450)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
bhargav-patel-29 pushed a commit to Bharatgen-Tech/vllm that referenced this pull request Apr 1, 2026
…llm-project#38450)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>
yzong-rh pushed a commit to yzong-rh/vllm that referenced this pull request Apr 3, 2026
Arist12 added a commit to amdpilot-org/amdpilot-evals that referenced this pull request Apr 4, 2026
New bugfix eval instances curated from sglang, vllm, and aiter repos:

- aiter-mxfp4-rounding-fix (ROCm/aiter#2249)
- sglang-json-nonfinite-fix (sgl-project/sglang#20714)
- vllm-corrupt-image-400 (vllm-project/vllm#38253)
- vllm-mxfp4-moe-fallback (vllm-project/vllm#35893)
- vllm-rocm-attn-blocksize-qwen35 (vllm-project/vllm#35923)
- vllm-rocm-cross-attn-dispatch (vllm-project/vllm#38450)
- vllm-rocm-fused-moe-fix (vllm-project/vllm#36100)
- vllm-rocm-lru-cache-fix (vllm-project/vllm#37547)
- vllm-rocm-nonpow2-blocksize (vllm-project/vllm#31380)

All instances validated end-to-end: test FAILS without fix, PASSES
with fix. vllm instances use rocm/vllm-dev base image.
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
Arist12 added a commit to amdpilot-org/amdpilot-evals that referenced this pull request Apr 6, 2026
* feat(evals): add optimization instances, fix bugs, update model config

- Add aiter-sigmoid-fastmath eval instance (Hard, aiter PR #1879)
- Add aiter-mla-reduce-optimize eval instance (Very Hard, aiter PR #1896)
- Fix curate_eval.py EVALS_DIR path bug (was evals/evals/instances)
- Add AMDPILOT_MODEL_URL env var support in curate_eval.py, run_issue.py
- Update all task.yaml model endpoints from hardcoded internal IP to
  localhost default (override via AMDPILOT_MODEL_URL env var)
- Update README with correct instance count and harness classifications

Made-with: Cursor

* refactor(config): remove dev-specific model name, GPU from task.yaml

- Remove hardcoded model: "qwen-3.5" from all 9 task.yaml files
  (model now comes from AMDPILOT_MODEL env var)
- Remove hardcoded gpu: "0" from all task.yaml container sections
  (GPU now comes from AMDPILOT_GPU env var or --gpu CLI flag)
- Fix classify_pr/classify_issue returning "fix" instead of "bugfix"
- Remove model name from generated task.yaml templates

Made-with: Cursor

* fix(eval): remove data leak from sglang-fused-moe-fix test harness

Replaced source-level inspection (checking for _is_cuda guards around
get_global_server_args calls) with pure behavioral tests: module imports
without NameError, function is accessible, and fresh subprocess import
succeeds. The old harness explicitly told the agent what fix to apply.

Made-with: Cursor

* fix(evals): fix data leak, JIT rebuild, and test harness bugs in optimization evals

- All 4 optimization Dockerfiles: delete pre-compiled JIT .so modules
  to force rebuild from reverted source (fixes incorrect cached binaries)
- All 4 optimization Dockerfiles: squash git history after source revert
  to prevent agent from viewing optimized solution via git diff/log
- aiter-mla-reduce-optimize: rewrite performance test to use persistent-mode
  MLA decode which actually exercises the reduce.cu kernel (old test
  measured PyTorch SDPA which is unrelated to the reduce kernel)
- aiter-sigmoid-fastmath: fix test to use aiter.ops.aiter_operator.sigmoid
  instead of non-existent generic activation() function

Made-with: Cursor

* fix(evals): adjust test thresholds and fix edge cases

- sigmoid: lower threshold to 18us (baseline ~22us, optimized ~15us),
  fix 1D tensor edge case, relax silu_and_mul tolerance
- mla-reduce: adjust threshold to 25us (baseline ~28us, optimized ~23us)
- all Dockerfiles: add git remote removal for complete data leak prevention

Made-with: Cursor

* fix(evals): adjust moe-align threshold from 140us to 175us

The PR's optimization achieves ~170us for E=64 (our test config), making
the original 140us target unreachable. The new 175us threshold requires
meaningful optimization (~10% over baseline ~192us) while being achievable.

Made-with: Cursor

* feat(scripts): use platform auto-detection for base image in run_issue.py

Replace hardcoded BASE_IMAGES dict with auto-detection from
amdpilot.orchestrator.platform. The issue resolver now automatically
selects the correct rocm/sgl-dev image for the host GPU and ROCm version.

Made-with: Cursor

* fix(evals): add JIT cache deletion to all aiter task descriptions

The agent was modifying kernel .cu files but changes had no effect
because AITER's JIT-compiled .so modules were not being rebuilt.
Updated all 4 aiter optimization task descriptions to include the
correct rebuild command with explicit JIT cache deletion.

Root cause of MOE align failure: agent tried 19+ optimization
strategies but none took effect because the old cached binary
was always used.

Made-with: Cursor

* feat(eval): add sglang-glm5-optimize instance for MI355X

Adapted from xsun_wip branch glm5-optimize job for 8× MI355X GPUs.
Includes Dockerfile with sglang config patch for glm_moe_dsa model type,
pre-built benchmark script, and focused task description.

* fix(eval): use sglang 20260311 base image for GLM-5 AMD support

The 20260311 image auto-selects tilelang NSA decode backend on AMD GPUs,
fixing the flash_attn_with_kvcache NameError. Upgrade transformers in
Dockerfile for native glm_moe_dsa config loading. Remove manual patch.

* feat(eval): strengthen glm5 task to require source-level optimizations

Expanded task description with concrete kernel-level targets: attention
backend tuning, MoE dispatch profiling, all-reduce analysis, CUDA graph
capture checks. Config-only tuning is explicitly marked as insufficient.

* feat(sglang-kimi-moe-tune): add Tier 0 profiling requirement to test harness (#2)

Restructured scoring from 10+10+80 to 15+10+10+65:
- Tier 0 (15 pts): Profiling evidence -- checks for rocprof output files
  (results.stats.csv) and profiling references in optimization_state.json
- Tier 3: Reduced from 80 to 65 pts, capped at 50% without profiling

Without profiling, max achievable score drops from 100 to ~52.5.
This forces the agent to actually run rocprof/torch.profiler instead of
skipping straight to blind config tuning.

Co-authored-by: jhinpan <311651cb+jhinpan@users.noreply.github.com>

* fix(sglang-glm5-optimize): improve task description with backend detection guidance

* fix(sglang-glm5-optimize): add benchmark timeout guidance for long-loading models

* refactor(glm5): simplify task description, remove overfitting

Remove step-by-step optimization walkthrough and leaked targets/baselines.
Keep only environment description, benchmark instructions, and rules.
Optimization knowledge belongs in agent skills, targets in supervisor stages.

* feat(glm5): add fast benchmark script and enable thinking

- Add bench_glm5_fast.sh: uses --disable-cuda-graph for faster iteration
  during profiling. Uses `| tee` instead of $() for real-time output
  streaming when run in background.
- Enable kimi_cli.thinking: true — required for Qwen3.5 thinking model
  to properly return content via reasoning_content field.
- Add fast_profile_command to task.yaml pointing to the fast script.

* fix(glm5): simplify task spec — remove unnecessary fast benchmark

Remove bench_glm5_fast.sh and fast_profile_command. The agent can use
--disable-cuda-graph on its own if needed. Simplify task_description.md
to match the original xsun_wip style. Keep thinking:true (required for
Qwen3.5 on sglang).

* fix(glm5): correct benchmark timing and add backend info

Benchmark takes ~25 min (not 50-60 min). Document that bench_one_batch
supports backend selection flags (--attention-backend, etc.) so agents
know backends can be configured.

* fix(glm5): source bench_config.env for reproducible verification

The benchmark script now sources /workspace/bench_config.env if present.
This lets the agent persist environment variables (e.g. backend selection)
in a file that both the agent's run and the orchestrator's verification
run will use, ensuring consistent results.

* fix(glm5): update benchmark time estimate for local NVMe

With model weights on local NVMe instead of NFS, first benchmark
run takes ~5 minutes (was ~25 minutes on NFS). Update the executor
guidance accordingly.

* chore(glm5): document local NVMe volume path in task.yaml

* feat(evals): add sglang-qwen-vl-optimize task instance

Qwen3-VL serving throughput optimization on MI355X. SGLang has a 33%
regression vs vLLM (1235 vs 1648 tok/s). Self-contained bench_serving
benchmark, PYTHONPATH fix for sglang.benchmark.datasets, fork cloned
to /workspace/sglang-fork/ to avoid namespace shadowing.

* fix(evals): add timeouts to bench_qwen_vl.sh to prevent hangs

bench_serving blocks indefinitely when the sglang server enters a stuck
graceful-shutdown state (common with aiter backend on VL models). Three
fixes:

- Wrap both warmup and benchmark bench_serving calls with `timeout`
  (default 900s, configurable via BENCH_SERVING_TIMEOUT)
- Use kill -9 in cleanup instead of SIGTERM (hung servers ignore SIGTERM)
- Kill ALL sglang child processes (scheduler, detokenizer) and free the
  port with fuser on cleanup, not just the launch_server parent
- Bump recommended agent timeout from 1200 to 2400s in task description

* fix(evals): lock triton backend in bench_qwen_vl.sh, reframe task as regression fix

The previous benchmark allowed the agent to override ATTENTION_BACKEND
via bench_config.env, enabling a bypass (switch to aiter = 2000 tok/s)
instead of fixing the actual triton regression (1235 tok/s).

- Hardcode ATTENTION_BACKEND="triton" in the benchmark script
- Remove ATTENTION_BACKEND from bench_config.env support
- Rewrite task description: fix must be source-level changes to the
  triton attention path, not a backend switch
- Update investigation areas to focus on triton kernel tuning, CUDA
  graph interaction, and VL-specific decode inefficiencies

* add sglang-kimi-k25-optimize eval instance

* update kimi-k25 task: switch executor to Kimi-K2.5, remove result leakage

- Update model_endpoint to moonshotai/Kimi-K2.5
- Rewrite task_description.md: remove all prior-run result leakage,
  make optimization approach fully flexible (no backend restrictions),
  update delivery branch to v3

* fix eval tasks: remove deprecated instance, fix test harness

- Remove aiter-moe-align-optimize instance (deprecated)
- Fix vllm-ck-mxfp4-moe test harness

* feat: LLM-powered SFT data curation pipeline

Three-phase pipeline that internalizes nudge agent signals into executor
trajectories for on-policy SFT training:

Phase 1 (regex): Structurally identify and remove _steer tool calls +
nudge tool results from the JSONL trajectory.

Phase 2 (Claude opus): For each nudge, call Claude via AMD Gateway to
rewrite the executor's thinking so it reads as independent reasoning.
The LLM sees the full nudge content, prior context, and executor response
as plain text (~3K chars per call). Never sees JSONL directly.

Phase 3 (Claude opus validation): Final check that zero nudge traces
remain. Only flags nudge-specific references — supervisor hints from
retry_with_hints are preserved as legitimate inter-trial context.

Tested on kimi-k25-optimize (25 nudges, 9 trials) and glm5-optimize
(16 nudges, 2 trials). Zero nudge traces in curated output.

* feat(evals): add 9 validated eval instances from merged PRs

New bugfix eval instances curated from sglang, vllm, and aiter repos:

- aiter-mxfp4-rounding-fix (ROCm/aiter#2249)
- sglang-json-nonfinite-fix (sgl-project/sglang#20714)
- vllm-corrupt-image-400 (vllm-project/vllm#38253)
- vllm-mxfp4-moe-fallback (vllm-project/vllm#35893)
- vllm-rocm-attn-blocksize-qwen35 (vllm-project/vllm#35923)
- vllm-rocm-cross-attn-dispatch (vllm-project/vllm#38450)
- vllm-rocm-fused-moe-fix (vllm-project/vllm#36100)
- vllm-rocm-lru-cache-fix (vllm-project/vllm#37547)
- vllm-rocm-nonpow2-blocksize (vllm-project/vllm#31380)

All instances validated end-to-end: test FAILS without fix, PASSES
with fix. vllm instances use rocm/vllm-dev base image.

* Add 14 new AMD GPU eval instances, fix 7 existing, drop 2 broken

New instances (14):
- aiter: asm-pa-headsize-fix, lru-cache-pollution, nonpow2-blocksize-crash, splitk-buffer-fix
- sglang: cutedsl-lazy-import, fp8-w8a8-gfx950-tune, kscale-vscale-fix, mla-ps-kernel-guard, shuffle-weight-attrs
- vllm: aiter-import-fix, cache-stride-fix, mla-last-page-len, mla-nhead-fix, quark-dtype-fix, spec-decode-dispatch

Fixed existing (7):
- sglang-qwen35-rope-fix, sglang-rotary-crash: Dockerfile + test harness improvements
- vllm-ck-mxfp4-moe, vllm-encoder-rocm: base image + metadata fixes
- vllm-mxfp4-moe-fallback, vllm-rocm-attn-blocksize-qwen35, vllm-rocm-nonpow2-blocksize: test harness rewrites

Dropped (2):
- sglang-fused-moe-fix: bug untestable in container (triton deps fail before buggy line)
- sglang-kimi-moe-tune: optimization, not a bug fix

All 28 remaining instances validated: score < 100 without fix, score = 100 with fix.

* Add 5 validated vLLM ROCm eval instances (follow-up)

premature-cuda-init, dynamo-arch-crash, cache-blocksize-backend,
aiter-headsize-fallback, slidingwin-cudagraph-fix — all validated
pre-fix <100%, post-fix 100%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix data leak: clean up FETCH_HEAD and git history in all Dockerfiles

The merge commit (containing the fix diff) was accessible via
FETCH_HEAD and git reflog after checkout. An agent could trivially
cat .git/FETCH_HEAD then git show to see the solution.

Fix: remove FETCH_HEAD, delete origin remote, expire reflog, and
gc prune after checkout in all 33 affected Dockerfiles.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix splitk harness buffer formula to match aiter _moe_sorting_impl (#6)

* Fix splitk harness buffer formula to match aiter _moe_sorting_impl

The test harness check 6a used a ceil-based formula
(ceil(token_num*topk/block_m)*block_m) to compute sorted_token_ids
length, but the actual aiter kernel uses a different formula from
_moe_sorting_impl: token_num*topk + num_experts*block_m - topk.

These formulas diverge on many parameter combinations (e.g.,
DeepSeek V3 decode: tn=1, tk=8, bm=4, ne=8 gives ceil=8 vs
aiter=32), making the harness unreliable.

Changes:
- Align check 6 sorted_len formula with _moe_sorting_impl
- Add check 7: diverging formula regression (DeepSeek decode params)
  where ceil formula incorrectly reports no overflow but actual
  formula correctly detects it (32 > 8)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Check 7: assert actual formula divergence (ceil vs aiter)

Per review: Check 7 previously only computed the aiter formula,
so it would pass even if the harness reverted to ceil. Now computes
both paths and asserts the divergence explicitly:
- ceil_overflow=False (ceil(8/4)*4 = 8, no overflow)
- aiter_overflow=True (8 + 8*4 - 8 = 32, overflow detected)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Add aiter-blockscale-stream-fix eval instance

FP8 GEMM kernel does not respect caller HIP stream context,
breaking non-default stream callers. Source inspection + runtime
harness validates the fix (aiter PR #2520).

* fix: strip solution leakage from 33 eval instances

Remove Fix: docstrings, Affected Files sections, and inline file path
references that disclosed solutions to the agent. Instances now describe
symptoms only, requiring independent diagnosis.

Changes across 33 instances (23 LEAK + 10 BORDERLINE):
- 18 harness docstrings: stripped Fix: paragraphs
- 26 task descriptions: removed Affected Files sections
- 10 task descriptions: replaced inline file paths with generic refs
- 7 instances: individual task description and docstring rewrites

* fix(eval): harden sglang-speculative-decode-fix harness for intermittent bug

- Increase prompts from 8 to 20 to reduce false positive probability
  (from ~2.3% to <10^-8 per round with 62.5% per-prompt coherent rate)
- Require 2 consecutive passing rounds for score 100.0
- Remove root-cause hint from task description (purely behavioral now)

Previous false positive: agent scored 100.0 with zero code changes on
unmodified codebase because 8/8 prompts happened to pass by chance.

* fix(eval): restart server between rounds for independent evidence

Address review feedback from @alex:
- Each round now starts a fresh server and shuts it down afterward,
  ensuring rounds are independent (different GPU init, stream state)
- Remove "non-speculative inference under TP=2" clue from task
  description — not source-grounded in the original issue report

---------

Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com>
Co-authored-by: jhinpan <311651cb+jhinpan@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
rishitdholakia13 pushed a commit to rishitdholakia13/vllm that referenced this pull request Apr 7, 2026
…llm-project#38450)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…llm-project#38450)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
big-yellow-duck pushed a commit to EmbeddedLLM/vllm that referenced this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants