Merged
Conversation
zhuohan123
approved these changes
Apr 2, 2023
Comment on lines
+35
to
+50
| # Directly call FlashAttention's internal function to avoid allocating | ||
| # a new tensor for the output. | ||
| _flash_attn_forward( | ||
| query, | ||
| key, | ||
| value, | ||
| output, | ||
| cumulative_prompt_lens, | ||
| cumulative_prompt_lens, | ||
| max_prompt_len, | ||
| max_prompt_len, | ||
| dropout_p=0.0, | ||
| softmax_scale=self.scale, | ||
| causal=True, | ||
| )[0] | ||
| # FIXME(woosuk): Unnecessary copy. Optimize this. | ||
| output.copy_(out, non_blocking=True) | ||
| return_softmax=False, | ||
| ) |
Member
There was a problem hiding this comment.
Just curious, so flash attention natively supports non-contiguous QKV tensors?
Collaborator
Author
There was a problem hiding this comment.
Yes. It actually requires qkv tensor of shape [num_tokens, 3, num_heads, head_size]. Previously, we inserted torch.stack to meet this shape requirement, and this PR eliminates this inefficiency.
Member
|
Speed before this PR on 1 A100: After: |
hongxiayang
pushed a commit
to hongxiayang/vllm
that referenced
this pull request
Feb 13, 2024
luo-cheng2021
pushed a commit
to luo-cheng2021/vllm
that referenced
this pull request
Apr 17, 2024
Produce artifacts for bare metal installation in Dockerfile.openvino
tdg5
pushed a commit
to tdg5/vllm
that referenced
this pull request
Apr 25, 2024
Fix logging lint errors
fxmarty
pushed a commit
to fxmarty/vllm-public
that referenced
this pull request
May 31, 2024
…factor Dockerfile improvements: multistage
tianyil1
pushed a commit
to tianyil1/vllm
that referenced
this pull request
Jun 5, 2024
* Fix setup.py for HPU * Fix vllm._C import ops -> vllm.hpu import ops * more of the same thing * re-add hpex rmsnorm and rope; but rope is crashing * remove unnecessary comments * add vllm/hpu files * add hpu autodetection * Add HabanaAttention stub * revert accidental changes * revert non-habana backend attention changes * add habana attention/worker/executor, sampling fails now * Restore unnecessarily changed files * enable HabanaMemoryProfiler * Make sampler pass * restore habana fused rope * prefill is now working!!! * fix prefill padding; decode is now working!!!!! * revert accidental changes * remove unused stuff in habana_paged_attn.py * remove diagnostic stuff from llm_engine.py * use HabanaExecutorAsync in async_llm_engine.py * add habana copyright headers to habana_*.py files * fix prefill attention conformance * minor naming fixes * remove naive attention from habana_attn (it never worked anyway) * re-enable profile run * Add fake HPUGraph support * add more metrics * indentation fix * ~~recipe cache metrics don't work lalalala~~ * i'm done with metrics for now * fix corner case in which hl-smi is not available but synapse is * FIXME: temporary setup.py workaround * WIP: add tensor parallelism stubs * habana worker cleanup * tensor parallelism is now working * remove unused files * remove unused func * add hpugraphrunner * improve hpu layernorm * Port pipelined PA * Port context length bucketing * remove cudagraphrunner from hpu runner * restore HPUGraphRunner back from FakeHPUGraphRunner * handle rotary embeddings properly on gaudi3 * oopsie! captured_block_counts was incorrect! * captured_block_counts.append doesn't do anything * Restore habana_main KV cache memory layout * fix memory profiler * overhaul hpugraph capture * memory profiling overhaul * format memory properly in model warmup * add graph compilation profiler for graph capture phase * adroll back log lvl on graph capture message * Remove unnecessary view on residual connection in RMSNorm (vllm-project#25) --------- Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>
Closed
1 task
1 task
1 task
1 task
1 task
zyongye
pushed a commit
to zyongye/vllm
that referenced
this pull request
Aug 5, 2025
zyongye
pushed a commit
to zyongye/vllm
that referenced
this pull request
Aug 6, 2025
irenemizus
pushed a commit
to axeltec-software/vllm
that referenced
this pull request
Sep 28, 2025
heheda12345
pushed a commit
to heheda12345/vllm
that referenced
this pull request
Sep 29, 2025
…integration [Feature] DeepGEMM integration
1 task
4 tasks
yma11
pushed a commit
to yma11/vllm
that referenced
this pull request
Nov 12, 2025
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
GuoRen868
pushed a commit
to GuoRen868/vllm
that referenced
this pull request
Dec 26, 2025
afd适配mtp和量化+bug fix
tjtanaa
pushed a commit
to tjtanaa/vllm
that referenced
this pull request
Jan 29, 2026
…t-processor [Engine]Refactor output processing for multimodal capabilities in vLLM-omni
1 task
1 task
1 task
1 task
yuezhu1
pushed a commit
to yuezhu1/vllm
that referenced
this pull request
Mar 25, 2026
…llm-project#10, closes vllm-project#20) Implements reallocate_lora_weights(new_slots) so stacked GPU tensors can be resized at runtime without restarting the server. - BaseLayerWithLoRA: single implementation with _reallocate() helper that handles both tuple-of-tensors (linear layers) and plain-tensor (LogitsProcessorWithLoRA) storage via isinstance check. All linear layer subclasses inherit this for free. - FusedMoEWithLoRA: override to reallocate the four w13/w2 weight tuples, resize adapter_enabled, rebuild the flat lora_a/b_stacked views list, and update max_loras. FusedMoE3DWithLoRA inherits this override. - 22 CPU-only unit tests in tests/lora/test_reallocate_lora_weights.py covering shape after grow/shrink, weight preservation for surviving slots, zero-init of new slots, no-op before create_lora_weights, and no empty_cache() call inside the method. Pre-commit: ruff-check, ruff-format, mypy-3.10 all pass. Tests: 22/22 pass on CPU. AI assistance was used (Claude Code). All changed lines reviewed by @yuezhu1. This does not duplicate any existing upstream PR or issue. Co-authored-by: Claude <noreply@anthropic.com>
yuankaichen-amd
added a commit
to yuankaichen-amd/vllm
that referenced
this pull request
Mar 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Should be merged after #15 .
The changes in this PR eliminate the need for redundant data movements such as
torch.cat,torch.stack, andtorch.contiguous, which were previously used to align input and output shapes. The PR modifies existing kernels and adds new kernels to accommodate non-contiguous tensors, making these data movement operators unnecessary.