Optimize data movement by WoosukKwon · Pull Request #20 · vllm-project/vllm

WoosukKwon · 2023-04-02T00:30:14Z

Should be merged after #15 .

The changes in this PR eliminate the need for redundant data movements such as torch.cat, torch.stack, and torch.contiguous, which were previously used to align input and output shapes. The PR modifies existing kernels and adds new kernels to accommodate non-contiguous tensors, making these data movement operators unnecessary.

zhuohan123

LGTM!

zhuohan123 · 2023-04-02T06:18:01Z

cacheflow/models/attention.py

+        # Directly call FlashAttention's internal function to avoid allocating
+        # a new tensor for the output.
+        _flash_attn_forward(
+            query,
+            key,
+            value,
+            output,
+            cumulative_prompt_lens,
+            cumulative_prompt_lens,
+            max_prompt_len,
+            max_prompt_len,
+            dropout_p=0.0,
+            softmax_scale=self.scale,
            causal=True,
-        )[0]
-        # FIXME(woosuk): Unnecessary copy. Optimize this.
-        output.copy_(out, non_blocking=True)
+            return_softmax=False,
+        )


Just curious, so flash attention natively supports non-contiguous QKV tensors?

Yes. It actually requires qkv tensor of shape [num_tokens, 3, num_heads, head_size]. Previously, we inserted torch.stack to meet this shape requirement, and this PR eliminates this inefficiency.

zhuohan123 · 2023-04-02T07:23:18Z

Speed before this PR on 1 A100:

ubuntu@ray-zhuohan-cf-head-2c23a277-compute:~/nfs/cacheflow/cacheflow/benchmark$ python benchmark_latency.py --model ~/hf-llama/llama-13b/
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='/home/ubuntu/hf-llama/llama-13b/', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-04-02 06:25:10,878 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266/
# GPU blocks: 1977, # CPU blocks: 3276
Warm up step
Profile step: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.53s/it]
Avg latency: 3.526289224624634 seconds
ubuntu@ray-zhuohan-cf-head-2c23a277-compute:~/nfs/cacheflow/cacheflow/benchmark$ python benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-04-02 06:27:55,300 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266/
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.54s/it]
Avg latency: 3.5404738585154214 seconds

After:

ubuntu@ray-zhuohan-cf-head-2c23a277-compute:~/nfs/cacheflow/cacheflow/benchmark$ python benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-04-02 07:17:35,120 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266/
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.43s/it]
Avg latency: 3.432361443837484 seconds
ubuntu@ray-zhuohan-cf-head-2c23a277-compute:~/nfs/cacheflow/cacheflow/benchmark$ python benchmark_latency.py --model ~/hf-llama/llama-13b/
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='/home/ubuntu/hf-llama/llama-13b/', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-04-02 07:19:00,665 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266/
# GPU blocks: 1977, # CPU blocks: 3276
Warm up step
Profile step: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:09<00:00,  3.27s/it]
Avg latency: 3.2731640338897705 seconds

Produce artifacts for bare metal installation in Dockerfile.openvino

Fix logging lint errors

…factor Dockerfile improvements: multistage

* Fix setup.py for HPU * Fix vllm._C import ops -> vllm.hpu import ops * more of the same thing * re-add hpex rmsnorm and rope; but rope is crashing * remove unnecessary comments * add vllm/hpu files * add hpu autodetection * Add HabanaAttention stub * revert accidental changes * revert non-habana backend attention changes * add habana attention/worker/executor, sampling fails now * Restore unnecessarily changed files * enable HabanaMemoryProfiler * Make sampler pass * restore habana fused rope * prefill is now working!!! * fix prefill padding; decode is now working!!!!! * revert accidental changes * remove unused stuff in habana_paged_attn.py * remove diagnostic stuff from llm_engine.py * use HabanaExecutorAsync in async_llm_engine.py * add habana copyright headers to habana_*.py files * fix prefill attention conformance * minor naming fixes * remove naive attention from habana_attn (it never worked anyway) * re-enable profile run * Add fake HPUGraph support * add more metrics * indentation fix * ~~recipe cache metrics don't work lalalala~~ * i'm done with metrics for now * fix corner case in which hl-smi is not available but synapse is * FIXME: temporary setup.py workaround * WIP: add tensor parallelism stubs * habana worker cleanup * tensor parallelism is now working * remove unused files * remove unused func * add hpugraphrunner * improve hpu layernorm * Port pipelined PA * Port context length bucketing * remove cudagraphrunner from hpu runner * restore HPUGraphRunner back from FakeHPUGraphRunner * handle rotary embeddings properly on gaudi3 * oopsie! captured_block_counts was incorrect! * captured_block_counts.append doesn't do anything * Restore habana_main KV cache memory layout * fix memory profiler * overhaul hpugraph capture * memory profiling overhaul * format memory properly in model warmup * add graph compilation profiler for graph capture phase * adroll back log lvl on graph capture message * Remove unnecessary view on residual connection in RMSNorm (vllm-project#25) --------- Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>

Added Qwen MoE

…integration [Feature] DeepGEMM integration

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

afd适配mtp和量化+bug fix

…t-processor [Engine]Refactor output processing for multimodal capabilities in vLLM-omni

@yuezhu1

…llm-project#10, closes vllm-project#20) Implements reallocate_lora_weights(new_slots) so stacked GPU tensors can be resized at runtime without restarting the server. - BaseLayerWithLoRA: single implementation with _reallocate() helper that handles both tuple-of-tensors (linear layers) and plain-tensor (LogitsProcessorWithLoRA) storage via isinstance check. All linear layer subclasses inherit this for free. - FusedMoEWithLoRA: override to reallocate the four w13/w2 weight tuples, resize adapter_enabled, rebuild the flat lora_a/b_stacked views list, and update max_loras. FusedMoE3DWithLoRA inherits this override. - 22 CPU-only unit tests in tests/lora/test_reallocate_lora_weights.py covering shape after grow/shrink, weight preservation for surviving slots, zero-init of new slots, no-op before create_lora_weights, and no empty_cache() call inside the method. Pre-commit: ruff-check, ruff-format, mypy-3.10 all pass. Tests: 22/22 pass on CPU. AI assistance was used (Claude Code). All changed lines reviewed by @yuezhu1. This does not duplicate any existing upstream PR or issue. Co-authored-by: Claude <noreply@anthropic.com>

zhuohan123 and others added 18 commits March 30, 2023 16:49

Merge QKV for OPT

28df307

merge qkv for llama

2e417f5

fix the code according to woosuk's comment

06f23ff

Merge branch 'main' into qkv_combined

da0fdd2

Merge branch 'main' into qkv_combined

b1ba1e4

Add SiluAndMul

9c5eca0

Remove

a5719c1

Merge branch 'activation' into qkv_combined

07fb828

Add SiluAndMul for fused SwiGLU

47622ec

Optimize data movement in attention

c3816b8

Add activation_ops to setup.py

b8d0024

Make rotary embedding in-place

3f8dd53

Bug fix

07e2bca

Roll back attention arguments

8a37545

Merge branch 'main' into data-move

0417554

Fix test for reshape_and_cache

e2a47cc

Fix test for rotary_embedding_neox

df402bb

Fix test for attention kernels

7152271

WoosukKwon requested a review from zhuohan123 April 2, 2023 05:17

WoosukKwon mentioned this pull request Apr 2, 2023

Merge QKV into one linear layer #15

Merged

zhuohan123 approved these changes Apr 2, 2023

View reviewed changes

Merge branch 'main' into data-move

0132133

WoosukKwon merged commit 897cb2a into main Apr 2, 2023

WoosukKwon deleted the data-move branch April 2, 2023 07:30

WoosukKwon mentioned this pull request Apr 5, 2023

Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script #27

Merged

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Optimize data movement (vllm-project#20)

a3ea458

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Apr 17, 2024

Merge pull request vllm-project#20 from mzegla/produce_artifacts

3570043

Produce artifacts for bare metal installation in Dockerfile.openvino

tdg5 pushed a commit to tdg5/vllm that referenced this pull request Apr 25, 2024

Merge pull request vllm-project#20 from tdg5/exp-2

e27e61e

Fix logging lint errors

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#20 from ROCm/Dockerfile_multistage_re…

23c696d

…factor Dockerfile improvements: multistage

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Closed

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Closed

1 task

zyongye pushed a commit to zyongye/vllm that referenced this pull request Aug 5, 2025

Support chat api (vllm-project#20)

291b9b1

zyongye pushed a commit to zyongye/vllm that referenced this pull request Aug 6, 2025

Support chat api (vllm-project#20)

7fc4bde

irenemizus pushed a commit to axeltec-software/vllm that referenced this pull request Sep 28, 2025

Merge pull request vllm-project#20 from nebius/feature/qwen3_moe_0.10.2

a2651c6

Added Qwen MoE

heheda12345 pushed a commit to heheda12345/vllm that referenced this pull request Sep 29, 2025

Merge pull request vllm-project#20 from vllm-model-0920/wye-deepgemm-…

1e304d8

…integration [Feature] DeepGEMM integration

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

yma11 pushed a commit to yma11/vllm that referenced this pull request Nov 12, 2025

add triton moe fall back by env var (vllm-project#20)

c8c6268

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

GuoRen868 pushed a commit to GuoRen868/vllm that referenced this pull request Dec 26, 2025

Merge pull request vllm-project#20 from ElleElleWu/jcz_afd_v0.11.0rc3

92daa55

afd适配mtp和量化+bug fix

tjtanaa pushed a commit to tjtanaa/vllm that referenced this pull request Jan 29, 2026

Merge pull request vllm-project#20 from tzhouam/feat/multimodal-outpu…

4a4c3c1

…t-processor [Engine]Refactor output processing for multimodal capabilities in vLLM-omni

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

Copilot AI mentioned this pull request Mar 20, 2026

Fix XPU segfault when tensor_parallel_size exceeds available devices hongbolv/vllm#5

Closed

yuankaichen-amd added a commit to yuankaichen-amd/vllm that referenced this pull request Mar 30, 2026

fix kv cache stride for extend path (vllm-project#20)

8cf9970

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize data movement#20

Optimize data movement#20
WoosukKwon merged 19 commits intomainfrom
data-move

WoosukKwon commented Apr 2, 2023

Uh oh!

zhuohan123 left a comment

Uh oh!

zhuohan123 Apr 2, 2023

Uh oh!

WoosukKwon Apr 2, 2023

Uh oh!

zhuohan123 commented Apr 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WoosukKwon commented Apr 2, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Apr 2, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Apr 2, 2023

Choose a reason for hiding this comment

Uh oh!

zhuohan123 commented Apr 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants