Skip to content

Implement custom kernel for LLaMA rotary embedding#14

Merged
WoosukKwon merged 9 commits intomainfrom
rotary-embedding
Mar 30, 2023
Merged

Implement custom kernel for LLaMA rotary embedding#14
WoosukKwon merged 9 commits intomainfrom
rotary-embedding

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon commented Mar 30, 2023

This PR implements a custom CUDA kernel for rotary embedding, which is used in LLaMA. The kernel is responsible for the entire process of applying rotary embedding to query and key, and is thus much more efficient than the PyTorch implementation.

Tested models:

  • LLaMA-7B
  • LLaMA-13B

Tested GPUs:

  • A100

@WoosukKwon WoosukKwon requested a review from zhuohan123 March 30, 2023 10:29
@WoosukKwon WoosukKwon changed the title Add custom kernel for rotary embedding Implement custom kernel for LLaMA rotary embedding Mar 30, 2023
Copy link
Copy Markdown
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@WoosukKwon WoosukKwon merged commit 88c0268 into main Mar 30, 2023
@WoosukKwon WoosukKwon deleted the rotary-embedding branch March 30, 2023 18:04
bigPYJ1151 added a commit to bigPYJ1151/vllm that referenced this pull request Sep 12, 2023
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024
mzusman pushed a commit to mzusman/vllm that referenced this pull request May 6, 2024
* remove JambaConfig and use official one from transformers

* changes in Jamba modeling file to align with official HF format
fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024
enable fused topK_softmax kernel for hip path
ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
Summary:
  Add benchmarking scripts and utils.
 Things to note : 
   - All files are stored in `neuralmagic` folder.
- neuralmagic/benchmarks/scripts/* : Actual benchmarking scripts that
interact with vllm engine.
- neuralmagic/benchmarks/configs/* : JSON config files that define what
benchmark commands to run.
- neuralmagic/benchmarks/run_*.py : Scripts that consume some config
file and run the benchmark scripts.
   - neuralmagic/tools : Add tools 

Testing:
Local testing

---------

Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
wuhuikx pushed a commit to wuhuikx/vllm that referenced this pull request Mar 27, 2025
a fix follow up [MRotaryEmbedding
change](vllm-project@bf3b79e#diff-6bc44986c91bf0876240dec03d56c748403691c7fcd90f7a22e7affff7b033ecR839)

Signed-off-by: z00897138 <zhaorifa@huawei.com>
Co-authored-by: z00897138 <zhaorifa@huawei.com>
heheda12345 pushed a commit to heheda12345/vllm that referenced this pull request Sep 29, 2025
yma11 pushed a commit to yma11/vllm that referenced this pull request Nov 14, 2025
…pSeek-v2 (vllm-project#28101) (vllm-project#14)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
iwooook pushed a commit to moreh-dev/vllm that referenced this pull request Nov 29, 2025
… update perf measurement to decode multiple tokens

Signed-off-by: Salar Hosseini <skhorasgani@tenstorrent.com>
tjtanaa pushed a commit to tjtanaa/vllm that referenced this pull request Jan 29, 2026
[Model] Add end2end example and documentation for qwen2.5-omni
yuezhu1 pushed a commit to yuezhu1/vllm that referenced this pull request Mar 30, 2026
…-project#7)

Adds five new fields to LoRAConfig in vllm/config/lora.py to support
runtime dynamic resizing of GPU LoRA adapter slots:

  - min_loras (int, ge=1): floor for dynamic slot shrinking
  - dynamic_lora_slots (bool): enables automatic watermark-driven scaling
  - lora_mem_high_watermark (float, 0<x<1): scale-down threshold
  - lora_mem_low_watermark (float, 0<x<1): scale-up threshold
  - lora_slot_resize_cooldown_s (float, ge=0): anti-thrash cooldown

Cross-field validation added to _validate_lora_config():
  - min_loras <= max_loras (when dynamic_lora_slots=True)
  - lora_mem_low_watermark < lora_mem_high_watermark (when dynamic=True)

Field-level bounds (ge/gt/lt) enforced by Pydantic at construction time.

dynamic_lora_slots added to compute_hash() as it affects the CudaGraph
specialization path (disables LoRA cudagraph when True, see issue vllm-project#14).

All new fields default to safe values so existing configs are unaffected
when dynamic_lora_slots=False (the default).

Includes 16 unit tests in tests/lora/test_lora_config_dynamic.py
covering defaults, valid configs, all validation error paths, and
compute_hash() behavior.

Closes vllm-project#7
Closes vllm-project#18

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>
yuezhu1 pushed a commit to yuezhu1/vllm that referenced this pull request Mar 30, 2026
…e_hash

- Clarify min_loras docstring: <= max_loras is only enforced when
  dynamic_lora_slots=True, not unconditionally.
- Clarify dynamic_lora_slots docstring: remove reference to unimplemented
  POST /v1/scale_max_loras endpoint; note operator-triggered scaling will
  be handled via plugin (issue vllm-project#16).
- Fix compute_hash() comment with TODO(vllm-project#14) reference.
- Add specialize_active_lora to compute_hash() factors — it controls which
  CUDA graphs are captured and must be part of the computation graph hash.
- Add test_compute_hash_differs_with_specialize_active_lora to cover above.

Co-authored-by: Claude
Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>
Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026
…-section-12

Expand README Section 12 experimental results and add Section 13 Algorithmic Efficiency analysis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants