Conversation
Contributor
Author
|
implementation is done. need testing (will do it on Thursday) the memory saving strategy is orthogonal to this kernel, so I would not include it in this PR |
678bb06 to
07e9891
Compare
07e9891 to
e21845e
Compare
Collaborator
|
Hey @suquark thanks for the PR! I have a quick question: have you also measured the performance diff between the two kernels before and after the optimization? |
Contributor
Author
|
see the PR comment for the optimized kernel performance comparison |
dtrifiro
pushed a commit
to dtrifiro/vllm
that referenced
this pull request
May 21, 2024
It's faster Signed-off-by: Nick Hill <nickhill@us.ibm.com>
tianyil1
pushed a commit
to tianyil1/vllm
that referenced
this pull request
Jun 5, 2024
fxmarty
pushed a commit
to fxmarty/vllm-public
that referenced
this pull request
Jun 12, 2024
Adding fp8 gemm computation
dtrifiro
pushed a commit
to dtrifiro/vllm
that referenced
this pull request
Jun 21, 2024
sync release with IBM/release
bigPYJ1151
pushed a commit
to bigPYJ1151/vllm
that referenced
this pull request
Jul 31, 2024
…ack_acc_bf16 fix linear init impacts on generation
Closed
wuhuikx
pushed a commit
to wuhuikx/vllm
that referenced
this pull request
Mar 27, 2025
Add official doc index. Move the release content to the right place. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
1 task
1 task
1 task
1 task
zyongye
pushed a commit
to zyongye/vllm
that referenced
this pull request
Aug 5, 2025
* Fix truncated output Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * fix Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> --------- Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
zyongye
pushed a commit
to zyongye/vllm
that referenced
this pull request
Aug 6, 2025
* Fix truncated output Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * fix Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> --------- Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
1 task
inkcherry
pushed a commit
to inkcherry/vllm
that referenced
this pull request
Nov 6, 2025
dik654
pushed a commit
to dik654/vllm-for-study
that referenced
this pull request
Nov 18, 2025
New Industry Use Cases (vllm-project#21-30): - vllm-project#21 Game Development: AI game testing + balance tuning - vllm-project#22 Construction: Vision AI safety inspection - vllm-project#23 Agriculture/Smart Farm: Crop monitoring + pest detection - vllm-project#24 Government/Public: Document automation + citizen services - vllm-project#25 Energy/Utilities: Grid monitoring + anomaly detection - vllm-project#26 Environment/Sustainability: Carbon tracking + ESG reporting - vllm-project#27 Fashion/Apparel: Trend analysis + inventory optimization - vllm-project#28 Sports/Fitness: Performance analytics + tactical analysis - vllm-project#29 Automotive/Mobility: Autonomous driving simulation - vllm-project#30 Space/Aerospace: Satellite image analysis Advanced Architecture Patterns: 1. Event-Driven Pattern: Webhook → Event Bus → Agent triggers 2. Streaming Pattern: Large dataset processing with chunking 3. Batch Processing Pattern: Celery-based parallel processing 4. Circuit Breaker Pattern: Fault tolerance + auto recovery 5. CQRS + Event Sourcing: Command/Query separation 6. Saga Pattern: Distributed transaction management Guide now covers: - 30+ industry-specific MCP implementations - 6 production-ready architecture patterns - Real-world scalability solutions - Enterprise integration strategies - Total: 8,672 lines (from 7,249)
chaojun-zhang
pushed a commit
to chaojun-zhang/vllm
that referenced
this pull request
Nov 20, 2025
* update ci with new repo name * update ipex to latest version * Update ci_pvc.yaml
minosfuture
added a commit
to minosfuture/vllm
that referenced
this pull request
Dec 23, 2025
…ch) (vllm-project#29…" This reverts commit f16356f. Signed-off-by: Ming Yang <minos.future@gmail.com>
soodoshll
pushed a commit
to soodoshll/vllm
that referenced
this pull request
Jan 30, 2026
* [Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image [Docker][Dev] Fix libnccl-dev version conflict for the CUDA 13.0.1 devel image Further update * feat: Support FA4 for mm-encoder-attn-backend for qwen models * feat: Kernel warmup for vit fa4 * fix: Fix some minor conflicts due to the introduction of flash_attn.cute * Revert "[Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image" This reverts commit ab76b28. * chore: Update requirements and revert README.md * chore: Install git for flash_attn cute installation * lint: Fix linting * Revert "[Improvement] Persist CUDA compat libraries paths to prevent reset on `apt-get` (vllm-project#30784)" (vllm-project#31) This reverts commit 2a60ac9. --------- Co-authored-by: Shang Wang <shangw@nvidia.com>
1 task
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Memcpy kernel for flash attention
The performance is pretty good (theoretical optimal throughput is 1.6TB/s for A100-40GB), considering the memory layout is not ideal.
result for unoptimized kernel:
the optimized kernel works much better for smaller number of tokens (+20% speedup)