Add CUDA graph-based all reduce launcher by WoosukKwon · Pull Request #26 · vllm-project/vllm

WoosukKwon · 2023-04-05T02:45:37Z

Related to #22

This PR uses CUDA graph to reduce the CPU overhead of NCCL all reduce operation.

zhuohan123

LGTM!

zhuohan123 · 2023-04-05T17:22:53Z

cacheflow/parallel_utils/parallel_state.py

+        self.group = get_tensor_model_parallel_group()
+        self.buffer = torch.empty(
+            size=(max_num_tokens, hidden_size),
+            dtype=torch.half, # FIXME: hardcoded dtype


Add a dtype argument for this class?

Disable NPU merged to OV master recently

Install and configure use of the NCCL version recommended by vLLM via the [vllm-nccl](https://github.com/vllm-project/vllm-nccl) package. The install is a little wonky... but this set of changes should work. Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

deps: bump fastapi to >= 0.109.1

Update max_context_len for custom paged attention.

…c466a3 Rebase habana_main up to cc466a3

…inear_fusion_and_prepack Enable linear fusion/prepack and MOE AWQ fusion

* add tool server Signed-off-by: Chen Zhang <zhangch99@outlook.com> * add back demo tool server Signed-off-by: Chen Zhang <zhangch99@outlook.com> * update Signed-off-by: Chen Zhang <zhangch99@outlook.com> * update Signed-off-by: Chen Zhang <zhangch99@outlook.com> * update disallow cases Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> * fix some type Signed-off-by: Chen Zhang <zhangch99@outlook.com> --------- Signed-off-by: Chen Zhang <zhangch99@outlook.com>

…oject#26) * indexer medatata to separate prefill and decode * deep_gemm prefill kernel * decode kernel, can run for single batch * bug fixing insert decode k into kv before gemm * don't use tilelang quant function * faster non-looping torch for kv cache insertion * add chunked prefill impl * change quant kernel back to tilelang for promotion * fix format (vllm-project#31) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * update unit tests * Fp8 indexer prefill (vllm-project#33) * init Signed-off-by: Chen Zhang <zhangch99@outlook.com> * can run --------- Signed-off-by: Chen Zhang <zhangch99@outlook.com> * remove debug comment Signed-off-by: Chen Zhang <zhangch99@outlook.com> * cleanup * further cleanup --------- Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>

…eas (#26…" This reverts commit 88a4974.

New Industry Use Cases (vllm-project#21-30): - vllm-project#21 Game Development: AI game testing + balance tuning - vllm-project#22 Construction: Vision AI safety inspection - vllm-project#23 Agriculture/Smart Farm: Crop monitoring + pest detection - vllm-project#24 Government/Public: Document automation + citizen services - vllm-project#25 Energy/Utilities: Grid monitoring + anomaly detection - vllm-project#26 Environment/Sustainability: Carbon tracking + ESG reporting - vllm-project#27 Fashion/Apparel: Trend analysis + inventory optimization - vllm-project#28 Sports/Fitness: Performance analytics + tactical analysis - vllm-project#29 Automotive/Mobility: Autonomous driving simulation - vllm-project#30 Space/Aerospace: Satellite image analysis Advanced Architecture Patterns: 1. Event-Driven Pattern: Webhook → Event Bus → Agent triggers 2. Streaming Pattern: Large dataset processing with chunking 3. Batch Processing Pattern: Celery-based parallel processing 4. Circuit Breaker Pattern: Fault tolerance + auto recovery 5. CQRS + Event Sourcing: Command/Query separation 6. Saga Pattern: Distributed transaction management Guide now covers: - 30+ industry-specific MCP implementations - 6 production-ready architecture patterns - Real-world scalability solutions - Enterprise integration strategies - Total: 8,672 lines (from 7,249)

- Remove unsafe del+reassign pattern in FusedMoEWithLoRA.reallocate_lora_weights; replace with direct assignments to avoid transient broken object state - Add 10 unit tests for FusedMoEWithLoRA.reallocate_lora_weights covering: shape after grow/shrink, weight preservation, zero-init of new slots, adapter_enabled length (new_slots+1), max_loras sync, flat lora_a/b list lengths Co-authored-by: Claude Signed-off-by: yuezhu1

…ora-weights [Core] Add reallocate_lora_weights() to all LoRA layer types

…ject#26) With eagle3 speculative decoding (num_speculative_tokens=4), the target model verification pass has batch size 5 (1 original + 4 speculative tokens). The skinny GEMM kernels only supported N<=4, causing fallback to a slow dequant + F.linear path that regressed decode throughput by 28% on Qwen3-30B-A3B-AWQ with eagle3. Add case 5 to all skinny GEMM kernel switch statements (fp16, int4, int4_g, int8) and raise SKINNY_GEMM_MAX_N from 4 to 5 so the fast HIP kernel path is used during speculative verification. Benchmark (Strix Halo, Qwen3-30B-A3B-AWQ-4bit, eagle3 specdec): Before fix: 81.3 tokens/s decode After fix: 115.7 tokens/s decode (+42%) Baseline: 113.3 tokens/s decode (pre-regression)

WoosukKwon added 4 commits April 5, 2023 01:07

Add -tp and -pp

0f86522

Add graph-based all reduce launcher

8f4c648

max_batch_size -> max_num_batched_tokens

8077445

max_batch_size -> max_num_batched_tokens

1cfdb00

WoosukKwon requested a review from zhuohan123 April 5, 2023 09:31

zhuohan123 approved these changes Apr 5, 2023

View reviewed changes

Address comments & Code cleaning

d406199

WoosukKwon merged commit 12659a0 into main Apr 5, 2023

WoosukKwon deleted the graph branch April 5, 2023 18:17

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add CUDA graph-based all reduce launcher (vllm-project#26)

6376304

slyalin pushed a commit to slyalin/vllm that referenced this pull request Apr 4, 2024

Merge pull request vllm-project#26 from ilya-lavrenov/disable-npu

818e384

Disable NPU merged to OV master recently

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

Merge pull request vllm-project#26 from dtrifiro/bump-deps

255735f

deps: bump fastapi to >= 0.109.1

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#26 from ROCm/cl/updates-pag-shomy

fa75cba

Update max_context_len for custom paged attention.

tianyil1 pushed a commit to tianyil1/vllm that referenced this pull request Jun 5, 2024

Merge pull request vllm-project#26 from HabanaAI/habana_main_rebase_c…

ae3d612

…c466a3 Rebase habana_main up to cc466a3

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Jun 25, 2024

Merge pull request vllm-project#26 from intel-sandbox/jianan/enable_l…

dddd40f

…inear_fusion_and_prepack Enable linear fusion/prepack and MOE AWQ fusion

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Closed

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Closed

1 task

noooop added a commit that referenced this pull request Oct 14, 2025

Revert "[issues template] Encourage the author implement their own id…

ed52f19

…eas (#26…" This reverts commit 88a4974.

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

inkcherry pushed a commit to inkcherry/vllm that referenced this pull request Nov 6, 2025

feat: Add mori availability check for all2all backend (vllm-project#26)

861d74c

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

Copilot AI mentioned this pull request Mar 20, 2026

Fix XPU segfault when tensor_parallel_size exceeds available devices hongbolv/vllm#5

Closed

yuezhu1 pushed a commit to yuezhu1/vllm that referenced this pull request Mar 30, 2026

Merge pull request vllm-project#26 from yuezhu1/issue-10-reallocate-l…

73b30dc

…ora-weights [Core] Add reallocate_lora_weights() to all LoRA layer types

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CUDA graph-based all reduce launcher#26

Add CUDA graph-based all reduce launcher#26
WoosukKwon merged 5 commits intomainfrom
graph

WoosukKwon commented Apr 5, 2023 •

edited

Loading

Uh oh!

zhuohan123 left a comment

Uh oh!

zhuohan123 Apr 5, 2023

Uh oh!

WoosukKwon Apr 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WoosukKwon commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Apr 5, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Apr 5, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WoosukKwon commented Apr 5, 2023 •

edited

Loading