Support tensor parallel by zhuohan123 · Pull Request #2 · vllm-project/vllm

zhuohan123 · 2023-02-28T08:40:38Z

TODOs:

In another PR:

Merge QKV into one.

WoosukKwon

Fantastic! Left minor comments.

BTW, the sampling results were different when using TP:

Current master (python server.py --model facebook/opt-13b)

# GPU blocks: 1826, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would look at it."
Seq 5: 'UC Berkeley is about to get some more tree-hugging support from the University of Washington'
Seq 6: "UC Berkeley is the university of utah\nNot even close\nYeah I'd say it's"
Seq 7: 'The future of cloud computing is React\n\n6 Avril, 2016 | By Maxime Boklan\n\n'

4-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 4)

# GPU blocks: 4970, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would've been too much"
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the seats, as the school�'
Seq 6: 'UC Berkeley is the university of weed.\n*school of vape\nNot everyone who vapes'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

8-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 8)

# GPU blocks: 5464, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would put a limit."
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the stands, as the school is'
Seq 6: 'UC Berkeley is the university of weed.\n*school of anarchy\nAll respect to the academics'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

cacheflow/utils.py

cacheflow/models/memory_analyzer.py

cacheflow/models/model_utils.py

cacheflow/models/opt.py

server.py

cacheflow/models/opt.py

cacheflow/worker/controller.py

cacheflow/worker/worker.py

zhuohan123 · 2023-03-21T09:36:06Z

@WoosukKwon Thanks again for the review! All comments resolved. Regarding the different sampling results, I think it's too hard to get the same sampling results for different tensor parallel configs. Adding more GPUs changes the model and the execution flow on each GPU, and thus it can change the random process here and there. I cannot and don't think it's necessary to keep their sampling results to be the same.

WoosukKwon

Thanks a lot @zhuohan123 for your huge effort! This is fantastic!

cacheflow/models/model_utils.py

[fix] lhy/add clamp

Signed-off-by: ramos <49182011+nemoramo@users.noreply.github.com> Signed-off-by: mayufeng <mayufeng@example.com> Co-authored-by: mayufeng <mayufeng@example.com>

zhuohan123 added 9 commits February 28, 2023 01:30

copy code from fairseq

e8d661c

remove files from fairscale

827f85f

copy files from megatron

76ed019

[WIP] add distributed init

55e5d86

Parallelize the Transformer layers

7100db2

Load weight on a single GPU

1e86393

support multi-gpu tensor parallelism

90970e1

support tensor parallelism on multiple gpus

88960f7

fix correctness

900eace

zhuohan123 changed the title ~~[WIP] Support tensor parallel~~ Support tensor parallel Mar 9, 2023

zhuohan123 added 6 commits March 17, 2023 14:05

Merge branch 'main' into tensor_parallel

6a6f7cc

fix merging errors

d5a70ab

add filelock

60bf11e

support parallel decoding

a7be5b8

update readme

538d067

remove unused files

893d4b3

zhuohan123 requested a review from WoosukKwon March 19, 2023 02:51

fix loading for large models

e0f9f48

WoosukKwon reviewed Mar 21, 2023

View reviewed changes

zhuohan123 added 5 commits March 21, 2023 03:24

Fix some smaller issues raised by Woosuk first.

6ef5111

Fix more review issues

6727083

remove duplicate set_seed

ddc1ab0

Support the case where embedding_size != hidden_size

1d532c5

Resolve comments on weight loading and device id comments.

64e3950

WoosukKwon approved these changes Mar 21, 2023

View reviewed changes

cacheflow/models/model_utils.py Show resolved Hide resolved

WoosukKwon merged commit 2f49f15 into main Mar 21, 2023

zhuohan123 deleted the tensor_parallel branch June 18, 2023 07:22

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

Danielkinz mentioned this pull request Aug 15, 2023

[Feature | CI] Added a github action to build wheels #746

Merged

gemini-code-assist bot mentioned this pull request Dec 30, 2025

[Bugfix] Fix weight_loader v1 block scale #31103

Merged

5 tasks

devbyteai mentioned this pull request Dec 30, 2025

fix(compile): apply partition wrapper when loading AOT cached functions #31536

Merged

hangy-amd referenced this pull request in hangy-amd/vllm Jan 4, 2026

Merge pull request #2 from hangy-amd/lhy/add_clamp

d91ff40

[fix] lhy/add clamp

bellkjtt mentioned this pull request Jan 26, 2026

fix: Add infinite loop detection for multimodal models (e.g., PaddleOCR-VL) #33068

Open

6 tasks

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

noooop mentioned this pull request Feb 9, 2026

[Usage]: 求助：vllm 在线部署qwen3-vl-Embedding模型，产出结果和离线transformer调用结果不一致是什么原因呢？vllm=0.14.0 #33167

Open

1 task

grimulkan mentioned this pull request Feb 16, 2026

[Bug]: vLLM does not support DeepSeek series on RTX PRO 6000/SM120 #26211

Open

1 task

elvircrn mentioned this pull request Feb 18, 2026

[Bugfix] Fix EPLB + NVFP4: make expanded activation scales contiguous #34646

Closed

4 tasks

Isotr0py mentioned this pull request Feb 25, 2026

[Bug]: Large Video Request cause vLLM Progress Core Dump #35285

Open

1 task

ChuanLi1101 mentioned this pull request Mar 6, 2026

[Refactor] Move FusedMoE hidden_size roundup to quant_method #34285

Merged

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

Copilot AI mentioned this pull request Mar 10, 2026

fix(mooncake): address review comments on HBM leak fix machov/vllm#1

Merged

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

Etelis mentioned this pull request Mar 16, 2026

[Frontend] Add FP8 output quantization support to FlashAttention backend #31636

Open

5 tasks

elvircrn mentioned this pull request Mar 16, 2026

[MoE/EPLB] Fix FlashInfer nvfp4 experts + EPLB correctness #37217

Merged

4 tasks

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

markmc mentioned this pull request Mar 19, 2026

[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request #37498

Merged

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

kimihailv mentioned this pull request Mar 27, 2026

[Bug]: IPC update_weights (checkpoint format): hot-swapped weights can diverge from cold load of target checkpoint #38374

Closed

1 task

This was referenced Apr 2, 2026

feat: add max_tokens_per_doc in rerank request (rebase of #33315) #38827

Open

feat: add max tokens per doc in rerank request #33315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support tensor parallel#2

Support tensor parallel#2
WoosukKwon merged 21 commits intomainfrom
tensor_parallel

zhuohan123 commented Feb 28, 2023 •

edited

Loading

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Mar 21, 2023

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zhuohan123 commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Mar 21, 2023

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhuohan123 commented Feb 28, 2023 •

edited

Loading