Fix a bug in tying OPT embeddings by WoosukKwon · Pull Request #1 · vllm-project/vllm

WoosukKwon · 2023-02-25T00:27:11Z

This PR fixes a bug in supporting OPT-350m/OPT-6.7b/OPT-13b and OPT-IML models.

The bug happened because our model code didn't include some methods that were required to tie the input and output embeddings.

add rope scaling as a cli arg so openai server can load rope scaled models

Fix key cache block shape.

Deterministic OpenVINO inference

merge code

BA-78554: Jurassic 2.5 * worked on jurasic2.5 configuration file, updated jurassic2_5 modeling file to support alternating experts/attn layers * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * jurassic_3 modeling file works, uses dummy weights initialized by "dummy" flag. Tokenizer raises issues, for now copying the mixtral tokenizer * changed default tokenizer vocab values, loading of custom .pt weight files works. * removed notebook * merging master to jurassic-2.5 to reset head * Merge branch 'master' into jurassic-2.5 * align to master Approved-by: Tomer Asida Approved-by: Mor Zusman

Triton compilation fix

Group Gemm Version

fea: support rfork

Fix vllm-project#1: Runtime instruction tensor fill - Compile-time tensor is now a TEMPLATE (opcodes set, dimensions zero) - Runtime code fills M/N/K and A/B/C pointers from function arguments - DAG analysis resolves which params feed each GEMM's A/B/C inputs - Intermediate buffers allocated/freed for inter-action data flow Fix vllm-project#2: bar.sync deadlock - GEMM pipeline emits bar.sync 0 expecting all 640 CTA threads - Only 128 threads (4 consumer warps) run the GEMM handler - Post-process replaces bar.sync 0 with bar.sync 1, 128 (scoped barrier) Fix vllm-project#3: Shared memory collision (verified non-issue) - Static smem (inst_state, mbarrier arrays) is separate from dynamic_smem - GEMM cp.async buffers use dynamic_smem offset 0 — no collision - Added compile-time assertions that GEMM smem fits in dynamic region - Documented Phase 3 TODO for when pages carry inter-op data 314 tests pass (84 ferrite-macros + 218 ferrite-ptx + 12 ferrite-runtime). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Nick Mitchell <nickm@us.ibm.com>

…t#38472) Add 17 unit tests covering the KV cache memory-management fixes from the previous commit. Tests run without NCCL, ZMQ, or CUDA by constructing a minimal engine object via object.__new__ and stubbing the two compiled extensions (vllm._C, vllm._C_stable_libtorch) in sys.modules. Test classes ------------ TestRecvTensorPopsFromStore (4 tests) - entry_absent_after_recv_tensor (regression guard vllm-project#1) - buffer_size_decremented_after_recv_tensor - multiple_layers_all_popped - recv_store_does_not_grow_under_sustained_load (high-QPS OOM scenario) TestRecvTensorPoolBackedEntries (4 tests) - pool_free_called_immediately (regression guard vllm-project#2) - pool_entry_removed_from_recv_store - pool_free_not_called_for_none_tensor (edge case: listener OOM path) - pool_free_called_once_per_entry_not_twice TestGetFinishedStragglerCleanup (6 tests) - straggler_gpu_tensor_removed_and_buffer_decremented - straggler_pool_tensor_freed - multiple_straggler_layers_all_cleaned - only_finished_request_cleaned_not_others - noop_when_tensors_already_consumed - send/recv_tracking_dict_cleaned_by_get_finished TestGetFinishedNoCompileLayersParamUnused (1 test) - Verifies straggler cleanup works even with no_compile_layers={} (the old iteration approach would silently skip all cleanup here) TestPoolFreeCalledOnce (1 test) - combined_lifecycle_exactly_one_free: ensures pool.free is called exactly once across recv_tensor + get_finished, not twice. Reproduction of original bugs confirmed: - Old recv_tensor (read without pop) leaves entry in recv_store: CONFIRMED - Old recv_tensor (no pool.free) leaks pinned RAM: CONFIRMED All 17 tests pass against the fixed code. Co-authored-by: GitHub Copilot

…t#38472) Add 17 unit tests covering the KV cache memory-management fixes from the previous commit. Tests run without NCCL, ZMQ, or CUDA by constructing a minimal engine object via object.__new__ and stubbing the two compiled extensions (vllm._C, vllm._C_stable_libtorch) in sys.modules. Test classes ------------ TestRecvTensorPopsFromStore (4 tests) - entry_absent_after_recv_tensor (regression guard vllm-project#1) - buffer_size_decremented_after_recv_tensor - multiple_layers_all_popped - recv_store_does_not_grow_under_sustained_load (high-QPS OOM scenario) TestRecvTensorPoolBackedEntries (4 tests) - pool_free_called_immediately (regression guard vllm-project#2) - pool_entry_removed_from_recv_store - pool_free_not_called_for_none_tensor (edge case: listener OOM path) - pool_free_called_once_per_entry_not_twice TestGetFinishedStragglerCleanup (6 tests) - straggler_gpu_tensor_removed_and_buffer_decremented - straggler_pool_tensor_freed - multiple_straggler_layers_all_cleaned - only_finished_request_cleaned_not_others - noop_when_tensors_already_consumed - send/recv_tracking_dict_cleaned_by_get_finished TestGetFinishedNoCompileLayersParamUnused (1 test) - Verifies straggler cleanup works even with no_compile_layers={} (the old iteration approach would silently skip all cleanup here) TestPoolFreeCalledOnce (1 test) - combined_lifecycle_exactly_one_free: ensures pool.free is called exactly once across recv_tensor + get_finished, not twice. Reproduction of original bugs confirmed: - Old recv_tensor (read without pop) leaves entry in recv_store: CONFIRMED - Old recv_tensor (no pool.free) leaks pinned RAM: CONFIRMED All 17 tests pass against the fixed code. Co-authored-by: GitHub Copilot Signed-off-by: saif <contact@saifmb.com>

…n-files-overview Add Chinese documentation for vLLM framework structure and GEMM call chains

[WIP] Fix Mamba state contamination in KV cache block reuse

Rebase of PR vllm-project#33315 onto current main. Adds max_tokens_per_doc parameter to rerank requests, matching Cohere and Jina rerank APIs. Documents longer than this limit are truncated before scoring. Handles all three cross-encoder code paths: - Cross-encoder with sep token (tokenizer built-in truncation) - Chat template / Jinja path (text truncation before template) - Score template path (text truncation before template) Also supports offline usage via PoolingParams(extra_kwargs={"max_tokens_per_doc": N}). Addresses reviewer feedback from original PR: - Offline support via PoolingParams (noooop vllm-project#1) - Score template compatibility tests (noooop vllm-project#2) - Tests across BAAI/bge-reranker-base, BAAI/bge-reranker-v2-gemma, and Qwen/Qwen3-Reranker-0.6B Original PR: vllm-project#33315 Original author: hustxiayang Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rebase of PR vllm-project#33315 onto current main. Adds max_tokens_per_doc parameter to rerank requests, matching Cohere and Jina rerank APIs. Documents longer than this limit are truncated before scoring. Handles all three cross-encoder code paths: - Cross-encoder with sep token (tokenizer built-in truncation) - Chat template / Jinja path (text truncation before template) - Score template path (text truncation before template) Also supports offline usage via PoolingParams(extra_kwargs={"max_tokens_per_doc": N}). Addresses reviewer feedback from original PR: - Offline support via PoolingParams (noooop vllm-project#1) - Score template compatibility tests (noooop vllm-project#2) - Tests across BAAI/bge-reranker-base, BAAI/bge-reranker-v2-gemma, and Qwen/Qwen3-Reranker-0.6B Original PR: vllm-project#33315 Original author: hustxiayang Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jesus Federico <jefp@amazon.com>

Fix OPT errors

44735b4

WoosukKwon merged commit cbf8779 into main Feb 25, 2023

WoosukKwon deleted the fix-opt branch February 25, 2023 00:29

murongweibo mentioned this pull request Jul 11, 2023

NCCL Error 5: invalid usage #427

Closed

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

CZT0 referenced this pull request in semedia-tech/vllm Sep 11, 2023

#1 测试部署vllm

cc4f1ce

orangetin referenced this pull request in togethercomputer/vllm-ttgi Sep 14, 2023

Merge pull request #1 from winglian/longchat-args

b9012fb

add rope scaling as a cli arg so openai server can load rope scaled models

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 18, 2023

Add function invoke call for underlying models (vllm-project#1)

9895bbd

bigPYJ1151 referenced this pull request in bigPYJ1151/vllm Oct 30, 2023

Merge pull request #1 from bigPYJ1151/fix_ans

b5e7066

Fix key cache block shape.

l1cacheDell pushed a commit to CaspianFang/vllm that referenced this pull request Nov 15, 2023

blora LlaMa support vllm-project#1

424df61

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang referenced this pull request in hongxiayang/vllm Feb 13, 2024

Fix a bug in tying OPT embeddings (#1)

2cb721d

kvikk mentioned this pull request Feb 15, 2024

ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects #2735

Closed

ilya-lavrenov referenced this pull request in ilya-lavrenov/vllm Feb 19, 2024

Merge pull request #1 from ilya-lavrenov/cpu-works

e3d65e0

Deterministic OpenVINO inference

daniel-geon-park added a commit to gmlwns2000/vllm-timber that referenced this pull request Apr 15, 2024

Merge pull request vllm-project#1 from DeepAuto-AI/geon-dev

d9d746e

merge code

afeldman-nm mentioned this pull request Apr 30, 2024

Adding support for encoder-decoder models, like T5 or BART #187

Closed

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

fmmoret mentioned this pull request May 8, 2024

[Bug]: Chunked prefill returning gibberish in some cases. #4697

Closed

Bellk17 added a commit to Bellk17/vllm that referenced this pull request May 10, 2024

Merge pull request vllm-project#1 from Bellk17/main

b36d574

Triton compilation fix

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

afeldman-nm mentioned this pull request Jun 3, 2024

[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests. #5208

Closed

ykim362 referenced this pull request in ykim362/vllm Jun 17, 2024

Wenxh/fp8 on a100 v5 (#1)

aca4a33

Group Gemm Version

xiejibing mentioned this pull request Jun 24, 2024

[Bug]: vLLM 0.4.2 8xH100 init failed #5785

Closed

llmpros mentioned this pull request Jun 27, 2024

[Frontend]: Support base64 embedding #5935

Merged

Juelianqvq mentioned this pull request Jul 3, 2024

[Bug]: Flashinfer stuck with CUDA Graph #6086

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

This was referenced Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

janssen-llm mentioned this pull request Feb 25, 2026

[Bug] GLM-5-FP8 Crash: CUDAGraph Replay Segmentation Fault #35293

Closed

binbinzhm mentioned this pull request Feb 27, 2026

[RFC]: why block_hash maps not always a single KVCacheBlock. #35516

Open

1 task

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 10, 2026

Merge pull request vllm-project#1 from IWantFight/bf_ant_group

32c46f9

fea: support rfork

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

haosdent mentioned this pull request Mar 15, 2026

[Bugfix] Disable cross-layer KV cache for MLA attention backends #37090

Merged

elvircrn mentioned this pull request Mar 16, 2026

[MoE/EPLB] Fix FlashInfer nvfp4 experts + EPLB correctness #37217

Merged

4 tasks

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

mgoin mentioned this pull request Mar 20, 2026

[Feature] Kvcache Int8 per-token scale on TRITON_ATTN continue of #34327 thanks EricccYang #36893

Closed

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

e1n00r mentioned this pull request Mar 26, 2026

[RFC]: Incremental MoE Expert Offloading — GPU Cache + Async Pipeline #38256

Open

kimihailv mentioned this pull request Mar 27, 2026

[Bug]: IPC update_weights (checkpoint format): hot-swapped weights can diverge from cold load of target checkpoint #38374

Closed

1 task

lishunyang12 mentioned this pull request Mar 30, 2026

[Quantization] Add TurboQuant dynamic kv cache compression #38280

Closed

ljy11a mentioned this pull request Mar 30, 2026

fix: vgpu segfault before nccl init #38529

Open

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#1 from bcacdwk/copilot/create-markdow…

e544f1f

…n-files-overview Add Chinese documentation for vLLM framework structure and GEMM call chains

anand-nv referenced this pull request in anand-nv/vllm Apr 1, 2026

Merge pull request vllm-project#1 from Slyne/slyne/alm_api_v3

70ddd7e

[WIP] Fix Mamba state contamination in KV cache block reuse

This was referenced Apr 2, 2026

feat: add max_tokens_per_doc in rerank request (rebase of #33315) #38827

Open

feat: add max tokens per doc in rerank request #33315

Open

varjoranta mentioned this pull request Apr 4, 2026

[Attention Backend] TurboQuant: 2-bit KV cache compression with 4x capacity #38479

Open

AlexanderValentini mentioned this pull request Apr 5, 2026

Qwen-3.5 9B often producing repetitive/garbled output with Intel Backend #38994

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix a bug in tying OPT embeddings#1

Fix a bug in tying OPT embeddings#1
WoosukKwon merged 1 commit intomainfrom
fix-opt

WoosukKwon commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant