Add cache watermark to avoid frequent cache eviction by WoosukKwon · Pull Request #11 · vllm-project/vllm

WoosukKwon · 2023-03-27T23:55:41Z

This PR implements the watermark mechanism to prevent frequent preemption.

If we admit new sequences such that the GPU KV cache becomes full, preemptions are highly likely to happen in the next few steps. Instead, we can reserve a small portion of the cache and refrain from utilizing the entire cache space when admitting new sequences. This will help us avoid the inefficiencies.

WoosukKwon · 2023-03-29T23:37:05Z

@zhuohan123 I'm merging this PR as it does not conflict with any other and it (slightly) improves the system performance.

* add pos_encoding impl * add benchmark and add open mp parallel

* Comments done above worker * format * fixed missing arguments * fix * format

Compress model to int8

Wenxh/fp8 on a100 v1 pr

Jiayi dev v2

Add OWNERS file

### What this PR does / why we need it? Add feature and model support matrix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI test is enough Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

Abstract async saving

Fused moe lora cleanup

Issue: TPU sampler and Eagle code had two separate but related issues: 1. TPU sampler divides by zero for greedy requests (temperature=0.0) 2. Eagle code triggers mypy type errors due to missing None check Root Cause: - TPU sampler's apply_temperature() method lacks epsilon guard to prevent division by zero when temperature=0.0 (greedy sampling) - Eagle's compute_probs_and_sample_next_token() uses temperature without asserting it's not None, causing mypy type errors Impact: - TPU: Division by zero produces NaN/Inf logits, breaking speculative decoding on TPU platforms for all models using Eagle/rejection sampling - Eagle: mypy type checking failures prevent pre-commit hooks from passing Fix: 1. TPU Sampler (vllm/v1/sample/tpu/sampler.py): - Add all_random parameter to apply_temperature() method - Add epsilon guard: if not all_random: temp = torch.where(temp < _SAMPLING_EPS, 1.0, temp) - Update call site to pass sampling_metadata.all_random 2. TPU Metadata (vllm/v1/sample/tpu/metadata.py): - Add all_random property to TPUSupportedSamplingMetadata - Populate all_random from input_batch in from_input_batch() 3. Eagle (vllm/v1/spec_decode/eagle.py): - Add assert sampling_metadata.temperature is not None after all_greedy early return - Matches sampler.py pattern (line 162) for type safety Files Modified: - vllm/v1/sample/tpu/sampler.py: Epsilon guard in apply_temperature() - vllm/v1/sample/tpu/metadata.py: Added all_random property - vllm/v1/spec_decode/eagle.py: Added temperature None assertion - CLAUDE.md: Updated modification vllm-project#11 to document fixes This addresses PR vllm-project#27077 reviewer feedback and resolves mypy type errors. Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>

Signed-off-by: Yan Ma <yan.ma@intel.com>

…ections Manufacturing enhancements: - Add complete Vision Inspection MCP with Vision AI defect detection - Add Manufacturing MES MCP with PostgreSQL integration - Include detailed defect classification and statistics - Add ROI analysis showing 78% cost reduction and 99.6% time savings Healthcare enhancements: - Enhance existing Medical OCR, Drug Interaction, and EHR MCPs - Add ROI analysis showing 97.2% time reduction - Include medical accident prevention benefits (5억원 annual savings) - Demonstrate HIPAA-compliant prescription OCR workflow Summary: - Sections vllm-project#5-8: Fully detailed implementations (2,000+ lines each) - Sections vllm-project#9-10: Enhanced with complete code + ROI - Sections vllm-project#11-20+: Comprehensive summaries covering all major industries - Total guide provides 20+ real-world MCP + Agent architecture patterns

Signed-off-by: Yan Ma <yan.ma@intel.com>

[AFD][MTP] Adapt MTP layer for AFD mode, co-locate with attention, and fix several quantization bugs.

hsliuustc0106

Good, focused change. The watermark logic is sound and well-scoped. A couple of minor observations below.

- Add _lora_slots field on LoRAModelManager, decoupled from lora_config so dynamic scaling does not mutate the original config object - Add _evict_adapters_to_fit() hook on base class (raises on overflow); LRUCacheLoRAModelManager overrides it with LRU eviction and rebuilds _active_adapters cache with new capacity (cachetools maxsize read-only) - Implement resize_lora_slots() on base class: validates, evicts, calls reallocate_lora_weights() on all modules, empty_cache() once, resizes lora_index_to_id, updates _lora_slots - Step 7 (re-load surviving adapters) intentionally omitted — weights are preserved via GPU-to-GPU copy in reallocate_lora_weights(); comment notes what to do if a remote weight store is introduced in future - Add tests/lora/test_lora_model_manager_resize.py: 6 CPU-only unit tests covering validation, no-op, grow, LRU shrink, base-class overflow raise, and empty_cache() called exactly once Closes vllm-project#11 Closes vllm-project#21 AI assistance was used; all changed lines reviewed by the submitter. Co-authored-by: Claude Signed-off-by: Yue Zhu <Yue.Zhu@ibm.com>

…-for-vllm Replace vLLM README with comprehensive SlideSparse documentation for ICML submission

WoosukKwon added 3 commits March 27, 2023 23:41

Add watermark to avoid thrashing

14d10da

Fix comment

1250689

Apply watermark to can_swap_in

fe1436d

WoosukKwon requested a review from zhuohan123 March 28, 2023 08:16

WoosukKwon changed the title ~~Add cache watermark to avoid frequent preemptions~~ Add cache watermark to avoid frequent cache eviction Mar 29, 2023

WoosukKwon added 2 commits March 29, 2023 23:37

Merge branch 'main' into watermark

3707e96

Minor

c2f59a9

WoosukKwon merged commit 64e0e38 into main Mar 29, 2023

WoosukKwon deleted the watermark branch March 29, 2023 23:38

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Sep 12, 2023

add pos_encoding impl (vllm-project#11)

d32add0

* add pos_encoding impl * add benchmark and add open mp parallel

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 25, 2023

More comments (vllm-project#11)

dd60db0

* Comments done above worker * format * fixed missing arguments * fix * format

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add cache watermark to avoid frequent cache eviction (vllm-project#11)

0b05fa5

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024

Merge pull request vllm-project#11 from ilya-lavrenov/int8

5bb3e35

Compress model to int8

ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024

Merge pull request vllm-project#11 from wenxcs/wenxh/fp8-on-a100-v1-pr

03e3bda

Wenxh/fp8 on a100 v1 pr

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024

Merge pull request vllm-project#11 from KuntaiDu/jiayi-dev-v2

2297c19

Jiayi dev v2

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

hteeyeoh mentioned this pull request Dec 6, 2024

[Bug]: Not able to install/compile vllm using alpine linux base image #10924

Closed

1 task

Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Jan 15, 2025

Merge pull request vllm-project#11 from vaibhavjainwiz/add_owners_file

fc8ec1a

Add OWNERS file

alokkrsahu mentioned this pull request Apr 9, 2025

[Bug]: meta-llama/Llama-4-Scout-17B-16E-Instruct compatibility #16330

Closed

1 task

njhill pushed a commit to njhill/vllm that referenced this pull request May 10, 2025

Merge pull request vllm-project#11 from njhill/abstract-async-save

70f3ed5

Abstract async saving

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Closed

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Closed

1 task

crischeng mentioned this pull request Aug 12, 2025

[Bug]: CUDA error during nsys profile : unspecified launch failure #22746

Closed

1 task

dcmaddix pushed a commit to dcmaddix/vllm that referenced this pull request Oct 11, 2025

Merge pull request vllm-project#11 from dcmaddix/fused_moe_lora_cleanup

a931b70

Fused moe lora cleanup

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

yma11 added a commit to yma11/vllm that referenced this pull request Nov 14, 2025

support Kimi-VL-A3B-thinking on xpu (vllm-project#11)

53d46c2

Signed-off-by: Yan Ma <yan.ma@intel.com>

yma11 added a commit to yma11/vllm that referenced this pull request Nov 16, 2025

support Kimi-VL-A3B-thinking on xpu (vllm-project#11)

81cc837

Signed-off-by: Yan Ma <yan.ma@intel.com>

chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request Nov 20, 2025

support Kimi-VL-A3B-thinking on xpu (vllm-project#11)

e196d22

Signed-off-by: Yan Ma <yan.ma@intel.com>

sravan500 mentioned this pull request Nov 25, 2025

[Bug]: vllm/vllm-openai:v0.11.0 deployment --quantization fp8 throws cuda and tensor errors #29374

Closed

1 task

GuoRen868 pushed a commit to GuoRen868/vllm that referenced this pull request Dec 26, 2025

Merge pull request vllm-project#11 from ElleElleWu/jcz_afd_v0.11.0rc3

6acce73

[AFD][MTP] Adapt MTP layer for AFD mode, co-locate with attention, and fix several quantization bugs.

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

hsliuustc0106 reviewed Feb 23, 2026

View reviewed changes

vadiklyutiy mentioned this pull request Feb 27, 2026

[Performance]: non-optimal performance of linear for medium batches #35467

Open

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

This was referenced Mar 20, 2026

Fix XPU segfault when tensor_parallel_size exceeds available devices hongbolv/vllm#5

Closed

Fix XPU Level Zero crash by setting per-worker ZE_AFFINITY_MASK hongbolv/vllm#6

Closed

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#11 from bcacdwk/copilot/update-readme…

d40539d

…-for-vllm Replace vLLM README with comprehensive SlideSparse documentation for ICML submission

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cache watermark to avoid frequent cache eviction#11

Add cache watermark to avoid frequent cache eviction#11
WoosukKwon merged 5 commits intomainfrom
watermark

WoosukKwon commented Mar 27, 2023 •

edited

Loading

Uh oh!

WoosukKwon commented Mar 29, 2023

Uh oh!

hsliuustc0106 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WoosukKwon commented Mar 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented Mar 29, 2023

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WoosukKwon commented Mar 27, 2023 •

edited

Loading