Implement LLaMA by WoosukKwon · Pull Request #9 · vllm-project/vllm

WoosukKwon · 2023-03-26T05:42:49Z

TODO:

Test against HF implementation
Add TP support (@zhuohan123)

WoosukKwon · 2023-03-29T06:11:28Z

@zhuohan123 Please feel free to approve and merge this PR once you think it's ready.

dont error if user doesnt have kernels installed

* code from ds Signed-off-by: youkaichao <youkaichao@gmail.com> * doc from ds Signed-off-by: youkaichao <youkaichao@gmail.com> * Fixes for support_materials/2-tilelang/ Signed-off-by: mgoin <mgoin64@gmail.com> * Fix example 1 Signed-off-by: mgoin <mgoin64@gmail.com> * Fix Einsum in deepgemm * Fix `libc10.so` unimported error * fix reference code Signed-off-by: youkaichao <youkaichao@gmail.com> * adding missing indexer args * passing index args into the module * init Signed-off-by: Chen Zhang <zhangch99@outlook.com> * build indexer k cache medadata * prefill indexer, but weight_proj will output -inf * unqiantized paged indexer, still have -inf issue * remove support material * adding topk_indices mask * add weight scale * unittest infrastructure and fix weight_proj, numeric error due to quantization * varlen prefill passed * paged prefill * add indices mask --------- Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>

uvloop fix for throughput.py

* add wf8af8 pass Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * remove redundant func Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * add env into vllm.envs Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> --------- Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>

…ections Manufacturing enhancements: - Add complete Vision Inspection MCP with Vision AI defect detection - Add Manufacturing MES MCP with PostgreSQL integration - Include detailed defect classification and statistics - Add ROI analysis showing 78% cost reduction and 99.6% time savings Healthcare enhancements: - Enhance existing Medical OCR, Drug Interaction, and EHR MCPs - Add ROI analysis showing 97.2% time reduction - Include medical accident prevention benefits (5억원 annual savings) - Demonstrate HIPAA-compliant prescription OCR workflow Summary: - Sections vllm-project#5-8: Fully detailed implementations (2,000+ lines each) - Sections vllm-project#9-10: Enhanced with complete code + ROI - Sections vllm-project#11-20+: Comprehensive summaries covering all major industries - Total guide provides 20+ real-world MCP + Agent architecture patterns

[temp] ubatch adaption v0.1

qwen2.5 & 3 vl fixes and tests

Update plan document to account for completed work: - Document PR #0 (EngineCoreEvent removal) as completed prerequisite - Clarify that do_tracing() is current OTEL mechanism (not legacy) - Update PR #9 to keep RequestJourneyEvent dataclass (needed for Prometheus) - Fix terminology: 'legacy' = EngineCoreEvent (removed), 'current' = RequestJourneyEvent - Add PR #0 to dependencies, timeline, and progress tracking sections Key corrections: - do_tracing() will NOT be removed (it's the current system) - RequestJourneyEvent dataclass will NOT be removed (needed for metrics) - Only buffering LOGIC will be removed in PR #9 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

/9) (#8) * [Docs] Update journey tracing plan to reflect completed PR #0 Update plan document to account for completed work: - Document PR #0 (EngineCoreEvent removal) as completed prerequisite - Clarify that do_tracing() is current OTEL mechanism (not legacy) - Update PR #9 to keep RequestJourneyEvent dataclass (needed for Prometheus) - Fix terminology: 'legacy' = EngineCoreEvent (removed), 'current' = RequestJourneyEvent - Add PR #0 to dependencies, timeline, and progress tracking sections Key corrections: - do_tracing() will NOT be removed (it's the current system) - RequestJourneyEvent dataclass will NOT be removed (needed for metrics) - Only buffering LOGIC will be removed in PR #9 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Feature] Initialize OTEL tracer in scheduler for journey tracing Add tracer initialization in Scheduler.__init__() to support dual-stream journey tracing architecture. This is the foundation for PR #2 which will create and manage core spans. Changes: - Add defensive SpanAttributes import with None fallback - Initialize tracer when enable_journey_tracing=True and endpoint configured - Add try/except with warning log for graceful degradation - Add otlp_traces_endpoint parameter to test utilities - Add 4 comprehensive tests with proper mocking Safety guarantees: - Zero per-request state (tracer is class-level only) - Zero overhead when disabled (boolean + endpoint guard) - No spans created (initialization only) - No cleanup needed (shared tracer instance) - Backward compatible (all parameters optional) Test results: All 85 tests passing (81 existing + 4 new) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

update RFC template

…Manager - Add store_threshold >= 2 validation in FilterReusedOffloadingManager constructor (mirrors the existing max_tracker_size >= 1 guard) - Fix cpu.py gate from > 1 to >= 2; update comment to clarify that values < 2 disable filtering - Add internal assertions to test_filter_reused_manager to verify tracker eviction and count reset (Comments vllm-project#8 and vllm-project#9) - Remove tests/v1/kv_offload/__init__.py (not needed for pytest discovery) - Remove accidentally tracked dev-workflow files (.patch, diff*.txt, error.txt, log files, mypy/test output files) Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com>

…-nongated-moe [Bugfix] Fix BF16 trtllm-gen MoE weight corruption for non-gated models

WoosukKwon and others added 9 commits March 25, 2023 23:35

Add sentencepiece and Ray to dependency

1c80c4a

Implement LLaMA

6ed7057

Implement LLaMA memory analyzer

bd8664c

[WIP] Add LLaMA model loader

4ea2216

Fix README

742ac4a

Minor

f3d5e78

Minor

d2e08a2

[WIP] Incorrect TP implementation

48b6dd1

fix lm_head

9144167

zhuohan123 self-requested a review March 29, 2023 06:37

zhuohan123 approved these changes Mar 29, 2023

View reviewed changes

zhuohan123 and others added 7 commits March 29, 2023 07:16

Merge branch 'main' into llama

20013d3

fix merge error

fcb4f95

fix distributed execution

91ec0c5

Fix a bug in memory analyzer when using TP

7e72747

Fix the shape of q and k

dd38548

fix memory analyzer

ffd9644

Merge branch 'main' into llama

56b46a1

zhuohan123 merged commit 80a2f81 into main Mar 30, 2023

WoosukKwon deleted the llama branch April 12, 2023 03:12

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023

Merge pull request vllm-project#9 from ri938/organise

010b5bc

dont error if user doesnt have kernels installed

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Dec 29, 2023

CPU only build (vllm-project#9)

62936e3

crischeng mentioned this pull request Aug 12, 2025

[Bug]: CUDA error during nsys profile : unspecified launch failure #22746

Closed

1 task

shaamil101-etched mentioned this pull request Aug 25, 2025

[Bug]: vLLM server timeout due to multiprocessing communication error #23582

Closed

1 task

isaick pushed a commit to isaick/vllm that referenced this pull request Oct 19, 2025

Merge pull request vllm-project#9 from TTNTech/fenglui-patch-throughput

92b11e2

uvloop fix for throughput.py

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

sravan500 mentioned this pull request Nov 25, 2025

[Bug]: vllm/vllm-openai:v0.11.0 deployment --quantization fp8 throws cuda and tensor errors #29374

Closed

1 task

chopper0126 pushed a commit to chopper0126/vllm that referenced this pull request Dec 12, 2025

Merge pull request vllm-project#9 from xiaoshudian555/ubatch_adapt_

71a1074

[temp] ubatch adaption v0.1

prashanth058 pushed a commit to prashanth058/vllm that referenced this pull request Dec 12, 2025

Merge pull request vllm-project#9 from prashanth058/vision-lora-fixes

7e710bc

qwen2.5 & 3 vl fixes and tests

sriumcp mentioned this pull request Jan 26, 2026

[Feature] Initialize OTEL tracer in scheduler for journey tracing (PR #1/9) inference-sim/vllm#8

Merged

7 tasks

This was referenced Jan 27, 2026

[Feature] Emit journey events to core spans (PR #4/9) #33136

Closed

[Feature] Add API parent span lifecycle management (PR #6/9) #33182

Closed

[Feature] Add API↔Engine context propagation for journey tracing (PR #7/9) #33190

Closed

tjtanaa pushed a commit to tjtanaa/vllm that referenced this pull request Jan 29, 2026

Merge pull request vllm-project#9 from hsliuustc0106/hsliu-dev-C

7a85034

update RFC template

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

This was referenced Mar 20, 2026

Fix XPU segfault when tensor_parallel_size exceeds available devices hongbolv/vllm#5

Closed

Fix XPU Level Zero crash by setting per-worker ZE_AFFINITY_MASK hongbolv/vllm#6

Closed

This was referenced Apr 3, 2026

[CI/Build] CPU release supports both of AVX2 and AVX512 #35466

Merged

Request for attribution: Multi-ISA CPU dispatcher work (PR #35466) #38942

Open

danisereb pushed a commit to de-inf/vllm that referenced this pull request Apr 5, 2026

Merge pull request vllm-project#9 from TomerBN-Nvidia/fix/bf16-trtllm…

c39a7c4

…-nongated-moe [Bugfix] Fix BF16 trtllm-gen MoE weight corruption for non-gated models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement LLaMA#9

Implement LLaMA#9
zhuohan123 merged 16 commits intomainfrom
llama

WoosukKwon commented Mar 26, 2023 •

edited by zhuohan123

Loading

Uh oh!

WoosukKwon commented Mar 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WoosukKwon commented Mar 26, 2023 • edited by zhuohan123 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented Mar 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WoosukKwon commented Mar 26, 2023 •

edited by zhuohan123

Loading