FastAPI-based working frontend by zhuohan123 · Pull Request #10 · vllm-project/vllm

zhuohan123 · 2023-03-27T06:19:00Z

Add a FastAPI-based frontend to cacheflow while keeping the old script working.

Remaining TODOs:

Add a README for the FastAPI frontend.
Rename the old script.
Add a gradio demo web frontend.

WoosukKwon

LGTM! Thanks for your effort.

cacheflow/sampling_params.py

cacheflow/master/server.py

* Add underlying functions * tests done

…sthrough Passthrough trust_remote_code

Wenxh/fp8 on a100 v1 pr

Kuntai disagg refactor

### What this PR does / why we need it? This PR adds Chinese documents for vllm-ascend for Chinese-speaking developers ### Does this PR introduce _any_ user-facing change? Change as follows - add README.zh.md - add environment.zh.md - add CONTRIBUTING.zh.md ### How was this patch tested? By CI --------- Signed-off-by: wangli <wangli858794774@gmail.com>

Move new GPUModelRunner methods out of `execute_model` method

* hf format Signed-off-by: Chen Zhang <zhangch99@outlook.com> * better qkv concat Signed-off-by: Chen Zhang <zhangch99@outlook.com> --------- Signed-off-by: Chen Zhang <zhangch99@outlook.com>

…ections Manufacturing enhancements: - Add complete Vision Inspection MCP with Vision AI defect detection - Add Manufacturing MES MCP with PostgreSQL integration - Include detailed defect classification and statistics - Add ROI analysis showing 78% cost reduction and 99.6% time savings Healthcare enhancements: - Enhance existing Medical OCR, Drug Interaction, and EHR MCPs - Add ROI analysis showing 97.2% time reduction - Include medical accident prevention benefits (5억원 annual savings) - Demonstrate HIPAA-compliant prescription OCR workflow Summary: - Sections vllm-project#5-8: Fully detailed implementations (2,000+ lines each) - Sections vllm-project#9-10: Enhanced with complete code + ROI - Sections vllm-project#11-20+: Comprehensive summaries covering all major industries - Total guide provides 20+ real-world MCP + Agent architecture patterns

cam support aclgraph full-graph.

…-fixes lora vision misc fixes

- Make w_dequant non-optional in W8A16 custom op since it is always pre-computed at weight-load time; remove dead inline dequant fallback. - Add explicit TORCH_CHECK for unsupported group_size in the wvSplitK_int4g_hf_sweep dispatch instead of silent fallthrough. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

@yuezhu1

…llm-project#10, closes vllm-project#20) Implements reallocate_lora_weights(new_slots) so stacked GPU tensors can be resized at runtime without restarting the server. - BaseLayerWithLoRA: single implementation with _reallocate() helper that handles both tuple-of-tensors (linear layers) and plain-tensor (LogitsProcessorWithLoRA) storage via isinstance check. All linear layer subclasses inherit this for free. - FusedMoEWithLoRA: override to reallocate the four w13/w2 weight tuples, resize adapter_enabled, rebuild the flat lora_a/b_stacked views list, and update max_loras. FusedMoE3DWithLoRA inherits this override. - 22 CPU-only unit tests in tests/lora/test_reallocate_lora_weights.py covering shape after grow/shrink, weight preservation for surviving slots, zero-init of new slots, no-op before create_lora_weights, and no empty_cache() call inside the method. Pre-commit: ruff-check, ruff-format, mypy-3.10 all pass. Tests: 22/22 pass on CPU. AI assistance was used (Claude Code). All changed lines reviewed by @yuezhu1. This does not duplicate any existing upstream PR or issue. Co-authored-by: Claude <noreply@anthropic.com>

…t-linear-fp8 Add cuSPARSELt FP8 Linear method analysis to fp8_gemm_integration_analysis.md

…dp-tcp-placement Port multi-node DP fixes from upstream PR vllm-project#38630

zhuohan123 added 6 commits March 26, 2023 12:41

modify current frontend & server to prepare for real frontend

521dcfa

add server_utils

3e8c132

fix small bugs

8873415

add a new server class

71c2b93

initial implementation of fastapi frontend

d262ac9

fix memory bugs and add test client

ea06325

zhuohan123 requested a review from WoosukKwon March 27, 2023 06:19

zhuohan123 added 7 commits March 27, 2023 09:09

Rename and small fixes

03f09f0

Merge branch 'main' into real-frontend

16e9674

Rename and update readme

f045128

fix

6f735fa

fix api and add gradio webserver

f1666e8

Modify readme

9b2972f

Update README

9d23237

WoosukKwon approved these changes Mar 28, 2023

View reviewed changes

cacheflow/sampling_params.py Outdated Show resolved Hide resolved

cacheflow/master/server.py Outdated Show resolved Hide resolved

Address review comments.

739f599

zhuohan123 merged commit 721fa3d into main Mar 29, 2023

zhuohan123 deleted the real-frontend branch March 29, 2023 06:49

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 25, 2023

Clean unused KVCache after usage (vllm-project#10)

fbcfee9

* Add underlying functions * tests done

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

FastAPI-based working frontend (vllm-project#10)

e1f8fcc

slyalin pushed a commit to slyalin/vllm that referenced this pull request Mar 22, 2024

Merge pull request vllm-project#10 from slyalin/trust_remote_code_pas…

a398824

…sthrough Passthrough trust_remote_code

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024

Merge pull request vllm-project#10 from wenxcs/wenxh/fp8-on-a100-v1-pr

94d4614

Wenxh/fp8 on a100 v1 pr

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024

Merge pull request vllm-project#10 from KuntaiDu/kuntai-disagg-refactor

0d81aaf

Kuntai disagg refactor

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

alokkrsahu mentioned this pull request Apr 9, 2025

[Bug]: meta-llama/Llama-4-Scout-17B-16E-Instruct compatibility #16330

Closed

1 task

juncgu pushed a commit to juncgu/vllm that referenced this pull request May 8, 2025

Merge pull request vllm-project#10 from njhill/streamline-runner

c2f2e77

Move new GPUModelRunner methods out of `execute_model` method

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Closed

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Closed

1 task

crischeng mentioned this pull request Aug 12, 2025

[Bug]: CUDA error during nsys profile : unspecified launch failure #22746

Closed

1 task

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

sravan500 mentioned this pull request Nov 25, 2025

[Bug]: vllm/vllm-openai:v0.11.0 deployment --quantization fp8 throws cuda and tensor errors #29374

Closed

1 task

chopper0126 pushed a commit to chopper0126/vllm that referenced this pull request Dec 12, 2025

Merge pull request vllm-project#10 from GuoRen868/jcz_afd_v0.11.0rc3

4b0f85e

cam support aclgraph full-graph.

prashanth058 pushed a commit to prashanth058/vllm that referenced this pull request Dec 12, 2025

Merge pull request vllm-project#10 from prashanth058/lora-vision-misc…

9d41f6e

…-fixes lora vision misc fixes

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

This was referenced Mar 20, 2026

Fix XPU segfault when tensor_parallel_size exceeds available devices hongbolv/vllm#5

Closed

Fix XPU Level Zero crash by setting per-worker ZE_AFFINITY_MASK hongbolv/vllm#6

Closed

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#10 from bcacdwk/copilot/add-cusparsel…

19c76a7

…t-linear-fp8 Add cuSPARSELt FP8 Linear method analysis to fp8_gemm_integration_analysis.md

danisereb pushed a commit to de-inf/vllm that referenced this pull request Apr 5, 2026

Merge pull request vllm-project#10 from TomerBN-Nvidia/fix/multinode-…

c6eb917

…dp-tcp-placement Port multi-node DP fixes from upstream PR vllm-project#38630

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FastAPI-based working frontend#10

FastAPI-based working frontend#10
zhuohan123 merged 14 commits intomainfrom
real-frontend

zhuohan123 commented Mar 27, 2023 •

edited

Loading

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zhuohan123 commented Mar 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhuohan123 commented Mar 27, 2023 •

edited

Loading