[Frontend][1/N] Improve all pooling task | Support FP16 Embedding Base64 (Still uses fp32 by default). by noooop · Pull Request #26414 · vllm-project/vllm

noooop · 2025-10-08T11:16:34Z

Improve all pooling task

[Model][0/N] Improve all pooling task | clean up #25817
[Frontend][1/N] Improve all pooling task | Support FP16 Embedding Base64 (Still uses fp32 by default). #26414
[Model][2/N] Improve all pooling task | Support multi-vector retrieval #25370
[Frontend][3/N] Improve all pooling task | Support binary embedding response #27066
[Frontend][4/N] Improve all pooling task | Add plugin pooling task #26973
[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. #25524

These PRs are mostly conflicting with each other, so combining them into a series would better inform reviewers about what happened. And what else needs to be done after that?

Purpose

FIX #26248

Eliminate overhead of converting tensor -> list[float] -> numpy in OpenAI API server
Support float32, float16, bfloat16, fp8_e4m3, fp8_e5m2 embed dtype.
The fp8 is better than expected, perhaps with practical value, please thoroughly test before use, welcome to feedback.

mteb test PTAL #17175
https://github.com/noooop/snippet/blob/main/benchmarks/test_mteb/test_embed_dtype.py

float32 ≈ float16 > bfloat16 > fp8_e4m3 >> fp8_e5m2

model_name	float32	float16	bfloat16	fp8_e4m3	fp8_e5m2
jinaai/jina-embeddings-v3	0.824336501	0.824339743	0.824335268	0.824326947	0.824323946
BAAI/bge-m3	0.78734317	0.787339618	0.787350117	0.787262401	0.78773927
intfloat/multilingual-e5-base	0.779325776	0.779326888	0.779314874	0.779537205	0.779854464
BAAI/bge-base-en	0.779337092	0.779340641	0.779370285	0.779269983	0.779406319
Alibaba-NLP/gte-multilingual-base	0.775074375	0.775069325	0.77506051	0.775014809	0.774906984
Qwen/Qwen3-Embedding-0.6B	0.771163535	0.771168301	0.771179591	0.771177116	0.771265142
thenlper/gte-large	0.768076652	0.768071025	0.768082659	0.768333837	0.767567718
Alibaba-NLP/gte-Qwen2-1.5B-instruct	0.758472692	0.75847403	0.758461189	0.758305555	0.758201921
BAAI/bge-code-v1	0.757253707	0.757255983	0.757230987	0.75725761	0.756827524
Alibaba-NLP/gte-modernbert-base	0.748189414	0.748189165	0.748210671	0.748433599	0.744441598
google/embeddinggemma-300m	0.747381858	0.747379788	0.747383603	0.747316076	0.747721116
intfloat/e5-small	0.742285948	0.742284024	0.742277706	0.741824568	0.742915194
nomic-ai/nomic-embed-text-v1	0.737569632	0.737570019	0.737581491	0.73745691	0.737700022
nomic-ai/nomic-embed-text-v2-moe	0.559459109	0.559442844	0.559431305	0.559320019	0.559509537
Snowflake/snowflake-arctic-embed-xs	0.714928682	0.714930751	0.714984003	0.714738386	0.712830927
Snowflake/snowflake-arctic-embed-l-v2.0	0.712258007	0.712257842	0.712263053	0.712389646	0.711971434
Snowflake/snowflake-arctic-embed-m-v2.0	0.706623072	0.706631128	0.706615263	0.706433834	0.706279522
TencentBAC/Conan-embedding-v1	0.688612388	0.688609943	0.68862858	0.688503053	0.688422993
Snowflake/snowflake-arctic-embed-m-long	0.681144894	0.681143862	0.681128579	0.681404862	0.678821797
Snowflake/snowflake-arctic-embed-m-v1.5	0.649088528	0.649089243	0.649064267	0.648947297	0.649799194

Even with fp8_e5m2, the gap is smaller than imagined.

Test Plan

tests/entrypoints/pooling/openai/test_embedding.py
tests/entrypoints/pooling/openai/test_pooling.py

Test Result

pass

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wang.yuqi <noooop@126.com>

vllm/entrypoints/openai/serving_embedding.py

Signed-off-by: wang.yuqi <noooop@126.com>

mergify · 2025-10-09T11:37:28Z

Documentation preview: https://vllm--26414.org.readthedocs.build/en/26414/

Signed-off-by: wang.yuqi <noooop@126.com>

noooop · 2025-10-09T13:39:59Z

@uasan

examples/online_serving/pooling/openai_embedding_embed_dtype_client.py

Do you ok with this api?

Yes, this PR can even use fp8. The small-scale test results are quite good. A more detailed test will be provided tomorrow.

Signed-off-by: wang.yuqi <noooop@126.com>

uasan · 2025-10-09T15:15:13Z

@noooop Yes embed_dtype I'm quite satisfied with it, thank you!

There are also optional enhancements: binary protocols, such as those in Postgres, always expect big-endian binary numbers; this is generally the de facto network standard for almost all binary protocols, but models typically operate in little-endian format; byte order conversion is always necessary.

Adding the endian parameter also becomes useful.

noooop · 2025-10-09T15:36:41Z

cc @DarkLight1337 @maxdebayser

Ready for review

float32 ≈ float16 > bfloat16 > fp8_e4m3 >> fp8_e5m2 Do we need to support fp8 embedding dtype?
(I guess fp8 ue8m0 + scale & bias might be a better choice, using the perspective of model and KV quantization
Do we need to add an endian parameter?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/entrypoints/openai/serving_embedding.py

maxdebayser

Awesome. I've left a few comments but this looks good to me.

Signed-off-by: wang.yuqi <noooop@126.com>

noooop · 2025-10-13T11:01:11Z

cc @DarkLight1337

Is there anything else that needs to be modified in this PR?

DarkLight1337

Nope, this LGTM now thanks

…e64 (Still uses fp32 by default). (vllm-project#26414) Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: 1994 <1994@users.noreply.github.com>

…e64 (Still uses fp32 by default). (vllm-project#26414) Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…e64 (Still uses fp32 by default). (vllm-project#26414) Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>

uasan · 2025-10-17T01:09:53Z

hi @uasan I'm not 100% clear about the use cases of endianness. Please describe it in another issue.

How do you plan to load big-endian or little-endian?

is it need to go through a proxy layer ?

or want to integrate with that API or system, which also requires converting binary to base64?

@noooop I described it here #27063

…e64 (Still uses fp32 by default). (vllm-project#26414) Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

…e64 (Still uses fp32 by default). (vllm-project#26414) Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…e64 (Still uses fp32 by default). (vllm-project#26414) Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

FP16 Embedding Base64

dce9edb

Signed-off-by: wang.yuqi <noooop@126.com>

mergify bot added the frontend label Oct 8, 2025

DarkLight1337 reviewed Oct 8, 2025

View reviewed changes

vllm/entrypoints/openai/serving_embedding.py Outdated Show resolved Hide resolved

noooop changed the title ~~[Model] FP16 Embedding Base64~~ [Model] Support FP16 Embedding Base64 (Still uses fp32 by default). Oct 8, 2025

noooop changed the title ~~[Model] Support FP16 Embedding Base64 (Still uses fp32 by default).~~ [Frontend] Support FP16 Embedding Base64 (Still uses fp32 by default). Oct 8, 2025

noooop added 2 commits October 9, 2025 12:16

Merge branch 'main' into embed_fp16

d6451fc

fix

7c7d124

Signed-off-by: wang.yuqi <noooop@126.com>

noooop mentioned this pull request Oct 9, 2025

[Model] Add support for ModernBertForTokenClassification #26340

Merged

5 tasks

noooop added 2 commits October 9, 2025 14:04

+ test_base64_embed_dtype

91964b7

Signed-off-by: wang.yuqi <noooop@126.com>

Merge branch 'main' into embed_fp16

cdc8174

mergify bot added the documentation Improvements or additions to documentation label Oct 9, 2025

+ examples

20524d6

Signed-off-by: wang.yuqi <noooop@126.com>

noooop force-pushed the embed_fp16 branch from 38c144a to 20524d6 Compare October 9, 2025 11:43

noooop mentioned this pull request Oct 9, 2025

[Usage]: Problem with concurrency in encoder-based embedder serving with V1 Engine #25842

Closed

1 task

+ Comment

0df211e

Signed-off-by: wang.yuqi <noooop@126.com>

Merge branch 'main' into embed_fp16

71af8fb

noooop marked this pull request as ready for review October 9, 2025 15:29

noooop requested review from NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and simon-mo as code owners October 9, 2025 15:29

chatgpt-codex-connector bot reviewed Oct 9, 2025

View reviewed changes

vllm/entrypoints/openai/serving_embedding.py Outdated Show resolved Hide resolved

maxdebayser reviewed Oct 9, 2025

View reviewed changes

vllm/entrypoints/openai/serving_embedding.py Outdated Show resolved Hide resolved

maxdebayser reviewed Oct 9, 2025

View reviewed changes

vllm/entrypoints/openai/serving_embedding.py Outdated Show resolved Hide resolved

maxdebayser approved these changes Oct 9, 2025

View reviewed changes

noooop added 3 commits October 13, 2025 17:16

fix

99e54ca

Signed-off-by: wang.yuqi <noooop@126.com>

fix

c1d6927

Signed-off-by: wang.yuqi <noooop@126.com>

fix

8f898df

Signed-off-by: wang.yuqi <noooop@126.com>

noooop mentioned this pull request Oct 13, 2025

[Model][2/N] Improve all pooling task | Support multi-vector retrieval #25370

Merged

5 tasks

DarkLight1337 approved these changes Oct 13, 2025

View reviewed changes

Merge branch 'main' into embed_fp16

fa54a71

noooop enabled auto-merge (squash) October 13, 2025 17:13

noooop merged commit d2a7938 into vllm-project:main Oct 13, 2025
49 checks passed

uasan mentioned this pull request Oct 14, 2025

Feature: [node] Support Float16Array lancedb/lancedb#2716

Closed

noooop mentioned this pull request Oct 14, 2025

[Feature]: Better base64 to torch tenser #26781

Open

1 task

noooop mentioned this pull request Oct 15, 2025

[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. #25524

Merged

5 tasks

noooop deleted the embed_fp16 branch October 16, 2025 00:51

noooop mentioned this pull request Oct 16, 2025

[Frontend][4/N] Improve all pooling task | Add plugin pooling task #26973

Merged

5 tasks

uasan mentioned this pull request Oct 17, 2025

[Feature]: Improvements to front-end embedding response #27063

Closed

1 task

This was referenced Oct 17, 2025

[Frontend][3/N] Improve all pooling task | Support binary embedding response #27066

Merged

[Model][6/N] Improve all pooling task | Support chunked prefill with ALL pooling #27145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Frontend][1/N] Improve all pooling task | Support FP16 Embedding Base64 (Still uses fp32 by default).#26414

[Frontend][1/N] Improve all pooling task | Support FP16 Embedding Base64 (Still uses fp32 by default).#26414
noooop merged 25 commits intovllm-project:mainfrom
noooop:embed_fp16

noooop commented Oct 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

noooop commented Oct 9, 2025 •

edited

Loading

Uh oh!

uasan commented Oct 9, 2025

Uh oh!

noooop commented Oct 9, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maxdebayser left a comment

Uh oh!

noooop commented Oct 13, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

uasan commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

noooop commented Oct 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improve all pooling task

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

noooop commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uasan commented Oct 9, 2025

Uh oh!

noooop commented Oct 9, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maxdebayser left a comment

Choose a reason for hiding this comment

Uh oh!

noooop commented Oct 13, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

uasan commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

noooop commented Oct 8, 2025 •

edited by github-actions bot

Loading

noooop commented Oct 9, 2025 •

edited

Loading