Skip to content

[Frontend][3/N] Improve all pooling task | Support binary embedding response#27066

Merged
DarkLight1337 merged 35 commits intovllm-project:mainfrom
noooop:binary_response
Oct 22, 2025
Merged

[Frontend][3/N] Improve all pooling task | Support binary embedding response#27066
DarkLight1337 merged 35 commits intovllm-project:mainfrom
noooop:binary_response

Conversation

@noooop
Copy link
Copy Markdown
Collaborator

@noooop noooop commented Oct 17, 2025

TL;DR

  • support endianness: ["native", "big", "little"], native by default
  • support bytes encoding_format, a very simple (but highly efficient) binary embedding response method
    • First, write metadata into the headers, then write all the binary data in order into Response body.
    • When reading, first read the headers, then read the body according to the offset into the corresponding tenser
  • This API provides three benefits @uasan
  1. Significant reduction in response size
  2. No need for JSON.parse and Base64.decode
  3. Possibility of stream processing of the response
  4. Endianness customizations

Improve all pooling task

These PRs are mostly conflicting with each other, so combining them into a series would better inform reviewers about what happened. And what else needs to be done after that?

Purpose

Response compression if the client sends the 'accept-encoding' header: 'zstd, gzip'
Use Response compression to transmit binary files that don't need base64, cool. base64 is very inefficient
I wouldn't think of this hacky method before seeing this issue.

Thanks @uasan for this cool idea

Fix #27063

cc @christian-pinto @maxdebayser @DarkLight1337

Test Plan

tests/utils_/test_serial_utils.py
tests/entrypoints/pooling/openai/test_embedding.py
tests/entrypoints/pooling/openai/test_pooling.py

Test Result

pass


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@noooop noooop changed the title [Frontend][1/N] Improve all pooling task | Support binary Embedding response by response compression [Frontend][3/N] Improve all pooling task | Support binary Embedding response by response compression Oct 17, 2025
@mergify mergify Bot added the frontend label Oct 17, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Comment thread vllm/entrypoints/openai/utils.py Outdated
@noooop noooop changed the title [Frontend][3/N] Improve all pooling task | Support binary Embedding response by response compression [Frontend][3/N] Improve all pooling task | Support binary embedding response Oct 18, 2025
@noooop noooop changed the title [Frontend][3/N] Improve all pooling task | Support binary embedding response [Frontend][4/N] Improve all pooling task | Support binary embedding response Oct 18, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Comment thread vllm/entrypoints/openai/protocol.py Outdated
Signed-off-by: wang.yuqi <noooop@126.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Oct 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noooop.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Oct 20, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Oct 20, 2025

Documentation preview: https://vllm--27066.org.readthedocs.build/en/27066/

@mergify mergify Bot added the documentation Improvements or additions to documentation label Oct 20, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
@mergify mergify Bot removed the needs-rebase label Oct 20, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Comment thread vllm/utils/serial_utils.py
Comment thread vllm/utils/serial_utils.py Outdated
Comment thread vllm/utils/serial_utils.py Outdated
Comment thread vllm/utils/serial_utils.py Outdated
Comment thread vllm/utils/serial_utils.py
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop
Copy link
Copy Markdown
Collaborator Author

noooop commented Oct 22, 2025

@DarkLight1337

Are there any more modifications needed for this PR?

Comment thread vllm/entrypoints/openai/serving_embedding.py
Comment thread vllm/entrypoints/openai/api_server.py
@DarkLight1337 DarkLight1337 merged commit 1f633b8 into vllm-project:main Oct 22, 2025
51 checks passed
@noooop noooop deleted the binary_response branch October 22, 2025 10:40
@noooop noooop restored the binary_response branch October 22, 2025 11:04
@noooop noooop deleted the binary_response branch October 22, 2025 11:08
usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Improvements to front-end embedding response

4 participants