docs : added a hardware constraint warning for low vram GPUs for preventing memory thrashing by sreyanshacharya · Pull Request #3802 · huggingface/sentence-transformers

sreyanshacharya · 2026-06-08T10:36:48Z

hey team!

while working with sentence-transformers in a local rag setup alongside quantized 4-bit llms on consumer hardware (specifically an rtx 4050 with 6gb vram), i noticed severe compute thrashing and inter-token latency degradation when the embedding model and llm shared cuda execution.

isolating the embedding model to the cpu completely stabilized generation latency with negligible impact on retrieval speed.

i noticed the efficiency page doesn't explicitly touch on low-vram/consumer gpu orchestration limits when sharing compute with a generator, so i added a small warning block under the pytorch section to save other local developers from hitting this specific kv-cache expansion trap.

let me know if any formatting or wording needs tweaking!

…enting memory thrashing

tomaarsen · 2026-06-09T12:03:21Z

Hello!

Thanks for the writeup, and for digging into the latency behavior on your setup.

I'm afraid I see a few issues with the block as written: the .. warning:: content isn't indented under the directive so it won't actually render inside the admonition, and recommending CPU as a general default sits a little awkwardly on this page, which is all about making the embedding model faster (CPU embedding is usually much slower than GPU). The "empirical benchmarks show..." framing is also a bit strong for a single-setup observation.

That said, the core takeaway is useful and I'm happy to include it. Could we trim it down to a brief 1-2 line note that just recommends keeping an eye on VRAM and latency when running a Sentence Transformers model and a generative model on the same hardware? Something like:

.. note::

   When running a Sentence Transformers model alongside a generative LLM on the same GPU, keep an eye on VRAM usage and generation latency, as the two can contend for memory and compute. For latency-sensitive local setups, moving small embedding models to the CPU (``device="cpu"``) can help.

And also probably move it to below the PyTorch GPU tabs (i.e. just above the ONNX header). If you're on board, I can make that change in this PR and merge it, or if you'd like to make the change yourself, feel free to push an update to this branch and I'll review it.

Tom Aarsen

sreyanshacharya · 2026-06-10T11:28:57Z

hey Tom,

thanks for the great feedback! i completely agree with the restructuring. i've gone ahead and removed the warning block, converted it to the note format you suggested, and relocated it right above the ONNX header.

just pushed the updates to the branch—let me know if it looks good to go!

tomaarsen

Some minor comments, but it looks good!

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

Copilot

Pull request overview

Adds a short documentation note to the inference efficiency guide warning that, on low-VRAM/consumer GPUs, running an embedding model and a generative LLM on the same GPU can contend for VRAM/compute and degrade generation latency—suggesting CPU placement for embeddings in latency-sensitive local setups.

Changes:

Added a PyTorch-section admonition about VRAM contention and latency when co-locating embedding + generative models on one GPU.
Suggested mitigating latency thrashing by moving the embedding model to CPU.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

docs : added a hardware constraint warning for low vram GPUs for prev…

64e163e

…enting memory thrashing

docs: convert warning to note and relocate below pytorch tabs per review

e14f83a

tomaarsen approved these changes Jun 10, 2026

View reviewed changes

Comment thread docs/sentence_transformer/usage/efficiency.rst

Comment thread docs/sentence_transformer/usage/efficiency.rst Outdated

Apply suggestions from code review

419eabe

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

tomaarsen requested a review from Copilot June 10, 2026 11:43

Copilot started reviewing on behalf of tomaarsen June 10, 2026 11:43 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread docs/sentence_transformer/usage/efficiency.rst Outdated

tomaarsen and others added 2 commits June 10, 2026 13:52

Specify device="cpu" location

4a44709

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into docs/mobile-gpu-warning

686ddeb

tomaarsen enabled auto-merge (squash) June 11, 2026 06:54

tomaarsen disabled auto-merge June 11, 2026 08:17

tomaarsen merged commit 23cf15a into huggingface:main Jun 11, 2026
15 of 17 checks passed

sreyanshacharya deleted the docs/mobile-gpu-warning branch June 11, 2026 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs : added a hardware constraint warning for low vram GPUs for preventing memory thrashing#3802

docs : added a hardware constraint warning for low vram GPUs for preventing memory thrashing#3802
tomaarsen merged 5 commits into
huggingface:mainfrom
sreyanshacharya:docs/mobile-gpu-warning

sreyanshacharya commented Jun 8, 2026

Uh oh!

tomaarsen commented Jun 9, 2026

Uh oh!

sreyanshacharya commented Jun 10, 2026

Uh oh!

tomaarsen left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

sreyanshacharya commented Jun 8, 2026

Uh oh!

tomaarsen commented Jun 9, 2026

Uh oh!

sreyanshacharya commented Jun 10, 2026

Uh oh!

tomaarsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants