Skip to content

docs : added a hardware constraint warning for low vram GPUs for preventing memory thrashing#3802

Merged
tomaarsen merged 5 commits into
huggingface:mainfrom
sreyanshacharya:docs/mobile-gpu-warning
Jun 11, 2026
Merged

docs : added a hardware constraint warning for low vram GPUs for preventing memory thrashing#3802
tomaarsen merged 5 commits into
huggingface:mainfrom
sreyanshacharya:docs/mobile-gpu-warning

Conversation

@sreyanshacharya

Copy link
Copy Markdown
Contributor

hey team!

while working with sentence-transformers in a local rag setup alongside quantized 4-bit llms on consumer hardware (specifically an rtx 4050 with 6gb vram), i noticed severe compute thrashing and inter-token latency degradation when the embedding model and llm shared cuda execution.

isolating the embedding model to the cpu completely stabilized generation latency with negligible impact on retrieval speed.

i noticed the efficiency page doesn't explicitly touch on low-vram/consumer gpu orchestration limits when sharing compute with a generator, so i added a small warning block under the pytorch section to save other local developers from hitting this specific kv-cache expansion trap.

let me know if any formatting or wording needs tweaking!

@tomaarsen

Copy link
Copy Markdown
Member

Hello!

Thanks for the writeup, and for digging into the latency behavior on your setup.

I'm afraid I see a few issues with the block as written: the .. warning:: content isn't indented under the directive so it won't actually render inside the admonition, and recommending CPU as a general default sits a little awkwardly on this page, which is all about making the embedding model faster (CPU embedding is usually much slower than GPU). The "empirical benchmarks show..." framing is also a bit strong for a single-setup observation.

That said, the core takeaway is useful and I'm happy to include it. Could we trim it down to a brief 1-2 line note that just recommends keeping an eye on VRAM and latency when running a Sentence Transformers model and a generative model on the same hardware? Something like:

.. note::

   When running a Sentence Transformers model alongside a generative LLM on the same GPU, keep an eye on VRAM usage and generation latency, as the two can contend for memory and compute. For latency-sensitive local setups, moving small embedding models to the CPU (``device="cpu"``) can help.

And also probably move it to below the PyTorch GPU tabs (i.e. just above the ONNX header). If you're on board, I can make that change in this PR and merge it, or if you'd like to make the change yourself, feel free to push an update to this branch and I'll review it.

  • Tom Aarsen

@sreyanshacharya

Copy link
Copy Markdown
Contributor Author

hey Tom,

thanks for the great feedback! i completely agree with the restructuring. i've gone ahead and removed the warning block, converted it to the note format you suggested, and relocated it right above the ONNX header.

just pushed the updates to the branch—let me know if it looks good to go!

@tomaarsen tomaarsen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but it looks good!

Comment thread docs/sentence_transformer/usage/efficiency.rst
Comment thread docs/sentence_transformer/usage/efficiency.rst Outdated
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a short documentation note to the inference efficiency guide warning that, on low-VRAM/consumer GPUs, running an embedding model and a generative LLM on the same GPU can contend for VRAM/compute and degrade generation latency—suggesting CPU placement for embeddings in latency-sensitive local setups.

Changes:

  • Added a PyTorch-section admonition about VRAM contention and latency when co-locating embedding + generative models on one GPU.
  • Suggested mitigating latency thrashing by moving the embedding model to CPU.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/sentence_transformer/usage/efficiency.rst Outdated
tomaarsen and others added 2 commits June 10, 2026 13:52
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@tomaarsen tomaarsen enabled auto-merge (squash) June 11, 2026 06:54
@tomaarsen tomaarsen disabled auto-merge June 11, 2026 08:17
@tomaarsen tomaarsen merged commit 23cf15a into huggingface:main Jun 11, 2026
15 of 17 checks passed
@sreyanshacharya sreyanshacharya deleted the docs/mobile-gpu-warning branch June 11, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants