docs : added a hardware constraint warning for low vram GPUs for preventing memory thrashing#3802
Conversation
…enting memory thrashing
|
Hello! Thanks for the writeup, and for digging into the latency behavior on your setup. I'm afraid I see a few issues with the block as written: the That said, the core takeaway is useful and I'm happy to include it. Could we trim it down to a brief 1-2 line note that just recommends keeping an eye on VRAM and latency when running a Sentence Transformers model and a generative model on the same hardware? Something like: .. note::
When running a Sentence Transformers model alongside a generative LLM on the same GPU, keep an eye on VRAM usage and generation latency, as the two can contend for memory and compute. For latency-sensitive local setups, moving small embedding models to the CPU (``device="cpu"``) can help.And also probably move it to below the PyTorch GPU tabs (i.e. just above the ONNX header). If you're on board, I can make that change in this PR and merge it, or if you'd like to make the change yourself, feel free to push an update to this branch and I'll review it.
|
|
hey Tom, thanks for the great feedback! i completely agree with the restructuring. i've gone ahead and removed the warning block, converted it to the note format you suggested, and relocated it right above the ONNX header. just pushed the updates to the branch—let me know if it looks good to go! |
tomaarsen
left a comment
There was a problem hiding this comment.
Some minor comments, but it looks good!
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a short documentation note to the inference efficiency guide warning that, on low-VRAM/consumer GPUs, running an embedding model and a generative LLM on the same GPU can contend for VRAM/compute and degrade generation latency—suggesting CPU placement for embeddings in latency-sensitive local setups.
Changes:
- Added a PyTorch-section admonition about VRAM contention and latency when co-locating embedding + generative models on one GPU.
- Suggested mitigating latency thrashing by moving the embedding model to CPU.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
hey team!
while working with sentence-transformers in a local rag setup alongside quantized 4-bit llms on consumer hardware (specifically an rtx 4050 with 6gb vram), i noticed severe compute thrashing and inter-token latency degradation when the embedding model and llm shared cuda execution.
isolating the embedding model to the cpu completely stabilized generation latency with negligible impact on retrieval speed.
i noticed the efficiency page doesn't explicitly touch on low-vram/consumer gpu orchestration limits when sharing compute with a generator, so i added a small warning block under the pytorch section to save other local developers from hitting this specific kv-cache expansion trap.
let me know if any formatting or wording needs tweaking!