perf(vision): lazy-import torchvision to avoid reserving VRAM at startup (#29292)#38986
Open
rodboev wants to merge 1 commit into
Open
perf(vision): lazy-import torchvision to avoid reserving VRAM at startup (#29292)#38986rodboev wants to merge 1 commit into
rodboev wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After v0.14.0 added the vision_analyze pixel-through feature (#22955), both
hermes gatewayandhermes agentreserve ~7 GB of GPU VRAM each at startup, even when idle and not performing any vision tasks. On systems with a single GPU shared between inference services and Hermes (the reporter's RTX PRO 6000 Blackwell 96 GB running sglang and ComfyUI), the combined ~14 GB reservation significantly reduces available VRAM for productive workloads. Users who rely on external multimodal LLMs (GPT-4o, Claude) rather than local vision models get zero benefit from the local vision code path, making the VRAM cost purely wasteful.The VRAM reservation happens because
torchvision(installed in the hermes venv as a transitive dependency of optional ML skills likeoptional-skills/mlops/clip) initializes a CUDA context at import time. Python's PyTorch allocator then pre-allocates a large contiguous memory block (~7 GB on NVIDIA GPUs). While the current core hermes codebase does not importtorchortorchvisionat module level (PIL is already lazy-imported inside_resize_image_for_vision), the import can still be triggered indirectly:torchvisionregisters itself as a Pillow plugin via entry points, so a PIL import can pull it in; skill discovery can import skill modules that transitively depend on torch; and future code changes could accidentally add a top-level import.The fix applies three defense layers. First, a documented contract in
tools/vision_tools.pystating that torch and torchvision must never be imported at module level, with a reference to this issue, so future contributors know the constraint. Second, aPYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truestartup guard incli.pyandgateway/run.pythat reduces VRAM pre-allocation from ~7 GB to ~500 MB if torch is imported as a side effect. This does not disable CUDA or prevent vision tools from using GPU when explicitly invoked; it only changes the allocator to use expandable segments instead of one large contiguous block. Third, a"vision.torch"entry intools/lazy_deps.pyso torchvision is available as an explicitly lazy-installable dependency rather than an implicit transitive, and a regression test that asserts importingtools.vision_toolsdoes not pulltorchortorchvisionintosys.modules.Fixes #29292
Changes
tools/vision_tools.py: add a module-level comment documenting the no-torch-at-import contract with issue reference (+4 lines)tools/lazy_deps.py: add"vision.torch": ("torchvision",)toLAZY_DEPS(+2 lines)cli.py: setPYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueif not already set, before tool discovery (+5 lines)gateway/run.py: same startup guard as cli.py (+5 lines)tests/tools/test_vision_tools.py:TestNoCudaInitAtImportwith subprocess-based clean-import guard and env var guard (+54 lines)Validation
hermes gatewaystartup with torchvision installedhermes agentstartup with torchvision installednvidia-smiafter idle startupimport tools.vision_toolsTest plan
pytest tests/tools/test_vision_tools.py -v --timeout=0— 69 passed, 6 skipped (Pillow not installed)pytest tests/tools/test_lazy_deps.py -v --timeout=0— 61 passedNot in scope
Completely removing torchvision from the venv is not feasible because optional ML skills depend on it. Adding
CUDA_VISIBLE_DEVICES=""at startup would disable CUDA entirely and break users who run local vision models. Theexpandable_segments:Trueapproach is the right tradeoff: it preserves full CUDA functionality while eliminating the large upfront reservation.