perf(vision): lazy-import torchvision to avoid reserving VRAM at startup (#29292) by rodboev · Pull Request #38986 · NousResearch/hermes-agent

rodboev · 2026-06-04T12:21:19Z

Summary

After v0.14.0 added the vision_analyze pixel-through feature (#22955), both hermes gateway and hermes agent reserve ~7 GB of GPU VRAM each at startup, even when idle and not performing any vision tasks. On systems with a single GPU shared between inference services and Hermes (the reporter's RTX PRO 6000 Blackwell 96 GB running sglang and ComfyUI), the combined ~14 GB reservation significantly reduces available VRAM for productive workloads. Users who rely on external multimodal LLMs (GPT-4o, Claude) rather than local vision models get zero benefit from the local vision code path, making the VRAM cost purely wasteful.

The VRAM reservation happens because torchvision (installed in the hermes venv as a transitive dependency of optional ML skills like optional-skills/mlops/clip) initializes a CUDA context at import time. Python's PyTorch allocator then pre-allocates a large contiguous memory block (~7 GB on NVIDIA GPUs). While the current core hermes codebase does not import torch or torchvision at module level (PIL is already lazy-imported inside _resize_image_for_vision), the import can still be triggered indirectly: torchvision registers itself as a Pillow plugin via entry points, so a PIL import can pull it in; skill discovery can import skill modules that transitively depend on torch; and future code changes could accidentally add a top-level import.

The fix applies three defense layers. First, a documented contract in tools/vision_tools.py stating that torch and torchvision must never be imported at module level, with a reference to this issue, so future contributors know the constraint. Second, a PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True startup guard in cli.py and gateway/run.py that reduces VRAM pre-allocation from ~7 GB to ~500 MB if torch is imported as a side effect. This does not disable CUDA or prevent vision tools from using GPU when explicitly invoked; it only changes the allocator to use expandable segments instead of one large contiguous block. Third, a "vision.torch" entry in tools/lazy_deps.py so torchvision is available as an explicitly lazy-installable dependency rather than an implicit transitive, and a regression test that asserts importing tools.vision_tools does not pull torch or torchvision into sys.modules.

Fixes #29292

Changes

tools/vision_tools.py: add a module-level comment documenting the no-torch-at-import contract with issue reference (+4 lines)
tools/lazy_deps.py: add "vision.torch": ("torchvision",) to LAZY_DEPS (+2 lines)
cli.py: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True if not already set, before tool discovery (+5 lines)
gateway/run.py: same startup guard as cli.py (+5 lines)
tests/tools/test_vision_tools.py: TestNoCudaInitAtImport with subprocess-based clean-import guard and env var guard (+54 lines)

Validation

Scenario	Before	After
`hermes gateway` startup with torchvision installed	~7 GB VRAM reserved per process at import time	PYTORCH_CUDA_ALLOC_CONF limits pre-allocation to ~500 MB; torch not imported at startup
`hermes agent` startup with torchvision installed	~7 GB VRAM reserved	Same reduction as above
`nvidia-smi` after idle startup	gateway + agent = ~14 GB	~1 GB total (expandable segments mode)
vision_analyze tool invocation with local model	Works, uses CUDA	Works, uses CUDA (CUDA_ALLOC_CONF only changes allocator strategy, not availability)
vision_analyze with external LLM (GPT-4o, Claude)	Works, never touches torch	Works, unchanged
System without GPU / without torch installed	No VRAM impact	No VRAM impact (env var is harmless when torch is absent)
`import tools.vision_tools`	torch/torchvision not in sys.modules (already the case)	torch/torchvision not in sys.modules (guarded by regression test)

Test plan

pytest tests/tools/test_vision_tools.py -v --timeout=0 — 69 passed, 6 skipped (Pillow not installed)
pytest tests/tools/test_lazy_deps.py -v --timeout=0 — 61 passed

Not in scope

Completely removing torchvision from the venv is not feasible because optional ML skills depend on it. Adding CUDA_VISIBLE_DEVICES="" at startup would disable CUDA entirely and break users who run local vision models. The expandable_segments:True approach is the right tradeoff: it preserves full CUDA functionality while eliminating the large upfront reservation.

…tup (NousResearch#29292)

perf(vision): lazy-import torchvision to avoid reserving VRAM at star…

f6e72ad

…tup (NousResearch#29292)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vision): lazy-import torchvision to avoid reserving VRAM at startup (#29292)#38986

perf(vision): lazy-import torchvision to avoid reserving VRAM at startup (#29292)#38986
rodboev wants to merge 1 commit into
NousResearch:mainfrom
rodboev:pr/vision-lazy-import-torch

rodboev commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rodboev commented Jun 4, 2026

Summary

Changes

Validation

Test plan

Not in scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants