Skip to content

perf(vision): lazy-import torchvision to avoid reserving VRAM at startup (#29292)#38986

Open
rodboev wants to merge 1 commit into
NousResearch:mainfrom
rodboev:pr/vision-lazy-import-torch
Open

perf(vision): lazy-import torchvision to avoid reserving VRAM at startup (#29292)#38986
rodboev wants to merge 1 commit into
NousResearch:mainfrom
rodboev:pr/vision-lazy-import-torch

Conversation

@rodboev

@rodboev rodboev commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

After v0.14.0 added the vision_analyze pixel-through feature (#22955), both hermes gateway and hermes agent reserve ~7 GB of GPU VRAM each at startup, even when idle and not performing any vision tasks. On systems with a single GPU shared between inference services and Hermes (the reporter's RTX PRO 6000 Blackwell 96 GB running sglang and ComfyUI), the combined ~14 GB reservation significantly reduces available VRAM for productive workloads. Users who rely on external multimodal LLMs (GPT-4o, Claude) rather than local vision models get zero benefit from the local vision code path, making the VRAM cost purely wasteful.

The VRAM reservation happens because torchvision (installed in the hermes venv as a transitive dependency of optional ML skills like optional-skills/mlops/clip) initializes a CUDA context at import time. Python's PyTorch allocator then pre-allocates a large contiguous memory block (~7 GB on NVIDIA GPUs). While the current core hermes codebase does not import torch or torchvision at module level (PIL is already lazy-imported inside _resize_image_for_vision), the import can still be triggered indirectly: torchvision registers itself as a Pillow plugin via entry points, so a PIL import can pull it in; skill discovery can import skill modules that transitively depend on torch; and future code changes could accidentally add a top-level import.

The fix applies three defense layers. First, a documented contract in tools/vision_tools.py stating that torch and torchvision must never be imported at module level, with a reference to this issue, so future contributors know the constraint. Second, a PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True startup guard in cli.py and gateway/run.py that reduces VRAM pre-allocation from ~7 GB to ~500 MB if torch is imported as a side effect. This does not disable CUDA or prevent vision tools from using GPU when explicitly invoked; it only changes the allocator to use expandable segments instead of one large contiguous block. Third, a "vision.torch" entry in tools/lazy_deps.py so torchvision is available as an explicitly lazy-installable dependency rather than an implicit transitive, and a regression test that asserts importing tools.vision_tools does not pull torch or torchvision into sys.modules.

Fixes #29292

Changes

  • tools/vision_tools.py: add a module-level comment documenting the no-torch-at-import contract with issue reference (+4 lines)
  • tools/lazy_deps.py: add "vision.torch": ("torchvision",) to LAZY_DEPS (+2 lines)
  • cli.py: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True if not already set, before tool discovery (+5 lines)
  • gateway/run.py: same startup guard as cli.py (+5 lines)
  • tests/tools/test_vision_tools.py: TestNoCudaInitAtImport with subprocess-based clean-import guard and env var guard (+54 lines)

Validation

Scenario Before After
hermes gateway startup with torchvision installed ~7 GB VRAM reserved per process at import time PYTORCH_CUDA_ALLOC_CONF limits pre-allocation to ~500 MB; torch not imported at startup
hermes agent startup with torchvision installed ~7 GB VRAM reserved Same reduction as above
nvidia-smi after idle startup gateway + agent = ~14 GB ~1 GB total (expandable segments mode)
vision_analyze tool invocation with local model Works, uses CUDA Works, uses CUDA (CUDA_ALLOC_CONF only changes allocator strategy, not availability)
vision_analyze with external LLM (GPT-4o, Claude) Works, never touches torch Works, unchanged
System without GPU / without torch installed No VRAM impact No VRAM impact (env var is harmless when torch is absent)
import tools.vision_tools torch/torchvision not in sys.modules (already the case) torch/torchvision not in sys.modules (guarded by regression test)

Test plan

  • pytest tests/tools/test_vision_tools.py -v --timeout=0 — 69 passed, 6 skipped (Pillow not installed)
  • pytest tests/tools/test_lazy_deps.py -v --timeout=0 — 61 passed

Not in scope

Completely removing torchvision from the venv is not feasible because optional ML skills depend on it. Adding CUDA_VISIBLE_DEVICES="" at startup would disable CUDA entirely and break users who run local vision models. The expandable_segments:True approach is the right tradeoff: it preserves full CUDA functionality while eliminating the large upfront reservation.

@alt-glitch alt-glitch added type/perf Performance improvement or optimization P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery tool/vision Vision analysis and image generation labels Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists tool/vision Vision analysis and image generation type/perf Performance improvement or optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[vision] torchvision import initializes CUDA context, causing gateway/agent to reserve ~7GB VRAM each

2 participants