Describe the bug
After upgrading to v0.14.0, both hermes gateway and hermes agent processes each reserve ~7GB of GPU VRAM at startup, even when idle and not performing any vision tasks. This is caused by torchvision being installed in the hermes venv — importing torchvision initializes a CUDA context that pre-allocates memory.
Impact
- On systems with limited GPU memory (e.g., single GPU shared between inference services and Hermes), this ~14GB reservation significantly reduces available VRAM for other workloads (sglang, ComfyUI, etc.)
- Users who rely on external multimodal LLMs (GPT-4o, Claude, etc.) do not benefit from the local
vision_analyze pixel-through feature, making this VRAM cost purely wasteful
Root cause
v0.14.0 introduced vision_analyze pixel-through to vision-capable models (#22955), which added torchvision as a dependency. When torchvision is imported, it loads libcudart and initializes a CUDA context, which reserves ~7GB per process on NVIDIA GPUs.
Reproduction
- Upgrade to v0.14.0+
- Start
hermes gateway run
- Run
nvidia-smi --query-compute-apps=pid,used_memory --format=csv,noheader
- Observe gateway process reserving ~7GB VRAM
Environment
- Hermes Agent: v0.14.0
- GPU: NVIDIA RTX PRO 6000 Blackwell 96GB
- PyTorch: 2.11.0+cu130
- torchvision: 0.26.0
Suggested fix
- Lazy-import torchvision only when
vision_analyze is actually called with a local vision model, not at gateway/agent startup
- Or add
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to reduce pre-allocation
- Or make torchvision an optional dependency that is only loaded when the active model supports local vision
Describe the bug
After upgrading to v0.14.0, both
hermes gatewayandhermes agentprocesses each reserve ~7GB of GPU VRAM at startup, even when idle and not performing any vision tasks. This is caused bytorchvisionbeing installed in the hermes venv — importing torchvision initializes a CUDA context that pre-allocates memory.Impact
vision_analyzepixel-through feature, making this VRAM cost purely wastefulRoot cause
v0.14.0 introduced
vision_analyzepixel-through to vision-capable models (#22955), which addedtorchvisionas a dependency. When torchvision is imported, it loadslibcudartand initializes a CUDA context, which reserves ~7GB per process on NVIDIA GPUs.Reproduction
hermes gateway runnvidia-smi --query-compute-apps=pid,used_memory --format=csv,noheaderEnvironment
Suggested fix
vision_analyzeis actually called with a local vision model, not at gateway/agent startupPYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueto reduce pre-allocation