Environment
- Hermes-Agent version: 0.9.0 (commit 722331a)
- Backend:
llama-server (llama.cpp) with mmproj — native multimodal capable
- Model: google/gemma-4-31B-it Q8_0 via custom provider
- Hardware: NVIDIA H100 80GB
- Python 3.11, Ubuntu
Summary
gateway/run.py::_enrich_message_with_vision() pre-processes every inbound image by asking the vision model for a full description, then injects that description as text into the main prompt. The default analysis prompt asks for thorough detail, which generates descriptions of ~2000 characters. On local models this causes two problems:
- Latency: generating a 2000-char description takes 35+ seconds on Gemma 4 31B @ Q8_0 (most of the 43.9s total response time for image requests).
- Prompt bloat: the long description gets injected before the user's original message, inflating context for the second (main) inference pass, which further slows things down.
Reproduction
- Set up Hermes-Agent with a local multimodal backend (e.g., llama-server with mmproj).
- Send an image (any real photo) to the agent via WeChat or any platform.
- Observe
vision_analyze_tool output length in logs/agent.log.
- Measure end-to-end response time.
Observed on a real photograph:
- Vision description output: 2072 characters
- Total response latency: 43.9 seconds (of which ~35s was vision pre-processing)
Root cause
Two defaults in the code:
1. gateway/run.py (around line 7315) — prompt is unbounded:
analysis_prompt = (
"Describe everything visible in this image in thorough detail. "
"Include any text, code, data, objects, people, layout, colors, "
"and any other notable visual information."
)
2. tools/vision_tools.py (around line 568) — max_tokens: 2000, which lets the model write very long descriptions.
Together these produce ~2000-char descriptions by default.
Suggested fix
Option A (minimal, recommended): Shorten the default prompt and cap tokens:
# gateway/run.py
analysis_prompt = (
"Concisely describe this image in 2-4 sentences "
"(~200 Chinese characters or ~150 English words). "
"Cover: main subject(s), key visible text/data/code, overall context. "
"If it's a chart or scientific figure, include axis labels, legend, "
"and key values. Skip decorative details."
)
# tools/vision_tools.py
max_tokens = 500 # was 2000
Results after fix (verified locally):
- Description length: ~500-800 chars (down from ~2072)
- Latency: ~15-20s (down from 43.9s)
- Main-prompt inference quality: unchanged
Option B (architectural, larger change): Add an auxiliary.vision.mode: native | preprocess config option. When the active provider supports multimodal natively (like llama-server with mmproj, Gemini, Claude), skip the pre-processing entirely and pass image content parts directly to the main inference call. This would eliminate one full inference pass and give an additional ~30-40% speedup on top of Option A.
Happy to open a PR for Option A. Option B is a larger refactor that probably deserves its own RFC.
Additional context
Related but separate issue: the default auxiliary.vision.timeout: 30s is also too short for local models on non-trivial images — anything above ~300 tokens in the description easily exceeds 30s on a 31B-class local model. A reasonable default for local multimodal backends would be 120-300s. Happy to file this as a second issue or roll it into the same PR.
Environment
llama-server(llama.cpp) with mmproj — native multimodal capableSummary
gateway/run.py::_enrich_message_with_vision()pre-processes every inbound image by asking the vision model for a full description, then injects that description as text into the main prompt. The default analysis prompt asks for thorough detail, which generates descriptions of ~2000 characters. On local models this causes two problems:Reproduction
vision_analyze_tooloutput length inlogs/agent.log.Observed on a real photograph:
Root cause
Two defaults in the code:
1.
gateway/run.py(around line 7315) — prompt is unbounded:2.
tools/vision_tools.py(around line 568) —max_tokens: 2000, which lets the model write very long descriptions.Together these produce ~2000-char descriptions by default.
Suggested fix
Option A (minimal, recommended): Shorten the default prompt and cap tokens:
Results after fix (verified locally):
Option B (architectural, larger change): Add an
auxiliary.vision.mode: native | preprocessconfig option. When the active provider supports multimodal natively (like llama-server with mmproj, Gemini, Claude), skip the pre-processing entirely and pass image content parts directly to the main inference call. This would eliminate one full inference pass and give an additional ~30-40% speedup on top of Option A.Happy to open a PR for Option A. Option B is a larger refactor that probably deserves its own RFC.
Additional context
Related but separate issue: the default
auxiliary.vision.timeout: 30sis also too short for local models on non-trivial images — anything above ~300 tokens in the description easily exceeds 30s on a 31B-class local model. A reasonable default for local multimodal backends would be 120-300s. Happy to file this as a second issue or roll it into the same PR.