Skip to content

Default vision pre-process prompt generates overly long descriptions (~2000 chars), significantly slowing down image-bearing requests on local models #10809

@funnybomb

Description

@funnybomb

Environment

  • Hermes-Agent version: 0.9.0 (commit 722331a)
  • Backend: llama-server (llama.cpp) with mmproj — native multimodal capable
  • Model: google/gemma-4-31B-it Q8_0 via custom provider
  • Hardware: NVIDIA H100 80GB
  • Python 3.11, Ubuntu

Summary

gateway/run.py::_enrich_message_with_vision() pre-processes every inbound image by asking the vision model for a full description, then injects that description as text into the main prompt. The default analysis prompt asks for thorough detail, which generates descriptions of ~2000 characters. On local models this causes two problems:

  1. Latency: generating a 2000-char description takes 35+ seconds on Gemma 4 31B @ Q8_0 (most of the 43.9s total response time for image requests).
  2. Prompt bloat: the long description gets injected before the user's original message, inflating context for the second (main) inference pass, which further slows things down.

Reproduction

  1. Set up Hermes-Agent with a local multimodal backend (e.g., llama-server with mmproj).
  2. Send an image (any real photo) to the agent via WeChat or any platform.
  3. Observe vision_analyze_tool output length in logs/agent.log.
  4. Measure end-to-end response time.

Observed on a real photograph:

  • Vision description output: 2072 characters
  • Total response latency: 43.9 seconds (of which ~35s was vision pre-processing)

Root cause

Two defaults in the code:

1. gateway/run.py (around line 7315) — prompt is unbounded:

analysis_prompt = (
    "Describe everything visible in this image in thorough detail. "
    "Include any text, code, data, objects, people, layout, colors, "
    "and any other notable visual information."
)

2. tools/vision_tools.py (around line 568)max_tokens: 2000, which lets the model write very long descriptions.

Together these produce ~2000-char descriptions by default.

Suggested fix

Option A (minimal, recommended): Shorten the default prompt and cap tokens:

# gateway/run.py
analysis_prompt = (
    "Concisely describe this image in 2-4 sentences "
    "(~200 Chinese characters or ~150 English words). "
    "Cover: main subject(s), key visible text/data/code, overall context. "
    "If it's a chart or scientific figure, include axis labels, legend, "
    "and key values. Skip decorative details."
)

# tools/vision_tools.py
max_tokens = 500  # was 2000

Results after fix (verified locally):

  • Description length: ~500-800 chars (down from ~2072)
  • Latency: ~15-20s (down from 43.9s)
  • Main-prompt inference quality: unchanged

Option B (architectural, larger change): Add an auxiliary.vision.mode: native | preprocess config option. When the active provider supports multimodal natively (like llama-server with mmproj, Gemini, Claude), skip the pre-processing entirely and pass image content parts directly to the main inference call. This would eliminate one full inference pass and give an additional ~30-40% speedup on top of Option A.

Happy to open a PR for Option A. Option B is a larger refactor that probably deserves its own RFC.

Additional context

Related but separate issue: the default auxiliary.vision.timeout: 30s is also too short for local models on non-trivial images — anything above ~300 tokens in the description easily exceeds 30s on a 31B-class local model. A reasonable default for local multimodal backends would be 120-300s. Happy to file this as a second issue or roll it into the same PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytool/visionVision analysis and image generationtype/perfPerformance improvement or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions