Default vision pre-process prompt generates overly long descriptions (~2000 chars), significantly slowing down image-bearing requests on local models

### Environment

- Hermes-Agent version: 0.9.0 (commit 722331a5)
- Backend: `llama-server` (llama.cpp) with mmproj — native multimodal capable
- Model: google/gemma-4-31B-it Q8_0 via custom provider
- Hardware: NVIDIA H100 80GB
- Python 3.11, Ubuntu

### Summary

`gateway/run.py::_enrich_message_with_vision()` pre-processes every inbound image by asking the vision model for a full description, then injects that description as text into the main prompt. The default analysis prompt asks for **thorough detail**, which generates descriptions of ~2000 characters. On local models this causes two problems:

1. **Latency**: generating a 2000-char description takes 35+ seconds on Gemma 4 31B @ Q8_0 (most of the 43.9s total response time for image requests).
2. **Prompt bloat**: the long description gets injected before the user's original message, inflating context for the second (main) inference pass, which further slows things down.

### Reproduction

1. Set up Hermes-Agent with a local multimodal backend (e.g., llama-server with mmproj).
2. Send an image (any real photo) to the agent via WeChat or any platform.
3. Observe `vision_analyze_tool` output length in `logs/agent.log`.
4. Measure end-to-end response time.

Observed on a real photograph:
- Vision description output: **2072 characters**
- Total response latency: **43.9 seconds** (of which ~35s was vision pre-processing)

### Root cause

Two defaults in the code:

**1. `gateway/run.py` (around line 7315)** — prompt is unbounded:

```python
analysis_prompt = (
    "Describe everything visible in this image in thorough detail. "
    "Include any text, code, data, objects, people, layout, colors, "
    "and any other notable visual information."
)
```

**2. `tools/vision_tools.py` (around line 568)** — `max_tokens: 2000`, which lets the model write very long descriptions.

Together these produce ~2000-char descriptions by default.

### Suggested fix

**Option A (minimal, recommended):** Shorten the default prompt and cap tokens:

```python
# gateway/run.py
analysis_prompt = (
    "Concisely describe this image in 2-4 sentences "
    "(~200 Chinese characters or ~150 English words). "
    "Cover: main subject(s), key visible text/data/code, overall context. "
    "If it's a chart or scientific figure, include axis labels, legend, "
    "and key values. Skip decorative details."
)

# tools/vision_tools.py
max_tokens = 500  # was 2000
```

**Results after fix (verified locally):**
- Description length: ~500-800 chars (down from ~2072)
- Latency: ~15-20s (down from 43.9s)
- Main-prompt inference quality: unchanged

**Option B (architectural, larger change):** Add an `auxiliary.vision.mode: native | preprocess` config option. When the active provider supports multimodal natively (like llama-server with mmproj, Gemini, Claude), skip the pre-processing entirely and pass image content parts directly to the main inference call. This would eliminate one full inference pass and give an additional ~30-40% speedup on top of Option A.

Happy to open a PR for Option A. Option B is a larger refactor that probably deserves its own RFC.

### Additional context

Related but separate issue: the default `auxiliary.vision.timeout: 30s` is also too short for local models on non-trivial images — anything above ~300 tokens in the description easily exceeds 30s on a 31B-class local model. A reasonable default for local multimodal backends would be **120-300s**. Happy to file this as a second issue or roll it into the same PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default vision pre-process prompt generates overly long descriptions (~2000 chars), significantly slowing down image-bearing requests on local models #10809

Environment

Summary

Reproduction

Root cause

Suggested fix

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Default vision pre-process prompt generates overly long descriptions (~2000 chars), significantly slowing down image-bearing requests on local models #10809

Description

Environment

Summary

Reproduction

Root cause

Suggested fix

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions