ACP image content blocks dropped before API call (persist_user_message override clobbers multimodal content)

## Summary

Image content blocks sent via the ACP adapter (`session/prompt` with an `image` content block) **never reach the model**. The model behaves as if no image was attached. This happens for **every** provider/model, independent of `promptCapabilities.image` and `_model_supports_vision` — the image is dropped *before* the request payload is built.

Reproduced on `v0.16.0` (upstream `3edd09a`).

## Root cause

`AIAgent._apply_persist_user_message_override` (`run_agent.py`) rewrites the current-turn user message in place:

```python
def _apply_persist_user_message_override(self, messages):
    idx = getattr(self, "_persist_user_message_idx", None)
    override = getattr(self, "_persist_user_message_override", None)
    if override is None or idx is None:
        return
    if 0 <= idx < len(messages):
        msg = messages[idx]
        if isinstance(msg, dict) and msg.get("role") == "user":
            msg["content"] = override   # <-- clobbers multimodal content
```

The ACP adapter passes a **text-only** `persist_user_message` for multimodal prompts (`acp_adapter/server.py`):

```python
result = agent.run_conversation(
    user_message=user_content,                         # list: [{type:text}, {type:image_url}]
    ...
    persist_user_message=user_text or "[Image attachment]",   # plain string
)
```

`build_turn_context` runs a **crash-resilience persist of the inbound user turn before the first API call**, which calls `_apply_persist_user_message_override(messages)` on the **same** `messages` list the conversation loop later reads to build `api_messages`. So the multimodal `content` list (`[{type:"text"}, {type:"image_url"}]`) is overwritten with the plain `persist_user_message` string **before** the request is assembled. The `image_url` part is gone end-to-end.

Because the strip happens this early, `_model_supports_vision()` / `_prepare_messages_for_non_vision_model()` are never even reached (the message no longer has image parts), and a fully vision-capable model still sees text only.

### Trace evidence

Instrumenting the ACP prompt handler and `_prepare_messages_for_non_vision_model`:

- At `acp_adapter/server.py` after `_content_blocks_to_openai_user_content(prompt)`: `user_content_shape=['text', 'image_url']` ✅
- At `_prepare_messages_for_non_vision_model` entry (just before the API call): user message `content` = `str(len=135)` ❌ — exactly the text-only `persist_user_message`, image dropped.

A direct provider call with the same `image_url` data URL (bypassing Hermes) works fine on a vision model, confirming the loss is internal to Hermes.

## Minimal repro

Drive `hermes acp` over stdio (line-delimited JSON-RPC): `initialize` → `session/new` → `session/prompt` with:

```json
{
  "sessionId": "<id>",
  "prompt": [
    { "type": "text", "text": "Reply with ONLY the dominant color word of the image. If you received no image, reply IMAGE_NOT_RECEIVED." },
    { "type": "image", "data": "<base64 of a solid-color PNG>", "mimeType": "image/png" }
  ]
}
```

With any vision-capable model configured (e.g. `openrouter/google/gemini-2.5-flash-lite`), the model replies `IMAGE_NOT_RECEIVED` instead of the color.

## Fix

Don't clobber multimodal list content — the synthetic-prefix cleanup the override exists for only applies to text turns:

```python
        if 0 <= idx < len(messages):
            msg = messages[idx]
            if isinstance(msg, dict) and msg.get("role") == "user":
                if isinstance(msg.get("content"), list):
                    return
                msg["content"] = override
```

With this, the same repro returns the correct color. Text-only turns are unaffected.

### Suggested cleaner fix

The override mutating the **shared** `messages` list (used for both the API call and persistence) is the underlying smell. A more robust fix would apply the clean/redacted `persist_user_message` only to a copy used for transcript/DB persistence, leaving the API-bound messages untouched — and for multimodal turns, persist a redacted form (e.g. text + `[image]` placeholder) so base64 blobs don't bloat the session DB while the image still reaches the model.

## Environment

- Hermes Agent v0.16.0 (2026.6.5), upstream `3edd09a`
- ACP adapter path (`hermes acp` over stdio)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACP image content blocks dropped before API call (persist_user_message override clobbers multimodal content) #44242

Summary

Root cause

Trace evidence

Minimal repro

Fix

Suggested cleaner fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ACP image content blocks dropped before API call (persist_user_message override clobbers multimodal content) #44242

Description

Summary

Root cause

Trace evidence

Minimal repro

Fix

Suggested cleaner fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions