Summary
Image content blocks sent via the ACP adapter (session/prompt with an image content block) never reach the model. The model behaves as if no image was attached. This happens for every provider/model, independent of promptCapabilities.image and _model_supports_vision — the image is dropped before the request payload is built.
Reproduced on v0.16.0 (upstream 3edd09a).
Root cause
AIAgent._apply_persist_user_message_override (run_agent.py) rewrites the current-turn user message in place:
def _apply_persist_user_message_override(self, messages):
idx = getattr(self, "_persist_user_message_idx", None)
override = getattr(self, "_persist_user_message_override", None)
if override is None or idx is None:
return
if 0 <= idx < len(messages):
msg = messages[idx]
if isinstance(msg, dict) and msg.get("role") == "user":
msg["content"] = override # <-- clobbers multimodal content
The ACP adapter passes a text-only persist_user_message for multimodal prompts (acp_adapter/server.py):
result = agent.run_conversation(
user_message=user_content, # list: [{type:text}, {type:image_url}]
...
persist_user_message=user_text or "[Image attachment]", # plain string
)
build_turn_context runs a crash-resilience persist of the inbound user turn before the first API call, which calls _apply_persist_user_message_override(messages) on the same messages list the conversation loop later reads to build api_messages. So the multimodal content list ([{type:"text"}, {type:"image_url"}]) is overwritten with the plain persist_user_message string before the request is assembled. The image_url part is gone end-to-end.
Because the strip happens this early, _model_supports_vision() / _prepare_messages_for_non_vision_model() are never even reached (the message no longer has image parts), and a fully vision-capable model still sees text only.
Trace evidence
Instrumenting the ACP prompt handler and _prepare_messages_for_non_vision_model:
- At
acp_adapter/server.py after _content_blocks_to_openai_user_content(prompt): user_content_shape=['text', 'image_url'] ✅
- At
_prepare_messages_for_non_vision_model entry (just before the API call): user message content = str(len=135) ❌ — exactly the text-only persist_user_message, image dropped.
A direct provider call with the same image_url data URL (bypassing Hermes) works fine on a vision model, confirming the loss is internal to Hermes.
Minimal repro
Drive hermes acp over stdio (line-delimited JSON-RPC): initialize → session/new → session/prompt with:
{
"sessionId": "<id>",
"prompt": [
{ "type": "text", "text": "Reply with ONLY the dominant color word of the image. If you received no image, reply IMAGE_NOT_RECEIVED." },
{ "type": "image", "data": "<base64 of a solid-color PNG>", "mimeType": "image/png" }
]
}
With any vision-capable model configured (e.g. openrouter/google/gemini-2.5-flash-lite), the model replies IMAGE_NOT_RECEIVED instead of the color.
Fix
Don't clobber multimodal list content — the synthetic-prefix cleanup the override exists for only applies to text turns:
if 0 <= idx < len(messages):
msg = messages[idx]
if isinstance(msg, dict) and msg.get("role") == "user":
if isinstance(msg.get("content"), list):
return
msg["content"] = override
With this, the same repro returns the correct color. Text-only turns are unaffected.
Suggested cleaner fix
The override mutating the shared messages list (used for both the API call and persistence) is the underlying smell. A more robust fix would apply the clean/redacted persist_user_message only to a copy used for transcript/DB persistence, leaving the API-bound messages untouched — and for multimodal turns, persist a redacted form (e.g. text + [image] placeholder) so base64 blobs don't bloat the session DB while the image still reaches the model.
Environment
- Hermes Agent v0.16.0 (2026.6.5), upstream
3edd09a
- ACP adapter path (
hermes acp over stdio)
Summary
Image content blocks sent via the ACP adapter (
session/promptwith animagecontent block) never reach the model. The model behaves as if no image was attached. This happens for every provider/model, independent ofpromptCapabilities.imageand_model_supports_vision— the image is dropped before the request payload is built.Reproduced on
v0.16.0(upstream3edd09a).Root cause
AIAgent._apply_persist_user_message_override(run_agent.py) rewrites the current-turn user message in place:The ACP adapter passes a text-only
persist_user_messagefor multimodal prompts (acp_adapter/server.py):build_turn_contextruns a crash-resilience persist of the inbound user turn before the first API call, which calls_apply_persist_user_message_override(messages)on the samemessageslist the conversation loop later reads to buildapi_messages. So the multimodalcontentlist ([{type:"text"}, {type:"image_url"}]) is overwritten with the plainpersist_user_messagestring before the request is assembled. Theimage_urlpart is gone end-to-end.Because the strip happens this early,
_model_supports_vision()/_prepare_messages_for_non_vision_model()are never even reached (the message no longer has image parts), and a fully vision-capable model still sees text only.Trace evidence
Instrumenting the ACP prompt handler and
_prepare_messages_for_non_vision_model:acp_adapter/server.pyafter_content_blocks_to_openai_user_content(prompt):user_content_shape=['text', 'image_url']✅_prepare_messages_for_non_vision_modelentry (just before the API call): user messagecontent=str(len=135)❌ — exactly the text-onlypersist_user_message, image dropped.A direct provider call with the same
image_urldata URL (bypassing Hermes) works fine on a vision model, confirming the loss is internal to Hermes.Minimal repro
Drive
hermes acpover stdio (line-delimited JSON-RPC):initialize→session/new→session/promptwith:{ "sessionId": "<id>", "prompt": [ { "type": "text", "text": "Reply with ONLY the dominant color word of the image. If you received no image, reply IMAGE_NOT_RECEIVED." }, { "type": "image", "data": "<base64 of a solid-color PNG>", "mimeType": "image/png" } ] }With any vision-capable model configured (e.g.
openrouter/google/gemini-2.5-flash-lite), the model repliesIMAGE_NOT_RECEIVEDinstead of the color.Fix
Don't clobber multimodal list content — the synthetic-prefix cleanup the override exists for only applies to text turns:
With this, the same repro returns the correct color. Text-only turns are unaffected.
Suggested cleaner fix
The override mutating the shared
messageslist (used for both the API call and persistence) is the underlying smell. A more robust fix would apply the clean/redactedpersist_user_messageonly to a copy used for transcript/DB persistence, leaving the API-bound messages untouched — and for multimodal turns, persist a redacted form (e.g. text +[image]placeholder) so base64 blobs don't bloat the session DB while the image still reaches the model.Environment
3edd09ahermes acpover stdio)