feat: native multimodal vision routing for vision-capable models#8610
feat: native multimodal vision routing for vision-capable models#86100xbyt4 wants to merge 15 commits into
Conversation
Messages with content as a list of blocks (text + image_url/input_image) were stored via raw `content TEXT` insertion, which Python coerced to repr() and read back as an unparseable string. Session reload silently broke for any vision-capable model that received native image content. Add a `content_blocks` TEXT column (schema v7) that stores the JSON serialization of the multimodal structure. The legacy `content` column keeps a flattened text representation so FTS5 search and any plain-string callers continue to work unchanged. Round-trips both shapes losslessly: - Plain string content stays in `content`, `content_blocks` is NULL - List content gets searchable text in `content` AND JSON in `content_blocks` Read paths (get_messages, get_messages_as_conversation) prefer `content_blocks` when present, fall back to `content` for legacy rows and the v6→v7 migration case. Tests: 8 round-trip cases (text, image_url, input_image, audio, None, legacy plaintext, FTS5 searchability) + v6→v7 migration test.
Token estimators (estimate_messages_tokens_rough, estimate_request_tokens_rough) used `len(str(msg))` over the entire message dict. For multimodal content, str() expands base64 image data inline, so a 1MB image inflated the token estimate by ~250x (524K "tokens" for what's actually ~1500 actual tokens). The compressor and pre-flight context checks then either wiped the vision turn immediately or tripped false context-overflow rejections, making vision-capable models effectively unusable for any image bigger than a thumbnail. Add `count_message_chars(msg)` that walks the message structure: - Plain text content: counted directly - text/input_text blocks: count text field only - image_url/input_image/image blocks: fixed ~6000-char budget (≈1500 tokens, the average across Anthropic/OpenAI/Gemini) - input_audio/audio blocks: fixed ~2000-char budget - Tool calls: name + arguments - Skip data: URLs in unknown block types Use the helper in both estimators and in run_agent.py's pre-API request size logging hook so all paths report consistent numbers. Tests: 11 multimodal cases (text-only, image_url, input_image, anthropic image type, audio, tool_calls, None content, request-level estimate, 2MB image stress test) + updated existing concrete-value tests to match the new accurate counting.
The Codex Responses API path in _chat_messages_to_responses_input
collapsed message content via `str(content) if content is not None
else ""`. For text content this was a no-op, but for a multimodal
list it produced a Python repr (`"[{'type': 'image_url', ...}]"`)
which the API treated as opaque user text. Codex/OpenAI Responses
models with native vision never received the actual image — vision
was effectively dead on this code path.
Add `_chat_content_to_responses_content(content, role)` that walks
the structure and emits the right Responses input shape:
- Plain string → unchanged
- text/input_text/output_text blocks → input_text (or output_text
for assistant role)
- image_url/input_image blocks → input_image with URL string
- Anthropic-native image blocks (source.type=base64|url) →
input_image with derived data URL
Plus `_responses_content_is_empty()` to handle the
``has_codex_reasoning + empty content`` edge case for both string
and list shapes.
User and assistant message branches both use the helper now.
Tool result messages keep `str()` because the Responses API
``function_call_output`` shape requires output as a string —
multimodal tool results are out of scope here.
Tests: 11 conversion cases (string, None, text-only list, assistant
output_text, OpenAI image_url dict, image_url string, input_image
passthrough, anthropic base64 source, anthropic url source) + 2
end-to-end tests via _chat_messages_to_responses_input that verify
list content reaches the API as a list of input_image blocks
instead of a stringified repr.
run_conversation only accepted ``user_message: str``, which forced the gateway to flatten any image attachments to a text description via the vision_analyze tool — losing the actual pixels even when the target model had native vision support. Widen the type to ``Union[str, List[Dict[str, Any]]]``. The list shape is the standard OpenAI multimodal content format (text + image_url blocks), which the chat_completions adapter passes through natively and the codex_responses adapter now converts via the helper added in the previous commit. Required adjustments: - _user_message_preview(): static helper that flattens str-or-list into a one-line preview (text + [image]/[audio] markers) for log lines and the "💬 Starting conversation" banner that previously did naked string slicing. - _sanitize_user_message_in_place(): walks list content, sanitizes surrogates inside text blocks, leaves image/audio blocks untouched. - original_user_message is now always coerced to a string before being passed to plugin hooks, _looks_like_codex_intermediate_ack, _save_trajectory, and the memory manager — all string-only consumers that don't need (and would break on) raw multimodal. The user_msg dict construction at the API boundary already worked for both shapes since it just wraps content as-is. Anthropic native API path will still re-flatten via _prepare_anthropic_messages_for_api until the next commit makes it capability-aware. chat_completions and codex paths get native multimodal immediately. Tests: 14 cases — preview helper (string, list, image-only, audio, anthropic image block, truncation), sanitize helper (surrogates in nested text fields, image fields untouched, no in-place mutation), end-to-end run_conversation with list content reaching the chat completions API as a list, plus a backwards-compat test that plain string content still produces string content.
…apable models
_prepare_anthropic_messages_for_api unconditionally flattened image
content blocks via _preprocess_anthropic_content, which calls
vision_analyze_tool to produce a text description. This was a sane
fallback for older Anthropic models without image input, but for
Claude 3+, Opus 4.6, Sonnet 4.6, and every other vision-capable
Anthropic-compatible model the legacy path:
- Made an extra LLM call to a vision describer (slower, more cost)
- Lost pixel-level information (text descriptions are lossy)
- Made vision-capable models behave like blind ones for webdev,
pixel-perfect debugging, OCR, etc.
The anthropic_adapter at agent/anthropic_adapter.py:829 already
converts image_url/input_image content blocks into Anthropic's
native ``{"type": "image", "source": ...}`` format. Skip the legacy
preprocess when the model declares native vision support and let
the adapter do its job.
Add ``_model_supports_native_vision()`` that:
- Caches the lookup result per (provider, model) tuple
- Looks up via ``agent.models_dev.get_model_capabilities``
- Returns False on lookup failure (safe legacy fallback)
- Honors ``HERMES_FORCE_NATIVE_VISION=1`` env var for self-hosted
models that aren't catalogued in models.dev (e.g. vLLM serving
Llama 3.2 Vision)
Tests: 6 new cases — capability cache, unknown-model fallback, env
var override, lookup-exception safety, vision-capable passthrough
(verifies vision_analyze_tool is NOT called and image blocks are
preserved), non-vision flatten (verifies legacy path still works
when capability says False).
Existing TestAnthropicImageFallback tests still pass — the agent
fixture has an empty model name, so capability lookup returns None
and the legacy fallback path runs unchanged.
When a user attached an image and the active model declares native vision support, build a typed content list (text + image_url blocks) and pass it through the agent unchanged so the model receives the actual pixels. Models without native vision keep the legacy vision_analyze fallback that flattens images to text descriptions. This is the user-visible end of the native vision feature: bug fixes 1-3 made the persistence, token estimator, and codex API path multimodal-safe; feature commits 4-5 widened run_conversation and made the anthropic preprocess capability-aware. Now the gateway actually routes the image content the right way at the platform boundary. Three new helpers on GatewayRunner: - _message_preview_for_hook(message): flattens str-or-list to text for the agent:start hook payload, log lines, and the auto-title generator (which expects a plain string). - _should_use_native_vision_for_source(source): resolves the active model+provider via _resolve_session_agent_runtime, looks up vision support via agent.models_dev.get_model_capabilities, and honors HERMES_FORCE_NATIVE_VISION=1 for self-hosted vision models not catalogued in models.dev. - _build_native_vision_content(text, paths): reads each image into a base64 data URL via the existing tools.vision_tools helper and emits OpenAI-style image_url blocks. The user's caption becomes the first text block. Bad image paths are skipped with a warning rather than failing the whole request. _prepare_inbound_message_text return type is widened to Optional[Any] (string OR list of content blocks). The two callers and _run_agent's ``message`` parameter all accept the wider type. The pending model-switch-note prepend at line 8049 is updated to prepend a text block when message is a list, so multimodal content isn't broken by string concatenation. Tests: 15 cases — preview helper (5 shapes), capability resolver (force env var, vision-capable, non-vision, unknown model, empty model name, runtime resolution exception), content builder (text+ image, image-only, multiple images, unreadable image graceful skip).
The new gateway native vision routing auto-detects model capability via models.dev. Self-hosted vision models (vLLM serving Llama 3.2 Vision, etc.) and brand-new models not yet catalogued in models.dev need an opt-in escape hatch — add HERMES_FORCE_NATIVE_VISION to the env var reference table next to the AUXILIARY_VISION_* entries.
The CLI's chat() method always called _preprocess_images_with_vision when the user attached an image, which routes the image through the auxiliary vision model (Gemini Flash) to produce a text description and prepends that to the user's message. The actual pixels never reached the main model — even when running Claude Opus 4.6 or another vision-capable model that would have done a better job natively. The vision.md user guide also incorrectly claimed images were sent "as base64-encoded content blocks, so any vision-capable model can process them" — that was the intent but never the reality. Mirror the gateway native vision routing here: - Add _build_native_vision_content_cli(text, images): reads each attached image into a base64 data URL via the existing _image_to_base64_data_url helper and emits OpenAI-style image_url content blocks that the provider adapter converts to native form. - chat() branches on agent._model_supports_native_vision() (the capability check added in commit 5). Native-capable models receive list content; everything else still gets the legacy text-flatten path so non-vision and unknown models keep working unchanged. - Capability-check exceptions and missing agent both fall back to the legacy path safely. Tests: 9 cases — content builder (text+image, image-only, multiple images, missing image graceful skip, all-bad fallback to placeholder string), and chat() routing (vision-capable uses native, non-vision uses legacy, capability exception falls back, no agent falls back).
The previous vision.md claimed images were sent as base64 content blocks "so any vision-capable model can process them" — that was the intended behavior but the actual code (both CLI and gateway) always routed images through the auxiliary vision model and prepended a text description. The docs were aspirational, not factual. Now that the CLI and gateway both implement capability-aware routing, update the docs to accurately describe what happens: - Add a "How It Works" enumeration of the two paths (native vision vs vision_analyze fallback) so users understand which model gets what shape. - Add a "Messaging Platforms" section explaining that the same routing applies to Telegram, Discord, Matrix, etc. — webdev workflows now work end-to-end through messaging because the model receives actual pixels. - Add a "Self-Hosted & Uncatalogued Vision Models" section documenting HERMES_FORCE_NATIVE_VISION=1 for vLLM-served Llama 3.2 Vision and other models that aren't yet in models.dev. - Rewrite the "Supported Models" section to list models confirmed to use the native path and explain the legacy fallback for non-vision models, including the AUXILIARY_VISION_* override pointer.
…or providers
Two robustness gaps in the capability lookup pipeline became visible
once we tested against the Nous endpoint with claude-sonnet-4.6:
1. **Dot vs hyphen mismatch**. Anthropic's catalog stores
``claude-sonnet-4-6`` (hyphens) while OpenRouter's catalog stores
``anthropic/claude-sonnet-4.6`` (dots). _find_model_entry only did
exact and case-insensitive matching, so a dotted query against the
hyphenated catalog returned None even though the model is the same.
2. **Aggregator providers not in models.dev**. ``nous``, custom OpenAI-
compatible proxies, and similar aggregators aren't in
PROVIDER_TO_MODELS_DEV at all, so the lookup short-circuits to None.
But the model name typically carries an upstream vendor prefix
(``anthropic/claude-sonnet-4.6``) that points at a real catalogued
vendor — we just need to follow the slug.
Fixes:
- ``_find_model_entry``: add a third matching pass that normalizes
both dots and hyphens to underscores before comparing. ``4.6`` and
``4-6`` both become ``4_6`` and resolve to the same entry. Case-
insensitive too. Unrelated families still don't match because their
base names differ.
- ``_model_supports_native_vision`` (run_agent.py) and
``_should_use_native_vision_for_source`` (gateway/run.py): when the
direct ``(provider, model)`` lookup returns None and the model name
contains a slash, split on ``/`` and try the upstream vendor's
catalog. ``("nous", "anthropic/claude-sonnet-4.6")`` → fall back to
``("anthropic", "claude-sonnet-4.6")`` → matches via the new
normalization → returns supports_vision=True.
End-to-end verified against the Nous endpoint:
``_model_supports_native_vision()`` was returning False for
``nous + anthropic/claude-sonnet-4.6`` before this commit; now returns
True and the gateway routes images natively to Sonnet 4.6.
Tests: 4 dot/hyphen normalization cases (dotted query matches hyphen
catalog, hyphen query backwards-compat, uppercase + dot combined,
unrelated family stays unmatched), 3 prefix-fallback cases on AIAgent
(unmapped provider with vendor slug → True, no slash → False, unknown
vendor → False), 2 prefix-fallback cases on GatewayRunner (aggregator
with known vendor → True, unknown vendor → False).
Testing the previous capability fix against the Nous endpoint with
the full 9-model lineup revealed remaining false negatives:
moonshotai/kimi-k2.5 api=PASS hermes=NO
mistralai/mistral-small-2603 api=PASS hermes=NO
mistralai/mistral-small-3.2-24b-... api=PASS hermes=NO
z-ai/glm-4.5v api=PASS hermes=NO
Root cause: the vendor-prefix fallback from the previous commit split
``moonshotai/kimi-k2.5`` into ``("moonshotai", "kimi-k2.5")`` but the
moonshotai catalog in models.dev is empty (0 entries). Mistral's
slug uses ``mistralai/`` but the catalog ID is ``mistral`` with
different model version names. The upstream vendor fallback couldn't
recover any of these.
The OpenRouter catalog, on the other hand, has all four models
because Nous, custom proxies, and most aggregators share the
OpenRouter slug format: ``moonshotai/kimi-k2.5``,
``mistralai/mistral-small-2603``, ``z-ai/glm-4.5v``, etc. Querying
the OpenRouter catalog with the full slug is a strong catch-all.
Add a third fallback layer to both ``_model_supports_native_vision``
(run_agent.py) and ``_should_use_native_vision_for_source``
(gateway/run.py):
1. Direct (provider, model) — unchanged
2. Vendor prefix strip (vendor, vendor_model) — unchanged
3. OpenRouter aggregator: ("openrouter", model) when model has a
slash prefix AND provider is not already "openrouter" (avoids
redundant double-lookup for direct OR users)
After this commit, all 8 vision-capable models Nous actually serves
are correctly detected by Hermes: Claude Opus/Sonnet 4.6, GPT-5.4
Mini, Gemini 3 Flash, Kimi K2.5, Mistral Small (2603 + 3.2-24b),
and GLM-4.5v. The only remaining mismatch is
``mistralai/pixtral-large-2411`` because it's missing from the
models.dev OpenRouter catalog entirely — users hitting that model
can set HERMES_FORCE_NATIVE_VISION=1 as an escape hatch.
Tests: 3 new cases on AIAgent (openrouter catalog fallback for
unmapped aggregator, respect non-vision flag from OR catalog, skip
the redundant second lookup when provider is already openrouter),
1 new case on GatewayRunner.
…d regression Bug 2's original fix replaced ``len(str(msg))`` with a field-walking counter in count_message_chars. That fixed the ~100x overcount for messages carrying base64 image payloads but introduced a new problem: the new counter didn't include dict serialization overhead (braces, quoted field names, separators) that the legacy formula implicitly counted. For tool-heavy text conversations the new counter reported roughly 26% of the old value — a 74% undercount that would push preflight compression, budget pre-flight checks, and the context compressor to fire much later than they used to. This is a real regression for non-vision users on smaller context windows (32K Kimi, local Gemma, etc.) because it means the estimator under-reports until the real context limit is almost hit. Fix: add a per-message overhead constant (30 chars) to approximate the dict braces + ``'role':``/``'content':`` wrappers, plus a per-tool-call wrapper overhead (40 chars) for tool-calling turns. These constants were tuned by diffing the field-walking count against ``len(str(msg))`` for typical chat / tool / multi-turn conversations; the result now stays within ~10% of the legacy formula for text/tool cases (well inside the noise of a ``chars/4`` rough estimator) while keeping the massive 173x savings for base64-bearing multimodal turns. Empirical comparison on a realistic 9-turn tool conversation: legacy len(str): 894 chars / 224 tokens before this fix: 274 chars / 69 tokens (31% — regression) after this fix: 809 chars / 203 tokens (91% — accurate) And on a 1MB base64 image message: legacy len(str): 1,048,728 chars / 262,182 tokens (broken) after this fix: 6,056 chars / 1,514 tokens (correct) Tests: updated 6 existing concrete-value expectations to match the new overhead constants, kept the upper-bound assertions on image messages (< 10,000 chars) that already had margin baked in.
Polish pass after self-review of the native vision work:
1. **Auto-resize large images before native send**. Both gateway and
CLI were calling ``_image_to_base64_data_url`` directly, which
encodes the image at its original resolution. A user dropping a
4K screenshot would get a 13MB base64 payload sent to the model,
costing ~2000+ tokens per image on OpenAI high-detail vs ~1500
expected. Switch to ``_resize_image_for_vision`` which is already
battle-tested (it's what the legacy ``_enrich_message_with_vision``
path uses) and auto-resizes to the standard ~5MB budget when the
encoded size would exceed it.
2. **Set ``detail: "auto"`` explicitly on image_url blocks**. Without
an explicit detail value, providers default to "high" for large
images which can double the token cost silently. Setting "auto"
lets the provider pick based on resolution and stays consistent
with what users see in typical OpenAI-compatible clients.
3. **Route logging for observability**. Both the gateway and CLI
decision points now emit a ``logger.info("[vision] route=... ...")``
line so operators can trace whether a given image-bearing turn
went through the native path or the legacy vision_analyze
fallback. Previously there was no way to debug "why did Claude
say it can't see my image?" without reading the source.
4. **Type annotation fix**. ``_build_native_vision_content`` was
declared as returning ``List[Dict[str, Any]]`` but the fallback
branch (all images failed to encode) returns the caption string.
Widen to ``Union[str, List[Dict[str, Any]]]`` so the annotation
matches reality; mypy / strict type-checkers would complain.
5. **Caption-only fallback returns a string, not a wrapped list**.
Previously if no images encoded successfully but the caption was
non-empty, the helper returned a list with a single text block.
That's semantically weird (a list with no image content has no
reason to be a list) and confused the edge-case test. Return the
caption string directly; let the regular text path handle it.
Tests: 2 updated skip-image tests to assert the new string fallback,
2 new tests covering the ``detail: "auto"`` contract on both the
gateway and CLI helpers.
…ient The auxiliary client's _try_openrouter() helper read the OpenRouter API key from OPENROUTER_API_KEY but built the OpenAI client against the hardcoded https://openrouter.ai/api/v1 endpoint, ignoring any OPENROUTER_BASE_URL env override that the main agent path respects. The visible failure mode is subtle and confusing: 1. User runs Hermes with OPENROUTER_API_KEY pointed at an alternate OR-compatible endpoint (Nous Portal, custom proxy) and sets OPENROUTER_BASE_URL accordingly. 2. Main agent works because the gateway runtime resolver respects the env override. 3. Vision auto-resolver tries OpenRouter, builds a client against the canonical openrouter.ai endpoint with the wrong API key. 4. OR returns 401, credential pool marks the entry exhausted. 5. ``check_vision_requirements()`` returns False on the next call. 6. ``vision_analyze`` is silently de-registered from the agent's toolset (it has a check_fn). 7. When the user later sends an image to a non-vision model, the gateway's legacy fallback path tells the agent to "use vision_analyze" — but the tool isn't even in the agent's tool list. The agent improvises with browser_vision (Playwright missing) or read_file (binary), and a poorly-aligned model can then hallucinate a fabricated image description instead of admitting failure. This was reproduced empirically: routing to Nous via OPENROUTER_BASE_URL and asking xiaomi/mimo-v2-pro about a manga panel resulted in a confident "Yemeksepeti food delivery bag" hallucination on one run and "Xiaomi smartphone interface" on the next, neither of which had any relationship to the actual image. Fix: read OPENROUTER_BASE_URL from env when building the auxiliary OpenRouter client (same pattern the main agent path uses), falling back to the hardcoded default when unset. The pool path already honors per-entry base_url overrides; this change brings the env-var path in line. Tests: 4 cases — env override applied, default fallback, no API key short-circuit, empty string treated as unset.
…l seeding
The previous commit fixed `_try_openrouter()` env-var path to honor
OPENROUTER_BASE_URL. But the credential pool path takes precedence,
and the openrouter-specific seeding branch in `_seed_from_env()` was
hardcoded to use the canonical openrouter.ai URL regardless of any
env override:
if provider == "openrouter":
token = os.getenv("OPENROUTER_API_KEY", "").strip()
if token:
...
"base_url": OPENROUTER_BASE_URL, # ← hardcoded, ignored env
...
The generic seeding path further down DOES read the per-provider base
URL env var (via `pconfig.base_url_env_var`), but the openrouter
branch early-returns before reaching it.
Effect for users routing OPENROUTER_API_KEY through an alternate
OR-compatible endpoint (Nous Portal, custom proxy):
1. First gateway start: pool seeds an entry with the alternate
endpoint key + the canonical openrouter.ai URL.
2. Auxiliary vision call goes to openrouter.ai with the wrong key.
3. OpenRouter returns 401 ``Missing Authentication header``.
4. Pool entry persists with stale base_url even after env vars are
corrected — restarting Hermes doesn't help unless the user
manually deletes auth.json's openrouter pool section.
This was reproduced empirically on the Nous endpoint: with
OPENROUTER_API_KEY=sk-bpby... and OPENROUTER_BASE_URL pointed at Nous,
auxiliary vision_analyze calls returned 401 every time even though
the env vars were correct. Inspecting auth.json showed the pool
entry had base_url ``https://openrouter.ai/api/v1`` — the hardcoded
constant from this seeding function.
Fix: read OPENROUTER_BASE_URL the same way the generic path reads
``pconfig.base_url_env_var``, fall back to the hardcoded default
when unset, and strip trailing slashes.
Tests: 4 cases — env override applied (with and without trailing
slash), default fallback when env var unset, default fallback on
explicit empty string.
|
+1 — this is the architecture I've been trying to achieve via aux-vision config and couldn't, because that subsystem has no fallback list. Native-vision-through-main-model is the right shape. One data point if useful for the test matrix: |
|
Closing as superseded — native multimodal vision routing for vision-capable main models shipped in #16506 (commit ec671c4, "feat(image-input): native multimodal routing based on model vision capability"). It reaches the same architectural goal this PR proposed: capability-aware routing of inbound user-attached images as native |
Summary
Hermes currently routes every image attachment through the auxiliary vision model (Gemini Flash by default) to produce a text description, regardless of whether the active model has native vision support. Claude Opus 4.6, GPT-5.4, Gemini 3 Flash, Xiaomi MiMo Omni and every other vision-capable model receives a lossy text summary instead of the actual pixels.
This PR makes the CLI, gateway, and agent loop capability-aware: when the active model declares native vision per models.dev, images flow through the API as typed content blocks (
image_url/input_image/ Anthropicimage). Models without native vision keep the legacyvision_analyzefallback unchanged. Zero regressions for text-only or non-vision use cases.Why This Matters
Real webdev workflow today:
vision_analyze→ Gemini Flash produces "I see a navbar at the top..."After this PR:
text+image_url)Verified end-to-end against the Nous endpoint: models that previously saw text now read pixel-level content (see test matrix below).
Problem Scope
The naive "just add a content list at the gateway" approach doesn't work because the pipeline has five independent layers that each flatten multimodal content to text:
hermes_state.py): SQLitecontent TEXTcolumn stores rawstr(content)→ lists become Python repr blobs → session reload corruptedagent/model_metadata.py):len(str(msg))counts base64 as text → 1MB image reports as ~262K tokens → compression wipes the image immediatelyrun_agent.py):str(content)coerces list to repr → API receives literal"[{'type': 'image_url', ...}]"as user textrun_conversationsignature (run_agent.py):user_message: str→ no way to pass multimodal through in the first placerun_agent.py):_prepare_anthropic_messages_for_apiunconditionally converts image content blocks to text descriptions via_describe_image_for_anthropic_fallback, even when the model has native visionPlus the gateway itself unconditionally called
_enrich_message_with_visionfor every image, and the CLI'schat()method did the same via_preprocess_images_with_vision.Each layer needs its own targeted fix; bundling them would make the PR unreviewable. This PR is split into 12 logically-ordered commits — 5 bug fixes, 4 feature commits, 2 docs, 1 regression fix discovered during testing.
Commit Walkthrough
Bug fixes (layer-by-layer correctness)
fix(hermes_state): persist multimodal content blocks for vision sessionscontent_blocks TEXTcolumn; round-trip list content losslessly while keeping flattened searchable text incontentfor FTS5 and legacy rowsfix(model_metadata): image-aware token estimation for multimodal turnslen(str(msg))withcount_message_chars()that skips base64 payloads; applies ~1500-token budget per image instead of ~262Kfix(run_agent): preserve multimodal content in codex responses input_chat_content_to_responses_content()walks content lists and emitsinput_text/input_imageitems in the shape the Codex Responses API expectsFeature commits (native vision path)
feat(run_agent): accept multimodal list user_message in run_conversationUnion[str, List[Dict[str, Any]]];_user_message_preview()+_sanitize_user_message_in_place()helpers; flatten to text for string-only downstream consumers (plugins, memory, trajectory)feat(run_agent): skip anthropic image-to-text preprocess for vision-capable models_model_supports_native_vision()with result caching andHERMES_FORCE_NATIVE_VISIONescape hatch;_prepare_anthropic_messages_for_apiskips flatten when capability says yesfeat(gateway): native multimodal routing for vision-capable models_prepare_inbound_message_text;_build_native_vision_contentreads local image files into base64 data URLs and emits OpenAI-style content blocks; both callers +_run_agentnow acceptUnion[str, list]feat(cli): native multimodal routing for vision-capable modelschat()method so/pasteand drag-and-drop images get native vision tooCapability detection robustness (discovered during real-API testing)
fix(native_vision): catalog matching for dotted versions and aggregator providers_find_model_entrynormalizes dots/hyphens (claude-sonnet-4.6↔claude-sonnet-4-6); provider-prefix strip fallback resolves("nous", "anthropic/claude-sonnet-4.6")→("anthropic", "claude-sonnet-4.6")fix(native_vision): OpenRouter catalog as aggregator catch-all fallback("openrouter", full_slug)when direct + vendor-prefix both fail. Catches Kimi, Mistral-small, GLM-4.5v, etc. that use OR-compatible slugs but aren't in their native vendor catalogfix(model_metadata): add dict overhead to count_message_chars to avoid regressionlen(str(msg))for tool-heavy text messages. Add per-message (30 chars) and per-tool-call (40 chars) overhead to keep the estimator within ~10% of legacy for non-image casesDocs
docs(env): document HERMES_FORCE_NATIVE_VISION env vardocs(vision): document capability-aware native vision routingEnd-to-End Verification (Nous endpoint)
Test script: create a PNG with a recognizable 4-digit number, send to each model via
chat.completions.createwith a nativeimage_urlcontent block, check whether the model reads the number back correctly.anthropic/claude-sonnet-4.6anthropic/claude-opus-4.6openai/gpt-5.4-minigoogle/gemini-3-flash-previewgoogle/gemma-4-31b-itgoogle/gemma-4-26b-a4b-itgoogle/gemma-3-4b-itmoonshotai/kimi-k2.5mistralai/mistral-small-2603mistralai/vs catalogmistral)mistralai/mistral-small-3.2-24b-instructz-ai/glm-4.5vz-ai/vs catalogzai)xiaomi/mimo-v2-omnimistralai/pixtral-large-2411HERMES_FORCE_NATIVE_VISION=1Before this PR: 0/13 received native vision — every image was text-flattened regardless of model capability.
After this PR: 12/13 automatically routed to native vision; 1 edge case covered by the env var escape hatch.
Backwards Compatibility
This is the section I want reviewers to scrutinize hardest, because image-handling touches so many layers.
Text-only messages — zero behavior change
hermes_state.append_message(content="hello")→content_blocksis NULL,contentTEXT stores"hello"→ read path returns"hello"unchangedrun_conversation(user_message="hello")→user_msg = {"role": "user", "content": "hello"}(unchanged)_prepare_anthropic_messages_for_api([...])→ early returnif not any(... has_image_parts ...)(unchanged)_chat_messages_to_responses_input([...])→ string input falls through the new helper and returns unchanged_prepare_inbound_message_text→ theif image_pathsbranch is skipped entirely for text-only eventsImage messages + non-vision model — legacy path unchanged
_should_use_native_vision_for_sourcereturns False →_enrich_message_with_visionis called just like beforeAIAgent._model_supports_native_visionreturns False →_prepare_anthropic_messages_for_apiruns the legacy text-flatten pathchat()with no agent or non-vision model → falls through to_preprocess_images_with_visionImage messages + vision model — new native path
_run_agent → run_conversation(list, ...)user_msg = {"role": "user", "content": [list]}is sent to the provider adapteranthropic_adapter.py:829already handlesimage_url→ Anthropic native image block (zero change)chat_completionspath passes through (zero change)codex_responsespath converts via the helper in commit 3Token estimator regression caught during testing
Commit 12 is important: the rewrite in commit 2 removed the implicit dict-serialization overhead that
len(str(msg))included. For text/tool conversations, the new estimator was reporting ~26% of the legacy value — a 74% undercount that would make preflight compression fire far later than before and put 32K-context models at risk of hitting real overflow.Comparison on a realistic 9-turn tool conversation (empirical):
len(str): 894 chars / 224 tokensAnd on a 1MB base64 image message:
len(str): 1,048,728 chars / 262,182 tokens ← broken (the bug we're fixing)Test Coverage
~80 new tests across 5 files:
tests/test_hermes_state.py— 8 round-trip cases + 1 v6→v7 migration testtests/agent/test_model_metadata.py— 11 token estimation / multimodal char counting cases + 4 dot/hyphen normalization casestests/run_agent/test_run_agent.py— 14 multimodalrun_conversation+ 6 native vision capability + 4 prefix fallback + 3 OpenRouter fallbacktests/run_agent/test_run_agent_codex_responses.py— 11 content conversion + 2 end-to-end codex responsestests/gateway/test_native_vision_routing.py— 15 helper tests + 2 OpenRouter fallback (new file)tests/cli/test_cli_native_vision.py— 9 helper + chat() routing tests (new file)All ~380 tests in the touched areas pass. 9 pre-existing failures in unrelated files (
test_auxiliary_client.py,test_session_race_guard.py, etc.) remain unchanged by this branch — confirmed by running the same tests againstmainwith the branch stashed.Known Limitations
mistralai/pixtral-large-2411— Missing from the models.dev OpenRouter catalog so capability detection returns False. Users can setHERMES_FORCE_NATIVE_VISION=1until the catalog is updated.HERMES_FORCE_NATIVE_VISION=1. Documented invision.mdandenvironment-variables.md.function_call_outputin the Codex Responses API takes a stringoutputfield, so images in tool results still get flattened. Out of scope for this PR; would need a separate design for tool result multimodal.Env Vars Added
HERMES_FORCE_NATIVE_VISION=1— Force native vision routing in both CLI and gateway regardless of capability lookup. Documented inwebsite/docs/reference/environment-variables.md.Test plan
/pastewith vision-capable model sends nativeimage_urlcontent blockMessageEventwithmedia_urls→ native content list when model has vision