feat: native multimodal vision routing for vision-capable models by 0xbyt4 · Pull Request #8610 · NousResearch/hermes-agent

0xbyt4 · 2026-04-12T21:17:27Z

Summary

Hermes currently routes every image attachment through the auxiliary vision model (Gemini Flash by default) to produce a text description, regardless of whether the active model has native vision support. Claude Opus 4.6, GPT-5.4, Gemini 3 Flash, Xiaomi MiMo Omni and every other vision-capable model receives a lossy text summary instead of the actual pixels.

This PR makes the CLI, gateway, and agent loop capability-aware: when the active model declares native vision per models.dev, images flow through the API as typed content blocks (image_url / input_image / Anthropic image). Models without native vision keep the legacy vision_analyze fallback unchanged. Zero regressions for text-only or non-vision use cases.

Why This Matters

Real webdev workflow today:

User drops a screenshot of a broken layout into Telegram
Gateway calls vision_analyze → Gemini Flash produces "I see a navbar at the top..."
That text description is prepended to the user's message
Claude Sonnet 4.6 receives only the text description and is asked "what's wrong?"
Claude cannot diagnose pixel-level issues because it never sees the pixels

After this PR:

Gateway detects Sonnet 4.6 has native vision
Builds a multimodal content list (text + image_url)
Sonnet 4.6 receives the actual pixels and diagnoses the 20px offset directly

Verified end-to-end against the Nous endpoint: models that previously saw text now read pixel-level content (see test matrix below).

Problem Scope

The naive "just add a content list at the gateway" approach doesn't work because the pipeline has five independent layers that each flatten multimodal content to text:

Persistence (hermes_state.py): SQLite content TEXT column stores raw str(content) → lists become Python repr blobs → session reload corrupted
Token estimation (agent/model_metadata.py): len(str(msg)) counts base64 as text → 1MB image reports as ~262K tokens → compression wipes the image immediately
Codex Responses API path (run_agent.py): str(content) coerces list to repr → API receives literal "[{'type': 'image_url', ...}]" as user text
run_conversation signature (run_agent.py): user_message: str → no way to pass multimodal through in the first place
Anthropic preprocess (run_agent.py): _prepare_anthropic_messages_for_api unconditionally converts image content blocks to text descriptions via _describe_image_for_anthropic_fallback, even when the model has native vision

Plus the gateway itself unconditionally called _enrich_message_with_vision for every image, and the CLI's chat() method did the same via _preprocess_images_with_vision.

Each layer needs its own targeted fix; bundling them would make the PR unreviewable. This PR is split into 12 logically-ordered commits — 5 bug fixes, 4 feature commits, 2 docs, 1 regression fix discovered during testing.

Commit Walkthrough

Bug fixes (layer-by-layer correctness)

#	Commit	What it fixes
1	`fix(hermes_state): persist multimodal content blocks for vision sessions`	Add schema v7 `content_blocks TEXT` column; round-trip list content losslessly while keeping flattened searchable text in `content` for FTS5 and legacy rows
2	`fix(model_metadata): image-aware token estimation for multimodal turns`	Replace `len(str(msg))` with `count_message_chars()` that skips base64 payloads; applies ~1500-token budget per image instead of ~262K
3	`fix(run_agent): preserve multimodal content in codex responses input`	`_chat_content_to_responses_content()` walks content lists and emits `input_text` / `input_image` items in the shape the Codex Responses API expects

Feature commits (native vision path)

#	Commit	What it adds
4	`feat(run_agent): accept multimodal list user_message in run_conversation`	Widen type to `Union[str, List[Dict[str, Any]]]`; `_user_message_preview()` + `_sanitize_user_message_in_place()` helpers; flatten to text for string-only downstream consumers (plugins, memory, trajectory)
5	`feat(run_agent): skip anthropic image-to-text preprocess for vision-capable models`	`_model_supports_native_vision()` with result caching and `HERMES_FORCE_NATIVE_VISION` escape hatch; `_prepare_anthropic_messages_for_api` skips flatten when capability says yes
6	`feat(gateway): native multimodal routing for vision-capable models`	Capability check in `_prepare_inbound_message_text`; `_build_native_vision_content` reads local image files into base64 data URLs and emits OpenAI-style content blocks; both callers + `_run_agent` now accept `Union[str, list]`
7	`feat(cli): native multimodal routing for vision-capable models`	Mirror the gateway routing in the CLI's `chat()` method so `/paste` and drag-and-drop images get native vision too

Capability detection robustness (discovered during real-API testing)

#	Commit	What it fixes
8	`fix(native_vision): catalog matching for dotted versions and aggregator providers`	`_find_model_entry` normalizes dots/hyphens (`claude-sonnet-4.6` ↔ `claude-sonnet-4-6`); provider-prefix strip fallback resolves `("nous", "anthropic/claude-sonnet-4.6")` → `("anthropic", "claude-sonnet-4.6")`
9	`fix(native_vision): OpenRouter catalog as aggregator catch-all fallback`	Third fallback layer: query `("openrouter", full_slug)` when direct + vendor-prefix both fail. Catches Kimi, Mistral-small, GLM-4.5v, etc. that use OR-compatible slugs but aren't in their native vendor catalog
12	`fix(model_metadata): add dict overhead to count_message_chars to avoid regression`	The rewrite in commit 2 was reporting ~26% of the legacy `len(str(msg))` for tool-heavy text messages. Add per-message (30 chars) and per-tool-call (40 chars) overhead to keep the estimator within ~10% of legacy for non-image cases

Docs

#	Commit
10	`docs(env): document HERMES_FORCE_NATIVE_VISION env var`
11	`docs(vision): document capability-aware native vision routing`

End-to-End Verification (Nous endpoint)

Test script: create a PNG with a recognizable 4-digit number, send to each model via chat.completions.create with a native image_url content block, check whether the model reads the number back correctly.

Model	API accepts vision	Hermes detects capability	Note
`anthropic/claude-sonnet-4.6`	✅	✅
`anthropic/claude-opus-4.6`	✅	✅
`openai/gpt-5.4-mini`	✅	✅
`google/gemini-3-flash-preview`	✅	✅
`google/gemma-4-31b-it`	✅	✅
`google/gemma-4-26b-a4b-it`	✅	✅
`google/gemma-3-4b-it`	✅	✅
`moonshotai/kimi-k2.5`	✅	✅	Fixed by commit 9 (OpenRouter fallback — moonshotai catalog empty in models.dev)
`mistralai/mistral-small-2603`	✅	✅	Fixed by commit 9 (slug `mistralai/` vs catalog `mistral`)
`mistralai/mistral-small-3.2-24b-instruct`	✅	✅	Fixed by commit 9
`z-ai/glm-4.5v`	✅	✅	Fixed by commit 9 (slug `z-ai/` vs catalog `zai`)
`xiaomi/mimo-v2-omni`	✅	✅	Described the blue background AND read the number
`mistralai/pixtral-large-2411`	✅	❌	Missing from models.dev OpenRouter catalog entirely; users hitting this model can set `HERMES_FORCE_NATIVE_VISION=1`

Before this PR: 0/13 received native vision — every image was text-flattened regardless of model capability.
After this PR: 12/13 automatically routed to native vision; 1 edge case covered by the env var escape hatch.

Backwards Compatibility

This is the section I want reviewers to scrutinize hardest, because image-handling touches so many layers.

Text-only messages — zero behavior change

hermes_state.append_message(content="hello") → content_blocks is NULL, content TEXT stores "hello" → read path returns "hello" unchanged
run_conversation(user_message="hello") → user_msg = {"role": "user", "content": "hello"} (unchanged)
_prepare_anthropic_messages_for_api([...]) → early return if not any(... has_image_parts ...) (unchanged)
_chat_messages_to_responses_input([...]) → string input falls through the new helper and returns unchanged
Gateway _prepare_inbound_message_text → the if image_paths branch is skipped entirely for text-only events

Image messages + non-vision model — legacy path unchanged

_should_use_native_vision_for_source returns False → _enrich_message_with_vision is called just like before
AIAgent._model_supports_native_vision returns False → _prepare_anthropic_messages_for_api runs the legacy text-flatten path
CLI chat() with no agent or non-vision model → falls through to _preprocess_images_with_vision

Image messages + vision model — new native path

Gateway builds content list, passes it through _run_agent → run_conversation(list, ...)
user_msg = {"role": "user", "content": [list]} is sent to the provider adapter
anthropic_adapter.py:829 already handles image_url → Anthropic native image block (zero change)
chat_completions path passes through (zero change)
codex_responses path converts via the helper in commit 3

Token estimator regression caught during testing

Commit 12 is important: the rewrite in commit 2 removed the implicit dict-serialization overhead that len(str(msg)) included. For text/tool conversations, the new estimator was reporting ~26% of the legacy value — a 74% undercount that would make preflight compression fire far later than before and put 32K-context models at risk of hitting real overflow.

Comparison on a realistic 9-turn tool conversation (empirical):

Legacy len(str): 894 chars / 224 tokens
After commit 2: 274 chars / 69 tokens ← 74% undercount, regression
After commit 12: 809 chars / 203 tokens ← 91% of legacy, within noise

And on a 1MB base64 image message:

Legacy len(str): 1,048,728 chars / 262,182 tokens ← broken (the bug we're fixing)
After commit 12: 6,056 chars / 1,514 tokens ← correct (1500 tokens ≈ actual image cost)

Test Coverage

~80 new tests across 5 files:

tests/test_hermes_state.py — 8 round-trip cases + 1 v6→v7 migration test
tests/agent/test_model_metadata.py — 11 token estimation / multimodal char counting cases + 4 dot/hyphen normalization cases
tests/run_agent/test_run_agent.py — 14 multimodal run_conversation + 6 native vision capability + 4 prefix fallback + 3 OpenRouter fallback
tests/run_agent/test_run_agent_codex_responses.py — 11 content conversion + 2 end-to-end codex responses
tests/gateway/test_native_vision_routing.py — 15 helper tests + 2 OpenRouter fallback (new file)
tests/cli/test_cli_native_vision.py — 9 helper + chat() routing tests (new file)

All ~380 tests in the touched areas pass. 9 pre-existing failures in unrelated files (test_auxiliary_client.py, test_session_race_guard.py, etc.) remain unchanged by this branch — confirmed by running the same tests against main with the branch stashed.

Known Limitations

mistralai/pixtral-large-2411 — Missing from the models.dev OpenRouter catalog so capability detection returns False. Users can set HERMES_FORCE_NATIVE_VISION=1 until the catalog is updated.
Self-hosted vision models (vLLM + Llama 3.2 Vision, etc.) — Not in models.dev, need HERMES_FORCE_NATIVE_VISION=1. Documented in vision.md and environment-variables.md.
Tool-result images — function_call_output in the Codex Responses API takes a string output field, so images in tool results still get flattened. Out of scope for this PR; would need a separate design for tool result multimodal.

Env Vars Added

HERMES_FORCE_NATIVE_VISION=1 — Force native vision routing in both CLI and gateway regardless of capability lookup. Documented in website/docs/reference/environment-variables.md.

Test plan

End-to-end: all 13 models from the Nous popular list tested against real API (test matrix above)
Unit tests: ~80 new tests, all pass
Regression: 9-turn tool conversation token estimate stays within 10% of legacy
Backwards compat: text-only messages exercise zero new code paths
CLI /paste with vision-capable model sends native image_url content block
Gateway routing from Telegram-style MessageEvent with media_urls → native content list when model has vision

Messages with content as a list of blocks (text + image_url/input_image) were stored via raw `content TEXT` insertion, which Python coerced to repr() and read back as an unparseable string. Session reload silently broke for any vision-capable model that received native image content. Add a `content_blocks` TEXT column (schema v7) that stores the JSON serialization of the multimodal structure. The legacy `content` column keeps a flattened text representation so FTS5 search and any plain-string callers continue to work unchanged. Round-trips both shapes losslessly: - Plain string content stays in `content`, `content_blocks` is NULL - List content gets searchable text in `content` AND JSON in `content_blocks` Read paths (get_messages, get_messages_as_conversation) prefer `content_blocks` when present, fall back to `content` for legacy rows and the v6→v7 migration case. Tests: 8 round-trip cases (text, image_url, input_image, audio, None, legacy plaintext, FTS5 searchability) + v6→v7 migration test.

Token estimators (estimate_messages_tokens_rough, estimate_request_tokens_rough) used `len(str(msg))` over the entire message dict. For multimodal content, str() expands base64 image data inline, so a 1MB image inflated the token estimate by ~250x (524K "tokens" for what's actually ~1500 actual tokens). The compressor and pre-flight context checks then either wiped the vision turn immediately or tripped false context-overflow rejections, making vision-capable models effectively unusable for any image bigger than a thumbnail. Add `count_message_chars(msg)` that walks the message structure: - Plain text content: counted directly - text/input_text blocks: count text field only - image_url/input_image/image blocks: fixed ~6000-char budget (≈1500 tokens, the average across Anthropic/OpenAI/Gemini) - input_audio/audio blocks: fixed ~2000-char budget - Tool calls: name + arguments - Skip data: URLs in unknown block types Use the helper in both estimators and in run_agent.py's pre-API request size logging hook so all paths report consistent numbers. Tests: 11 multimodal cases (text-only, image_url, input_image, anthropic image type, audio, tool_calls, None content, request-level estimate, 2MB image stress test) + updated existing concrete-value tests to match the new accurate counting.

The Codex Responses API path in _chat_messages_to_responses_input collapsed message content via `str(content) if content is not None else ""`. For text content this was a no-op, but for a multimodal list it produced a Python repr (`"[{'type': 'image_url', ...}]"`) which the API treated as opaque user text. Codex/OpenAI Responses models with native vision never received the actual image — vision was effectively dead on this code path. Add `_chat_content_to_responses_content(content, role)` that walks the structure and emits the right Responses input shape: - Plain string → unchanged - text/input_text/output_text blocks → input_text (or output_text for assistant role) - image_url/input_image blocks → input_image with URL string - Anthropic-native image blocks (source.type=base64|url) → input_image with derived data URL Plus `_responses_content_is_empty()` to handle the ``has_codex_reasoning + empty content`` edge case for both string and list shapes. User and assistant message branches both use the helper now. Tool result messages keep `str()` because the Responses API ``function_call_output`` shape requires output as a string — multimodal tool results are out of scope here. Tests: 11 conversion cases (string, None, text-only list, assistant output_text, OpenAI image_url dict, image_url string, input_image passthrough, anthropic base64 source, anthropic url source) + 2 end-to-end tests via _chat_messages_to_responses_input that verify list content reaches the API as a list of input_image blocks instead of a stringified repr.

run_conversation only accepted ``user_message: str``, which forced the gateway to flatten any image attachments to a text description via the vision_analyze tool — losing the actual pixels even when the target model had native vision support. Widen the type to ``Union[str, List[Dict[str, Any]]]``. The list shape is the standard OpenAI multimodal content format (text + image_url blocks), which the chat_completions adapter passes through natively and the codex_responses adapter now converts via the helper added in the previous commit. Required adjustments: - _user_message_preview(): static helper that flattens str-or-list into a one-line preview (text + [image]/[audio] markers) for log lines and the "💬 Starting conversation" banner that previously did naked string slicing. - _sanitize_user_message_in_place(): walks list content, sanitizes surrogates inside text blocks, leaves image/audio blocks untouched. - original_user_message is now always coerced to a string before being passed to plugin hooks, _looks_like_codex_intermediate_ack, _save_trajectory, and the memory manager — all string-only consumers that don't need (and would break on) raw multimodal. The user_msg dict construction at the API boundary already worked for both shapes since it just wraps content as-is. Anthropic native API path will still re-flatten via _prepare_anthropic_messages_for_api until the next commit makes it capability-aware. chat_completions and codex paths get native multimodal immediately. Tests: 14 cases — preview helper (string, list, image-only, audio, anthropic image block, truncation), sanitize helper (surrogates in nested text fields, image fields untouched, no in-place mutation), end-to-end run_conversation with list content reaching the chat completions API as a list, plus a backwards-compat test that plain string content still produces string content.

…apable models _prepare_anthropic_messages_for_api unconditionally flattened image content blocks via _preprocess_anthropic_content, which calls vision_analyze_tool to produce a text description. This was a sane fallback for older Anthropic models without image input, but for Claude 3+, Opus 4.6, Sonnet 4.6, and every other vision-capable Anthropic-compatible model the legacy path: - Made an extra LLM call to a vision describer (slower, more cost) - Lost pixel-level information (text descriptions are lossy) - Made vision-capable models behave like blind ones for webdev, pixel-perfect debugging, OCR, etc. The anthropic_adapter at agent/anthropic_adapter.py:829 already converts image_url/input_image content blocks into Anthropic's native ``{"type": "image", "source": ...}`` format. Skip the legacy preprocess when the model declares native vision support and let the adapter do its job. Add ``_model_supports_native_vision()`` that: - Caches the lookup result per (provider, model) tuple - Looks up via ``agent.models_dev.get_model_capabilities`` - Returns False on lookup failure (safe legacy fallback) - Honors ``HERMES_FORCE_NATIVE_VISION=1`` env var for self-hosted models that aren't catalogued in models.dev (e.g. vLLM serving Llama 3.2 Vision) Tests: 6 new cases — capability cache, unknown-model fallback, env var override, lookup-exception safety, vision-capable passthrough (verifies vision_analyze_tool is NOT called and image blocks are preserved), non-vision flatten (verifies legacy path still works when capability says False). Existing TestAnthropicImageFallback tests still pass — the agent fixture has an empty model name, so capability lookup returns None and the legacy fallback path runs unchanged.

When a user attached an image and the active model declares native vision support, build a typed content list (text + image_url blocks) and pass it through the agent unchanged so the model receives the actual pixels. Models without native vision keep the legacy vision_analyze fallback that flattens images to text descriptions. This is the user-visible end of the native vision feature: bug fixes 1-3 made the persistence, token estimator, and codex API path multimodal-safe; feature commits 4-5 widened run_conversation and made the anthropic preprocess capability-aware. Now the gateway actually routes the image content the right way at the platform boundary. Three new helpers on GatewayRunner: - _message_preview_for_hook(message): flattens str-or-list to text for the agent:start hook payload, log lines, and the auto-title generator (which expects a plain string). - _should_use_native_vision_for_source(source): resolves the active model+provider via _resolve_session_agent_runtime, looks up vision support via agent.models_dev.get_model_capabilities, and honors HERMES_FORCE_NATIVE_VISION=1 for self-hosted vision models not catalogued in models.dev. - _build_native_vision_content(text, paths): reads each image into a base64 data URL via the existing tools.vision_tools helper and emits OpenAI-style image_url blocks. The user's caption becomes the first text block. Bad image paths are skipped with a warning rather than failing the whole request. _prepare_inbound_message_text return type is widened to Optional[Any] (string OR list of content blocks). The two callers and _run_agent's ``message`` parameter all accept the wider type. The pending model-switch-note prepend at line 8049 is updated to prepend a text block when message is a list, so multimodal content isn't broken by string concatenation. Tests: 15 cases — preview helper (5 shapes), capability resolver (force env var, vision-capable, non-vision, unknown model, empty model name, runtime resolution exception), content builder (text+ image, image-only, multiple images, unreadable image graceful skip).

The new gateway native vision routing auto-detects model capability via models.dev. Self-hosted vision models (vLLM serving Llama 3.2 Vision, etc.) and brand-new models not yet catalogued in models.dev need an opt-in escape hatch — add HERMES_FORCE_NATIVE_VISION to the env var reference table next to the AUXILIARY_VISION_* entries.

The CLI's chat() method always called _preprocess_images_with_vision when the user attached an image, which routes the image through the auxiliary vision model (Gemini Flash) to produce a text description and prepends that to the user's message. The actual pixels never reached the main model — even when running Claude Opus 4.6 or another vision-capable model that would have done a better job natively. The vision.md user guide also incorrectly claimed images were sent "as base64-encoded content blocks, so any vision-capable model can process them" — that was the intent but never the reality. Mirror the gateway native vision routing here: - Add _build_native_vision_content_cli(text, images): reads each attached image into a base64 data URL via the existing _image_to_base64_data_url helper and emits OpenAI-style image_url content blocks that the provider adapter converts to native form. - chat() branches on agent._model_supports_native_vision() (the capability check added in commit 5). Native-capable models receive list content; everything else still gets the legacy text-flatten path so non-vision and unknown models keep working unchanged. - Capability-check exceptions and missing agent both fall back to the legacy path safely. Tests: 9 cases — content builder (text+image, image-only, multiple images, missing image graceful skip, all-bad fallback to placeholder string), and chat() routing (vision-capable uses native, non-vision uses legacy, capability exception falls back, no agent falls back).

The previous vision.md claimed images were sent as base64 content blocks "so any vision-capable model can process them" — that was the intended behavior but the actual code (both CLI and gateway) always routed images through the auxiliary vision model and prepended a text description. The docs were aspirational, not factual. Now that the CLI and gateway both implement capability-aware routing, update the docs to accurately describe what happens: - Add a "How It Works" enumeration of the two paths (native vision vs vision_analyze fallback) so users understand which model gets what shape. - Add a "Messaging Platforms" section explaining that the same routing applies to Telegram, Discord, Matrix, etc. — webdev workflows now work end-to-end through messaging because the model receives actual pixels. - Add a "Self-Hosted & Uncatalogued Vision Models" section documenting HERMES_FORCE_NATIVE_VISION=1 for vLLM-served Llama 3.2 Vision and other models that aren't yet in models.dev. - Rewrite the "Supported Models" section to list models confirmed to use the native path and explain the legacy fallback for non-vision models, including the AUXILIARY_VISION_* override pointer.

…or providers Two robustness gaps in the capability lookup pipeline became visible once we tested against the Nous endpoint with claude-sonnet-4.6: 1. **Dot vs hyphen mismatch**. Anthropic's catalog stores ``claude-sonnet-4-6`` (hyphens) while OpenRouter's catalog stores ``anthropic/claude-sonnet-4.6`` (dots). _find_model_entry only did exact and case-insensitive matching, so a dotted query against the hyphenated catalog returned None even though the model is the same. 2. **Aggregator providers not in models.dev**. ``nous``, custom OpenAI- compatible proxies, and similar aggregators aren't in PROVIDER_TO_MODELS_DEV at all, so the lookup short-circuits to None. But the model name typically carries an upstream vendor prefix (``anthropic/claude-sonnet-4.6``) that points at a real catalogued vendor — we just need to follow the slug. Fixes: - ``_find_model_entry``: add a third matching pass that normalizes both dots and hyphens to underscores before comparing. ``4.6`` and ``4-6`` both become ``4_6`` and resolve to the same entry. Case- insensitive too. Unrelated families still don't match because their base names differ. - ``_model_supports_native_vision`` (run_agent.py) and ``_should_use_native_vision_for_source`` (gateway/run.py): when the direct ``(provider, model)`` lookup returns None and the model name contains a slash, split on ``/`` and try the upstream vendor's catalog. ``("nous", "anthropic/claude-sonnet-4.6")`` → fall back to ``("anthropic", "claude-sonnet-4.6")`` → matches via the new normalization → returns supports_vision=True. End-to-end verified against the Nous endpoint: ``_model_supports_native_vision()`` was returning False for ``nous + anthropic/claude-sonnet-4.6`` before this commit; now returns True and the gateway routes images natively to Sonnet 4.6. Tests: 4 dot/hyphen normalization cases (dotted query matches hyphen catalog, hyphen query backwards-compat, uppercase + dot combined, unrelated family stays unmatched), 3 prefix-fallback cases on AIAgent (unmapped provider with vendor slug → True, no slash → False, unknown vendor → False), 2 prefix-fallback cases on GatewayRunner (aggregator with known vendor → True, unknown vendor → False).

Testing the previous capability fix against the Nous endpoint with the full 9-model lineup revealed remaining false negatives: moonshotai/kimi-k2.5 api=PASS hermes=NO mistralai/mistral-small-2603 api=PASS hermes=NO mistralai/mistral-small-3.2-24b-... api=PASS hermes=NO z-ai/glm-4.5v api=PASS hermes=NO Root cause: the vendor-prefix fallback from the previous commit split ``moonshotai/kimi-k2.5`` into ``("moonshotai", "kimi-k2.5")`` but the moonshotai catalog in models.dev is empty (0 entries). Mistral's slug uses ``mistralai/`` but the catalog ID is ``mistral`` with different model version names. The upstream vendor fallback couldn't recover any of these. The OpenRouter catalog, on the other hand, has all four models because Nous, custom proxies, and most aggregators share the OpenRouter slug format: ``moonshotai/kimi-k2.5``, ``mistralai/mistral-small-2603``, ``z-ai/glm-4.5v``, etc. Querying the OpenRouter catalog with the full slug is a strong catch-all. Add a third fallback layer to both ``_model_supports_native_vision`` (run_agent.py) and ``_should_use_native_vision_for_source`` (gateway/run.py): 1. Direct (provider, model) — unchanged 2. Vendor prefix strip (vendor, vendor_model) — unchanged 3. OpenRouter aggregator: ("openrouter", model) when model has a slash prefix AND provider is not already "openrouter" (avoids redundant double-lookup for direct OR users) After this commit, all 8 vision-capable models Nous actually serves are correctly detected by Hermes: Claude Opus/Sonnet 4.6, GPT-5.4 Mini, Gemini 3 Flash, Kimi K2.5, Mistral Small (2603 + 3.2-24b), and GLM-4.5v. The only remaining mismatch is ``mistralai/pixtral-large-2411`` because it's missing from the models.dev OpenRouter catalog entirely — users hitting that model can set HERMES_FORCE_NATIVE_VISION=1 as an escape hatch. Tests: 3 new cases on AIAgent (openrouter catalog fallback for unmapped aggregator, respect non-vision flag from OR catalog, skip the redundant second lookup when provider is already openrouter), 1 new case on GatewayRunner.

…d regression Bug 2's original fix replaced ``len(str(msg))`` with a field-walking counter in count_message_chars. That fixed the ~100x overcount for messages carrying base64 image payloads but introduced a new problem: the new counter didn't include dict serialization overhead (braces, quoted field names, separators) that the legacy formula implicitly counted. For tool-heavy text conversations the new counter reported roughly 26% of the old value — a 74% undercount that would push preflight compression, budget pre-flight checks, and the context compressor to fire much later than they used to. This is a real regression for non-vision users on smaller context windows (32K Kimi, local Gemma, etc.) because it means the estimator under-reports until the real context limit is almost hit. Fix: add a per-message overhead constant (30 chars) to approximate the dict braces + ``'role':``/``'content':`` wrappers, plus a per-tool-call wrapper overhead (40 chars) for tool-calling turns. These constants were tuned by diffing the field-walking count against ``len(str(msg))`` for typical chat / tool / multi-turn conversations; the result now stays within ~10% of the legacy formula for text/tool cases (well inside the noise of a ``chars/4`` rough estimator) while keeping the massive 173x savings for base64-bearing multimodal turns. Empirical comparison on a realistic 9-turn tool conversation: legacy len(str): 894 chars / 224 tokens before this fix: 274 chars / 69 tokens (31% — regression) after this fix: 809 chars / 203 tokens (91% — accurate) And on a 1MB base64 image message: legacy len(str): 1,048,728 chars / 262,182 tokens (broken) after this fix: 6,056 chars / 1,514 tokens (correct) Tests: updated 6 existing concrete-value expectations to match the new overhead constants, kept the upper-bound assertions on image messages (< 10,000 chars) that already had margin baked in.

Polish pass after self-review of the native vision work: 1. **Auto-resize large images before native send**. Both gateway and CLI were calling ``_image_to_base64_data_url`` directly, which encodes the image at its original resolution. A user dropping a 4K screenshot would get a 13MB base64 payload sent to the model, costing ~2000+ tokens per image on OpenAI high-detail vs ~1500 expected. Switch to ``_resize_image_for_vision`` which is already battle-tested (it's what the legacy ``_enrich_message_with_vision`` path uses) and auto-resizes to the standard ~5MB budget when the encoded size would exceed it. 2. **Set ``detail: "auto"`` explicitly on image_url blocks**. Without an explicit detail value, providers default to "high" for large images which can double the token cost silently. Setting "auto" lets the provider pick based on resolution and stays consistent with what users see in typical OpenAI-compatible clients. 3. **Route logging for observability**. Both the gateway and CLI decision points now emit a ``logger.info("[vision] route=... ...")`` line so operators can trace whether a given image-bearing turn went through the native path or the legacy vision_analyze fallback. Previously there was no way to debug "why did Claude say it can't see my image?" without reading the source. 4. **Type annotation fix**. ``_build_native_vision_content`` was declared as returning ``List[Dict[str, Any]]`` but the fallback branch (all images failed to encode) returns the caption string. Widen to ``Union[str, List[Dict[str, Any]]]`` so the annotation matches reality; mypy / strict type-checkers would complain. 5. **Caption-only fallback returns a string, not a wrapped list**. Previously if no images encoded successfully but the caption was non-empty, the helper returned a list with a single text block. That's semantically weird (a list with no image content has no reason to be a list) and confused the edge-case test. Return the caption string directly; let the regular text path handle it. Tests: 2 updated skip-image tests to assert the new string fallback, 2 new tests covering the ``detail: "auto"`` contract on both the gateway and CLI helpers.

…ient The auxiliary client's _try_openrouter() helper read the OpenRouter API key from OPENROUTER_API_KEY but built the OpenAI client against the hardcoded https://openrouter.ai/api/v1 endpoint, ignoring any OPENROUTER_BASE_URL env override that the main agent path respects. The visible failure mode is subtle and confusing: 1. User runs Hermes with OPENROUTER_API_KEY pointed at an alternate OR-compatible endpoint (Nous Portal, custom proxy) and sets OPENROUTER_BASE_URL accordingly. 2. Main agent works because the gateway runtime resolver respects the env override. 3. Vision auto-resolver tries OpenRouter, builds a client against the canonical openrouter.ai endpoint with the wrong API key. 4. OR returns 401, credential pool marks the entry exhausted. 5. ``check_vision_requirements()`` returns False on the next call. 6. ``vision_analyze`` is silently de-registered from the agent's toolset (it has a check_fn). 7. When the user later sends an image to a non-vision model, the gateway's legacy fallback path tells the agent to "use vision_analyze" — but the tool isn't even in the agent's tool list. The agent improvises with browser_vision (Playwright missing) or read_file (binary), and a poorly-aligned model can then hallucinate a fabricated image description instead of admitting failure. This was reproduced empirically: routing to Nous via OPENROUTER_BASE_URL and asking xiaomi/mimo-v2-pro about a manga panel resulted in a confident "Yemeksepeti food delivery bag" hallucination on one run and "Xiaomi smartphone interface" on the next, neither of which had any relationship to the actual image. Fix: read OPENROUTER_BASE_URL from env when building the auxiliary OpenRouter client (same pattern the main agent path uses), falling back to the hardcoded default when unset. The pool path already honors per-entry base_url overrides; this change brings the env-var path in line. Tests: 4 cases — env override applied, default fallback, no API key short-circuit, empty string treated as unset.

…l seeding The previous commit fixed `_try_openrouter()` env-var path to honor OPENROUTER_BASE_URL. But the credential pool path takes precedence, and the openrouter-specific seeding branch in `_seed_from_env()` was hardcoded to use the canonical openrouter.ai URL regardless of any env override: if provider == "openrouter": token = os.getenv("OPENROUTER_API_KEY", "").strip() if token: ... "base_url": OPENROUTER_BASE_URL, # ← hardcoded, ignored env ... The generic seeding path further down DOES read the per-provider base URL env var (via `pconfig.base_url_env_var`), but the openrouter branch early-returns before reaching it. Effect for users routing OPENROUTER_API_KEY through an alternate OR-compatible endpoint (Nous Portal, custom proxy): 1. First gateway start: pool seeds an entry with the alternate endpoint key + the canonical openrouter.ai URL. 2. Auxiliary vision call goes to openrouter.ai with the wrong key. 3. OpenRouter returns 401 ``Missing Authentication header``. 4. Pool entry persists with stale base_url even after env vars are corrected — restarting Hermes doesn't help unless the user manually deletes auth.json's openrouter pool section. This was reproduced empirically on the Nous endpoint: with OPENROUTER_API_KEY=sk-bpby... and OPENROUTER_BASE_URL pointed at Nous, auxiliary vision_analyze calls returned 401 every time even though the env vars were correct. Inspecting auth.json showed the pool entry had base_url ``https://openrouter.ai/api/v1`` — the hardcoded constant from this seeding function. Fix: read OPENROUTER_BASE_URL the same way the generic path reads ``pconfig.base_url_env_var``, fall back to the hardcoded default when unset, and strip trailing slashes. Tests: 4 cases — env override applied (with and without trailing slash), default fallback when env var unset, default fallback on explicit empty string.

linxule · 2026-04-23T10:03:14Z

+1 — this is the architecture I've been trying to achieve via aux-vision config and couldn't, because that subsystem has no fallback list. Native-vision-through-main-model is the right shape.

One data point if useful for the test matrix: kimi-for-coding on api.kimi.com/coding/v1 (Kimi's subscription coding endpoint, distinct from the moonshotai/kimi-k2.5 you already tested) also accepts native multimodal — verified via direct curl, K2.6 reads images correctly. Endpoint quirk worth noting in docs: it rejects http(s):// image URLs with "unsupported image url" and requires base64 data: URLs — which _build_native_vision_content already emits, so no code change needed. Happy to test the branch end-to-end against a Kimi+Codex setup if it'd help unstick this.

teknium1 · 2026-05-10T01:09:42Z

Closing as superseded — native multimodal vision routing for vision-capable main models shipped in #16506 (commit ec671c4, "feat(image-input): native multimodal routing based on model vision capability"). It reaches the same architectural goal this PR proposed: capability-aware routing of inbound user-attached images as native image_url content parts when the active model declares supports_vision=true, with the legacy vision_analyze enrichment retained as the fallback for non-vision models. The shipped implementation lives in agent/image_routing.py (decide_image_input_mode, build_native_content_parts) and is wired into the CLI, gateway, and TUI gateway paths.\n\nThanks for the work — the design exploration here directly informed the shipped version. Closing in favor of #16506 to consolidate the open-PR list.

0xbyt4 added 15 commits April 12, 2026 18:28

LiuYangArt mentioned this pull request Apr 26, 2026

[Bug]: stale session override api_mode can misroute custom /v1 image turns to anthropic_messages #16000

Open

teknium1 closed this May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: native multimodal vision routing for vision-capable models#8610

feat: native multimodal vision routing for vision-capable models#8610
0xbyt4 wants to merge 15 commits into
NousResearch:mainfrom
0xbyt4:feat/native-multimodal-vision

0xbyt4 commented Apr 12, 2026 •

edited

Loading

Uh oh!

linxule commented Apr 23, 2026

Uh oh!

teknium1 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

0xbyt4 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why This Matters

Problem Scope

Commit Walkthrough

Bug fixes (layer-by-layer correctness)

Feature commits (native vision path)

Capability detection robustness (discovered during real-API testing)

Docs

End-to-End Verification (Nous endpoint)

Backwards Compatibility

Text-only messages — zero behavior change

Image messages + non-vision model — legacy path unchanged

Image messages + vision model — new native path

Token estimator regression caught during testing

Test Coverage

Known Limitations

Env Vars Added

Test plan

Uh oh!

linxule commented Apr 23, 2026

Uh oh!

teknium1 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

0xbyt4 commented Apr 12, 2026 •

edited

Loading