Skip to content

feat: native multimodal vision routing for vision-capable models#8610

Closed
0xbyt4 wants to merge 15 commits into
NousResearch:mainfrom
0xbyt4:feat/native-multimodal-vision
Closed

feat: native multimodal vision routing for vision-capable models#8610
0xbyt4 wants to merge 15 commits into
NousResearch:mainfrom
0xbyt4:feat/native-multimodal-vision

Conversation

@0xbyt4

@0xbyt4 0xbyt4 commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Hermes currently routes every image attachment through the auxiliary vision model (Gemini Flash by default) to produce a text description, regardless of whether the active model has native vision support. Claude Opus 4.6, GPT-5.4, Gemini 3 Flash, Xiaomi MiMo Omni and every other vision-capable model receives a lossy text summary instead of the actual pixels.

This PR makes the CLI, gateway, and agent loop capability-aware: when the active model declares native vision per models.dev, images flow through the API as typed content blocks (image_url / input_image / Anthropic image). Models without native vision keep the legacy vision_analyze fallback unchanged. Zero regressions for text-only or non-vision use cases.

Why This Matters

Real webdev workflow today:

  1. User drops a screenshot of a broken layout into Telegram
  2. Gateway calls vision_analyze → Gemini Flash produces "I see a navbar at the top..."
  3. That text description is prepended to the user's message
  4. Claude Sonnet 4.6 receives only the text description and is asked "what's wrong?"
  5. Claude cannot diagnose pixel-level issues because it never sees the pixels

After this PR:

  1. Gateway detects Sonnet 4.6 has native vision
  2. Builds a multimodal content list (text + image_url)
  3. Sonnet 4.6 receives the actual pixels and diagnoses the 20px offset directly

Verified end-to-end against the Nous endpoint: models that previously saw text now read pixel-level content (see test matrix below).

Problem Scope

The naive "just add a content list at the gateway" approach doesn't work because the pipeline has five independent layers that each flatten multimodal content to text:

  1. Persistence (hermes_state.py): SQLite content TEXT column stores raw str(content) → lists become Python repr blobs → session reload corrupted
  2. Token estimation (agent/model_metadata.py): len(str(msg)) counts base64 as text → 1MB image reports as ~262K tokens → compression wipes the image immediately
  3. Codex Responses API path (run_agent.py): str(content) coerces list to repr → API receives literal "[{'type': 'image_url', ...}]" as user text
  4. run_conversation signature (run_agent.py): user_message: str → no way to pass multimodal through in the first place
  5. Anthropic preprocess (run_agent.py): _prepare_anthropic_messages_for_api unconditionally converts image content blocks to text descriptions via _describe_image_for_anthropic_fallback, even when the model has native vision

Plus the gateway itself unconditionally called _enrich_message_with_vision for every image, and the CLI's chat() method did the same via _preprocess_images_with_vision.

Each layer needs its own targeted fix; bundling them would make the PR unreviewable. This PR is split into 12 logically-ordered commits — 5 bug fixes, 4 feature commits, 2 docs, 1 regression fix discovered during testing.

Commit Walkthrough

Bug fixes (layer-by-layer correctness)

# Commit What it fixes
1 fix(hermes_state): persist multimodal content blocks for vision sessions Add schema v7 content_blocks TEXT column; round-trip list content losslessly while keeping flattened searchable text in content for FTS5 and legacy rows
2 fix(model_metadata): image-aware token estimation for multimodal turns Replace len(str(msg)) with count_message_chars() that skips base64 payloads; applies ~1500-token budget per image instead of ~262K
3 fix(run_agent): preserve multimodal content in codex responses input _chat_content_to_responses_content() walks content lists and emits input_text / input_image items in the shape the Codex Responses API expects

Feature commits (native vision path)

# Commit What it adds
4 feat(run_agent): accept multimodal list user_message in run_conversation Widen type to Union[str, List[Dict[str, Any]]]; _user_message_preview() + _sanitize_user_message_in_place() helpers; flatten to text for string-only downstream consumers (plugins, memory, trajectory)
5 feat(run_agent): skip anthropic image-to-text preprocess for vision-capable models _model_supports_native_vision() with result caching and HERMES_FORCE_NATIVE_VISION escape hatch; _prepare_anthropic_messages_for_api skips flatten when capability says yes
6 feat(gateway): native multimodal routing for vision-capable models Capability check in _prepare_inbound_message_text; _build_native_vision_content reads local image files into base64 data URLs and emits OpenAI-style content blocks; both callers + _run_agent now accept Union[str, list]
7 feat(cli): native multimodal routing for vision-capable models Mirror the gateway routing in the CLI's chat() method so /paste and drag-and-drop images get native vision too

Capability detection robustness (discovered during real-API testing)

# Commit What it fixes
8 fix(native_vision): catalog matching for dotted versions and aggregator providers _find_model_entry normalizes dots/hyphens (claude-sonnet-4.6claude-sonnet-4-6); provider-prefix strip fallback resolves ("nous", "anthropic/claude-sonnet-4.6")("anthropic", "claude-sonnet-4.6")
9 fix(native_vision): OpenRouter catalog as aggregator catch-all fallback Third fallback layer: query ("openrouter", full_slug) when direct + vendor-prefix both fail. Catches Kimi, Mistral-small, GLM-4.5v, etc. that use OR-compatible slugs but aren't in their native vendor catalog
12 fix(model_metadata): add dict overhead to count_message_chars to avoid regression The rewrite in commit 2 was reporting ~26% of the legacy len(str(msg)) for tool-heavy text messages. Add per-message (30 chars) and per-tool-call (40 chars) overhead to keep the estimator within ~10% of legacy for non-image cases

Docs

# Commit
10 docs(env): document HERMES_FORCE_NATIVE_VISION env var
11 docs(vision): document capability-aware native vision routing

End-to-End Verification (Nous endpoint)

Test script: create a PNG with a recognizable 4-digit number, send to each model via chat.completions.create with a native image_url content block, check whether the model reads the number back correctly.

Model API accepts vision Hermes detects capability Note
anthropic/claude-sonnet-4.6
anthropic/claude-opus-4.6
openai/gpt-5.4-mini
google/gemini-3-flash-preview
google/gemma-4-31b-it
google/gemma-4-26b-a4b-it
google/gemma-3-4b-it
moonshotai/kimi-k2.5 Fixed by commit 9 (OpenRouter fallback — moonshotai catalog empty in models.dev)
mistralai/mistral-small-2603 Fixed by commit 9 (slug mistralai/ vs catalog mistral)
mistralai/mistral-small-3.2-24b-instruct Fixed by commit 9
z-ai/glm-4.5v Fixed by commit 9 (slug z-ai/ vs catalog zai)
xiaomi/mimo-v2-omni Described the blue background AND read the number
mistralai/pixtral-large-2411 Missing from models.dev OpenRouter catalog entirely; users hitting this model can set HERMES_FORCE_NATIVE_VISION=1

Before this PR: 0/13 received native vision — every image was text-flattened regardless of model capability.
After this PR: 12/13 automatically routed to native vision; 1 edge case covered by the env var escape hatch.

Backwards Compatibility

This is the section I want reviewers to scrutinize hardest, because image-handling touches so many layers.

Text-only messages — zero behavior change

  • hermes_state.append_message(content="hello")content_blocks is NULL, content TEXT stores "hello" → read path returns "hello" unchanged
  • run_conversation(user_message="hello")user_msg = {"role": "user", "content": "hello"} (unchanged)
  • _prepare_anthropic_messages_for_api([...]) → early return if not any(... has_image_parts ...) (unchanged)
  • _chat_messages_to_responses_input([...]) → string input falls through the new helper and returns unchanged
  • Gateway _prepare_inbound_message_text → the if image_paths branch is skipped entirely for text-only events

Image messages + non-vision model — legacy path unchanged

  • _should_use_native_vision_for_source returns False → _enrich_message_with_vision is called just like before
  • AIAgent._model_supports_native_vision returns False → _prepare_anthropic_messages_for_api runs the legacy text-flatten path
  • CLI chat() with no agent or non-vision model → falls through to _preprocess_images_with_vision

Image messages + vision model — new native path

  • Gateway builds content list, passes it through _run_agent → run_conversation(list, ...)
  • user_msg = {"role": "user", "content": [list]} is sent to the provider adapter
  • anthropic_adapter.py:829 already handles image_url → Anthropic native image block (zero change)
  • chat_completions path passes through (zero change)
  • codex_responses path converts via the helper in commit 3

Token estimator regression caught during testing

Commit 12 is important: the rewrite in commit 2 removed the implicit dict-serialization overhead that len(str(msg)) included. For text/tool conversations, the new estimator was reporting ~26% of the legacy value — a 74% undercount that would make preflight compression fire far later than before and put 32K-context models at risk of hitting real overflow.

Comparison on a realistic 9-turn tool conversation (empirical):

  • Legacy len(str): 894 chars / 224 tokens
  • After commit 2: 274 chars / 69 tokens ← 74% undercount, regression
  • After commit 12: 809 chars / 203 tokens ← 91% of legacy, within noise

And on a 1MB base64 image message:

  • Legacy len(str): 1,048,728 chars / 262,182 tokens ← broken (the bug we're fixing)
  • After commit 12: 6,056 chars / 1,514 tokens ← correct (1500 tokens ≈ actual image cost)

Test Coverage

~80 new tests across 5 files:

  • tests/test_hermes_state.py — 8 round-trip cases + 1 v6→v7 migration test
  • tests/agent/test_model_metadata.py — 11 token estimation / multimodal char counting cases + 4 dot/hyphen normalization cases
  • tests/run_agent/test_run_agent.py — 14 multimodal run_conversation + 6 native vision capability + 4 prefix fallback + 3 OpenRouter fallback
  • tests/run_agent/test_run_agent_codex_responses.py — 11 content conversion + 2 end-to-end codex responses
  • tests/gateway/test_native_vision_routing.py — 15 helper tests + 2 OpenRouter fallback (new file)
  • tests/cli/test_cli_native_vision.py — 9 helper + chat() routing tests (new file)

All ~380 tests in the touched areas pass. 9 pre-existing failures in unrelated files (test_auxiliary_client.py, test_session_race_guard.py, etc.) remain unchanged by this branch — confirmed by running the same tests against main with the branch stashed.

Known Limitations

  1. mistralai/pixtral-large-2411 — Missing from the models.dev OpenRouter catalog so capability detection returns False. Users can set HERMES_FORCE_NATIVE_VISION=1 until the catalog is updated.
  2. Self-hosted vision models (vLLM + Llama 3.2 Vision, etc.) — Not in models.dev, need HERMES_FORCE_NATIVE_VISION=1. Documented in vision.md and environment-variables.md.
  3. Tool-result imagesfunction_call_output in the Codex Responses API takes a string output field, so images in tool results still get flattened. Out of scope for this PR; would need a separate design for tool result multimodal.

Env Vars Added

  • HERMES_FORCE_NATIVE_VISION=1 — Force native vision routing in both CLI and gateway regardless of capability lookup. Documented in website/docs/reference/environment-variables.md.

Test plan

  • End-to-end: all 13 models from the Nous popular list tested against real API (test matrix above)
  • Unit tests: ~80 new tests, all pass
  • Regression: 9-turn tool conversation token estimate stays within 10% of legacy
  • Backwards compat: text-only messages exercise zero new code paths
  • CLI /paste with vision-capable model sends native image_url content block
  • Gateway routing from Telegram-style MessageEvent with media_urls → native content list when model has vision

0xbyt4 added 15 commits April 12, 2026 18:28
Messages with content as a list of blocks (text + image_url/input_image)
were stored via raw `content TEXT` insertion, which Python coerced to
repr() and read back as an unparseable string. Session reload silently
broke for any vision-capable model that received native image content.

Add a `content_blocks` TEXT column (schema v7) that stores the JSON
serialization of the multimodal structure. The legacy `content` column
keeps a flattened text representation so FTS5 search and any
plain-string callers continue to work unchanged.

Round-trips both shapes losslessly:
- Plain string content stays in `content`, `content_blocks` is NULL
- List content gets searchable text in `content` AND JSON in `content_blocks`

Read paths (get_messages, get_messages_as_conversation) prefer
`content_blocks` when present, fall back to `content` for legacy rows
and the v6→v7 migration case.

Tests: 8 round-trip cases (text, image_url, input_image, audio, None,
legacy plaintext, FTS5 searchability) + v6→v7 migration test.
Token estimators (estimate_messages_tokens_rough,
estimate_request_tokens_rough) used `len(str(msg))` over the entire
message dict. For multimodal content, str() expands base64 image data
inline, so a 1MB image inflated the token estimate by ~250x (524K
"tokens" for what's actually ~1500 actual tokens).

The compressor and pre-flight context checks then either wiped the
vision turn immediately or tripped false context-overflow rejections,
making vision-capable models effectively unusable for any image bigger
than a thumbnail.

Add `count_message_chars(msg)` that walks the message structure:
- Plain text content: counted directly
- text/input_text blocks: count text field only
- image_url/input_image/image blocks: fixed ~6000-char budget
  (≈1500 tokens, the average across Anthropic/OpenAI/Gemini)
- input_audio/audio blocks: fixed ~2000-char budget
- Tool calls: name + arguments
- Skip data: URLs in unknown block types

Use the helper in both estimators and in run_agent.py's pre-API
request size logging hook so all paths report consistent numbers.

Tests: 11 multimodal cases (text-only, image_url, input_image, anthropic
image type, audio, tool_calls, None content, request-level estimate,
2MB image stress test) + updated existing concrete-value tests to
match the new accurate counting.
The Codex Responses API path in _chat_messages_to_responses_input
collapsed message content via `str(content) if content is not None
else ""`. For text content this was a no-op, but for a multimodal
list it produced a Python repr (`"[{'type': 'image_url', ...}]"`)
which the API treated as opaque user text. Codex/OpenAI Responses
models with native vision never received the actual image — vision
was effectively dead on this code path.

Add `_chat_content_to_responses_content(content, role)` that walks
the structure and emits the right Responses input shape:
- Plain string → unchanged
- text/input_text/output_text blocks → input_text (or output_text
  for assistant role)
- image_url/input_image blocks → input_image with URL string
- Anthropic-native image blocks (source.type=base64|url) →
  input_image with derived data URL

Plus `_responses_content_is_empty()` to handle the
``has_codex_reasoning + empty content`` edge case for both string
and list shapes.

User and assistant message branches both use the helper now.
Tool result messages keep `str()` because the Responses API
``function_call_output`` shape requires output as a string —
multimodal tool results are out of scope here.

Tests: 11 conversion cases (string, None, text-only list, assistant
output_text, OpenAI image_url dict, image_url string, input_image
passthrough, anthropic base64 source, anthropic url source) + 2
end-to-end tests via _chat_messages_to_responses_input that verify
list content reaches the API as a list of input_image blocks
instead of a stringified repr.
run_conversation only accepted ``user_message: str``, which forced
the gateway to flatten any image attachments to a text description
via the vision_analyze tool — losing the actual pixels even when the
target model had native vision support.

Widen the type to ``Union[str, List[Dict[str, Any]]]``. The list shape
is the standard OpenAI multimodal content format (text + image_url
blocks), which the chat_completions adapter passes through natively
and the codex_responses adapter now converts via the helper added in
the previous commit.

Required adjustments:

- _user_message_preview(): static helper that flattens str-or-list
  into a one-line preview (text + [image]/[audio] markers) for log
  lines and the "💬 Starting conversation" banner that previously
  did naked string slicing.

- _sanitize_user_message_in_place(): walks list content, sanitizes
  surrogates inside text blocks, leaves image/audio blocks untouched.

- original_user_message is now always coerced to a string before
  being passed to plugin hooks, _looks_like_codex_intermediate_ack,
  _save_trajectory, and the memory manager — all string-only
  consumers that don't need (and would break on) raw multimodal.

The user_msg dict construction at the API boundary already worked
for both shapes since it just wraps content as-is.

Anthropic native API path will still re-flatten via
_prepare_anthropic_messages_for_api until the next commit makes it
capability-aware. chat_completions and codex paths get native
multimodal immediately.

Tests: 14 cases — preview helper (string, list, image-only, audio,
anthropic image block, truncation), sanitize helper (surrogates in
nested text fields, image fields untouched, no in-place mutation),
end-to-end run_conversation with list content reaching the chat
completions API as a list, plus a backwards-compat test that plain
string content still produces string content.
…apable models

_prepare_anthropic_messages_for_api unconditionally flattened image
content blocks via _preprocess_anthropic_content, which calls
vision_analyze_tool to produce a text description. This was a sane
fallback for older Anthropic models without image input, but for
Claude 3+, Opus 4.6, Sonnet 4.6, and every other vision-capable
Anthropic-compatible model the legacy path:

- Made an extra LLM call to a vision describer (slower, more cost)
- Lost pixel-level information (text descriptions are lossy)
- Made vision-capable models behave like blind ones for webdev,
  pixel-perfect debugging, OCR, etc.

The anthropic_adapter at agent/anthropic_adapter.py:829 already
converts image_url/input_image content blocks into Anthropic's
native ``{"type": "image", "source": ...}`` format. Skip the legacy
preprocess when the model declares native vision support and let
the adapter do its job.

Add ``_model_supports_native_vision()`` that:
- Caches the lookup result per (provider, model) tuple
- Looks up via ``agent.models_dev.get_model_capabilities``
- Returns False on lookup failure (safe legacy fallback)
- Honors ``HERMES_FORCE_NATIVE_VISION=1`` env var for self-hosted
  models that aren't catalogued in models.dev (e.g. vLLM serving
  Llama 3.2 Vision)

Tests: 6 new cases — capability cache, unknown-model fallback, env
var override, lookup-exception safety, vision-capable passthrough
(verifies vision_analyze_tool is NOT called and image blocks are
preserved), non-vision flatten (verifies legacy path still works
when capability says False).

Existing TestAnthropicImageFallback tests still pass — the agent
fixture has an empty model name, so capability lookup returns None
and the legacy fallback path runs unchanged.
When a user attached an image and the active model declares native
vision support, build a typed content list (text + image_url blocks)
and pass it through the agent unchanged so the model receives the
actual pixels. Models without native vision keep the legacy
vision_analyze fallback that flattens images to text descriptions.

This is the user-visible end of the native vision feature: bug fixes
1-3 made the persistence, token estimator, and codex API path
multimodal-safe; feature commits 4-5 widened run_conversation and
made the anthropic preprocess capability-aware. Now the gateway
actually routes the image content the right way at the platform
boundary.

Three new helpers on GatewayRunner:

- _message_preview_for_hook(message): flattens str-or-list to text
  for the agent:start hook payload, log lines, and the auto-title
  generator (which expects a plain string).

- _should_use_native_vision_for_source(source): resolves the active
  model+provider via _resolve_session_agent_runtime, looks up
  vision support via agent.models_dev.get_model_capabilities, and
  honors HERMES_FORCE_NATIVE_VISION=1 for self-hosted vision models
  not catalogued in models.dev.

- _build_native_vision_content(text, paths): reads each image into
  a base64 data URL via the existing tools.vision_tools helper and
  emits OpenAI-style image_url blocks. The user's caption becomes
  the first text block. Bad image paths are skipped with a warning
  rather than failing the whole request.

_prepare_inbound_message_text return type is widened to Optional[Any]
(string OR list of content blocks). The two callers and _run_agent's
``message`` parameter all accept the wider type.

The pending model-switch-note prepend at line 8049 is updated to
prepend a text block when message is a list, so multimodal content
isn't broken by string concatenation.

Tests: 15 cases — preview helper (5 shapes), capability resolver
(force env var, vision-capable, non-vision, unknown model, empty
model name, runtime resolution exception), content builder (text+
image, image-only, multiple images, unreadable image graceful skip).
The new gateway native vision routing auto-detects model capability
via models.dev. Self-hosted vision models (vLLM serving Llama 3.2
Vision, etc.) and brand-new models not yet catalogued in models.dev
need an opt-in escape hatch — add HERMES_FORCE_NATIVE_VISION to the
env var reference table next to the AUXILIARY_VISION_* entries.
The CLI's chat() method always called _preprocess_images_with_vision
when the user attached an image, which routes the image through the
auxiliary vision model (Gemini Flash) to produce a text description
and prepends that to the user's message. The actual pixels never
reached the main model — even when running Claude Opus 4.6 or another
vision-capable model that would have done a better job natively.

The vision.md user guide also incorrectly claimed images were sent
"as base64-encoded content blocks, so any vision-capable model can
process them" — that was the intent but never the reality.

Mirror the gateway native vision routing here:

- Add _build_native_vision_content_cli(text, images): reads each
  attached image into a base64 data URL via the existing
  _image_to_base64_data_url helper and emits OpenAI-style image_url
  content blocks that the provider adapter converts to native form.

- chat() branches on agent._model_supports_native_vision() (the
  capability check added in commit 5). Native-capable models receive
  list content; everything else still gets the legacy text-flatten
  path so non-vision and unknown models keep working unchanged.

- Capability-check exceptions and missing agent both fall back to
  the legacy path safely.

Tests: 9 cases — content builder (text+image, image-only, multiple
images, missing image graceful skip, all-bad fallback to placeholder
string), and chat() routing (vision-capable uses native, non-vision
uses legacy, capability exception falls back, no agent falls back).
The previous vision.md claimed images were sent as base64 content
blocks "so any vision-capable model can process them" — that was the
intended behavior but the actual code (both CLI and gateway) always
routed images through the auxiliary vision model and prepended a
text description. The docs were aspirational, not factual.

Now that the CLI and gateway both implement capability-aware routing,
update the docs to accurately describe what happens:

- Add a "How It Works" enumeration of the two paths (native vision
  vs vision_analyze fallback) so users understand which model gets
  what shape.

- Add a "Messaging Platforms" section explaining that the same
  routing applies to Telegram, Discord, Matrix, etc. — webdev
  workflows now work end-to-end through messaging because the model
  receives actual pixels.

- Add a "Self-Hosted & Uncatalogued Vision Models" section
  documenting HERMES_FORCE_NATIVE_VISION=1 for vLLM-served Llama
  3.2 Vision and other models that aren't yet in models.dev.

- Rewrite the "Supported Models" section to list models confirmed to
  use the native path and explain the legacy fallback for non-vision
  models, including the AUXILIARY_VISION_* override pointer.
…or providers

Two robustness gaps in the capability lookup pipeline became visible
once we tested against the Nous endpoint with claude-sonnet-4.6:

1. **Dot vs hyphen mismatch**. Anthropic's catalog stores
   ``claude-sonnet-4-6`` (hyphens) while OpenRouter's catalog stores
   ``anthropic/claude-sonnet-4.6`` (dots). _find_model_entry only did
   exact and case-insensitive matching, so a dotted query against the
   hyphenated catalog returned None even though the model is the same.

2. **Aggregator providers not in models.dev**. ``nous``, custom OpenAI-
   compatible proxies, and similar aggregators aren't in
   PROVIDER_TO_MODELS_DEV at all, so the lookup short-circuits to None.
   But the model name typically carries an upstream vendor prefix
   (``anthropic/claude-sonnet-4.6``) that points at a real catalogued
   vendor — we just need to follow the slug.

Fixes:

- ``_find_model_entry``: add a third matching pass that normalizes
  both dots and hyphens to underscores before comparing. ``4.6`` and
  ``4-6`` both become ``4_6`` and resolve to the same entry. Case-
  insensitive too. Unrelated families still don't match because their
  base names differ.

- ``_model_supports_native_vision`` (run_agent.py) and
  ``_should_use_native_vision_for_source`` (gateway/run.py): when the
  direct ``(provider, model)`` lookup returns None and the model name
  contains a slash, split on ``/`` and try the upstream vendor's
  catalog. ``("nous", "anthropic/claude-sonnet-4.6")`` → fall back to
  ``("anthropic", "claude-sonnet-4.6")`` → matches via the new
  normalization → returns supports_vision=True.

End-to-end verified against the Nous endpoint:
``_model_supports_native_vision()`` was returning False for
``nous + anthropic/claude-sonnet-4.6`` before this commit; now returns
True and the gateway routes images natively to Sonnet 4.6.

Tests: 4 dot/hyphen normalization cases (dotted query matches hyphen
catalog, hyphen query backwards-compat, uppercase + dot combined,
unrelated family stays unmatched), 3 prefix-fallback cases on AIAgent
(unmapped provider with vendor slug → True, no slash → False, unknown
vendor → False), 2 prefix-fallback cases on GatewayRunner (aggregator
with known vendor → True, unknown vendor → False).
Testing the previous capability fix against the Nous endpoint with
the full 9-model lineup revealed remaining false negatives:

  moonshotai/kimi-k2.5                  api=PASS  hermes=NO
  mistralai/mistral-small-2603          api=PASS  hermes=NO
  mistralai/mistral-small-3.2-24b-...   api=PASS  hermes=NO
  z-ai/glm-4.5v                         api=PASS  hermes=NO

Root cause: the vendor-prefix fallback from the previous commit split
``moonshotai/kimi-k2.5`` into ``("moonshotai", "kimi-k2.5")`` but the
moonshotai catalog in models.dev is empty (0 entries). Mistral's
slug uses ``mistralai/`` but the catalog ID is ``mistral`` with
different model version names. The upstream vendor fallback couldn't
recover any of these.

The OpenRouter catalog, on the other hand, has all four models
because Nous, custom proxies, and most aggregators share the
OpenRouter slug format: ``moonshotai/kimi-k2.5``,
``mistralai/mistral-small-2603``, ``z-ai/glm-4.5v``, etc. Querying
the OpenRouter catalog with the full slug is a strong catch-all.

Add a third fallback layer to both ``_model_supports_native_vision``
(run_agent.py) and ``_should_use_native_vision_for_source``
(gateway/run.py):

  1. Direct (provider, model) — unchanged
  2. Vendor prefix strip (vendor, vendor_model) — unchanged
  3. OpenRouter aggregator: ("openrouter", model) when model has a
     slash prefix AND provider is not already "openrouter" (avoids
     redundant double-lookup for direct OR users)

After this commit, all 8 vision-capable models Nous actually serves
are correctly detected by Hermes: Claude Opus/Sonnet 4.6, GPT-5.4
Mini, Gemini 3 Flash, Kimi K2.5, Mistral Small (2603 + 3.2-24b),
and GLM-4.5v. The only remaining mismatch is
``mistralai/pixtral-large-2411`` because it's missing from the
models.dev OpenRouter catalog entirely — users hitting that model
can set HERMES_FORCE_NATIVE_VISION=1 as an escape hatch.

Tests: 3 new cases on AIAgent (openrouter catalog fallback for
unmapped aggregator, respect non-vision flag from OR catalog, skip
the redundant second lookup when provider is already openrouter),
1 new case on GatewayRunner.
…d regression

Bug 2's original fix replaced ``len(str(msg))`` with a field-walking
counter in count_message_chars. That fixed the ~100x overcount for
messages carrying base64 image payloads but introduced a new problem:
the new counter didn't include dict serialization overhead (braces,
quoted field names, separators) that the legacy formula implicitly
counted. For tool-heavy text conversations the new counter reported
roughly 26% of the old value — a 74% undercount that would push
preflight compression, budget pre-flight checks, and the context
compressor to fire much later than they used to.

This is a real regression for non-vision users on smaller context
windows (32K Kimi, local Gemma, etc.) because it means the estimator
under-reports until the real context limit is almost hit.

Fix: add a per-message overhead constant (30 chars) to approximate
the dict braces + ``'role':``/``'content':`` wrappers, plus a
per-tool-call wrapper overhead (40 chars) for tool-calling turns.
These constants were tuned by diffing the field-walking count
against ``len(str(msg))`` for typical chat / tool / multi-turn
conversations; the result now stays within ~10% of the legacy
formula for text/tool cases (well inside the noise of a ``chars/4``
rough estimator) while keeping the massive 173x savings for
base64-bearing multimodal turns.

Empirical comparison on a realistic 9-turn tool conversation:
  legacy len(str): 894 chars / 224 tokens
  before this fix:  274 chars /  69 tokens  (31% — regression)
  after this fix:   809 chars / 203 tokens  (91% — accurate)

And on a 1MB base64 image message:
  legacy len(str):  1,048,728 chars / 262,182 tokens  (broken)
  after this fix:       6,056 chars /   1,514 tokens  (correct)

Tests: updated 6 existing concrete-value expectations to match the
new overhead constants, kept the upper-bound assertions on image
messages (< 10,000 chars) that already had margin baked in.
Polish pass after self-review of the native vision work:

1. **Auto-resize large images before native send**. Both gateway and
   CLI were calling ``_image_to_base64_data_url`` directly, which
   encodes the image at its original resolution. A user dropping a
   4K screenshot would get a 13MB base64 payload sent to the model,
   costing ~2000+ tokens per image on OpenAI high-detail vs ~1500
   expected. Switch to ``_resize_image_for_vision`` which is already
   battle-tested (it's what the legacy ``_enrich_message_with_vision``
   path uses) and auto-resizes to the standard ~5MB budget when the
   encoded size would exceed it.

2. **Set ``detail: "auto"`` explicitly on image_url blocks**. Without
   an explicit detail value, providers default to "high" for large
   images which can double the token cost silently. Setting "auto"
   lets the provider pick based on resolution and stays consistent
   with what users see in typical OpenAI-compatible clients.

3. **Route logging for observability**. Both the gateway and CLI
   decision points now emit a ``logger.info("[vision] route=... ...")``
   line so operators can trace whether a given image-bearing turn
   went through the native path or the legacy vision_analyze
   fallback. Previously there was no way to debug "why did Claude
   say it can't see my image?" without reading the source.

4. **Type annotation fix**. ``_build_native_vision_content`` was
   declared as returning ``List[Dict[str, Any]]`` but the fallback
   branch (all images failed to encode) returns the caption string.
   Widen to ``Union[str, List[Dict[str, Any]]]`` so the annotation
   matches reality; mypy / strict type-checkers would complain.

5. **Caption-only fallback returns a string, not a wrapped list**.
   Previously if no images encoded successfully but the caption was
   non-empty, the helper returned a list with a single text block.
   That's semantically weird (a list with no image content has no
   reason to be a list) and confused the edge-case test. Return the
   caption string directly; let the regular text path handle it.

Tests: 2 updated skip-image tests to assert the new string fallback,
2 new tests covering the ``detail: "auto"`` contract on both the
gateway and CLI helpers.
…ient

The auxiliary client's _try_openrouter() helper read the OpenRouter
API key from OPENROUTER_API_KEY but built the OpenAI client against
the hardcoded https://openrouter.ai/api/v1 endpoint, ignoring any
OPENROUTER_BASE_URL env override that the main agent path respects.

The visible failure mode is subtle and confusing:
1. User runs Hermes with OPENROUTER_API_KEY pointed at an alternate
   OR-compatible endpoint (Nous Portal, custom proxy) and sets
   OPENROUTER_BASE_URL accordingly.
2. Main agent works because the gateway runtime resolver respects
   the env override.
3. Vision auto-resolver tries OpenRouter, builds a client against
   the canonical openrouter.ai endpoint with the wrong API key.
4. OR returns 401, credential pool marks the entry exhausted.
5. ``check_vision_requirements()`` returns False on the next call.
6. ``vision_analyze`` is silently de-registered from the agent's
   toolset (it has a check_fn).
7. When the user later sends an image to a non-vision model, the
   gateway's legacy fallback path tells the agent to "use
   vision_analyze" — but the tool isn't even in the agent's tool
   list. The agent improvises with browser_vision (Playwright
   missing) or read_file (binary), and a poorly-aligned model can
   then hallucinate a fabricated image description instead of
   admitting failure.

This was reproduced empirically: routing to Nous via
OPENROUTER_BASE_URL and asking xiaomi/mimo-v2-pro about a manga
panel resulted in a confident "Yemeksepeti food delivery bag"
hallucination on one run and "Xiaomi smartphone interface" on the
next, neither of which had any relationship to the actual image.

Fix: read OPENROUTER_BASE_URL from env when building the auxiliary
OpenRouter client (same pattern the main agent path uses), falling
back to the hardcoded default when unset. The pool path already
honors per-entry base_url overrides; this change brings the env-var
path in line.

Tests: 4 cases — env override applied, default fallback, no API
key short-circuit, empty string treated as unset.
…l seeding

The previous commit fixed `_try_openrouter()` env-var path to honor
OPENROUTER_BASE_URL. But the credential pool path takes precedence,
and the openrouter-specific seeding branch in `_seed_from_env()` was
hardcoded to use the canonical openrouter.ai URL regardless of any
env override:

    if provider == "openrouter":
        token = os.getenv("OPENROUTER_API_KEY", "").strip()
        if token:
            ...
            "base_url": OPENROUTER_BASE_URL,  # ← hardcoded, ignored env
            ...

The generic seeding path further down DOES read the per-provider base
URL env var (via `pconfig.base_url_env_var`), but the openrouter
branch early-returns before reaching it.

Effect for users routing OPENROUTER_API_KEY through an alternate
OR-compatible endpoint (Nous Portal, custom proxy):

  1. First gateway start: pool seeds an entry with the alternate
     endpoint key + the canonical openrouter.ai URL.
  2. Auxiliary vision call goes to openrouter.ai with the wrong key.
  3. OpenRouter returns 401 ``Missing Authentication header``.
  4. Pool entry persists with stale base_url even after env vars are
     corrected — restarting Hermes doesn't help unless the user
     manually deletes auth.json's openrouter pool section.

This was reproduced empirically on the Nous endpoint: with
OPENROUTER_API_KEY=sk-bpby... and OPENROUTER_BASE_URL pointed at Nous,
auxiliary vision_analyze calls returned 401 every time even though
the env vars were correct.  Inspecting auth.json showed the pool
entry had base_url ``https://openrouter.ai/api/v1`` — the hardcoded
constant from this seeding function.

Fix: read OPENROUTER_BASE_URL the same way the generic path reads
``pconfig.base_url_env_var``, fall back to the hardcoded default
when unset, and strip trailing slashes.

Tests: 4 cases — env override applied (with and without trailing
slash), default fallback when env var unset, default fallback on
explicit empty string.
@linxule

linxule commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

+1 — this is the architecture I've been trying to achieve via aux-vision config and couldn't, because that subsystem has no fallback list. Native-vision-through-main-model is the right shape.

One data point if useful for the test matrix: kimi-for-coding on api.kimi.com/coding/v1 (Kimi's subscription coding endpoint, distinct from the moonshotai/kimi-k2.5 you already tested) also accepts native multimodal — verified via direct curl, K2.6 reads images correctly. Endpoint quirk worth noting in docs: it rejects http(s):// image URLs with "unsupported image url" and requires base64 data: URLs — which _build_native_vision_content already emits, so no code change needed. Happy to test the branch end-to-end against a Kimi+Codex setup if it'd help unstick this.

@teknium1

Copy link
Copy Markdown
Contributor

Closing as superseded — native multimodal vision routing for vision-capable main models shipped in #16506 (commit ec671c4, "feat(image-input): native multimodal routing based on model vision capability"). It reaches the same architectural goal this PR proposed: capability-aware routing of inbound user-attached images as native image_url content parts when the active model declares supports_vision=true, with the legacy vision_analyze enrichment retained as the fallback for non-vision models. The shipped implementation lives in agent/image_routing.py (decide_image_input_mode, build_native_content_parts) and is wired into the CLI, gateway, and TUI gateway paths.\n\nThanks for the work — the design exploration here directly informed the shipped version. Closing in favor of #16506 to consolidate the open-PR list.

@teknium1 teknium1 closed this May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists tool/vision Vision analysis and image generation type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants