feat(multimodal): implement native image content flow across gateway, agent, and read_file, with safer context budgeting by latentharbor · Pull Request #4535 · NousResearch/hermes-agent

latentharbor · 2026-04-02T04:15:36Z

What does this PR do?

This PR implements native multimodal image input across Hermes' Gateway, agent loop, provider adapters, persistence, token budgeting, and file-reading path.

Before this PR, Hermes could accept images at the Gateway level, but the system largely treated them as text:

image-bearing user messages were often flattened into strings
non-text content was routed through vision_analyze even when the selected model already supported native vision
read_file(image) behaved like a legacy binary read path instead of a multimodal input path
several downstream layers assumed tool results and message content were always strings
rough token estimation significantly underestimated multimodal turns, especially for image-bearing conversations and non-ASCII-heavy text, which made context compression and hygiene less reliable

This PR changes that architecture so Hermes can preserve structured text + image content end-to-end for supported models, while still falling back safely for models that do not support native vision.

Compared with the old flow, the new approach has several advantages:

native vision models now receive the actual image instead of a lossy text description
Gateway routing now uses the resolved turn model, so vision support is checked against the model actually selected for that turn
read_file becomes the single image entrypoint for the model, removing the need to expose a separate vision_analyze tool
tool results are no longer forced into strings before provider adaptation
session persistence, token estimation, compression, and history handling are multimodal-safe instead of string-assumption-driven
repeated image reads no longer collapse into text-only dedup stubs that cannot reconstruct the original image payload
context pressure logic now uses safer rough budgeting for multilingual text and image inputs, which reduces the risk of silently overfilling context windows in long multimodal sessions

In short: this PR upgrades Hermes from "image-aware text plumbing" to a working native text + image multimodal content pipeline.

Related Issue

Fixes #

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

Added unified multimodal capability detection in agent/model_metadata.py and hermes_cli/models.py.
- Introduces a shared supports_vision source of truth so Gateway and agent logic can reliably decide whether to use native image input.
Added shared multimodal content helpers in agent/message_content.py.
- Centralizes:
  - structured content detection
  - text projection for search/logging/compression
  - provider-side content conversion
  - image path -> data URL conversion
- This is the new foundation for text + image handling.
Updated the core agent loop in run_agent.py.
- Preserves structured multimodal content instead of flattening it prematurely.
- Adapts read_file(image) results based on model capability:
  - native vision models receive image-bearing structured content
  - non-vision models get automatic auxiliary image analysis
- Removes vision_analyze from the model-visible tool surface.
- Fixes ordering so image tool results are processed before any truncation guard.
- Makes low-frequency tool/display paths tolerant of structured tool results.
- Removes temporary request-body debug printing used during development.
Updated tool surface generation in model_tools.py.
- Hides vision_analyze from the model and routes image inspection through read_file instead.
Updated Gateway inbound image handling in gateway/run.py.
- Loads incoming images from local cache and converts them to data URLs.
- Checks vision support against the resolved turn model rather than a stale/default model.
- Sends native multimodal content directly for vision-capable models.
- Falls back to auxiliary analysis only for non-vision models.
- Removes temporary image-routing debug printing used during validation.
Updated API server content handling in gateway/platforms/api_server.py.
- Stops collapsing structured content from API requests into plain strings.
- Preserves multimodal request content for downstream processing.
Updated Anthropic provider adaptation in agent/anthropic_adapter.py.
- Allows tool_result.content to carry structured text/image blocks instead of always JSON-stringifying tool output.
Updated session persistence in hermes_state.py.
- Adds content_json storage for original structured content.
- Keeps a text projection for search/preview while preserving raw multimodal content for round-trip history recovery.
Updated context/token handling in agent/context_compressor.py and agent/model_metadata.py.
- Compression and hygiene logic now operate on multimodal-safe text projections instead of assuming raw strings.
- Rough token estimation is materially improved:
  - text estimation is now more conservative for non-ASCII-heavy languages instead of relying on a pure len(text) // 4 heuristic
  - image inputs are assigned explicit token cost using an OpenAI-style image estimation model instead of being effectively counted as a tiny placeholder
- This reduces the risk of underestimating prompt size and accidentally overfilling context windows in long multimodal conversations.
Updated file tools in tools/file_operations.py and tools/file_tools.py.
- read_file now supports image files directly.
- Images are loaded through the same Python data URL helper used by Gateway.
- Oversized or failed image reads now instruct the model to stop retrying terminal/PIL/tesseract workarounds and report the limitation honestly.
- Image reads are excluded from the text-file dedup stub path so repeated image reads still return usable image payloads.
Updated session search formatting in tools/session_search_tool.py.
- Uses multimodal-safe text projection for transcript summarization/search views.
Updated display/error handling in agent/display.py.
- Makes tool failure detection safe for structured tool results instead of assuming every result is a string.
Added targeted regression coverage:

How to Test

Verify native Gateway image routing on a vision-capable model.
- Send an image through Gateway with a model that reports supports_vision=True.
- Confirm the model receives native multimodal image input and does not call vision_analyze.
Verify fallback behavior on a non-vision model.
- Send an image through Gateway with a model that does not support native vision.
- Confirm Hermes performs auxiliary analysis and continues the conversation without exposing vision_analyze to the model.
Verify read_file(image) behavior.
- Ask the model to inspect a local image via read_file(path).
- On a native vision model, confirm the image is attached back to the model as structured multimodal content.
- On a non-vision model, confirm read_file automatically returns analyzed text instead of requiring a separate vision tool.
Verify transcript/persistence safety.
- Resume a session with prior image-bearing turns.
- Confirm history reload does not break and multimodal-safe persistence still works.
Run targeted regression tests.
- mkdir -p /tmp/hermes-test-home && HERMES_HOME=/tmp/hermes-test-home python3 -m pytest tests/test_run_agent.py::TestNativeMultimodalRouting tests/test_anthropic_adapter.py tests/agent/test_display.py -q -n0
- python3 -m pytest tests/gateway/test_api_server.py tests/gateway/test_native_multimodal_gateway.py tests/gateway/test_session.py -q
- python3 -m pytest tests/tools/test_file_operations.py tests/tools/test_file_read_guards.py tests/tools/test_file_tools_live.py -q

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: macOS

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

Native vision path: Gateway image input now arrives as structured multimodal content instead of being flattened into [The user sent an image ...] text.
Non-vision path: read_file(image) now automatically analyzes the image rather than instructing the model to call vision_analyze.
Chat Completions / Responses / Anthropic Messages now each preserve multimodal content in the protocol-appropriate way instead of forcing the same string-only tool result path.

- Extract vision/audio capabilities from OpenRouter API (architecture.input_modalities) - Add get_model_capabilities() query function in model_metadata.py - Cache capabilities alongside context_length and pricing data - Add tests for capability extraction (vision, audio, text-only models) - Provides foundation for run_agent to decide native vs fallback multimodal handling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This changes Hermes from a text-first image fallback pipeline into a multimodal-preserving pipeline for models with native vision support, while keeping the old auxiliary vision path as an explicit fallback for non-vision models. Previous flow - Gateway received an image and eagerly converted it into a plain-text description via vision_analyze. - User image input was flattened into strings like "[The user sent an image~ Here's what I can see: ...]" before the main agent loop. - Chat Completions / Responses / gateway API handling frequently coerced structured content into plain strings. - Even when the resolved model supported native vision, the gateway still routed images through the auxiliary vision tool. - The agent still exposed vision_analyze in its tool surface, so native-vision models could re-route already-native image inputs back through the plugin. - Session storage only preserved the text form, so multimodal history could not round-trip structurally. - Memory/search/compression/token estimation assumed message content was text and often used stringification. - Rough token estimation counted text with len(text)//4 and effectively treated images as a tiny placeholder, which severely undercounted multimodal turns and also undercounted CJK / other non-ASCII-heavy text. New flow - Gateway receives an image, keeps the cached local file, and converts it to a data URL only when constructing native multimodal content. - The gateway now resolves the actual per-turn model first, then checks supports_vision on that resolved model instead of using stale/default model state. - If the resolved model supports vision, the gateway constructs native multimodal content blocks and passes them into the agent loop without flattening them into strings. - If the resolved model does not support vision, the gateway preserves the old auxiliary vision enrichment path as the explicit fallback. - The agent preserves structured content internally and converts between OpenAI Chat Completions, Responses, and Anthropic-style shapes only at protocol boundaries. - For native-vision models, vision_analyze is removed from the model-facing tool surface so the model cannot bounce already-native image input back through the auxiliary plugin. - Session persistence now stores both a text projection and the original structured content JSON, allowing multimodal history to round-trip correctly. - Memory/search/display/compression/token-estimation code now derives text projections from structured content instead of blindly stringifying message objects. - Rough token estimation now uses a conservative mixed-language text heuristic plus image-specific costs based on the OpenAI GPT-5.4 image sizing/tokenization model, so multimodal and non-ASCII conversations are no longer dramatically undercounted. Key implementation details - Added agent/message_content.py as the single helper module for multimodal content handling: - detect image parts - project structured content to text for logging/search/compression - serialize/deserialize structured content for storage - convert chat-style multimodal content into Responses input blocks - Updated gateway/run.py to: - route image handling based on the resolved turn model - construct native image_url content with data URLs for vision-capable models - keep vision_analyze fallback only for non-vision models - avoid assuming message_text is always a string in reply/document/audio/session-hygiene paths - emit a gateway image-routing debug line when request-body debug printing is enabled - Updated run_agent.py to: - preserve structured user/assistant content through the loop - convert multimodal content correctly for Responses API payloads - stop coercing assistant content into plain strings as a stored canonical form - derive text projections only when needed for logs, Honcho sync, trajectory saving, and fallback handling - print outbound provider request bodies behind HERMES_PRINT_API_REQUEST_BODY - Updated model_tools.py and AIAgent initialization to hide vision_analyze from native-vision models. - Updated hermes_state.py to store content_json alongside text content so multimodal transcripts survive reloads. - Updated agent/context_compressor.py, agent/model_metadata.py, and tools/session_search_tool.py to use structured-content-aware text extraction. - Upgraded rough token estimation: - text estimate now uses max(len(text)//4, ascii_chars//4 + non_ascii_chars) - image estimate now parses local files/data URLs to infer dimensions without Pillow - image costs follow a GPT-5.4-style patch budget model for low/high/original detail, with a conservative fallback when dimensions are unavailable Behavioral consequences - Native-vision models now actually receive images as images. - Non-vision models still get the previous text-enrichment behavior. - Multimodal history is preserved across session reloads instead of collapsing irreversibly into text. - Compression/session hygiene/token budgeting are more conservative and more accurate for image-heavy and CJK-heavy conversations. - Request-body debugging is now useful for verifying that provider payloads still contain image blocks rather than fallback text descriptions. Tests added/updated - Added gateway regression coverage for native multimodal routing vs auxiliary-vision fallback, including the case where the resolved turn model differs from the runner's default model. - Added API server tests ensuring chat-completions and responses endpoints preserve multimodal content arrays. - Added session DB coverage verifying structured content round-trips via content_json. - Added agent tests verifying native multimodal payload generation and hiding of auxiliary vision tools for native-vision models. - Added model metadata tests covering conservative non-ASCII text estimation and GPT-5.4-style image token estimation behavior. Validation run during this work - python3 -m pytest tests/agent/test_model_metadata.py -q - python3 -m pytest tests/gateway/test_session_hygiene.py -q - python3 -m pytest tests/gateway/test_native_multimodal_gateway.py tests/gateway/test_api_server.py tests/gateway/test_session.py -q - HERMES_HOME=/tmp/hermes-test-home python3 -m pytest tests/test_run_agent.py::TestNativeMultimodalRouting -q

…odal content Before this change, Hermes had two different image-read flows. Gateway inbound images used Python-side bytes->base64 conversion and could route directly into native multimodal requests, but read_file(image) still behaved like a legacy binary file tool. It either told the model to use vision_analyze, inlined huge base64 JSON strings that were later truncated, or collapsed back into plain text before provider formatting. This also meant the tool-call loop still had multiple string-only assumptions, so structured read_file image results could fail in quiet/display paths or be lost before reaching Responses or Anthropic tool-result payloads. This change makes read_file the single image entrypoint for models. vision_analyze is removed from the model-visible tool surface, Gateway fallback guidance now points to read_file, and read_file(image) becomes capability-aware: - native vision models: read_file loads the image and keeps it as structured multimodal content - non-vision models: read_file automatically runs auxiliary image analysis and returns text guidance without exposing a separate vision tool - oversized or failed image reads: the tool now explicitly tells the model to stop retrying via terminal/PIL/tesseract-style side channels and to report the limitation honestly to the user The implementation also unifies the actual image loading path. Gateway and read_file now share the same Python helper for path->data URL conversion, so image ingestion no longer depends on shell base64 behavior or platform-specific flags. Provider / protocol behavior after this commit: OpenAI Chat Completions - tool messages remain text-only, because the API does not support image blocks in tool role content - read_file(image) therefore returns a sanitized text tool result plus a synthetic multimodal follow-up message containing the image OpenAI Responses - function_call_output.output can now remain structured - read_file(image) tool results are converted into input_text + input_image blocks instead of being force-stringified - tool image payloads are processed before truncation guards, so large base64 blobs are never truncated into unusable JSON strings first Anthropic Messages - tool_result.content now supports structured text + image blocks - read_file(image) no longer gets json.dumps()-flattened on the Anthropic adapter path The file-read dedup layer is also corrected for multimodal files. Text reads still deduplicate by unchanged path/range, but image and binary reads no longer return the generic 'File unchanged since last read' stub. Re-reading the same image must return a fresh image payload, because the earlier result may already have been transformed into a provider-specific multimodal follow-up and is not safely reusable as plain tool text. Token / context handling remains on the newer multimodal-safe path introduced in the previous commit: - original structured content is preserved where needed - text projection is used only for search/logging/compression views - image costs are estimated separately in rough token accounting - display/error-detection code now tolerates structured tool results instead of assuming every tool output is a string Verified with targeted regression coverage: - tests/test_run_agent.py::TestNativeMultimodalRouting - tests/test_anthropic_adapter.py - tests/tools/test_file_operations.py - tests/tools/test_file_read_guards.py::TestFileDedup::test_image_reads_are_not_deduped - tests/agent/test_display.py Anthropic native tool-result image handling is covered by adapter/agent tests in this repo, but was not manually smoke-tested against a live Anthropic endpoint in this round.

Drop the temporary stdout/env-hook debugging that was added while validating native multimodal request assembly. This removes: - HERMES_PRINT_API_REQUEST_BODY request-body printing - HERMES_DUMP_REQUESTS preflight dump trigger - HERMES_DUMP_REQUEST_STDOUT echoing of request dump payloads - the Gateway image-routing stdout trace line The regular error-path request dump helper remains in place, so provider failures can still write a structured debug artifact to the session logs without leaving always-on or env-triggered request/body printing in normal runtime paths. This keeps the multimodal codepath cleaner and avoids leaking large structured payloads or image-bearing request bodies to stdout during real gateway runs.

teknium1 · 2026-05-10T01:09:48Z

Closing as superseded — native multimodal vision routing for vision-capable main models shipped in #16506 (commit ec671c4, "feat(image-input): native multimodal routing based on model vision capability"). It reaches the same architectural goal this PR proposed: capability-aware routing of inbound user-attached images as native image_url content parts when the active model declares supports_vision=true, with the legacy vision_analyze enrichment retained as the fallback for non-vision models. The shipped implementation lives in agent/image_routing.py (decide_image_input_mode, build_native_content_parts) and is wired into the CLI, gateway, and TUI gateway paths.\n\nThanks for the work — the design exploration here directly informed the shipped version. Closing in favor of #16506 to consolidate the open-PR list.

Gang Wang and others added 4 commits April 2, 2026 11:59

latentharbor mentioned this pull request Apr 2, 2026

feat: Computer Use Tool — macOS desktop control via Anthropic native API #4562

Closed

This was referenced Apr 27, 2026

feat(image-input): native multimodal routing based on model vision capability #16506

Merged

[Feature Request] Native multimodal input instead of vision_analyze tool #7641

Closed

alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery comp/agent Core agent loop, run_agent.py, prompt builder tool/vision Vision analysis and image generation labels May 1, 2026

teknium1 closed this May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(multimodal): implement native image content flow across gateway, agent, and read_file, with safer context budgeting#4535

feat(multimodal): implement native image content flow across gateway, agent, and read_file, with safer context budgeting#4535
latentharbor wants to merge 4 commits into
NousResearch:mainfrom
latentharbor:feat/native-multimodal-pr-clean

latentharbor commented Apr 2, 2026

Uh oh!

teknium1 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

latentharbor commented Apr 2, 2026

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

teknium1 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants