feat(multimodal): implement native image content flow across gateway, agent, and read_file, with safer context budgeting#4535
Conversation
- Extract vision/audio capabilities from OpenRouter API (architecture.input_modalities) - Add get_model_capabilities() query function in model_metadata.py - Cache capabilities alongside context_length and pricing data - Add tests for capability extraction (vision, audio, text-only models) - Provides foundation for run_agent to decide native vs fallback multimodal handling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This changes Hermes from a text-first image fallback pipeline into a multimodal-preserving pipeline for models with native vision support, while keeping the old auxiliary vision path as an explicit fallback for non-vision models. Previous flow - Gateway received an image and eagerly converted it into a plain-text description via vision_analyze. - User image input was flattened into strings like "[The user sent an image~ Here's what I can see: ...]" before the main agent loop. - Chat Completions / Responses / gateway API handling frequently coerced structured content into plain strings. - Even when the resolved model supported native vision, the gateway still routed images through the auxiliary vision tool. - The agent still exposed vision_analyze in its tool surface, so native-vision models could re-route already-native image inputs back through the plugin. - Session storage only preserved the text form, so multimodal history could not round-trip structurally. - Memory/search/compression/token estimation assumed message content was text and often used stringification. - Rough token estimation counted text with len(text)//4 and effectively treated images as a tiny placeholder, which severely undercounted multimodal turns and also undercounted CJK / other non-ASCII-heavy text. New flow - Gateway receives an image, keeps the cached local file, and converts it to a data URL only when constructing native multimodal content. - The gateway now resolves the actual per-turn model first, then checks supports_vision on that resolved model instead of using stale/default model state. - If the resolved model supports vision, the gateway constructs native multimodal content blocks and passes them into the agent loop without flattening them into strings. - If the resolved model does not support vision, the gateway preserves the old auxiliary vision enrichment path as the explicit fallback. - The agent preserves structured content internally and converts between OpenAI Chat Completions, Responses, and Anthropic-style shapes only at protocol boundaries. - For native-vision models, vision_analyze is removed from the model-facing tool surface so the model cannot bounce already-native image input back through the auxiliary plugin. - Session persistence now stores both a text projection and the original structured content JSON, allowing multimodal history to round-trip correctly. - Memory/search/display/compression/token-estimation code now derives text projections from structured content instead of blindly stringifying message objects. - Rough token estimation now uses a conservative mixed-language text heuristic plus image-specific costs based on the OpenAI GPT-5.4 image sizing/tokenization model, so multimodal and non-ASCII conversations are no longer dramatically undercounted. Key implementation details - Added agent/message_content.py as the single helper module for multimodal content handling: - detect image parts - project structured content to text for logging/search/compression - serialize/deserialize structured content for storage - convert chat-style multimodal content into Responses input blocks - Updated gateway/run.py to: - route image handling based on the resolved turn model - construct native image_url content with data URLs for vision-capable models - keep vision_analyze fallback only for non-vision models - avoid assuming message_text is always a string in reply/document/audio/session-hygiene paths - emit a gateway image-routing debug line when request-body debug printing is enabled - Updated run_agent.py to: - preserve structured user/assistant content through the loop - convert multimodal content correctly for Responses API payloads - stop coercing assistant content into plain strings as a stored canonical form - derive text projections only when needed for logs, Honcho sync, trajectory saving, and fallback handling - print outbound provider request bodies behind HERMES_PRINT_API_REQUEST_BODY - Updated model_tools.py and AIAgent initialization to hide vision_analyze from native-vision models. - Updated hermes_state.py to store content_json alongside text content so multimodal transcripts survive reloads. - Updated agent/context_compressor.py, agent/model_metadata.py, and tools/session_search_tool.py to use structured-content-aware text extraction. - Upgraded rough token estimation: - text estimate now uses max(len(text)//4, ascii_chars//4 + non_ascii_chars) - image estimate now parses local files/data URLs to infer dimensions without Pillow - image costs follow a GPT-5.4-style patch budget model for low/high/original detail, with a conservative fallback when dimensions are unavailable Behavioral consequences - Native-vision models now actually receive images as images. - Non-vision models still get the previous text-enrichment behavior. - Multimodal history is preserved across session reloads instead of collapsing irreversibly into text. - Compression/session hygiene/token budgeting are more conservative and more accurate for image-heavy and CJK-heavy conversations. - Request-body debugging is now useful for verifying that provider payloads still contain image blocks rather than fallback text descriptions. Tests added/updated - Added gateway regression coverage for native multimodal routing vs auxiliary-vision fallback, including the case where the resolved turn model differs from the runner's default model. - Added API server tests ensuring chat-completions and responses endpoints preserve multimodal content arrays. - Added session DB coverage verifying structured content round-trips via content_json. - Added agent tests verifying native multimodal payload generation and hiding of auxiliary vision tools for native-vision models. - Added model metadata tests covering conservative non-ASCII text estimation and GPT-5.4-style image token estimation behavior. Validation run during this work - python3 -m pytest tests/agent/test_model_metadata.py -q - python3 -m pytest tests/gateway/test_session_hygiene.py -q - python3 -m pytest tests/gateway/test_native_multimodal_gateway.py tests/gateway/test_api_server.py tests/gateway/test_session.py -q - HERMES_HOME=/tmp/hermes-test-home python3 -m pytest tests/test_run_agent.py::TestNativeMultimodalRouting -q
…odal content Before this change, Hermes had two different image-read flows. Gateway inbound images used Python-side bytes->base64 conversion and could route directly into native multimodal requests, but read_file(image) still behaved like a legacy binary file tool. It either told the model to use vision_analyze, inlined huge base64 JSON strings that were later truncated, or collapsed back into plain text before provider formatting. This also meant the tool-call loop still had multiple string-only assumptions, so structured read_file image results could fail in quiet/display paths or be lost before reaching Responses or Anthropic tool-result payloads. This change makes read_file the single image entrypoint for models. vision_analyze is removed from the model-visible tool surface, Gateway fallback guidance now points to read_file, and read_file(image) becomes capability-aware: - native vision models: read_file loads the image and keeps it as structured multimodal content - non-vision models: read_file automatically runs auxiliary image analysis and returns text guidance without exposing a separate vision tool - oversized or failed image reads: the tool now explicitly tells the model to stop retrying via terminal/PIL/tesseract-style side channels and to report the limitation honestly to the user The implementation also unifies the actual image loading path. Gateway and read_file now share the same Python helper for path->data URL conversion, so image ingestion no longer depends on shell base64 behavior or platform-specific flags. Provider / protocol behavior after this commit: OpenAI Chat Completions - tool messages remain text-only, because the API does not support image blocks in tool role content - read_file(image) therefore returns a sanitized text tool result plus a synthetic multimodal follow-up message containing the image OpenAI Responses - function_call_output.output can now remain structured - read_file(image) tool results are converted into input_text + input_image blocks instead of being force-stringified - tool image payloads are processed before truncation guards, so large base64 blobs are never truncated into unusable JSON strings first Anthropic Messages - tool_result.content now supports structured text + image blocks - read_file(image) no longer gets json.dumps()-flattened on the Anthropic adapter path The file-read dedup layer is also corrected for multimodal files. Text reads still deduplicate by unchanged path/range, but image and binary reads no longer return the generic 'File unchanged since last read' stub. Re-reading the same image must return a fresh image payload, because the earlier result may already have been transformed into a provider-specific multimodal follow-up and is not safely reusable as plain tool text. Token / context handling remains on the newer multimodal-safe path introduced in the previous commit: - original structured content is preserved where needed - text projection is used only for search/logging/compression views - image costs are estimated separately in rough token accounting - display/error-detection code now tolerates structured tool results instead of assuming every tool output is a string Verified with targeted regression coverage: - tests/test_run_agent.py::TestNativeMultimodalRouting - tests/test_anthropic_adapter.py - tests/tools/test_file_operations.py - tests/tools/test_file_read_guards.py::TestFileDedup::test_image_reads_are_not_deduped - tests/agent/test_display.py Anthropic native tool-result image handling is covered by adapter/agent tests in this repo, but was not manually smoke-tested against a live Anthropic endpoint in this round.
Drop the temporary stdout/env-hook debugging that was added while validating native multimodal request assembly. This removes: - HERMES_PRINT_API_REQUEST_BODY request-body printing - HERMES_DUMP_REQUESTS preflight dump trigger - HERMES_DUMP_REQUEST_STDOUT echoing of request dump payloads - the Gateway image-routing stdout trace line The regular error-path request dump helper remains in place, so provider failures can still write a structured debug artifact to the session logs without leaving always-on or env-triggered request/body printing in normal runtime paths. This keeps the multimodal codepath cleaner and avoids leaking large structured payloads or image-bearing request bodies to stdout during real gateway runs.
|
Closing as superseded — native multimodal vision routing for vision-capable main models shipped in #16506 (commit ec671c4, "feat(image-input): native multimodal routing based on model vision capability"). It reaches the same architectural goal this PR proposed: capability-aware routing of inbound user-attached images as native |
What does this PR do?
This PR implements native multimodal image input across Hermes' Gateway, agent loop, provider adapters, persistence, token budgeting, and file-reading path.
Before this PR, Hermes could accept images at the Gateway level, but the system largely treated them as text:
vision_analyzeeven when the selected model already supported native visionread_file(image)behaved like a legacy binary read path instead of a multimodal input pathThis PR changes that architecture so Hermes can preserve structured
text + imagecontent end-to-end for supported models, while still falling back safely for models that do not support native vision.Compared with the old flow, the new approach has several advantages:
read_filebecomes the single image entrypoint for the model, removing the need to expose a separatevision_analyzetoolIn short: this PR upgrades Hermes from "image-aware text plumbing" to a working native
text + imagemultimodal content pipeline.Related Issue
Fixes #
Type of Change
Changes Made
Added unified multimodal capability detection in agent/model_metadata.py and hermes_cli/models.py.
supports_visionsource of truth so Gateway and agent logic can reliably decide whether to use native image input.Added shared multimodal content helpers in agent/message_content.py.
text + imagehandling.Updated the core agent loop in run_agent.py.
read_file(image)results based on model capability:vision_analyzefrom the model-visible tool surface.Updated tool surface generation in model_tools.py.
vision_analyzefrom the model and routes image inspection throughread_fileinstead.Updated Gateway inbound image handling in gateway/run.py.
Updated API server content handling in gateway/platforms/api_server.py.
Updated Anthropic provider adaptation in agent/anthropic_adapter.py.
tool_result.contentto carry structured text/image blocks instead of always JSON-stringifying tool output.Updated session persistence in hermes_state.py.
content_jsonstorage for original structured content.Updated context/token handling in agent/context_compressor.py and agent/model_metadata.py.
len(text) // 4heuristicUpdated file tools in tools/file_operations.py and tools/file_tools.py.
read_filenow supports image files directly.Updated session search formatting in tools/session_search_tool.py.
Updated display/error handling in agent/display.py.
Added targeted regression coverage:
How to Test
Verify native Gateway image routing on a vision-capable model.
supports_vision=True.vision_analyze.Verify fallback behavior on a non-vision model.
vision_analyzeto the model.Verify
read_file(image)behavior.read_file(path).read_fileautomatically returns analyzed text instead of requiring a separate vision tool.Verify transcript/persistence safety.
Run targeted regression tests.
mkdir -p /tmp/hermes-test-home && HERMES_HOME=/tmp/hermes-test-home python3 -m pytest tests/test_run_agent.py::TestNativeMultimodalRouting tests/test_anthropic_adapter.py tests/agent/test_display.py -q -n0python3 -m pytest tests/gateway/test_api_server.py tests/gateway/test_native_multimodal_gateway.py tests/gateway/test_session.py -qpython3 -m pytest tests/tools/test_file_operations.py tests/tools/test_file_read_guards.py tests/tools/test_file_tools_live.py -qChecklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/AScreenshots / Logs
[The user sent an image ...]text.read_file(image)now automatically analyzes the image rather than instructing the model to callvision_analyze.