Skip to content

feat(multimodal): implement native image content flow across gateway, agent, and read_file, with safer context budgeting#4535

Closed
latentharbor wants to merge 4 commits into
NousResearch:mainfrom
latentharbor:feat/native-multimodal-pr-clean
Closed

feat(multimodal): implement native image content flow across gateway, agent, and read_file, with safer context budgeting#4535
latentharbor wants to merge 4 commits into
NousResearch:mainfrom
latentharbor:feat/native-multimodal-pr-clean

Conversation

@latentharbor

Copy link
Copy Markdown

What does this PR do?

This PR implements native multimodal image input across Hermes' Gateway, agent loop, provider adapters, persistence, token budgeting, and file-reading path.

Before this PR, Hermes could accept images at the Gateway level, but the system largely treated them as text:

  • image-bearing user messages were often flattened into strings
  • non-text content was routed through vision_analyze even when the selected model already supported native vision
  • read_file(image) behaved like a legacy binary read path instead of a multimodal input path
  • several downstream layers assumed tool results and message content were always strings
  • rough token estimation significantly underestimated multimodal turns, especially for image-bearing conversations and non-ASCII-heavy text, which made context compression and hygiene less reliable

This PR changes that architecture so Hermes can preserve structured text + image content end-to-end for supported models, while still falling back safely for models that do not support native vision.

Compared with the old flow, the new approach has several advantages:

  • native vision models now receive the actual image instead of a lossy text description
  • Gateway routing now uses the resolved turn model, so vision support is checked against the model actually selected for that turn
  • read_file becomes the single image entrypoint for the model, removing the need to expose a separate vision_analyze tool
  • tool results are no longer forced into strings before provider adaptation
  • session persistence, token estimation, compression, and history handling are multimodal-safe instead of string-assumption-driven
  • repeated image reads no longer collapse into text-only dedup stubs that cannot reconstruct the original image payload
  • context pressure logic now uses safer rough budgeting for multilingual text and image inputs, which reduces the risk of silently overfilling context windows in long multimodal sessions

In short: this PR upgrades Hermes from "image-aware text plumbing" to a working native text + image multimodal content pipeline.

Related Issue

Fixes #

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Added unified multimodal capability detection in agent/model_metadata.py and hermes_cli/models.py.

    • Introduces a shared supports_vision source of truth so Gateway and agent logic can reliably decide whether to use native image input.
  • Added shared multimodal content helpers in agent/message_content.py.

    • Centralizes:
      • structured content detection
      • text projection for search/logging/compression
      • provider-side content conversion
      • image path -> data URL conversion
    • This is the new foundation for text + image handling.
  • Updated the core agent loop in run_agent.py.

    • Preserves structured multimodal content instead of flattening it prematurely.
    • Adapts read_file(image) results based on model capability:
      • native vision models receive image-bearing structured content
      • non-vision models get automatic auxiliary image analysis
    • Removes vision_analyze from the model-visible tool surface.
    • Fixes ordering so image tool results are processed before any truncation guard.
    • Makes low-frequency tool/display paths tolerant of structured tool results.
    • Removes temporary request-body debug printing used during development.
  • Updated tool surface generation in model_tools.py.

    • Hides vision_analyze from the model and routes image inspection through read_file instead.
  • Updated Gateway inbound image handling in gateway/run.py.

    • Loads incoming images from local cache and converts them to data URLs.
    • Checks vision support against the resolved turn model rather than a stale/default model.
    • Sends native multimodal content directly for vision-capable models.
    • Falls back to auxiliary analysis only for non-vision models.
    • Removes temporary image-routing debug printing used during validation.
  • Updated API server content handling in gateway/platforms/api_server.py.

    • Stops collapsing structured content from API requests into plain strings.
    • Preserves multimodal request content for downstream processing.
  • Updated Anthropic provider adaptation in agent/anthropic_adapter.py.

    • Allows tool_result.content to carry structured text/image blocks instead of always JSON-stringifying tool output.
  • Updated session persistence in hermes_state.py.

    • Adds content_json storage for original structured content.
    • Keeps a text projection for search/preview while preserving raw multimodal content for round-trip history recovery.
  • Updated context/token handling in agent/context_compressor.py and agent/model_metadata.py.

    • Compression and hygiene logic now operate on multimodal-safe text projections instead of assuming raw strings.
    • Rough token estimation is materially improved:
      • text estimation is now more conservative for non-ASCII-heavy languages instead of relying on a pure len(text) // 4 heuristic
      • image inputs are assigned explicit token cost using an OpenAI-style image estimation model instead of being effectively counted as a tiny placeholder
    • This reduces the risk of underestimating prompt size and accidentally overfilling context windows in long multimodal conversations.
  • Updated file tools in tools/file_operations.py and tools/file_tools.py.

    • read_file now supports image files directly.
    • Images are loaded through the same Python data URL helper used by Gateway.
    • Oversized or failed image reads now instruct the model to stop retrying terminal/PIL/tesseract workarounds and report the limitation honestly.
    • Image reads are excluded from the text-file dedup stub path so repeated image reads still return usable image payloads.
  • Updated session search formatting in tools/session_search_tool.py.

    • Uses multimodal-safe text projection for transcript summarization/search views.
  • Updated display/error handling in agent/display.py.

    • Makes tool failure detection safe for structured tool results instead of assuming every result is a string.
  • Added targeted regression coverage:

How to Test

  1. Verify native Gateway image routing on a vision-capable model.

    • Send an image through Gateway with a model that reports supports_vision=True.
    • Confirm the model receives native multimodal image input and does not call vision_analyze.
  2. Verify fallback behavior on a non-vision model.

    • Send an image through Gateway with a model that does not support native vision.
    • Confirm Hermes performs auxiliary analysis and continues the conversation without exposing vision_analyze to the model.
  3. Verify read_file(image) behavior.

    • Ask the model to inspect a local image via read_file(path).
    • On a native vision model, confirm the image is attached back to the model as structured multimodal content.
    • On a non-vision model, confirm read_file automatically returns analyzed text instead of requiring a separate vision tool.
  4. Verify transcript/persistence safety.

    • Resume a session with prior image-bearing turns.
    • Confirm history reload does not break and multimodal-safe persistence still works.
  5. Run targeted regression tests.

    • mkdir -p /tmp/hermes-test-home && HERMES_HOME=/tmp/hermes-test-home python3 -m pytest tests/test_run_agent.py::TestNativeMultimodalRouting tests/test_anthropic_adapter.py tests/agent/test_display.py -q -n0
    • python3 -m pytest tests/gateway/test_api_server.py tests/gateway/test_native_multimodal_gateway.py tests/gateway/test_session.py -q
    • python3 -m pytest tests/tools/test_file_operations.py tests/tools/test_file_read_guards.py tests/tools/test_file_tools_live.py -q

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

  • Native vision path: Gateway image input now arrives as structured multimodal content instead of being flattened into [The user sent an image ...] text.
  • Non-vision path: read_file(image) now automatically analyzes the image rather than instructing the model to call vision_analyze.
  • Chat Completions / Responses / Anthropic Messages now each preserve multimodal content in the protocol-appropriate way instead of forcing the same string-only tool result path.

Gang Wang and others added 4 commits April 2, 2026 11:59
- Extract vision/audio capabilities from OpenRouter API (architecture.input_modalities)
- Add get_model_capabilities() query function in model_metadata.py
- Cache capabilities alongside context_length and pricing data
- Add tests for capability extraction (vision, audio, text-only models)
- Provides foundation for run_agent to decide native vs fallback multimodal handling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This changes Hermes from a text-first image fallback pipeline into a multimodal-preserving pipeline for models with native vision support, while keeping the old auxiliary vision path as an explicit fallback for non-vision models.

Previous flow
- Gateway received an image and eagerly converted it into a plain-text description via vision_analyze.
- User image input was flattened into strings like "[The user sent an image~ Here's what I can see: ...]" before the main agent loop.
- Chat Completions / Responses / gateway API handling frequently coerced structured content into plain strings.
- Even when the resolved model supported native vision, the gateway still routed images through the auxiliary vision tool.
- The agent still exposed vision_analyze in its tool surface, so native-vision models could re-route already-native image inputs back through the plugin.
- Session storage only preserved the text form, so multimodal history could not round-trip structurally.
- Memory/search/compression/token estimation assumed message content was text and often used stringification.
- Rough token estimation counted text with len(text)//4 and effectively treated images as a tiny placeholder, which severely undercounted multimodal turns and also undercounted CJK / other non-ASCII-heavy text.

New flow
- Gateway receives an image, keeps the cached local file, and converts it to a data URL only when constructing native multimodal content.
- The gateway now resolves the actual per-turn model first, then checks supports_vision on that resolved model instead of using stale/default model state.
- If the resolved model supports vision, the gateway constructs native multimodal content blocks and passes them into the agent loop without flattening them into strings.
- If the resolved model does not support vision, the gateway preserves the old auxiliary vision enrichment path as the explicit fallback.
- The agent preserves structured content internally and converts between OpenAI Chat Completions, Responses, and Anthropic-style shapes only at protocol boundaries.
- For native-vision models, vision_analyze is removed from the model-facing tool surface so the model cannot bounce already-native image input back through the auxiliary plugin.
- Session persistence now stores both a text projection and the original structured content JSON, allowing multimodal history to round-trip correctly.
- Memory/search/display/compression/token-estimation code now derives text projections from structured content instead of blindly stringifying message objects.
- Rough token estimation now uses a conservative mixed-language text heuristic plus image-specific costs based on the OpenAI GPT-5.4 image sizing/tokenization model, so multimodal and non-ASCII conversations are no longer dramatically undercounted.

Key implementation details
- Added agent/message_content.py as the single helper module for multimodal content handling:
  - detect image parts
  - project structured content to text for logging/search/compression
  - serialize/deserialize structured content for storage
  - convert chat-style multimodal content into Responses input blocks
- Updated gateway/run.py to:
  - route image handling based on the resolved turn model
  - construct native image_url content with data URLs for vision-capable models
  - keep vision_analyze fallback only for non-vision models
  - avoid assuming message_text is always a string in reply/document/audio/session-hygiene paths
  - emit a gateway image-routing debug line when request-body debug printing is enabled
- Updated run_agent.py to:
  - preserve structured user/assistant content through the loop
  - convert multimodal content correctly for Responses API payloads
  - stop coercing assistant content into plain strings as a stored canonical form
  - derive text projections only when needed for logs, Honcho sync, trajectory saving, and fallback handling
  - print outbound provider request bodies behind HERMES_PRINT_API_REQUEST_BODY
- Updated model_tools.py and AIAgent initialization to hide vision_analyze from native-vision models.
- Updated hermes_state.py to store content_json alongside text content so multimodal transcripts survive reloads.
- Updated agent/context_compressor.py, agent/model_metadata.py, and tools/session_search_tool.py to use structured-content-aware text extraction.
- Upgraded rough token estimation:
  - text estimate now uses max(len(text)//4, ascii_chars//4 + non_ascii_chars)
  - image estimate now parses local files/data URLs to infer dimensions without Pillow
  - image costs follow a GPT-5.4-style patch budget model for low/high/original detail, with a conservative fallback when dimensions are unavailable

Behavioral consequences
- Native-vision models now actually receive images as images.
- Non-vision models still get the previous text-enrichment behavior.
- Multimodal history is preserved across session reloads instead of collapsing irreversibly into text.
- Compression/session hygiene/token budgeting are more conservative and more accurate for image-heavy and CJK-heavy conversations.
- Request-body debugging is now useful for verifying that provider payloads still contain image blocks rather than fallback text descriptions.

Tests added/updated
- Added gateway regression coverage for native multimodal routing vs auxiliary-vision fallback, including the case where the resolved turn model differs from the runner's default model.
- Added API server tests ensuring chat-completions and responses endpoints preserve multimodal content arrays.
- Added session DB coverage verifying structured content round-trips via content_json.
- Added agent tests verifying native multimodal payload generation and hiding of auxiliary vision tools for native-vision models.
- Added model metadata tests covering conservative non-ASCII text estimation and GPT-5.4-style image token estimation behavior.

Validation run during this work
- python3 -m pytest tests/agent/test_model_metadata.py -q
- python3 -m pytest tests/gateway/test_session_hygiene.py -q
- python3 -m pytest tests/gateway/test_native_multimodal_gateway.py tests/gateway/test_api_server.py tests/gateway/test_session.py -q
- HERMES_HOME=/tmp/hermes-test-home python3 -m pytest tests/test_run_agent.py::TestNativeMultimodalRouting -q
…odal content

Before this change, Hermes had two different image-read flows. Gateway inbound images used Python-side bytes->base64 conversion and could route directly into native multimodal requests, but read_file(image) still behaved like a legacy binary file tool. It either told the model to use vision_analyze, inlined huge base64 JSON strings that were later truncated, or collapsed back into plain text before provider formatting. This also meant the tool-call loop still had multiple string-only assumptions, so structured read_file image results could fail in quiet/display paths or be lost before reaching Responses or Anthropic tool-result payloads.

This change makes read_file the single image entrypoint for models. vision_analyze is removed from the model-visible tool surface, Gateway fallback guidance now points to read_file, and read_file(image) becomes capability-aware:

- native vision models: read_file loads the image and keeps it as structured multimodal content
- non-vision models: read_file automatically runs auxiliary image analysis and returns text guidance without exposing a separate vision tool
- oversized or failed image reads: the tool now explicitly tells the model to stop retrying via terminal/PIL/tesseract-style side channels and to report the limitation honestly to the user

The implementation also unifies the actual image loading path. Gateway and read_file now share the same Python helper for path->data URL conversion, so image ingestion no longer depends on shell base64 behavior or platform-specific flags.

Provider / protocol behavior after this commit:

OpenAI Chat Completions
- tool messages remain text-only, because the API does not support image blocks in tool role content
- read_file(image) therefore returns a sanitized text tool result plus a synthetic multimodal follow-up message containing the image

OpenAI Responses
- function_call_output.output can now remain structured
- read_file(image) tool results are converted into input_text + input_image blocks instead of being force-stringified
- tool image payloads are processed before truncation guards, so large base64 blobs are never truncated into unusable JSON strings first

Anthropic Messages
- tool_result.content now supports structured text + image blocks
- read_file(image) no longer gets json.dumps()-flattened on the Anthropic adapter path

The file-read dedup layer is also corrected for multimodal files. Text reads still deduplicate by unchanged path/range, but image and binary reads no longer return the generic 'File unchanged since last read' stub. Re-reading the same image must return a fresh image payload, because the earlier result may already have been transformed into a provider-specific multimodal follow-up and is not safely reusable as plain tool text.

Token / context handling remains on the newer multimodal-safe path introduced in the previous commit:
- original structured content is preserved where needed
- text projection is used only for search/logging/compression views
- image costs are estimated separately in rough token accounting
- display/error-detection code now tolerates structured tool results instead of assuming every tool output is a string

Verified with targeted regression coverage:
- tests/test_run_agent.py::TestNativeMultimodalRouting
- tests/test_anthropic_adapter.py
- tests/tools/test_file_operations.py
- tests/tools/test_file_read_guards.py::TestFileDedup::test_image_reads_are_not_deduped
- tests/agent/test_display.py

Anthropic native tool-result image handling is covered by adapter/agent tests in this repo, but was not manually smoke-tested against a live Anthropic endpoint in this round.
Drop the temporary stdout/env-hook debugging that was added while validating native multimodal request assembly.

This removes:
- HERMES_PRINT_API_REQUEST_BODY request-body printing
- HERMES_DUMP_REQUESTS preflight dump trigger
- HERMES_DUMP_REQUEST_STDOUT echoing of request dump payloads
- the Gateway image-routing stdout trace line

The regular error-path request dump helper remains in place, so provider failures can still write a structured debug artifact to the session logs without leaving always-on or env-triggered request/body printing in normal runtime paths.

This keeps the multimodal codepath cleaner and avoids leaking large structured payloads or image-bearing request bodies to stdout during real gateway runs.
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery comp/agent Core agent loop, run_agent.py, prompt builder tool/vision Vision analysis and image generation labels May 1, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Closing as superseded — native multimodal vision routing for vision-capable main models shipped in #16506 (commit ec671c4, "feat(image-input): native multimodal routing based on model vision capability"). It reaches the same architectural goal this PR proposed: capability-aware routing of inbound user-attached images as native image_url content parts when the active model declares supports_vision=true, with the legacy vision_analyze enrichment retained as the fallback for non-vision models. The shipped implementation lives in agent/image_routing.py (decide_image_input_mode, build_native_content_parts) and is wired into the CLI, gateway, and TUI gateway paths.\n\nThanks for the work — the design exploration here directly informed the shipped version. Closing in favor of #16506 to consolidate the open-PR list.

@teknium1 teknium1 closed this May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have tool/vision Vision analysis and image generation type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants