Fix #521: inline multimodal read_file attachments by kshitijk4poor · Pull Request #777 · NousResearch/hermes-agent

kshitijk4poor · 2026-03-09T16:59:13Z

Summary

implement issue Feature: Inline Image/PDF Support in read_file — Skip the Extra Tool Call (inspired by Kilocode) #521 by letting read_file inline small image/PDF payloads into the next model turn instead of always bouncing the user to a separate vision step
mirror the original KiloCode read-tool behavior closely enough for direct cross-verification against its implementation, docs, and read-tool tests, while adapting Hermes' stringified tool-result architecture
keep PDF inlining intentionally limited to supported OpenRouter chat-completions paths and document that provider-specific constraint explicitly

What Changed

tools/file_operations.py
- detect small images and PDFs in read_file
- return structured attachment metadata plus base64 payload for supported inline handoff
run_agent.py
- strip raw base64 from visible tool output so it never bloats the tool transcript
- inject a synthetic multimodal follow-up user message so the next model turn can actually see the attachment
- preserve multimodal user content on the Codex Responses path
- gate inline image/PDF injection by provider/API mode instead of assuming all chat-completions backends are multimodal-safe
tools/file_tools.py
- document the intentional provider-specific PDF limitation in the tool description
tests
- add regression coverage for inline image/PDF extraction, multimodal injection, Codex conversion, and provider gating/fallback behavior

Cross-Verification Against Original Implementation

This change is intentionally modeled after KiloCode's read tool behavior and can be checked directly against these upstream references:

KiloCode implementation: https://github.com/Kilo-Org/kilocode/blob/main/packages/opencode/src/tool/read.ts
KiloCode tests: https://github.com/Kilo-Org/kilocode/blob/main/packages/opencode/test/tool/read.test.ts
KiloCode docs: https://github.com/Kilo-Org/kilocode/blob/main/packages/kilo-docs/pages/automate/tools/read-file.md
Hermes issue: Feature: Inline Image/PDF Support in read_file — Skip the Extra Tool Call (inspired by Kilocode) #521

The main architectural difference is intentional: KiloCode can return attachment objects directly from the tool result, while Hermes tool handlers still return JSON strings. Hermes therefore follows the same user-visible behavior through a different transport: sanitize the tool payload, then inject a synthetic multimodal follow-up message for the next turn.

Provider Behavior

images inline on supported multimodal paths:
- OpenRouter chat-completions
- OpenAI Codex Responses
PDFs inline only on supported OpenRouter chat-completions paths
unsupported/text-only backends keep attachment metadata and emit an explicit fallback hint instead of receiving multimodal message parts

Closes #521.

- Extract _classify_attachment() replacing _is_image/_is_pdf/_is_inline_attachment to avoid computing os.path.splitext 4 times per read - Extract module-level MIME_TYPES constant from inline dict - Add substring guard before json.loads in _build_tool_result_messages to skip parsing for ~95% of text-file read_file results - Fix silent base64 stripping when attachment_kind is None by scoping hint/strip logic to known attachment types only - Extend MAX_TOOL_RESULT_CHARS truncation guard to all tool_messages, closing gap where inline user message with ~700KB data URL bypassed it - Replace nested ternary hint construction with explicit if/elif/else - Extract _is_openrouter property to deduplicate base_url checks - Add .strip() to multimodal part values for whitespace safety

teknium1 · 2026-03-11T10:10:23Z

Thanks for the thorough work on this @kshitijk4poor — the cross-referencing with KiloCode and the provider gating show real attention to detail. However, we're going to pass on this approach for a few architectural reasons:

1. Synthetic user messages break message flow invariants
Hermes enforces strict message alternation: after the system message, it's always user/tool ↔ assistant. Injecting a {"role": "user"} message after a tool result violates this — the expected sequence is tool→assistant, not tool→user→assistant. Some providers will reject this, and it creates a precedent we don't want to maintain.

2. Context cost is too high
A 512KB image inlined as base64 is ~680KB = ~170K tokens in the conversation context. That's nearly an entire 200K context window for one image. The existing flow (read_file detects image → redirects to vision_analyze) uses a cheap auxiliary model and doesn't bloat the main context at all.

3. The existing flow already works well
read_file → vision_analyze is clean, works on all providers, and keeps image processing costs separate from the main conversation. This is the right pattern.

4. Complexity in run_agent.py
+130 lines in the core agent loop (including multimodal format conversion and provider gating) is significant maintenance surface for a feature that only works on specific providers.

The PDF inline support being limited to "supported OpenRouter chat-completions paths" further underscores that this is too niche for the complexity it adds.

Appreciate the contribution though — if you're interested in improving the image experience, enhancing the vision_analyze tool itself would be a great direction! 🙏

kshitijk4poor added 2 commits March 9, 2026 22:28

Fix inline multimodal read_file attachments

079e510

kshitijk4poor force-pushed the fix/issue-521-inline-read-file-multimodal branch from 2b1dc51 to 9984383 Compare March 9, 2026 17:28

teknium1 closed this Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #521: inline multimodal read_file attachments#777

Fix #521: inline multimodal read_file attachments#777
kshitijk4poor wants to merge 2 commits into
NousResearch:mainfrom
kshitijk4poor:fix/issue-521-inline-read-file-multimodal

kshitijk4poor commented Mar 9, 2026 •

edited

Loading

Uh oh!

teknium1 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kshitijk4poor commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Cross-Verification Against Original Implementation

Provider Behavior

Uh oh!

teknium1 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kshitijk4poor commented Mar 9, 2026 •

edited

Loading