Skip to content

Fix #521: inline multimodal read_file attachments#777

Closed
kshitijk4poor wants to merge 2 commits into
NousResearch:mainfrom
kshitijk4poor:fix/issue-521-inline-read-file-multimodal
Closed

Fix #521: inline multimodal read_file attachments#777
kshitijk4poor wants to merge 2 commits into
NousResearch:mainfrom
kshitijk4poor:fix/issue-521-inline-read-file-multimodal

Conversation

@kshitijk4poor

@kshitijk4poor kshitijk4poor commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • implement issue Feature: Inline Image/PDF Support in read_file — Skip the Extra Tool Call (inspired by Kilocode) #521 by letting read_file inline small image/PDF payloads into the next model turn instead of always bouncing the user to a separate vision step
  • mirror the original KiloCode read-tool behavior closely enough for direct cross-verification against its implementation, docs, and read-tool tests, while adapting Hermes' stringified tool-result architecture
  • keep PDF inlining intentionally limited to supported OpenRouter chat-completions paths and document that provider-specific constraint explicitly

What Changed

  • tools/file_operations.py
    • detect small images and PDFs in read_file
    • return structured attachment metadata plus base64 payload for supported inline handoff
  • run_agent.py
    • strip raw base64 from visible tool output so it never bloats the tool transcript
    • inject a synthetic multimodal follow-up user message so the next model turn can actually see the attachment
    • preserve multimodal user content on the Codex Responses path
    • gate inline image/PDF injection by provider/API mode instead of assuming all chat-completions backends are multimodal-safe
  • tools/file_tools.py
    • document the intentional provider-specific PDF limitation in the tool description
  • tests
    • add regression coverage for inline image/PDF extraction, multimodal injection, Codex conversion, and provider gating/fallback behavior

Cross-Verification Against Original Implementation

This change is intentionally modeled after KiloCode's read tool behavior and can be checked directly against these upstream references:

The main architectural difference is intentional: KiloCode can return attachment objects directly from the tool result, while Hermes tool handlers still return JSON strings. Hermes therefore follows the same user-visible behavior through a different transport: sanitize the tool payload, then inject a synthetic multimodal follow-up message for the next turn.

Provider Behavior

  • images inline on supported multimodal paths:
    • OpenRouter chat-completions
    • OpenAI Codex Responses
  • PDFs inline only on supported OpenRouter chat-completions paths
  • unsupported/text-only backends keep attachment metadata and emit an explicit fallback hint instead of receiving multimodal message parts

Closes #521.

- Extract _classify_attachment() replacing _is_image/_is_pdf/_is_inline_attachment
  to avoid computing os.path.splitext 4 times per read
- Extract module-level MIME_TYPES constant from inline dict
- Add substring guard before json.loads in _build_tool_result_messages
  to skip parsing for ~95% of text-file read_file results
- Fix silent base64 stripping when attachment_kind is None by scoping
  hint/strip logic to known attachment types only
- Extend MAX_TOOL_RESULT_CHARS truncation guard to all tool_messages,
  closing gap where inline user message with ~700KB data URL bypassed it
- Replace nested ternary hint construction with explicit if/elif/else
- Extract _is_openrouter property to deduplicate base_url checks
- Add .strip() to multimodal part values for whitespace safety
@kshitijk4poor kshitijk4poor force-pushed the fix/issue-521-inline-read-file-multimodal branch from 2b1dc51 to 9984383 Compare March 9, 2026 17:28
@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the thorough work on this @kshitijk4poor — the cross-referencing with KiloCode and the provider gating show real attention to detail. However, we're going to pass on this approach for a few architectural reasons:

1. Synthetic user messages break message flow invariants
Hermes enforces strict message alternation: after the system message, it's always user/tool ↔ assistant. Injecting a {"role": "user"} message after a tool result violates this — the expected sequence is tool→assistant, not tool→user→assistant. Some providers will reject this, and it creates a precedent we don't want to maintain.

2. Context cost is too high
A 512KB image inlined as base64 is ~680KB = ~170K tokens in the conversation context. That's nearly an entire 200K context window for one image. The existing flow (read_file detects image → redirects to vision_analyze) uses a cheap auxiliary model and doesn't bloat the main context at all.

3. The existing flow already works well
read_filevision_analyze is clean, works on all providers, and keeps image processing costs separate from the main conversation. This is the right pattern.

4. Complexity in run_agent.py
+130 lines in the core agent loop (including multimodal format conversion and provider gating) is significant maintenance surface for a feature that only works on specific providers.

The PDF inline support being limited to "supported OpenRouter chat-completions paths" further underscores that this is too niche for the complexity it adds.

Appreciate the contribution though — if you're interested in improving the image experience, enhancing the vision_analyze tool itself would be a great direction! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Inline Image/PDF Support in read_file — Skip the Extra Tool Call (inspired by Kilocode)

2 participants