Skip to content

Feature: Inline Image/PDF Support in read_file — Skip the Extra Tool Call (inspired by Kilocode) #521

@teknium1

Description

@teknium1

Overview

When Hermes's read_file encounters an image file, it returns: "This is an image file. Use vision_analyze tool to examine it." The agent then needs a second tool call to actually see the image. Kilocode reads images inline as base64 attachments, letting the agent see the image in a single tool call.

This is a small quality-of-life improvement that saves one round trip per image read.


Research Findings

Kilocode (tool/read.ts)

When read_file encounters an image or PDF:

  • Reads the file as base64
  • Returns it as an attachment in the tool result (multipart content)
  • The model sees the image inline without a separate tool call
  • PDFs are handled similarly

Hermes (tools/file_operations.py)

When read_file encounters an image:

if is_image:
    return "This is an image file. Use the vision_analyze tool to examine it."

The agent must then call vision_analyze(image_url=path, question="...") as a separate step.


Implementation

In file_operations.py, when an image is detected:

if is_image:
    import base64
    with open(resolved_path, "rb") as f:
        img_data = base64.b64encode(f.read()).decode()
    mime = {
        "png": "image/png", "jpg": "image/jpeg", "jpeg": "image/jpeg",
        "gif": "image/gif", "webp": "image/webp"
    }.get(ext, "image/png")
    return [
        {"type": "text", "text": f"Image file: {path} ({os.path.getsize(resolved_path)} bytes)"},
        {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{img_data}"}}
    ]

This requires the tool result handling in run_agent.py to support multipart content (list of content parts) in addition to plain strings. Check if this is already supported.

For PDFs: could use pdf2image or pymupdf to render pages as images, or fall back to text extraction.

Effort: Medium (~50 LOC + tool result format change).


Pros & Cons

Pros

  • Saves one tool call per image read (faster, cheaper)
  • More natural workflow — "read this file" works for all file types
  • Consistent with how vision-capable models expect image input

Cons

  • Large images consume significant tokens (base64 is ~33% overhead)
  • Need to handle size limits (don't base64-encode a 50MB image)
  • Requires multipart content support in tool results (may need changes in run_agent.py)
  • Not all models support image input — need graceful fallback for text-only models

References

  • Kilocode tool/read.ts — Inline image reading
  • Hermes tools/file_operations.py — Current image redirect
  • Hermes tools/vision_tools.py — Current vision_analyze tool

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions