Overview
When Hermes's read_file encounters an image file, it returns: "This is an image file. Use vision_analyze tool to examine it." The agent then needs a second tool call to actually see the image. Kilocode reads images inline as base64 attachments, letting the agent see the image in a single tool call.
This is a small quality-of-life improvement that saves one round trip per image read.
Research Findings
Kilocode (tool/read.ts)
When read_file encounters an image or PDF:
- Reads the file as base64
- Returns it as an attachment in the tool result (multipart content)
- The model sees the image inline without a separate tool call
- PDFs are handled similarly
Hermes (tools/file_operations.py)
When read_file encounters an image:
if is_image:
return "This is an image file. Use the vision_analyze tool to examine it."
The agent must then call vision_analyze(image_url=path, question="...") as a separate step.
Implementation
In file_operations.py, when an image is detected:
if is_image:
import base64
with open(resolved_path, "rb") as f:
img_data = base64.b64encode(f.read()).decode()
mime = {
"png": "image/png", "jpg": "image/jpeg", "jpeg": "image/jpeg",
"gif": "image/gif", "webp": "image/webp"
}.get(ext, "image/png")
return [
{"type": "text", "text": f"Image file: {path} ({os.path.getsize(resolved_path)} bytes)"},
{"type": "image_url", "image_url": {"url": f"data:{mime};base64,{img_data}"}}
]
This requires the tool result handling in run_agent.py to support multipart content (list of content parts) in addition to plain strings. Check if this is already supported.
For PDFs: could use pdf2image or pymupdf to render pages as images, or fall back to text extraction.
Effort: Medium (~50 LOC + tool result format change).
Pros & Cons
Pros
- Saves one tool call per image read (faster, cheaper)
- More natural workflow — "read this file" works for all file types
- Consistent with how vision-capable models expect image input
Cons
- Large images consume significant tokens (base64 is ~33% overhead)
- Need to handle size limits (don't base64-encode a 50MB image)
- Requires multipart content support in tool results (may need changes in run_agent.py)
- Not all models support image input — need graceful fallback for text-only models
References
- Kilocode tool/read.ts — Inline image reading
- Hermes
tools/file_operations.py — Current image redirect
- Hermes
tools/vision_tools.py — Current vision_analyze tool
Overview
When Hermes's
read_fileencounters an image file, it returns: "This is an image file. Use vision_analyze tool to examine it." The agent then needs a second tool call to actually see the image. Kilocode reads images inline as base64 attachments, letting the agent see the image in a single tool call.This is a small quality-of-life improvement that saves one round trip per image read.
Research Findings
Kilocode (tool/read.ts)
When
read_fileencounters an image or PDF:Hermes (tools/file_operations.py)
When
read_fileencounters an image:The agent must then call
vision_analyze(image_url=path, question="...")as a separate step.Implementation
In
file_operations.py, when an image is detected:This requires the tool result handling in
run_agent.pyto support multipart content (list of content parts) in addition to plain strings. Check if this is already supported.For PDFs: could use
pdf2imageorpymupdfto render pages as images, or fall back to text extraction.Effort: Medium (~50 LOC + tool result format change).
Pros & Cons
Pros
Cons
References
tools/file_operations.py— Current image redirecttools/vision_tools.py— Current vision_analyze tool