Feature: Inline Image/PDF Support in read_file — Skip the Extra Tool Call (inspired by Kilocode)

## Overview

When Hermes's `read_file` encounters an image file, it returns: "This is an image file. Use vision_analyze tool to examine it." The agent then needs a second tool call to actually see the image. Kilocode reads images inline as base64 attachments, letting the agent see the image in a single tool call.

This is a small quality-of-life improvement that saves one round trip per image read.

---

## Research Findings

### Kilocode (tool/read.ts)

When `read_file` encounters an image or PDF:
- Reads the file as base64
- Returns it as an attachment in the tool result (multipart content)
- The model sees the image inline without a separate tool call
- PDFs are handled similarly

### Hermes (tools/file_operations.py)

When `read_file` encounters an image:

```python
if is_image:
    return "This is an image file. Use the vision_analyze tool to examine it."
```

The agent must then call `vision_analyze(image_url=path, question="...")` as a separate step.

---

## Implementation

In `file_operations.py`, when an image is detected:

```python
if is_image:
    import base64
    with open(resolved_path, "rb") as f:
        img_data = base64.b64encode(f.read()).decode()
    mime = {
        "png": "image/png", "jpg": "image/jpeg", "jpeg": "image/jpeg",
        "gif": "image/gif", "webp": "image/webp"
    }.get(ext, "image/png")
    return [
        {"type": "text", "text": f"Image file: {path} ({os.path.getsize(resolved_path)} bytes)"},
        {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{img_data}"}}
    ]
```

This requires the tool result handling in `run_agent.py` to support multipart content (list of content parts) in addition to plain strings. Check if this is already supported.

For PDFs: could use `pdf2image` or `pymupdf` to render pages as images, or fall back to text extraction.

**Effort:** Medium (~50 LOC + tool result format change).

---

## Pros & Cons

### Pros
- Saves one tool call per image read (faster, cheaper)
- More natural workflow — "read this file" works for all file types
- Consistent with how vision-capable models expect image input

### Cons
- Large images consume significant tokens (base64 is ~33% overhead)
- Need to handle size limits (don't base64-encode a 50MB image)
- Requires multipart content support in tool results (may need changes in run_agent.py)
- Not all models support image input — need graceful fallback for text-only models

---

## References

- [Kilocode tool/read.ts](https://github.com/Kilo-Org/kilocode/blob/main/packages/opencode/src/tool/read.ts) — Inline image reading
- Hermes `tools/file_operations.py` — Current image redirect
- Hermes `tools/vision_tools.py` — Current vision_analyze tool


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Inline Image/PDF Support in read_file — Skip the Extra Tool Call (inspired by Kilocode) #521

Overview

Research Findings

Kilocode (tool/read.ts)

Hermes (tools/file_operations.py)

Implementation

Pros & Cons

Pros

Cons

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature: Inline Image/PDF Support in read_file — Skip the Extra Tool Call (inspired by Kilocode) #521

Description

Overview

Research Findings

Kilocode (tool/read.ts)

Hermes (tools/file_operations.py)

Implementation

Pros & Cons

Pros

Cons

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions