feat(provider): send attached images to vision-capable models by esengine · Pull Request #4204 · esengine/DeepSeek-Reasonix

esengine · 2026-06-12T15:00:57Z

Refs #4158. First cut of multimodal: send attached images to vision-capable models. Video / keyframe decoding is intentionally out of scope.

The gap

Images dragged/pasted into chat are saved to .reasonix/attachments/ and referenced as @…png. Today the controller turns that into a text block — [image attachment available… use a vision MCP tool] — and provider.Message.Content is a plain string, so the model never receives image data. The issue's root-cause read (openai chatMessage.Content is *string) was correct.

What changed

provider.Message.Images []string (data URLs). Agent history is []provider.Message, so one field carries images through history, session persistence, and both providers.
openai: chatMessage.Content → any; a []chatContentPart array (text + image_url) for a vision user turn, the same string/null shape as before otherwise.
anthropic: a {type:image, source:{type:base64,media_type,data}} block beside the text block (parsed from the data URL via the shared provider.ParseImageDataURL).
Gating — a per-model vision config flag (forwarded through Config.Extra). Only vision models get image parts; text-only models never receive them (they'd 400) and their prompt-cache prefix is byte-for-byte unchanged. Set vision = true on the DeepSeek model entry once its multimodal API lands.
Threading — the controller resolves image @refs to data URLs and passes them via agent.WithUserImages(ctx, …), mirroring WithParentSession, so agent stays free of an internal/control import.

Cache safety

Images live only in user turns; embedding one doesn't retroactively invalidate the prefix before it (append-only history). Text-only flows are untouched because of the vision gate.

Tests

provider (ParseImageDataURL), openai / anthropic (image parts/blocks when vision, plain content when not), control (image @ref → data URL on the turn). go vet + the touched packages all green locally.

Known tradeoffs / follow-ups

Images store inline as data URLs in session history → .jsonl bloat. A path-and-resolve-at-send design would keep history compact (deferred).
Video → keyframes (ffmpeg) is a separate, larger effort.
Anthropic vision works today; DeepSeek's main models are text-only until their multimodal API ships, so the immediate beneficiaries are Claude and other OpenAI-compatible vision endpoints.

@refs

Refs #4158 Attached images were saved to .reasonix/attachments and referenced as @-paths, so a vision model only ever saw the path text — the message content was a plain string with no way to carry image data. This embeds them for models that declare vision support. - provider.Message gains Images []string (data URLs). Agent history is already []provider.Message, so images ride through history, persistence, and both providers with one field. - openai: chatMessage.Content becomes `any` — a []chatContentPart array (text + image_url) for a vision user turn, the string/null shape otherwise. - anthropic: a base64 image block alongside the text block. - A model's `vision` config flag (forwarded via Extra) gates emission: text-only models never receive images (they 400) and their prompt cache stays untouched. - The controller resolves image @refs to data URLs and carries them via agent.WithUserImages(ctx, …); the agent must not import control. Images store inline in history (session bloat is the known tradeoff). Video and per-frame decoding are out of scope.

…ob (#4210) Follow-up to #4204. Sent images had no size control — an attached photo went out at full resolution (up to the 10 MB cap), wasting request bytes and image tokens since vision models downscale server-side anyway. - internal/control: a vision-only send path (visionImageDataURL) downscales an oversized image to 1568px on its longest side and re-encodes it — PNG/GIF stay lossless (screenshots, text, transparency), JPEG/WebP go to JPEG q85 — guarded against decompression bombs. Best-effort: an undecodable format passes through untouched. The desktop preview path (ImageDataURL) is unchanged, full res. - A per-model `vision_detail` (low|high) config flag sets the openai image_url detail hint; empty = auto/omit. "low" pins an image to ~85 tokens. - Deliberately no request-body gzip: it only helps the wire (~25%, and provider-support-dependent) and nothing for tokens, so downscaling is the lever. Also corrects the #4204 comment that claimed images "break prefix-cache stability" — they don't (vision-gated, append-only, byte-stable); the real concern was always cost.

esengine requested a review from SivanCola as a code owner June 12, 2026 15:00

github-actions Bot added v2 Go rewrite (1.x) — main-v2 branch, active development agent Core agent loop (internal/agent, internal/control) config Configuration & setup (internal/config) provider Model providers & selection (internal/provider) labels Jun 12, 2026

esengine merged commit 278984f into main-v2 Jun 12, 2026
14 checks passed

esengine deleted the feat/multimodal-images branch June 12, 2026 15:04

esengine mentioned this pull request Jun 12, 2026

feat(vision): downscale attached images before sending; add detail knob #4210

Merged

This was referenced Jun 13, 2026

[Feature]: 希望作者能加一个能配置视觉模型来看图片的功能 #3877

Closed

[Feature]: 功能请求：支持剪贴板图片直接粘贴到聊天输入框 #4178

Closed

[Feature]: 上传视频/图片进行多模态分析的支持 #4158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(provider): send attached images to vision-capable models#4204

feat(provider): send attached images to vision-capable models#4204
esengine merged 1 commit into
main-v2from
feat/multimodal-images

esengine commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

esengine commented Jun 12, 2026

The gap

What changed

Cache safety

Tests

Known tradeoffs / follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant