feat(provider): send attached images to vision-capable models#4204
Merged
Conversation
Refs #4158 Attached images were saved to .reasonix/attachments and referenced as @-paths, so a vision model only ever saw the path text — the message content was a plain string with no way to carry image data. This embeds them for models that declare vision support. - provider.Message gains Images []string (data URLs). Agent history is already []provider.Message, so images ride through history, persistence, and both providers with one field. - openai: chatMessage.Content becomes `any` — a []chatContentPart array (text + image_url) for a vision user turn, the string/null shape otherwise. - anthropic: a base64 image block alongside the text block. - A model's `vision` config flag (forwarded via Extra) gates emission: text-only models never receive images (they 400) and their prompt cache stays untouched. - The controller resolves image @refs to data URLs and carries them via agent.WithUserImages(ctx, …); the agent must not import control. Images store inline in history (session bloat is the known tradeoff). Video and per-frame decoding are out of scope.
esengine
added a commit
that referenced
this pull request
Jun 12, 2026
…ob (#4210) Follow-up to #4204. Sent images had no size control — an attached photo went out at full resolution (up to the 10 MB cap), wasting request bytes and image tokens since vision models downscale server-side anyway. - internal/control: a vision-only send path (visionImageDataURL) downscales an oversized image to 1568px on its longest side and re-encodes it — PNG/GIF stay lossless (screenshots, text, transparency), JPEG/WebP go to JPEG q85 — guarded against decompression bombs. Best-effort: an undecodable format passes through untouched. The desktop preview path (ImageDataURL) is unchanged, full res. - A per-model `vision_detail` (low|high) config flag sets the openai image_url detail hint; empty = auto/omit. "low" pins an image to ~85 tokens. - Deliberately no request-body gzip: it only helps the wire (~25%, and provider-support-dependent) and nothing for tokens, so downscaling is the lever. Also corrects the #4204 comment that claimed images "break prefix-cache stability" — they don't (vision-gated, append-only, byte-stable); the real concern was always cost.
This was referenced Jun 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs #4158. First cut of multimodal: send attached images to vision-capable models. Video / keyframe decoding is intentionally out of scope.
The gap
Images dragged/pasted into chat are saved to
.reasonix/attachments/and referenced as@…png. Today the controller turns that into a text block —[image attachment available… use a vision MCP tool]— andprovider.Message.Contentis a plain string, so the model never receives image data. The issue's root-cause read (openaichatMessage.Contentis*string) was correct.What changed
provider.Message.Images []string(data URLs). Agent history is[]provider.Message, so one field carries images through history, session persistence, and both providers.chatMessage.Content→any; a[]chatContentPartarray (text +image_url) for a vision user turn, the same string/nullshape as before otherwise.{type:image, source:{type:base64,media_type,data}}block beside the text block (parsed from the data URL via the sharedprovider.ParseImageDataURL).visionconfig flag (forwarded throughConfig.Extra). Onlyvisionmodels get image parts; text-only models never receive them (they'd400) and their prompt-cache prefix is byte-for-byte unchanged. Setvision = trueon the DeepSeek model entry once its multimodal API lands.@refsto data URLs and passes them viaagent.WithUserImages(ctx, …), mirroringWithParentSession, soagentstays free of aninternal/controlimport.Cache safety
Images live only in user turns; embedding one doesn't retroactively invalidate the prefix before it (append-only history). Text-only flows are untouched because of the
visiongate.Tests
provider(ParseImageDataURL),openai/anthropic(image parts/blocks when vision, plain content when not),control(image@ref→ data URL on the turn).go vet+ the touched packages all green locally.Known tradeoffs / follow-ups
.jsonlbloat. A path-and-resolve-at-send design would keep history compact (deferred).