Skip to content

feat(provider): send attached images to vision-capable models#4204

Merged
esengine merged 1 commit into
main-v2from
feat/multimodal-images
Jun 12, 2026
Merged

feat(provider): send attached images to vision-capable models#4204
esengine merged 1 commit into
main-v2from
feat/multimodal-images

Conversation

@esengine

Copy link
Copy Markdown
Owner

Refs #4158. First cut of multimodal: send attached images to vision-capable models. Video / keyframe decoding is intentionally out of scope.

The gap

Images dragged/pasted into chat are saved to .reasonix/attachments/ and referenced as @…png. Today the controller turns that into a text block — [image attachment available… use a vision MCP tool] — and provider.Message.Content is a plain string, so the model never receives image data. The issue's root-cause read (openai chatMessage.Content is *string) was correct.

What changed

  • provider.Message.Images []string (data URLs). Agent history is []provider.Message, so one field carries images through history, session persistence, and both providers.
  • openai: chatMessage.Contentany; a []chatContentPart array (text + image_url) for a vision user turn, the same string/null shape as before otherwise.
  • anthropic: a {type:image, source:{type:base64,media_type,data}} block beside the text block (parsed from the data URL via the shared provider.ParseImageDataURL).
  • Gating — a per-model vision config flag (forwarded through Config.Extra). Only vision models get image parts; text-only models never receive them (they'd 400) and their prompt-cache prefix is byte-for-byte unchanged. Set vision = true on the DeepSeek model entry once its multimodal API lands.
  • Threading — the controller resolves image @refs to data URLs and passes them via agent.WithUserImages(ctx, …), mirroring WithParentSession, so agent stays free of an internal/control import.

Cache safety

Images live only in user turns; embedding one doesn't retroactively invalidate the prefix before it (append-only history). Text-only flows are untouched because of the vision gate.

Tests

provider (ParseImageDataURL), openai / anthropic (image parts/blocks when vision, plain content when not), control (image @ref → data URL on the turn). go vet + the touched packages all green locally.

Known tradeoffs / follow-ups

  • Images store inline as data URLs in session history → .jsonl bloat. A path-and-resolve-at-send design would keep history compact (deferred).
  • Video → keyframes (ffmpeg) is a separate, larger effort.
  • Anthropic vision works today; DeepSeek's main models are text-only until their multimodal API ships, so the immediate beneficiaries are Claude and other OpenAI-compatible vision endpoints.

Refs #4158

Attached images were saved to .reasonix/attachments and referenced as @-paths, so
a vision model only ever saw the path text — the message content was a plain
string with no way to carry image data. This embeds them for models that declare
vision support.

- provider.Message gains Images []string (data URLs). Agent history is already
  []provider.Message, so images ride through history, persistence, and both
  providers with one field.
- openai: chatMessage.Content becomes `any` — a []chatContentPart array (text +
  image_url) for a vision user turn, the string/null shape otherwise.
- anthropic: a base64 image block alongside the text block.
- A model's `vision` config flag (forwarded via Extra) gates emission: text-only
  models never receive images (they 400) and their prompt cache stays untouched.
- The controller resolves image @refs to data URLs and carries them via
  agent.WithUserImages(ctx, …); the agent must not import control.

Images store inline in history (session bloat is the known tradeoff). Video and
per-frame decoding are out of scope.
@esengine esengine requested a review from SivanCola as a code owner June 12, 2026 15:00
@github-actions github-actions Bot added v2 Go rewrite (1.x) — main-v2 branch, active development agent Core agent loop (internal/agent, internal/control) config Configuration & setup (internal/config) provider Model providers & selection (internal/provider) labels Jun 12, 2026
@esengine esengine merged commit 278984f into main-v2 Jun 12, 2026
14 checks passed
@esengine esengine deleted the feat/multimodal-images branch June 12, 2026 15:04
esengine added a commit that referenced this pull request Jun 12, 2026
…ob (#4210)

Follow-up to #4204. Sent images had no size control — an attached photo went out
at full resolution (up to the 10 MB cap), wasting request bytes and image tokens
since vision models downscale server-side anyway.

- internal/control: a vision-only send path (visionImageDataURL) downscales an
  oversized image to 1568px on its longest side and re-encodes it — PNG/GIF stay
  lossless (screenshots, text, transparency), JPEG/WebP go to JPEG q85 — guarded
  against decompression bombs. Best-effort: an undecodable format passes through
  untouched. The desktop preview path (ImageDataURL) is unchanged, full res.
- A per-model `vision_detail` (low|high) config flag sets the openai image_url
  detail hint; empty = auto/omit. "low" pins an image to ~85 tokens.
- Deliberately no request-body gzip: it only helps the wire (~25%, and
  provider-support-dependent) and nothing for tokens, so downscaling is the lever.

Also corrects the #4204 comment that claimed images "break prefix-cache
stability" — they don't (vision-gated, append-only, byte-stable); the real
concern was always cost.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Core agent loop (internal/agent, internal/control) config Configuration & setup (internal/config) provider Model providers & selection (internal/provider) v2 Go rewrite (1.x) — main-v2 branch, active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant