Skip to content

Image egress reads the whole file and inline-base64-copies it every LLM call (no downscale, no streaming) #1296

@Aaronontheweb

Description

@Aaronontheweb

What happens

When a message carries an image, every LLM call reads the whole image file off disk and hands the raw bytes to the provider, which base64-serializes them inline into the request JSON. No downscale, no streaming, and it happens on every turn the image is still in the un-compacted window — not once.

A single message is bounded at admission (25MB/file, 10 files/message), but 10×25MB is ~660MB of base64 in one request, re-materialized each turn. On a memory-limited daemon (we run at 1Gi) that's a real spike, and it's the same allocation shape as #1293ReadAllBytes plus a large derived copy — just for media instead of shell output.

Why

The egress reads the full file and wraps it for inline encoding:

Admission caps exist, but they bound what's allowed in, not the per-turn materialization:

  • if (inspection.Category == AttachmentCategory.Image && inlined)
    {
    if (inspection.SizeBytes > MaxModelInputFileBytes)
    {
    return BuildMetadataResponse(
    authorizedPath,
    inspection,
    $"Image exceeds the {ByteSizeFormatter.Format(MaxModelInputFileBytes)} model-input handoff limit. Raw binary output is not returned by file_read.");
    }
    context.AddModelInputFile(authorizedPath, Path.GetFileName(authorizedPath), inspection.MimeType);
    return BuildMetadataResponse(
    authorizedPath,
    inspection,
    "Image loaded for model-visible inspection on the next LLM call.");
    }

The compaction layer already knows base64 inflates by ~4/3 and counts it toward tokens, so the token accounting is handled — it's the memory and raw-byte side that isn't:

  • /// Naive token estimation: total character count / 4. Includes the system
    /// prompt (if present), text content, tool-call arguments, and media-payload
    /// inflation across all messages. Media is estimated as <c>fileSize * 4 / 3</c>
    /// chars (base64 inflation) and contributes to the token count even though
    /// the bytes are stored as references on disk — at LLM-call time
    /// <see cref="ChatMessageConverter.ToAiMessages"/> loads them as
    /// <c>DataContent</c> and the provider base64-serializes them into the JSON
    /// payload. The accumulator is <see cref="long"/> to handle multi-megabyte
    /// images without int overflow.
    /// </summary>
    public static int EstimateTokens(
    List<SerializableChatMessage> messages,
    SerializableChatMessage? systemPrompt)
    {
    var totalChars = 0L;
    if (systemPrompt is not null)
    {
    totalChars += systemPrompt.Content?.Length ?? 0;
    totalChars += EstimateMediaChars(systemPrompt);
    }
    foreach (var msg in messages)
    {
    totalChars += msg.Content?.Length ?? 0;
    foreach (var tc in msg.ToolCalls)
    totalChars += tc.ArgumentsJson?.Length ?? 0;
    totalChars += EstimateMediaChars(msg);
    }
    return (int)(totalChars / 4);
    }
    private static long EstimateMediaChars(SerializableChatMessage msg)
    {
    var total = 0L;
    foreach (var media in msg.MediaReferences)
    {
    // base64 inflates raw bytes by 4/3
    total += media.FileSizeBytes * 4 / 3;

file_read recently gained image-from-disk handoff (the AddModelInputFile path above), so the agent can now pull arbitrary on-disk images into context — which widens the surface for this.

Why the text playbook doesn't apply

You can't truncate or slice an image — the model needs a coherent frame, so the head+tail/grep approach we'd use for shell output is off the table. The levers are different:

  • Downscale before encode. Vision models tile images and cap effective resolution (roughly 1–2MP of usable detail for most providers). A 25MB high-res photo carries no more signal than a modest resize for nearly any task, at a fraction of the bytes and tokens. Resizing at egress shrinks both the spike and the bill. We don't do it today.
  • Stream the encode, or use a provider data/file API. ReadAllBytes + DataContent materializes the whole payload; encoding straight into the request stream — or uploading once and referencing by ID — avoids holding the giant base64 string.

Suggested direction

  1. Downscale/re-encode images to the model's effective max resolution before handoff. Biggest win, and it helps memory and tokens at once.
  2. Stream the base64 encode instead of materializing a full copy; use provider file/data APIs where available.
  3. Tighten or make deployment-aware the 10×25MB per-message admission ceiling for small-footprint pods.

Related

#1293 (same allocation pattern, shell side). #1266 (audio/video native input) — see the egress notes there; A/V must not extend this inline-base64 path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    context-pipelineLLM context assembly: prompt layers, dynamic injection, memory recall, temporal groundingenhancementNew feature or requestprovidersProvider integrations and capability detection across OpenAI-compatible backends.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions