Image egress reads the whole file and inline-base64-copies it every LLM call (no downscale, no streaming)

## What happens

When a message carries an image, every LLM call reads the whole image file off disk and hands the raw bytes to the provider, which base64-serializes them inline into the request JSON. No downscale, no streaming, and it happens on **every turn** the image is still in the un-compacted window — not once.

A single message is bounded at admission (25MB/file, 10 files/message), but 10×25MB is ~660MB of base64 in one request, re-materialized each turn. On a memory-limited daemon (we run at 1Gi) that's a real spike, and it's the same allocation shape as #1293 — `ReadAllBytes` plus a large derived copy — just for media instead of shell output.

## Why

The egress reads the full file and wraps it for inline encoding:

- https://github.com/netclaw-dev/netclaw/blob/60601c6cc82cdffbc38c61a062aac28d2d8b3444/src/Netclaw.Actors/Protocol/ChatMessageConverter.cs#L99-L100

Admission caps exist, but they bound what's allowed in, not the per-turn materialization:

- https://github.com/netclaw-dev/netclaw/blob/60601c6cc82cdffbc38c61a062aac28d2d8b3444/src/Netclaw.Actors/Tools/FileReadTool.cs#L196-L211

The compaction layer already knows base64 inflates by ~4/3 and counts it toward tokens, so the *token* accounting is handled — it's the memory and raw-byte side that isn't:

- https://github.com/netclaw-dev/netclaw/blob/60601c6cc82cdffbc38c61a062aac28d2d8b3444/src/Netclaw.Actors/Sessions/Pipelines/SessionCompactionPipeline.cs#L205-L241

`file_read` recently gained image-from-disk handoff (the `AddModelInputFile` path above), so the agent can now pull arbitrary on-disk images into context — which widens the surface for this.

## Why the text playbook doesn't apply

You can't truncate or slice an image — the model needs a coherent frame, so the head+tail/grep approach we'd use for shell output is off the table. The levers are different:

- **Downscale before encode.** Vision models tile images and cap effective resolution (roughly 1–2MP of usable detail for most providers). A 25MB high-res photo carries no more signal than a modest resize for nearly any task, at a fraction of the bytes and tokens. Resizing at egress shrinks both the spike and the bill. We don't do it today.
- **Stream the encode, or use a provider data/file API.** `ReadAllBytes` + `DataContent` materializes the whole payload; encoding straight into the request stream — or uploading once and referencing by ID — avoids holding the giant base64 string.

## Suggested direction

1. Downscale/re-encode images to the model's effective max resolution before handoff. Biggest win, and it helps memory and tokens at once.
2. Stream the base64 encode instead of materializing a full copy; use provider file/data APIs where available.
3. Tighten or make deployment-aware the 10×25MB per-message admission ceiling for small-footprint pods.

## Related

#1293 (same allocation pattern, shell side). #1266 (audio/video native input) — see the egress notes there; A/V must **not** extend this inline-base64 path.


	if (inspection.Category == AttachmentCategory.Image && inlined)
	{
	if (inspection.SizeBytes > MaxModelInputFileBytes)
	{
	return BuildMetadataResponse(
	authorizedPath,
	inspection,
	$"Image exceeds the {ByteSizeFormatter.Format(MaxModelInputFileBytes)} model-input handoff limit. Raw binary output is not returned by file_read.");
	}

	context.AddModelInputFile(authorizedPath, Path.GetFileName(authorizedPath), inspection.MimeType);
	return BuildMetadataResponse(
	authorizedPath,
	inspection,
	"Image loaded for model-visible inspection on the next LLM call.");
	}

	/// Naive token estimation: total character count / 4. Includes the system
	/// prompt (if present), text content, tool-call arguments, and media-payload
	/// inflation across all messages. Media is estimated as <c>fileSize * 4 / 3</c>
	/// chars (base64 inflation) and contributes to the token count even though
	/// the bytes are stored as references on disk — at LLM-call time
	/// <see cref="ChatMessageConverter.ToAiMessages"/> loads them as
	/// <c>DataContent</c> and the provider base64-serializes them into the JSON
	/// payload. The accumulator is <see cref="long"/> to handle multi-megabyte
	/// images without int overflow.
	/// </summary>
	public static int EstimateTokens(
	List<SerializableChatMessage> messages,
	SerializableChatMessage? systemPrompt)
	{
	var totalChars = 0L;
	if (systemPrompt is not null)
	{
	totalChars += systemPrompt.Content?.Length ?? 0;
	totalChars += EstimateMediaChars(systemPrompt);
	}
	foreach (var msg in messages)
	{
	totalChars += msg.Content?.Length ?? 0;
	foreach (var tc in msg.ToolCalls)
	totalChars += tc.ArgumentsJson?.Length ?? 0;
	totalChars += EstimateMediaChars(msg);
	}
	return (int)(totalChars / 4);
	}

	private static long EstimateMediaChars(SerializableChatMessage msg)
	{
	var total = 0L;
	foreach (var media in msg.MediaReferences)
	{
	// base64 inflates raw bytes by 4/3
	total += media.FileSizeBytes * 4 / 3;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image egress reads the whole file and inline-base64-copies it every LLM call (no downscale, no streaming) #1296

What happens

Why

Why the text playbook doesn't apply

Suggested direction

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	var bytes = File.ReadAllBytes(fullPath);
	contents.Add(new DataContent(bytes, media.MimeType.Value));

Image egress reads the whole file and inline-base64-copies it every LLM call (no downscale, no streaming) #1296

Description

What happens

Why

Why the text playbook doesn't apply

Suggested direction

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions