What happens
When a message carries an image, every LLM call reads the whole image file off disk and hands the raw bytes to the provider, which base64-serializes them inline into the request JSON. No downscale, no streaming, and it happens on every turn the image is still in the un-compacted window — not once.
A single message is bounded at admission (25MB/file, 10 files/message), but 10×25MB is ~660MB of base64 in one request, re-materialized each turn. On a memory-limited daemon (we run at 1Gi) that's a real spike, and it's the same allocation shape as #1293 — ReadAllBytes plus a large derived copy — just for media instead of shell output.
Why
The egress reads the full file and wraps it for inline encoding:
|
var bytes = File.ReadAllBytes(fullPath); |
|
contents.Add(new DataContent(bytes, media.MimeType.Value)); |
Admission caps exist, but they bound what's allowed in, not the per-turn materialization:
|
if (inspection.Category == AttachmentCategory.Image && inlined) |
|
{ |
|
if (inspection.SizeBytes > MaxModelInputFileBytes) |
|
{ |
|
return BuildMetadataResponse( |
|
authorizedPath, |
|
inspection, |
|
$"Image exceeds the {ByteSizeFormatter.Format(MaxModelInputFileBytes)} model-input handoff limit. Raw binary output is not returned by file_read."); |
|
} |
|
|
|
context.AddModelInputFile(authorizedPath, Path.GetFileName(authorizedPath), inspection.MimeType); |
|
return BuildMetadataResponse( |
|
authorizedPath, |
|
inspection, |
|
"Image loaded for model-visible inspection on the next LLM call."); |
|
} |
The compaction layer already knows base64 inflates by ~4/3 and counts it toward tokens, so the token accounting is handled — it's the memory and raw-byte side that isn't:
|
/// Naive token estimation: total character count / 4. Includes the system |
|
/// prompt (if present), text content, tool-call arguments, and media-payload |
|
/// inflation across all messages. Media is estimated as <c>fileSize * 4 / 3</c> |
|
/// chars (base64 inflation) and contributes to the token count even though |
|
/// the bytes are stored as references on disk — at LLM-call time |
|
/// <see cref="ChatMessageConverter.ToAiMessages"/> loads them as |
|
/// <c>DataContent</c> and the provider base64-serializes them into the JSON |
|
/// payload. The accumulator is <see cref="long"/> to handle multi-megabyte |
|
/// images without int overflow. |
|
/// </summary> |
|
public static int EstimateTokens( |
|
List<SerializableChatMessage> messages, |
|
SerializableChatMessage? systemPrompt) |
|
{ |
|
var totalChars = 0L; |
|
if (systemPrompt is not null) |
|
{ |
|
totalChars += systemPrompt.Content?.Length ?? 0; |
|
totalChars += EstimateMediaChars(systemPrompt); |
|
} |
|
foreach (var msg in messages) |
|
{ |
|
totalChars += msg.Content?.Length ?? 0; |
|
foreach (var tc in msg.ToolCalls) |
|
totalChars += tc.ArgumentsJson?.Length ?? 0; |
|
totalChars += EstimateMediaChars(msg); |
|
} |
|
return (int)(totalChars / 4); |
|
} |
|
|
|
private static long EstimateMediaChars(SerializableChatMessage msg) |
|
{ |
|
var total = 0L; |
|
foreach (var media in msg.MediaReferences) |
|
{ |
|
// base64 inflates raw bytes by 4/3 |
|
total += media.FileSizeBytes * 4 / 3; |
file_read recently gained image-from-disk handoff (the AddModelInputFile path above), so the agent can now pull arbitrary on-disk images into context — which widens the surface for this.
Why the text playbook doesn't apply
You can't truncate or slice an image — the model needs a coherent frame, so the head+tail/grep approach we'd use for shell output is off the table. The levers are different:
- Downscale before encode. Vision models tile images and cap effective resolution (roughly 1–2MP of usable detail for most providers). A 25MB high-res photo carries no more signal than a modest resize for nearly any task, at a fraction of the bytes and tokens. Resizing at egress shrinks both the spike and the bill. We don't do it today.
- Stream the encode, or use a provider data/file API.
ReadAllBytes + DataContent materializes the whole payload; encoding straight into the request stream — or uploading once and referencing by ID — avoids holding the giant base64 string.
Suggested direction
- Downscale/re-encode images to the model's effective max resolution before handoff. Biggest win, and it helps memory and tokens at once.
- Stream the base64 encode instead of materializing a full copy; use provider file/data APIs where available.
- Tighten or make deployment-aware the 10×25MB per-message admission ceiling for small-footprint pods.
Related
#1293 (same allocation pattern, shell side). #1266 (audio/video native input) — see the egress notes there; A/V must not extend this inline-base64 path.
What happens
When a message carries an image, every LLM call reads the whole image file off disk and hands the raw bytes to the provider, which base64-serializes them inline into the request JSON. No downscale, no streaming, and it happens on every turn the image is still in the un-compacted window — not once.
A single message is bounded at admission (25MB/file, 10 files/message), but 10×25MB is ~660MB of base64 in one request, re-materialized each turn. On a memory-limited daemon (we run at 1Gi) that's a real spike, and it's the same allocation shape as #1293 —
ReadAllBytesplus a large derived copy — just for media instead of shell output.Why
The egress reads the full file and wraps it for inline encoding:
netclaw/src/Netclaw.Actors/Protocol/ChatMessageConverter.cs
Lines 99 to 100 in 60601c6
Admission caps exist, but they bound what's allowed in, not the per-turn materialization:
netclaw/src/Netclaw.Actors/Tools/FileReadTool.cs
Lines 196 to 211 in 60601c6
The compaction layer already knows base64 inflates by ~4/3 and counts it toward tokens, so the token accounting is handled — it's the memory and raw-byte side that isn't:
netclaw/src/Netclaw.Actors/Sessions/Pipelines/SessionCompactionPipeline.cs
Lines 205 to 241 in 60601c6
file_readrecently gained image-from-disk handoff (theAddModelInputFilepath above), so the agent can now pull arbitrary on-disk images into context — which widens the surface for this.Why the text playbook doesn't apply
You can't truncate or slice an image — the model needs a coherent frame, so the head+tail/grep approach we'd use for shell output is off the table. The levers are different:
ReadAllBytes+DataContentmaterializes the whole payload; encoding straight into the request stream — or uploading once and referencing by ID — avoids holding the giant base64 string.Suggested direction
Related
#1293 (same allocation pattern, shell side). #1266 (audio/video native input) — see the egress notes there; A/V must not extend this inline-base64 path.