Summary
Codex locally estimates detail: "original" image token usage by counting raw 32px patches across the decoded source image, with no upper bound for the image sizing caps that GPT-5.4/GPT-5.5 apply server-side.
In the Computer Use guide, the documentation states that detail: "original preserves the full screenshot resolution of up to 10.24 megapixels before images are downscaled.
That can hugely overestimate context usage for large images and materially affect local behavior such as auto-compaction and remote compaction trimming. Uncapped estimates, in practice, can lead to endless compaction loops when the original image bytes estimate significantly diverge from the true token cost of consuming that image.
Research
The estimator in codex-rs/core/src/context_manager/history.rs replaces inline base64 image payload bytes with a model-visible byte estimate:
- non-original images use
RESIZED_IMAGE_BYTES_ESTIMATE = 7373
- original-detail images call
estimate_original_image_bytes
estimate_original_image_bytes decodes the image, reads dimensions, computes:
ceil(width / 32) * ceil(height / 32) patches
and returns:
approx_bytes_for_tokens(patch_count) == patch_count * 4 bytes
Because the later token estimate is ceil(bytes / 4), the original image estimate is effectively one token per raw 32px patch. However, the local estimator not cap the token estimate. This is non consistent with how tokens are counted on the server.
Example local estimates:
| Image |
Local estimated image bytes |
Local estimated image tokens |
6000x6000 |
141,376 |
35,344 |
10000x10000 |
391,876 |
97,969 |
12000x12000 |
562,500 |
140,625 |
20000x20000 |
1,562,500 |
390,625 |
OpenAI's image sizing docs for GPT-5.4/GPT-5.5 say original allows up to 10,000 patches or a 6000px max dimension, and images over either limit are resized while preserving aspect ratio (https://developers.openai.com/api/docs/guides/images-vision#model-sizing-behavior).
However, the documentation does not mention what the token multiplier for the model is, so I ran a bunch of API requests on a 10.24 megapixel image at various aspect ratios to check the most usage an image can incur. The numbers are the server-reported token cost of the vision request.
According to the table, detail: "original" vision payloads consume, at maximum, 12,000 tokens (which implies a 1.2x token multiplier on 10,000 patches).
Relevant code:
codex-rs/core/src/context_manager/history.rs
RESIZED_IMAGE_BYTES_ESTIMATE
ORIGINAL_IMAGE_PATCH_SIZE
estimate_original_image_bytes
image_data_url_estimate_adjustment
codex-rs/utils/string/src/truncate.rs
APPROX_BYTES_PER_TOKEN = 4
approx_token_count
approx_bytes_for_tokens
approx_tokens_from_byte_count
Impact
ContextManager::get_total_token_usage adds estimated tokens for items after the last model-generated item.
session/turn.rs uses total token usage to decide pre-turn and mid-turn auto-compaction.
compact_remote.rs trims function-call history while estimate_token_count_with_base_instructions exceeds the model context window.
recompute_token_usage can synthesize token-count events from the estimate.
So one or multiple large, original-detail images can make Codex believe the session is larger than the API actually sees, causing premature compaction or trimming.
Replication
Generate a 6000x6000 pixel image. Provide the path to that image and tell codex: "Please use the view image tool 10 times on the same image, at original resolution". Codex will go into an endless compaction loop because the local byte estimates far exceed the compaction threshold.
Expected behavior
detail: "original" image estimation should be capped at 12,000 tokens (~48,000 bytes)
Summary
Codex locally estimates
detail: "original"image token usage by counting raw 32px patches across the decoded source image, with no upper bound for the image sizing caps that GPT-5.4/GPT-5.5 apply server-side.In the Computer Use guide, the documentation states that
detail: "originalpreserves the full screenshot resolution of up to 10.24 megapixels before images are downscaled.That can hugely overestimate context usage for large images and materially affect local behavior such as auto-compaction and remote compaction trimming. Uncapped estimates, in practice, can lead to endless compaction loops when the original image bytes estimate significantly diverge from the true token cost of consuming that image.
Research
The estimator in
codex-rs/core/src/context_manager/history.rsreplaces inline base64 image payload bytes with a model-visible byte estimate:RESIZED_IMAGE_BYTES_ESTIMATE = 7373estimate_original_image_bytesestimate_original_image_bytesdecodes the image, reads dimensions, computes:and returns:
Because the later token estimate is
ceil(bytes / 4), the original image estimate is effectively one token per raw 32px patch. However, the local estimator not cap the token estimate. This is non consistent with how tokens are counted on the server.Example local estimates:
6000x6000141,37635,34410000x10000391,87697,96912000x12000562,500140,62520000x200001,562,500390,625OpenAI's image sizing docs for GPT-5.4/GPT-5.5 say
originalallows up to 10,000 patches or a 6000px max dimension, and images over either limit are resized while preserving aspect ratio (https://developers.openai.com/api/docs/guides/images-vision#model-sizing-behavior).However, the documentation does not mention what the token multiplier for the model is, so I ran a bunch of API requests on a 10.24 megapixel image at various aspect ratios to check the most usage an image can incur. The numbers are the server-reported token cost of the vision request.
According to the table,
detail: "original"vision payloads consume, at maximum, 12,000 tokens (which implies a 1.2x token multiplier on 10,000 patches).Relevant code:
codex-rs/core/src/context_manager/history.rsRESIZED_IMAGE_BYTES_ESTIMATEORIGINAL_IMAGE_PATCH_SIZEestimate_original_image_bytesimage_data_url_estimate_adjustmentcodex-rs/utils/string/src/truncate.rsAPPROX_BYTES_PER_TOKEN = 4approx_token_countapprox_bytes_for_tokensapprox_tokens_from_byte_countImpact
ContextManager::get_total_token_usageadds estimated tokens for items after the last model-generated item.session/turn.rsuses total token usage to decide pre-turn and mid-turn auto-compaction.compact_remote.rstrims function-call history whileestimate_token_count_with_base_instructionsexceeds the model context window.recompute_token_usagecan synthesize token-count events from the estimate.So one or multiple large, original-detail images can make Codex believe the session is larger than the API actually sees, causing premature compaction or trimming.
Replication
Generate a 6000x6000 pixel image. Provide the path to that image and tell codex: "Please use the view image tool 10 times on the same image, at original resolution". Codex will go into an endless compaction loop because the local byte estimates far exceed the compaction threshold.
Expected behavior
detail: "original"image estimation should be capped at 12,000 tokens (~48,000 bytes)