Bug type
Behavior bug
Beta release blocker
No
Summary
On the Codex app-server transport (provider=openai, modelApi=openai-responses, modelId=gpt-5.5), inbound image attachments from Discord (and structurally any channel that populates ctx.MediaPath / ctx.MediaPaths) are inlined into the prompt body as a text reference but never converted to params.images, so the vision-capable model receives no pixels and imagesCount: 0 for every image-bearing turn.
Steps to reproduce
- Configure an agent on this transport (
provider: openai, modelApi: openai-responses, modelId: gpt-5.5) connected to a Discord channel.
- Send a message with an image attachment in that channel.
- Inspect
<session>.trajectory.jsonl for that turn. The prompt.submitted event contains a prompt body with a [media attached: <abs-path> (image/png) | <abs-path>] line and data.imagesCount: 0. The model.completed.usage shows input token counts consistent with no image parts in the wire payload.
Expected behavior
The same image should reach the model as a vision input, equivalent to the behavior on the provider=openai-codex, modelApi=openai-codex-responses route on the same host, which loads MediaPath into base64 entries via resolveAcpAttachments() (dispatch-acp-oy1KlABg.js:425-455) and detectAndLoadPromptImages() (src/agents/pi-embedded-runner/run/images.ts). On those routes, imagesCount > 0 is observed when an image is attached.
Actual behavior
For every image-bearing turn on the openai/openai-responses route, the prompt body has the [media attached: ...] line, data.imagesCount is 0, and the model has no pixels available. The model still receives a built-in image tool whose description (openclaw-tools-BfDU2PXL.js:4056) states "Only use this tool when images were NOT already provided in the user's message. Images mentioned in the prompt are automatically visible to you." On this transport that second sentence is not true, so the model frequently does not call the image tool and answers as if it had inspected the image. In a sample of 1,123 turns across recent sessions on the same host: 284 turns referenced [media attached: in the prompt body; 246 of those had imagesCount: 0 on this route, and 0 of those 246 had any image data in the wire payload.
Code path observed in the installed package:
buildUserInput in /opt/homebrew/lib/node_modules/openclaw/dist/thread-lifecycle-BQKXEdzO.js:1084 consumes params.images if populated:
function buildUserInput(params, promptText = params.prompt) {
return [{ type: "text", text: promptText, text_elements: [] },
...(params.images ?? []).map((image) => ({
type: "image",
url: `data:${image.mimeType};base64,${image.data}`
}))];
}
buildInboundMediaNote in get-reply-wxuuKnTx.js:2225 converts ctx.MediaPath / ctx.MediaPaths into the text line that ends up in the prompt body.
- No code in the Codex app-server reply assembly reads
ctx.MediaPath(s) and populates params.images with base64-encoded image bytes. Equivalent loaders exist on the other transports.
OpenClaw version
2026.5.12
Operating system
macOS 15.4 (Darwin 25.4.0)
Install method
npm global
Model
openai/gpt-5.5
Provider / routing chain
openclaw -> openai (Codex app-server, modelApi=openai-responses) -> OpenAI Responses API
Additional provider/model setup details
- Default agent on
openai/gpt-5.5 via Codex app-server. agents.defaults.imageModel not set in ~/.openclaw/openclaw.json (the image tool falls back to the provider's default vision model).
- Same host runs other agents/sessions on
openai-codex/gpt-5.5 (Codex CLI native auth, modelApi=openai-codex-responses); on that route, imagesCount > 0 is observed in trajectory logs for image-bearing turns. This is the parity gap.
Logs, screenshots, and evidence
Trajectory event excerpt (sanitized) for a representative image-bearing turn on this transport:
{
"type": "prompt.submitted",
"provider": "openai",
"modelId": "gpt-5.5",
"modelApi": "openai-responses",
"data": {
"turnId": "...",
"prompt": "... [media attached: ~/.openclaw/media/inbound/image---<uuid>.png (image/png) | ~/.openclaw/media/inbound/image---<uuid>.png] ...",
"imagesCount": 0
}
}
Tool call observed when the model does invoke the `image` tool (sometimes, ~5% of image-bearing turns):
{
"type": "tool.call",
"data": {
"name": "image",
"arguments": {
"image": "~/.openclaw/media/inbound/image---<uuid>.png",
"prompt": "Inspect this supplement bottle label proof ..."
}
}
}
Aggregate over 30 days on one host:
- 1123 total turns across all main-agent sessions
- 284 turns with `[media attached:` in the prompt body
- On `provider=openai, modelApi=openai-responses` (this bug): 246 image-bearing turns, 0 had imagesCount > 0
- On `provider=openai-codex, modelApi=openai-codex-responses` (parity): 38 image-bearing turns, 17 had imagesCount > 0
Impact and severity
- Affected: any agent on
provider=openai, modelApi=openai-responses receiving images from a transport that populates ctx.MediaPath(s) (observed on Discord; structurally affects any channel using the same media pipeline).
- Severity: High for any workflow that depends on the model actually seeing an attached image. The model is told images are automatically visible (per the
image tool description), so it typically skips the fallback tool and produces text that reads as if it had inspected the image. Downstream this drives confidently wrong visual claims and decision drift.
- Frequency: Always on this transport. 246/246 image-bearing turns observed had
imagesCount: 0 in the trajectory and no image parts in the wire payload.
- Consequence: agents return hallucinated visual observations; users acting on those observations (e.g., approving print-ready label proofs) can be misled. Indirectly worsens conversation drift because the model commits to invented details and is then asked to refine them.
Additional information
- A clean fix likely belongs in the Codex app-server reply assembly upstream of
buildTurnStartParams: a shared helper that reads ctx.MediaPath / ctx.MediaPaths plus MIME metadata, filters to image/*, applies the same local-root / timeout / max-byte / sanitization rules already used by resolveAcpAttachments() (dispatch-acp-oy1KlABg.js:425-455) and the embedded runner's detectAndLoadPromptImages(), and returns { mimeType, data }[] that gets merged into params.images, preserving ordering with any caller-supplied opts.images. Structured context is the primary source; prompt-text parsing should remain a compatibility fallback.
- The
image tool description in openclaw-tools-BfDU2PXL.js:4056 is structurally wrong on transports that do not deliver images even though the underlying model is vision-capable. Detecting transport delivery (not just modelHasVision) and adjusting the description accordingly would defuse the confabulation. Worth folding into the same fix; if not, it is a small follow-up issue.
- Regression markers worth covering in tests: data-URL size caps for OpenAI Responses, dedupe when both
opts.images and ctx.MediaPath are present, multi-image ordering, and trajectory coverage asserting data.imagesCount > 0 for image-bearing turns on openai/openai-responses.
Bug type
Behavior bug
Beta release blocker
No
Summary
On the Codex app-server transport (
provider=openai,modelApi=openai-responses,modelId=gpt-5.5), inbound image attachments from Discord (and structurally any channel that populatesctx.MediaPath/ctx.MediaPaths) are inlined into the prompt body as a text reference but never converted toparams.images, so the vision-capable model receives no pixels andimagesCount: 0for every image-bearing turn.Steps to reproduce
provider: openai,modelApi: openai-responses,modelId: gpt-5.5) connected to a Discord channel.<session>.trajectory.jsonlfor that turn. Theprompt.submittedevent contains a prompt body with a[media attached: <abs-path> (image/png) | <abs-path>]line anddata.imagesCount: 0. Themodel.completed.usageshows input token counts consistent with no image parts in the wire payload.Expected behavior
The same image should reach the model as a vision input, equivalent to the behavior on the
provider=openai-codex, modelApi=openai-codex-responsesroute on the same host, which loadsMediaPathinto base64 entries viaresolveAcpAttachments()(dispatch-acp-oy1KlABg.js:425-455) anddetectAndLoadPromptImages()(src/agents/pi-embedded-runner/run/images.ts). On those routes,imagesCount > 0is observed when an image is attached.Actual behavior
For every image-bearing turn on the
openai/openai-responsesroute, the prompt body has the[media attached: ...]line,data.imagesCountis0, and the model has no pixels available. The model still receives a built-inimagetool whose description (openclaw-tools-BfDU2PXL.js:4056) states "Only use this tool when images were NOT already provided in the user's message. Images mentioned in the prompt are automatically visible to you." On this transport that second sentence is not true, so the model frequently does not call theimagetool and answers as if it had inspected the image. In a sample of 1,123 turns across recent sessions on the same host: 284 turns referenced[media attached:in the prompt body; 246 of those hadimagesCount: 0on this route, and 0 of those 246 had any image data in the wire payload.Code path observed in the installed package:
buildUserInputin/opt/homebrew/lib/node_modules/openclaw/dist/thread-lifecycle-BQKXEdzO.js:1084consumesparams.imagesif populated:buildInboundMediaNoteinget-reply-wxuuKnTx.js:2225convertsctx.MediaPath/ctx.MediaPathsinto the text line that ends up in the prompt body.ctx.MediaPath(s)and populatesparams.imageswith base64-encoded image bytes. Equivalent loaders exist on the other transports.OpenClaw version
2026.5.12
Operating system
macOS 15.4 (Darwin 25.4.0)
Install method
npm global
Model
openai/gpt-5.5
Provider / routing chain
openclaw -> openai (Codex app-server, modelApi=openai-responses) -> OpenAI Responses API
Additional provider/model setup details
openai/gpt-5.5via Codex app-server.agents.defaults.imageModelnot set in~/.openclaw/openclaw.json(theimagetool falls back to the provider's default vision model).openai-codex/gpt-5.5(Codex CLI native auth,modelApi=openai-codex-responses); on that route,imagesCount > 0is observed in trajectory logs for image-bearing turns. This is the parity gap.Logs, screenshots, and evidence
Impact and severity
provider=openai, modelApi=openai-responsesreceiving images from a transport that populatesctx.MediaPath(s)(observed on Discord; structurally affects any channel using the same media pipeline).imagetool description), so it typically skips the fallback tool and produces text that reads as if it had inspected the image. Downstream this drives confidently wrong visual claims and decision drift.imagesCount: 0in the trajectory and no image parts in the wire payload.Additional information
buildTurnStartParams: a shared helper that readsctx.MediaPath/ctx.MediaPathsplus MIME metadata, filters toimage/*, applies the same local-root / timeout / max-byte / sanitization rules already used byresolveAcpAttachments()(dispatch-acp-oy1KlABg.js:425-455) and the embedded runner'sdetectAndLoadPromptImages(), and returns{ mimeType, data }[]that gets merged intoparams.images, preserving ordering with any caller-suppliedopts.images. Structured context is the primary source; prompt-text parsing should remain a compatibility fallback.imagetool description inopenclaw-tools-BfDU2PXL.js:4056is structurally wrong on transports that do not deliver images even though the underlying model is vision-capable. Detecting transport delivery (not justmodelHasVision) and adjusting the description accordingly would defuse the confabulation. Worth folding into the same fix; if not, it is a small follow-up issue.opts.imagesandctx.MediaPathare present, multi-image ordering, and trajectory coverage assertingdata.imagesCount > 0for image-bearing turns onopenai/openai-responses.