Agent Vision is a macOS-only Codex plugin that lets a local Codex session capture camera frames through a signed local app and materialize them as JPEG files.
It gives Codex a tiny, explicit window into the physical world around your Mac. Not a browser camera hack. Not a cloud vision service. Not an always-on surveillance product wearing a fake mustache and pretending to be productivity software. Just a signed native macOS app, an MCP-backed capture path, and a local JPEG file when you ask for one.
Some people will love this. Some people will absolutely hate it. Both reactions are reasonable.
If the idea of an AI assistant seeing your desk makes your soul leave your body and file a formal complaint, this plugin is not trying to convert you. Agent Vision is for the person who already trusts a local Codex session with real work and wants to say, "look at this thing," without taking a screenshot, emailing themself a photo, dragging files around, or performing the tiny office ritual where you hold a circuit board up to a laptop camera like you are negotiating with the future.
Version 1.0.1 gives Codex four explicit MCP tools:
agent_vision_snapshotagent_vision_startagent_vision_frameagent_vision_stop
The user-facing slash command is intentionally small:
/agent-vision snapshot
/agent-vision streaming
/agent-vision roast
Snapshot mode starts the camera if needed, waits for a usable JPEG frame, materializes that frame under ~/.codex/agent-vision/frames, displays it with an absolute Markdown image link, and stops the camera only if snapshot started it. If the camera returns a black warm-up frame, Agent Vision keeps the camera on, waits 5 seconds between attempts, and tries up to 3 total attempts.
Streaming mode keeps the camera session active so Codex can pull frames when visual context would help. The Mac camera indicator should stay on while streaming mode is active.
Roast mode is snapshot plus prose: it materializes a usable JPEG frame, passes that exact file to codex exec -i, and returns one opt-in playful roast of 400 characters or fewer. There is no separate roast MCP tool in version 1.0.1.
To stop streaming, tell Codex to stop camera use:
Agent Vision streaming off
The installed skill maps that request to agent_vision_stop.
Agent Vision does not implement:
- Cloud upload.
- Background recording.
- Audio capture.
- Device selection.
- Browser
getUserMedia. - Remote camera access.
- Automatic frame ingestion when streaming mode is off.
The camera stays local. Snapshot and roast mode use a saved JPEG file as the user-visible image contract; MCP image bytes are not treated as directly inspectable model vision input.
Agent Vision is for local-first Codex users who want the assistant to inspect physical things near the computer.
It is useful when the thing you need help with is real, visible, and annoying to describe:
- A handwritten note that says either
tokenortoker, and unfortunately the distinction matters. - A breadboard where one jumper wire is doing interpretive dance.
- A router light pattern that appears to be communicating in passive aggression.
- A whiteboard diagram that made sense during the meeting and has since become a corporate cave painting.
- A printed error code on a device whose manufacturer believed fonts were a moral weakness.
- A desk setup where the cable situation has entered its final form.
- A receipt, shipping label, part number, serial number, or sticker that you do not want to retype.
- A physical prototype where you need another set of eyes and those eyes can also read Swift.
It is not for people who want their camera to be completely absent from their AI workflow. That is a good boundary. Keep it. This plugin is deliberately explicit because the camera is not a casual permission.
Ask Codex to install Agent Vision from the repository URL:
Install Agent Vision from https://github.com/zfifteen/agent-vision
Codex should download the packaged release from that repository, extract it, run the package install.sh, and then open a new Codex session so /agent-vision and the Agent Vision MCP tools are loaded.
For QA evidence that the install and uninstall lifecycle maps to the available OpenAI/Codex plugin guidance, see docs/agent-vision-install-uninstall-traceability.md.
Manual package install:
curl -L -o agent-vision-1.0.1.tar.gz https://github.com/zfifteen/agent-vision/releases/download/v1.0.1/agent-vision-1.0.1.tar.gz
tar -xzf agent-vision-1.0.1.tar.gz
cd agent-vision-1.0.1
./install.shIf you are asking Codex to install the plugin for you, use a prompt like this:
Install Agent Vision from https://github.com/zfifteen/agent-vision. Use the packaged release archive from the repo releases, not the source/developer installer. Extract the archive, run ./install.sh, and open a new Codex session before using /agent-vision.
Ask Codex:
Use Agent Vision to start the camera, inspect the latest frame, and tell me what you can read from my note.
Take one image and turn the camera off:
/agent-vision snapshot
Use this when you want one usable image and then want the camera off. If the camera is already on because streaming mode is active, snapshot leaves streaming mode active. Codex should show the saved JPEG through an absolute Markdown image link.
Start streaming mode and keep the camera available:
/agent-vision streaming
Use this when Codex may need to inspect more than one moment in time. While streaming mode is active, Codex can pull frames as needed without asking for each frame. The camera indicator should stay on while this mode is active.
Stop streaming mode and release the camera:
Agent Vision streaming off
You can also say stop streaming or turn off the camera. Codex maps those requests to agent_vision_stop.
Take one image and request immediate emotional damage, responsibly:
/agent-vision roast
Roast mode uses the same camera lifecycle as snapshot mode, then adds a short text response. The roast is opt-in and based only on visible non-sensitive details such as outfit, posture, expression, lighting, or room chaos. It should not infer or attack protected traits, body size, age, disability, or other sensitive attributes. It is a tiny comedy mode, not a license to become a municipal cruelty department.
Read something in the room:
/agent-vision snapshot
What does the label on this device say?
Debug a physical setup:
/agent-vision streaming
Watch this prototype while I press the button. Tell me whether the status LED changes.
Use it as the least glamorous lab assistant ever hired:
/agent-vision snapshot
Is this connector seated correctly, or am I about to spend 45 minutes blaming software for a cable problem?
Use it for desk archaeology:
/agent-vision snapshot
Find the sticky note with the part number and read it back to me.
Use it for gentle accountability:
/agent-vision snapshot
Does my whiteboard plan contain an actual architecture, or did I draw six boxes and hope confidence would do the rest?
Use it when you have made the bold choice to ask your computer for fashion notes:
/agent-vision roast
Roast me in 400 characters or fewer.
The plugin cannot touch objects, move the camera, choose a different camera, or infer anything outside the returned image. If the camera cannot see it, Agent Vision cannot see it either. This is still software, not a dramatic scene from a hacking movie.
flowchart LR
A["Codex slash command"] --> B["agent-vision-capture-file"]
B --> C["Agent Vision MCP wrapper"]
C --> D["AgentVision.app"]
D --> E["AVFoundation"]
E --> F["Built-in Mac camera"]
F --> G["JPEG frame"]
G --> C
C --> B
B --> H["Saved JPEG file"]
H --> A
A --> I["Markdown image link or codex exec -i"]
The plugin package contains:
.codex-plugin/plugin.json.mcp.jsoncommands/agent-vision.mdskills/camera-control/SKILL.mddist/AgentVision.appdist/agent-vision-mcpdist/agent-vision-capture-file
The native app owns the camera permission. The MCP wrapper launches the signed app bundle and bridges JSON-RPC over named FIFOs. The file materializer calls the wrapper, decodes exactly one returned JPEG image, writes it to an explicit absolute path, and prints JSON. This preserves the macOS app identity that Camera permission is attached to while giving Codex an inspectable local file.
The installer stages the plugin under ~/plugins/agent-vision, caches it under ~/.codex/plugins/cache/local/agent-vision/1.0.1, registers the home-local marketplace and agent-vision@local plugin entry in ~/.codex/config.toml, removes legacy duplicate mcp_servers.agent_vision config, verifies the MCP tool list, and runs a Codex admission check before exiting.
Snapshot mode:
- Codex runs
agent-vision-capture-file --output "$OUTPUT" --json. - The file materializer calls
agent_vision_snapshotthrough the installed MCP wrapper. AgentVision.appstarts the built-in camera if it is not already running.- The app waits for and returns one usable JPEG frame.
- The file materializer writes the JPEG to
$OUTPUTand prints JSON withok: true. - Codex displays the saved JPEG with an absolute Markdown image link.
Roast mode:
- Codex runs
agent-vision-capture-file --output "$OUTPUT" --json. - The file materializer writes one usable JPEG to
$OUTPUT. - Codex runs
codex exec --ephemeral -i "$OUTPUT" -- "...roast prompt...". - Codex returns the saved JPEG link and the roast text from that image-input pass.
Streaming mode:
- Codex calls
agent_vision_start. AgentVision.appkeeps a capture session active.- Codex calls
agent_vision_framewhenever visual context would help. - Codex calls
agent_vision_stopwhen the user asks to stop camera use.
The user-visible invariant is simple: snapshot should blink the camera on briefly; streaming should keep the camera indicator on until stopped.
Agent Vision is explicit and pull-based. Snapshot mode starts the camera only for one frame. Streaming mode starts only when Codex calls agent_vision_start; frames are returned only when Codex calls agent_vision_frame; the session stops when Codex calls agent_vision_stop.
macOS asks for camera permission for the signed AgentVision.app the first time the capture session starts. Repeated prompts usually mean the app identity changed and the local installer should be rerun.
Version 1.0.1 does not implement background recording, cloud upload, device selection, audio capture, or unsolicited streaming into Codex context.
See PRIVACY.md for the standalone policy.
Run the test suite:
swift testBuild the release executable:
swift build -c releaseValidate manifests and release build without installing:
scripts/install-local.sh --dry-runRun the slash-command matrix:
scripts/test-slash-commands.shVerify file-backed snapshot while streaming is active:
scripts/test-streaming-interaction.shBuild a release archive:
AGENT_VISION_SIGN_IDENTITY="Developer ID Application: Your Name (TEAMID)" \
scripts/package-release.shUninstall the local plugin:
scripts/uninstall-local.shThe source installer is for development and release production. It builds and signs AgentVision.app locally, so it requires Swift, Xcode command line tools, and a local signing identity. The default install path for users is the packaged release installer.
If the slash command does not appear, verify the local plugin cache exists:
ls ~/.codex/plugins/cache/local/agent-vision/1.0.1If macOS repeatedly asks for camera permission, rerun the installer. Camera permission is tied to the signed AgentVision.app identity.
If streaming says it started but the camera indicator is off, the MCP process is not being kept alive. Streaming mode requires a persistent MCP session.
If frames are unavailable immediately after starting streaming, wait briefly and pull again. If frame errors persist, report the exact camera error instead of substituting another capture path.
If snapshot or roast mode sees a black frame, it treats that as camera warm-up and keeps the camera on. The 5-second wait happens between attempts, for 3 total attempts. After 3 black-frame attempts, it returns an error instead of handing Codex a useless image.
MIT
