landing page: https://landing-amber-five-23.vercel.app/

Inspiration

If you have ever built a frontend with an AI coding agent, you know the pain: the agent takes a screenshot after every action, sends the full image to a vision model, and your context window fills up fast. Most of those screenshots are nearly identical to the one before. Maybe a form field got filled in, maybe a button changed color, but the agent is paying full-resolution image tokens to rediscover that 95% of the screen looks exactly the same. Microsoft Research's ReVision paper measured this directly: 36 to 56% of consecutive screenshots in real agent traces are pixel-identical. The rest usually differ in only a small region. ReVision and similar work (ShowUI, HistPrune-GUI) attack this problem by pruning tokens inside a specific VLM, but they require fine-tuning that model and offer no install path for end users. We drew inspiration from their core finding (most frames are redundant) but asked a different question: what if we handled the redundancy before the image ever reaches the model? Instead of sending full screenshots and hoping the model ignores what did not change, we extract only the changes locally using cheap image diffing and OCR, represent them as structured text, and only call a vision model for the small fraction of frames where text alone cannot explain what happened. That is StateLens.

What it does

StateLens is a model-agnostic compression layer that sits between a UI agent and its reasoning model. It watches the screenshot stream, filters redundant frames, extracts semantic state changes locally, and replaces expensive image input with a 50-token JSON event when it is safe to do so. It works with any vision model (Anthropic, OpenAI, Gemini) and ships as an npm package you can drop into any project. The pipeline runs in six stages, five of them local with no model call at all. First, a visual gate (perceptual hash plus pixelmatch) kills obvious no-ops. Then we localize the region that actually changed, OCR-diff just the changed crop, and score importance to pick keyframes worth reasoning about. Only at stage five do we optionally call a small VLM (Haiku, downscaled). Stage six emits the result as a semantic event into a per-session timeline. We ship three delivery surfaces backed by the same pipeline, but the HTTP proxy is the primary one. You boot StateLens, point ANTHROPIC_BASE_URL at it, and you are done. Zero code change, no tool calls, no prompt edits, no MITM, no custom CA. The MCP server is the editor-compatibility adapter for Cursor, Claude Code, and Claude Desktop. The in-process library is for teams who control their own Playwright or computer-use loop and want to import StateLens directly as a package. The numbers (and we are precise about which surface produced them): The in-process pipeline harness (controlled SDK loop, no agent overhead) showed what the pipeline can deliver: 81.9% token reduction, 90.1% cost reduction, 100% lenient accuracy, zero misses on a 12-frame login flow; 69.9% / 81.2% / 77.8% on a 10-frame Zara checkout. Those are real Anthropic API tokens with internal Haiku spend counted against us. The production HTTP proxy, running over a real network round-trip on the same 12-frame login flow, delivers 31.3% cost reduction (prev+curr per turn) with 100% lenient accuracy and zero misses preserved through the wire, and 59.4% cost reduction in the single-image-per-turn shape that real computer-use and browser agents actually use. Reproducible end-to-end with node dist/eval/measure_proxy.js.

How we built it

Written in TypeScript on Node 20+. The local-first pipeline uses sharp for image work, pixelmatch for the visual gate, and tesseract.js for OCR. The selective VLM stage calls Claude Haiku 4.5 with downscaled inputs. The proxy is a small HTTP server speaking the Anthropic /v1/messages shape; the MCP server is built on @modelcontextprotocol/sdk. Schemas are Zod, tests are Vitest, and we ship a Playwright adapter (captureAndRoute) for browser agents. The measurement story drove the build. We built a two-scenario harness (a 12-frame Sephora login and a 10-frame Zara checkout) and ran every change through it. Accuracy uses Claude as a judge: did StateLens's compressed event describe the same thing as a full-image baseline? Every number in the README comes from committed JSON files under eval/results/ and reproduces with npm run measure.

Challenges we ran into

The baseline was wrong. Phase 3 showed only 25% reduction. Diagnosis: Sonnet was getting one image and a "summarize what changed" prompt, so it correctly said "I can't compare without a previous screenshot." We made the baseline fair (prev+curr per turn), which doubled baseline tokens but made the comparison honest. Real reduction jumped to 81.9%. Haiku was sending full-res images. Each Haiku call cost around 3,500 tokens because we encoded the original resolution. Downscaling to 768px on the long edge produced no measurable quality loss, and per-call tokens dropped to around 1,300. OCR garbage on form-heavy frames. Zara's stylized form fields were tesseract'd into fragments like "a / ® |", and the importance scorer accepted it as "text-present." We built isTextReliable(), which rejects short, low-alphanumeric, fragmentary OCR, so form-heavy frames route to Haiku instead. Checkout accuracy lifted from 56% to 78%. The big one: MCP was 27% more expensive. We shipped MCP first and dogfooded it in Claude Code. Same prompt, same model, swap Read for statelens_observe, and the result was +27% cost. Diagnosis: not the pipeline. Three MCP-specific overheads caused the problem: verbose tool-call args, tool definitions cached every turn (around 22k extra cache reads), and JSON responses churning the cache because every observation is textually unique. All three exist whether the tool is called or not. Modeled breakeven on Opus: 30 to 50 frames. Below that, MCP overhead dominates. So we shipped the proxy. Same pipeline, different surface, and it runs outside the agent's context with no tool-def tax and no cache churn. Result: +27% flipped to -31% (eval parity) or -59% (computer-use shape). The same pipeline, on the right delivery surface. Proxy gzip bug. The first end-to-end run blew up with "incorrect header check." fetch() auto-decompresses Anthropic's gzipped responses, but we were forwarding the original content-encoding: gzip and stale content-length headers. Our unit tests used Response.json() and never exercised the real network path. One-line fix; durable lesson about trusting mocked tests for transport-layer code.

Accomplishments that we're proud of

The proxy ships and delivers the savings on a real network round-trip. Same code on both sides. The only difference between Run A and Run B is the baseURL of the Anthropic client. Run B goes through the StateLens proxy, hits Anthropic over real HTTP, and comes back 31.3% cheaper in eval-parity mode (100% lenient accuracy, zero misses) and 59.4% cheaper in the single-image shape that real agent loops use. Drop-in via one environment variable, no code change in the agent. The MCP-to-proxy reversal. We shipped MCP first, dogfooded it in Claude Code, and found it was 27% more expensive than just using Read. Same pipeline, same model; the surface was the problem (tool-def cache tax, JSON cache churn, verbose tool args, all paid per turn whether the tool is called or not). We diagnosed it, shipped the proxy form, and the same pipeline flipped from +27% to -31% (eval parity) or -59% (computer-use shape). That reversal is the most credible thing in the project. The action_failed event. When the agent passes an actionLabel and expects the action to mutate the UI but the screen does not change, the proxy emits event_type: action_failed instead of no_change. No OCR, no VLM, no false silence hiding a stuck page. The first signal a self-healing test agent needs is sitting on the same data path as the cost savings. Honest accounting. Every Haiku token StateLens spends internally is counted against the savings, not hidden by shifting work to a cheaper model. Internal Haiku calls inside the proxy bypass StateLens via a recursion guard, so the pipeline does not loop through itself. Every number reproduces from committed JSON in eval/results/.

What we learned

The delivery surface matters as much as the pipeline. A great compression algorithm wrapped in a bad delivery surface (per-turn cache tax, tool-def overhead, JSON churn) can be net-negative. The proxy form runs outside the agent's context, on the data path. Adopting it costs one environment variable. The compression that saves money also produces a side benefit: runs become debuggable. Raw screenshots disappear into the model context. StateLens leaves a timeline you can replay, answering "what changed, when, which frames actually needed vision, did an action do nothing?"

What's next for Statelens

A Playwright Test fixture with lens.step("submit form", action) and built-in healer signal. A Stagehand adapter for hosted browser agents. Healer-loop packaging that wraps action_failed plus the timeline into a retry-context bundle. A cost dashboard for agentic testing. Provider expansion is the next major surface. The in-process library is already model-agnostic: observe() returns a route decision plus a text summary, and the user picks which downstream model to call. Next on the roadmap is OpenAI- and Gemini-compatible proxies so the zero-code-change story extends beyond Anthropic SDKs, plus a configurable internal VLM (STATELENS_INTERNAL_VLM_PROVIDER) so users can swap the Haiku triage call for GPT-4o-mini, Gemini Flash, or a local vision model, pushing the internal cost arbitrage further. We also built a three-way measurement methodology comparing StateLens not just against direct Sonnet but also against a naive Sonnet to Haiku model swap, so positioning stays honest about where the visual gate is uniquely valuable versus where a cheaper model would also work.

Built With

Share this project:

Updates