StateLens

Screenshot gateway for computer-use agents. Drop-in Anthropic-compatible proxy that cuts input tokens 46-59% by replacing redundant screenshots with compact text observations. Same agent code, one env var change. Measured: 70-82% input-token reduction and 81-90% cost reduction on real Anthropic API calls across two UI flows.

StateLens sits between a UI agent and its reasoning model. It watches screenshot streams, filters redundant frames, extracts semantic state changes, and replaces expensive image input with compact text observations when it is safe to do so.

Use it when you are building or running screenshot-heavy browser/computer-use agents and want to stop paying for frames that did not meaningfully change.

Architecture in one paragraph

StateLens is a cost-arbitrage pipeline. It uses a cheap-but-capable vision model (Anthropic Haiku by default) to extract text observations from screenshots, so your expensive primary model (Sonnet, GPT-4o, Gemini Pro — whichever you're committed to) only processes images on frames where vision genuinely matters. On frames that are pixel-identical to the prior frame, no LLM is called at all — the visual gate skips them locally. This means our savings are above and beyond any "just use a cheaper model" swap, because cheaper models still pay per frame; the visual gate doesn't.

What's supported today

Component	Supports	Multi-provider?
Local API proxy (`statelens proxy`) — the recommended path	Anthropic only today; OpenAI / Gemini planned	No (today)
In-process library (`observe`, `routeObservation`, `captureAndRoute`)	Any LLM provider you call yourself	Yes — model-agnostic by design
MCP server (`statelens serve`) — secondary (see +27% MCP overhead investigation)	Any MCP-compatible client	Yes
Internal VLM (for the text observation step)	Anthropic Haiku	Hard-coded today; configurable VLM provider planned for v0.2.0

Requirements: an API key for whichever vision model you use. The default install uses Anthropic Haiku internally, so the proxy and the library's observe() need ANTHROPIC_API_KEY set. The library can be wired in front of any model for the primary call (the one the agent makes); only the internal Haiku step is currently Anthropic-bound.

Library is model-agnostic today: if you control your agent loop, observe() returns a route decision (skip_vision / use_text_observation / use_full_vision) plus a text summary. You decide which model to call on the resulting route. See Use In Process for the code.

Proxy is Anthropic-only today: it speaks Anthropic's wire format (/v1/messages, base64 image content blocks). OpenAI- and Gemini-compatible proxies are planned for v0.2.0. If you need them sooner, the in-process library has zero provider lock-in.

Install

npm install -g statelens-sdk

Or from source:

git clone https://github.com/zhizhongs/statelens.git
cd statelens
npm install
npm run build
npm link

End-to-End: From `npm install` To Measured Savings

Three commands, one terminal, real Anthropic API dollars. No code changes to your agent.

1. Install

npm install -g statelens-sdk
export ANTHROPIC_API_KEY=sk-ant-...

2. Start The Proxy

statelens proxy --provider anthropic --port 8443

The proxy is an Anthropic-compatible endpoint. It intercepts POST /v1/messages, runs the StateLens pipeline on the screenshot blocks, and forwards a rewritten request upstream.

3. Point Your Agent At It

Any Anthropic SDK-based computer-use agent works. The only line that changes:

const client = new Anthropic({
  baseURL: 'http://127.0.0.1:8443',  // ← that's the entire integration
});

4. Reproduce The Numbers

git clone https://github.com/zhizhongs/statelens.git && cd statelens
npm install && npm run build
bash demo/record.sh

You get a one-page report straight from response.usage.input_tokens, comparing the same agent code running direct vs through the proxy:

                          direct      via proxy
  input tokens             18996         10215
  cost (USD)            $0.066768     $0.027457

  token reduction: 46.2%
  cost  reduction: 58.9%

Inspect what the proxy actually did:

curl http://127.0.0.1:8443/sessions/<id>/timeline

Numbers are real Anthropic API token counts and include the Haiku tokens StateLens spends internally — no "shifted to a cheaper model" trick. Full methodology in RESULTS.md.

Use The Proxy

Use the Anthropic-compatible proxy when your agent or SDK can set ANTHROPIC_BASE_URL. This is the most transparent integration because StateLens sits directly in the model-request path.

Start StateLens:

export ANTHROPIC_API_KEY=sk-ant-...
statelens proxy --provider anthropic --port 8443

Point your Anthropic SDK-based agent at it:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8443

Then run your existing agent normally.

Health check:

curl http://127.0.0.1:8443/health

Optional session controls:

export STATELENS_SESSION_ID=my-run
# or send x-statelens-session-id: my-run

Debug endpoints:

curl http://127.0.0.1:8443/sessions/my-run/timeline
curl -X POST http://127.0.0.1:8443/sessions/my-run/reset

Proxy behavior:

Targets Anthropic /v1/messages.
Detects base64 screenshot image blocks in the latest user message.
Forwards first frames, analysis errors, and ambiguous requests unchanged.
Replaces safe-to-compress screenshots with StateLens text observations.
Does not MITM traffic or require a custom CA certificate.
Does not synthesize provider responses by default.

Use MCP (secondary)

Prefer the proxy if you have the choice. We dogfooded an MCP-based Claude Code session against the same flow and measured +27% MORE expensive than baseline — MCP tool definitions, tool-call arg payloads, and JSON cache churn dominated short sessions. The proxy form has zero per-turn tax. Full investigation in RESULTS.md.

MCP makes sense only when the client can't take a baseURL override (some IDEs, some legacy clients). Otherwise, use the proxy.

If you do want MCP, add to ~/.cursor/mcp.json (or the equivalent for Claude Code / Claude Desktop):

{
  "mcpServers": {
    "statelens": {
      "command": "statelens",
      "args": ["serve"]
    }
  }
}

Tool use is voluntary — the agent has to decide to call StateLens tools. That's the main reason the proxy is preferred: it's automatic and transparent.

Use In Process

When to reach for the library (vs the proxy)

The proxy is the default recommendation — one env var, zero code change. Reach for the in-process library when you need one of these:

If you...	The library gives you...
Control the agent loop and want maximum savings (~82-90% vs the proxy's ~46-59%)	The freedom to `return` and make zero model calls on no-change frames, instead of returning a synthesized Anthropic response
Use a model other than Anthropic (OpenAI, Gemini, local Llama-Vision)	A model-agnostic API — `observe()` returns a route decision plus text observation; you wire in whatever model you want
Run UI tests with AI in the loop (Stagehand-style, browserbase, custom AI-augmented Playwright)	The `action_failed` signal — Playwright says "click succeeded," StateLens says "the UI actually responded" — plus a session timeline you can attach to failure artifacts
Need to branch behavior per route (e.g., write the screenshot to disk on `use_full_vision`, retry on `action_failed`)	Direct access to the routing decision before any model call happens

If none of those apply, prefer the proxy — it's strictly less integration work.

Code

Use the TypeScript adapter when you control the agent loop.

import { captureAndRoute } from 'statelens-sdk';

const { observation, route, screenshot } = await captureAndRoute(page, {
  sessionId: 'login_flow',
  actionLabel: 'click_submit',
});

switch (route.route) {
  case 'skip_vision':
    break;
  case 'use_text_observation':
    await reasoningModel({ text: route.context });
    break;
  case 'use_full_vision':
    await reasoningModel({ image: screenshot });
    break;
}

The adapter accepts any object with screenshot(): Promise<Buffer>, including Playwright, Puppeteer, or your own browser/desktop driver.

The lower-level routing helper is also available:

import { observe, routeObservation } from 'statelens-sdk';

const observation = await observe(screenshotBuffer, 'session-id', 'click_submit');
const route = routeObservation(observation);

Integration Surfaces

Ordered by recommendation, most to least:

Surface	Status	Best for
Local API proxy (`statelens proxy`)	Built for Anthropic	SDK-based agents or binaries that support `baseURL` / endpoint overrides. Recommended. Transparent, no per-turn tax.
In-process routing helper	Built, model-agnostic	Custom Playwright/Puppeteer/browser-use style loops where you control the agent code. Highest savings ceiling (82-90%).
MCP server (`statelens serve`)	Built but secondary	Clients that can't take a `baseURL` override. Comes with per-turn MCP overhead — see the +27% Claude Code regression before choosing this surface.
SDK middleware	Planned (v0.2.0)	Apps that instantiate the Anthropic/OpenAI SDK in code and can wrap the client
OpenAI / Gemini proxy	Planned (v0.2.0)	Same as the Anthropic proxy but for other providers

All surfaces use the same pipeline:

screenshot
  -> visual gate
  -> spatial diff
  -> OCR diff
  -> importance score
  -> optional small VLM explanation
  -> text observation, full-vision fallback, or no-change route

Results

All numbers are measured with real Anthropic API token counts (no estimates). The harness includes internal Haiku usage, so savings are not hidden by shifting work to a cheaper model.

Pipeline measurements (in-process eval, baseline = prev+curr screenshots to Sonnet)

Scenario	Frames	Token reduction	Cost reduction	Accuracy (lenient)
Login flow	12	81.9%	90.1%	100.0%
Checkout flow	10	69.9%	81.2%	77.8%

End-to-end proxy A/B (login flow, 12 frames, real HTTP round-trip through `statelens proxy`)

Same code on both sides — the only difference between Run A and Run B is the baseURL of the Anthropic client.

Agent pattern	Token reduction	Cost reduction	Accuracy (strict)	Accuracy (lenient)
Single-image-per-turn (Claude Code / Cursor / computer-use style)	46.7%	59.4%	—	—
Prev+curr per turn (change-detection agents)	24.2%	31.3%	75.0%	100.0% (zero misses)

The proxy form preserves the same observation quality as the in-process pipeline (visual-gate filters count as match-by-construction, same as the in-process eval). The single-image-per-turn pattern produces larger savings because there's no prior image dragging tokens along — that's the realistic pattern for most agent loops.

Why is the proxy lower than the in-process pipeline? The in-process adapter (81.9% / 90.1% on the same flow) can skip the model call entirely on no-change turns — the agent's own loop handles the skip. The proxy can't safely do that: it doesn't know whether the agent expects text, a tool_use block, or a structured JSON action, and getting the synthesized response wrong would break computer-use and most production agent loops. Use the proxy when you can't modify agent code; use the in-process adapter when you can.

A future --synthesize-on-skip proxy flag, scoped to agents with known output shapes (e.g., text-output change-detection prompts), could close more of the gap. It's intentionally not shipped in v0.1 — the proxy's current contract is "transparent rewrite, never synthesize," and that's the safer default.

For context: an MCP-based dogfood on a short Claude Code session was +27% more expensive than baseline because MCP tool definitions, tool-call args, and JSON cache churn dominated a 5-frame session. The proxy form has zero per-turn tax. Full investigation in RESULTS.md.

See RESULTS.md for the full methodology, evolution, and per-frame verdicts; docs/DEMO_AND_EVAL.md for demo and reproduction commands.

Docs

RESULTS.md — full methodology, measurement evolution, proxy A/B validation, MCP overhead investigation
docs/PROXY_IMPLEMENTATION.md — proxy implementation details
DESIGN.md — architecture and product direction
docs/DEMO_AND_EVAL.md — demo, eval, and measurement commands

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
demo		demo
docs		docs
eval		eval
landing		landing
src		src
tests		tests
.gitignore		.gitignore
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StateLens

Architecture in one paragraph

What's supported today

Install

End-to-End: From `npm install` To Measured Savings

1. Install

2. Start The Proxy

3. Point Your Agent At It

4. Reproduce The Numbers

Use The Proxy

Use MCP (secondary)

Use In Process

When to reach for the library (vs the proxy)

Code

Integration Surfaces

Results

Pipeline measurements (in-process eval, baseline = prev+curr screenshots to Sonnet)

End-to-end proxy A/B (login flow, 12 frames, real HTTP round-trip through `statelens proxy`)

Docs

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StateLens

Architecture in one paragraph

What's supported today

Install

End-to-End: From npm install To Measured Savings

1. Install

2. Start The Proxy

3. Point Your Agent At It

4. Reproduce The Numbers

Use The Proxy

Use MCP (secondary)

Use In Process

When to reach for the library (vs the proxy)

Code

Integration Surfaces

Results

Pipeline measurements (in-process eval, baseline = prev+curr screenshots to Sonnet)

End-to-end proxy A/B (login flow, 12 frames, real HTTP round-trip through statelens proxy)

Docs

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

End-to-End: From `npm install` To Measured Savings

End-to-end proxy A/B (login flow, 12 frames, real HTTP round-trip through `statelens proxy`)

Packages