Which browser-automation MCP server should your AI agent use?
Vendors all claim "real Chrome." Cold-start latencies vary 51× across the field. One leader ships as a closed-source binary that touches your cookies on launch. Real tradeoffs are hidden behind marketing copy.
So we ran 7 of them on identical, frozen fixtures, scored every dimension that matters, and published the evidence. Pick a row.
⚠ Scope of v1.0: Job-application fixtures (Greenhouse server-rendered + Ashby React SPA). It's a useful proxy for "real, modern web pages" — but it is one specific use case. v1.1 will expand the fixture set to general-purpose web tasks (search, e-commerce, content extraction, multi-page navigation). The harness and rubric are designed to absorb the new fixtures without re-baselining.
| MCP | Composite | Cold-start (median) | Tier | Use it for | Skip it for |
|---|---|---|---|---|---|
| playwright | 7.93 | 197 ms | 🟢 PRIMARY | Interactive forms, batch fills, the safe default | Nothing — it's the baseline |
| lightpanda | 6.31 | 13 ms ⚡ | 🟢 PRIMARY | Read-only SSR extraction — pair with Playwright as a 51× faster cold-start specialist | JS-heavy SPAs (it's a Zig engine with no JS runtime — React-blind by design) |
| browser-use (direct) | 5.87 | 668 ms | 🟡 SECONDARY | Direct tool mode without an LLM key for S1+S2+S3+S8 | Interactive form fill (S4–S7) — use Playwright |
| chrome-devtools-mcp | 5.60 | 358 ms | 🟡 SECONDARY | DevTools-specific debugging (network, perf, console) | First-pass scraping — use Playwright |
| firecrawl | 4.23 | 171 ms | 🟡 SECONDARY | Cloud SSR scraping at scale (9× byte-count lift on Greenhouse) | Local/loopback fixtures (cloud rejects 127.0.0.1) and React SPAs (203 bytes on Ashby) |
| cloakbrowser | 8.33 | 235 ms | 🔴 SANDBOX-ONLY | Public-fixture stealth research only | Authenticated sessions — closed-source binary touches cookies on launch |
| obscura | 3.27¹ | 158 ms | ⚫ SKIP | (nothing on macOS today — see v1.0.2 footnote on Linux re-test) | macOS production use — Sec-CH-UA-Platform-* leaks the real OS |
| browser-use (agent mode) | 0.00 | — | ⚫ TOOL-BUG | (nothing — it doesn't start) | Everything in v0.12.7's MCP path. 30s BrowserStartEvent timeout when any LLM key is in env; reproducible without the harness (stdio_probe_evidence.log). Direct mode in the same binary scores 5.87. |
¹ v1.0.2 Obscura cross-platform finding — running on Linux x86_64 with
--stealth(which is suppressed on macOS per SAFETY-03) does not yield a comparable score: theobscura serveengine has a phantom-listener bug on Linux (logs port 9222 binding, never actually opens it). The macOS 3.27 composite stands; Linux is N/A pending a vendor fix. Full bisection:results/2026-05-29-linux/obscura/DEEP_ANALYSIS.md.
🐞 v1.0.1 finding — browser-use 0.12.7 MCP-agent path is broken. Setting
OPENAI_API_KEY(orANTHROPIC_API_KEY) in env makesbrowser_navigatehang for 30s then return a misleading success string; every CDP-dependent tool then errors withRoot CDP client not initialized. Same binary with env-LLM-keys unset (direct mode) scores 5.87. Reproducer + bisection inresults/2026-05-26/browser-use-agent/DEEP_ANALYSIS.md.
🔬 v1.0.2 BrowserMCP exploratory probe — an out-of-scope 8th MCP not in v1.0's 7-candidate framing. Drives the operator's real Chrome via extension-attach (port 9009 WebSocket). Works end-to-end on real Chrome (S1+S2 navigate + accessibility-tree snapshot succeeded), captures the operator's real Chrome TLS fingerprint (
ja4_hash: 3fc5444b6956— the production baseline for #11/G-739), and surfaces one new bug (recursiveserver.closestack overflow on shutdown). Not scored on the v1.0 rubric; full notes inresults/2026-05-29-browsermcp/EXPLORATORY.md. Candidate decision for v1.1.
One-line takeaway: Pair Playwright (interactive) with Lightpanda (read-only). Reach for Firecrawl when you need cloud SSR at scale and your targets are publicly reachable. Treat Cloakbrowser as a research sandbox. Wait on Obscura until a Linux re-test lands.
Full per-MCP rationale and citations: results/recommendations.md. Full 8-dimension score breakdown + S1-S8 stage matrix: results/2026-05-27-mcp-comparison.md.
8 weighted dimensions (composite is a 0-10 blend, locked from the prior wave for direct comparability):
| Dimension | Weight | What it captures |
|---|---|---|
| Data Quality | 3× | Did the extracted JSON/structure match the source page faithfully? |
| Reliability | 3× | Did the same prompt produce the same answer across 3 retry passes? |
| Speed | 2× | Wall-clock per stage |
| Token Efficiency | 2× | Tokens consumed per task completed |
| Interaction Depth | 2× | How many of S4–S8 (form fill, upload, dropdown, screenshot) succeeded? |
| JS Rendering | 1× | Did the MCP see the client-rendered DOM, not just the SSR shell? |
| Setup Complexity | 1× | How painful was getting the MCP running at all? |
| Error Handling | 1× | When something broke, did we get a useful error or silence? |
N/A-aware composite. A read-only MCP (Lightpanda, Firecrawl) marked N/A for "fill the form" cells drops those cells from its weighted denominator rather than being penalised with a 0. That's why Lightpanda's 6.31 is honest, not inflated. The handler is scoring/score_with_na.py; the rubric is scoring/rubric.md (sacrosanct — byte-for-byte unchanged from the start of the wave).
Median of 3 passes. Every run is repeated 3× and the median is published. This surfaced agent-discovery variance (Chrome DevTools MCP ran 5.6 / 5.6 / 8.33 across passes — one pass alone found a server-side-rendering rescue trick that the other two missed). Single-shot scores would have lied.
Frozen loopback fixtures. Every MCP hits a byte-for-byte snapshot of the same Greenhouse + Ashby pages served from 127.0.0.1. No live URLs, no network jitter, no "the site changed under us." Anyone with this repo can reproduce the scores: docs/REPRODUCIBILITY.md.
Every MCP runs the same prompt (prompts/stage_walk.md). The first three stages are read-only; the last five are interactive:
| Stage | What the MCP must do | Page type |
|---|---|---|
| S1 | Extract structured job data | Greenhouse (server-rendered) |
| S2 | Extract from a React SPA | Ashby (client-rendered) |
| S3 | Detect the ATS platform | both |
| S4 | Navigate to the apply form | both |
| S5 | Fill the application form with mock data | both |
| S6 | Upload a mock resume PDF | both |
| S7 | Handle React-Select dropdowns | both |
| S8 | Screenshot the filled form | both |
S1–S3 measure read-only extraction quality; S4–S8 measure interactive depth. Read-only MCPs (Lightpanda, Firecrawl) are categorically N/A for S4–S8 — and that's the point: don't pick a read-only specialist for a form-fill workload, and don't penalize it for being honest about its surface.
- 51× cold-start spread. Lightpanda 13 ms vs Browser-Use direct 668 ms. If you're spawning MCPs per-request, this is your latency budget.
- Cloakbrowser leads the raw score (8.33) but is pre-tiered SANDBOX-ONLY by construction. Closed-source binary + cookie-touch trust model is the binding constraint, not the score. Composite alone cannot drive graduation tier.
- Firecrawl confirms the SSR lift, refutes the SPA claim. 9× byte-count lift on Greenhouse SSR (24 KB markdown vs Playwright's 2.6 KB structured YAML) — real. Same approach on Ashby React SPA: 203 bytes of footer chrome. Cloud LLM-extraction is the right tool for SSR-heavy targets, not a universal JS-SPA fallback.
- Browser-Use direct mode works without an LLM key for the deterministic S1+S2+S3+S8 subset; the
retry_with_browser_use_agentLLM escape hatch was never invoked. Agent mode SKIPPED forLLM_KEY_ABSENTper the dual-row contract. - Headless Chromium leaks three independently fingerprintable signals by default (
HeadlessChrome/UA, SwiftShader WebGL,navigator.plugins.length=0). That's a fingerprint-resilience finding worth carrying into your stack — the deferred follow-up wave will quantify it against the live adversary set.
git clone https://github.com/pleasedodisturb/web-agent-comparison
cd web-agent-comparison
# Inspect the rubric and rules
cat scoring/rubric.md
# Read the third-party reproducibility recipe
cat docs/REPRODUCIBILITY.md
# Reproduce the Playwright calibration row (~10 min)
make bench-playwright && make scoreThe recipe in docs/REPRODUCIBILITY.md walks you through prereq checking, snapshot serving, and per-MCP runs with the exact pinned binary SHA256s.
v1.0 deliberately deferred four expensive measurement axes so the headline could ship. They live in the next wave:
- TLS fingerprint capture per MCP (JA3/JA4) — does any of these actually pass 2025-26 bot-detection, or is "real Chrome" marketing copy?
- Bot-detection adversary set — Cloudflare, DataDome, Akamai Bot Manager, reCAPTCHA v2/v3
- Cross-machine reproducibility — MacBook parity vs the Mac Mini that ran this wave
- Obscura Linux A/B — re-test
--stealthfrom a Linux host whereSec-CH-UA-Platform-*is honest, so the tier can actually move off SKIP - General-purpose fixture expansion — beyond the job-application use case, into search, e-commerce, content extraction, multi-page navigation
Follow-up wave anchor: G-710. v1.0 umbrella: G-703.
.mcp.json Project-scoped MCP server registry (the 7-candidate roster)
bench/ Harness + scoring + report builders (Python 3.12, uv-locked)
docs/ Reproducibility recipe + run-environment docs
fixtures/ Mock data, resume PDF, loopback snapshot fixtures
prompts/ Locked S1–S8 stage-walk prompt
scoring/ Locked 8-dimension rubric + N/A-aware scoring engine
scripts/ Test orchestration + harness CLI
results/ Per-wave dated subdirectories with scored evidence
.mcp.json is project-scoped (not user-scoped) so the 7 MCPs only spawn when Claude opens this repo. That isolation matters — the same trick keeps your other Claude sessions free of rocket-icon dock pollution.
Stage 1 of a 3-stage pipeline:
- This repo (public) — score candidate browser MCPs on standardized fixtures
terminal-craft(private) — package the winners as a production toolkit- Kestrel + Eyas (private) — wire the toolkit into job-hunting agents
The repo's job is to produce defensible, third-party-verifiable graduation tiers so Stage 2 isn't picking tooling based on vendor marketing. The graduation gate is results/recommendations.md. v1.0 shipped it.
Historical wave (preserved for traceability): the 2026-03-31 app-level comparison of 5 application-layer agents (Playwright MCP 9.07, WebFetch 7.87, Agent Browser 7.60, Lightpanda 5.87, BrowserMCP 5.53) lives at results/2026-03-31_run.md. Same rubric, different candidate set (apps, not MCP-layer servers).