# crawler.sh

> A local-first Markdown extractor for AI training, RAG, and agent context. Crawl any website on your own machine, render JavaScript with a custom engine, respect robots.txt by default, and export clean Markdown ready for retrieval-augmented generation or fine-tuning. No headless Chrome, no per-page cloud fees, no API quota. Also doubles as an SEO auditor with automated checks across every page.

For full documentation including complete CLI usage, desktop app usage, all flags, data models, and examples, see [llms-full.txt](https://crawler.sh/llms-full.txt).

## Project details

- **Name**: crawler.sh
- **URL**: https://crawler.sh
- **Category**: Developer Tools / AI Ingestion / SEO
- **License**: Proprietary (freemium)
- **Platforms**: macOS (Universal - Apple Silicon and Intel), Linux (x64 and ARM64, .deb). Windows coming soon.
- **Install**: `curl -fsSL https://install.crawler.sh | sh`
- **Pricing**: Free up to 50 pages per crawl when not signed in, 400 signed in. Pro $99/year unlocks 10,000 pages per crawl and Markdown archive export.
- **Author**: Mehmet Kose
- **Contact**: https://crawler.sh/contact/

## What crawler.sh does

crawler.sh crawls websites locally and turns them into clean Markdown for AI training, RAG pipelines, fine-tuning corpora, and agent context. It runs entirely on your machine: nothing routes through us, no API key required, no per-page cloud bill.

JavaScript-heavy pages are rendered by a custom in-process JavaScript engine, not a headless Chrome instance. The outbound TLS handshake matches Chrome 131, so sites that block headless Chrome on TLS fingerprint let crawler.sh through. A single cookie jar is shared between the static fetch and JavaScript-driven fetch calls during rendering, so session-walled and login-walled pages render with the right state.

The crawler is polite by default: robots.txt Disallow, Allow, and Crawl-delay are honored. Per-host adaptive backoff doubles the request delay on 429 or 403 responses and halves it after a streak of successes. Pass `--ignore-robots` only when you have authorization.

After crawling, an SEO analysis engine runs automated checks across every page (missing or duplicate titles and descriptions, H1 issues, thin content, broken links, content freshness signals, and more). Results export to NDJSON, JSON, CSV, TXT, and W3C-compliant Sitemap XML. Pro tier exports the entire crawl as a Markdown archive with YAML frontmatter (url, title, captured_at, word_count, language).

## Interfaces

- **CLI** - Subcommands: `crawl` (site crawling), `seo` (SEO analysis with CSV/TXT export), `info` (crawl file stats), `export` (format conversion including Markdown archive), `update` (self-update), `auth` (login). Install with `curl -fsSL https://install.crawler.sh | sh`.
- **Desktop app** - macOS and Linux app with a live crawl feed, SEO issue dashboard, page-status charts, and a content viewer.
- **MCP server** - `crawler-mcp` exposes crawler.sh to agents over the Model Context Protocol (Claude Code, Cursor, Cline, and other MCP clients). Tools: `crawl_site`, `fetch_page`, `discover_links`.

## Pricing

- Free, not signed in: up to 50 pages per crawl.
- Free, signed in: up to 400 pages per crawl.
- Pro ($99/year): up to 10,000 pages per crawl, Markdown archive export.

All tiers include JavaScript rendering, robots.txt support, SEO analysis, and export to NDJSON / JSON / Sitemap XML / CSV / TXT.

## Platform availability

macOS (Universal - Apple Silicon and Intel), Linux (x64 and ARM64, .deb). Windows coming soon.

## Why pick crawler.sh

- **Local-first.** Pages are fetched directly by your machine. Nothing routes through any cloud, including ours.
- **No headless Chrome.** Custom in-process JavaScript engine uses around 10 to 20 MB per render worker instead of around 200 MB for a headless Chrome instance.
- **Polite by default.** robots.txt support and adaptive per-host backoff are on out of the box.
- **One identity across the crawl.** Chrome 131 TLS fingerprint and a shared cookie jar between the render path and the main crawler mean session-walled pages render correctly.
- **Flat pricing.** Free for the long tail, $99 a year for large crawls. No per-page or per-token bill.

## Comparisons

- [crawler.sh vs Firecrawl](https://crawler.sh/compare/crawler-sh-vs-firecrawl/): local-first desktop and CLI vs cloud scraping API.
- [crawler.sh vs Jina Reader](https://crawler.sh/compare/crawler-sh-vs-jina-reader/): full-site crawl and Markdown archive vs single-URL cloud reader.
- [crawler.sh vs Crawl4AI](https://crawler.sh/compare/crawler-sh-vs-crawl4ai/): single binary with custom JS engine vs Python + headless Chrome.

## Documentation

- [Getting Started](https://crawler.sh/docs/docs.md): Overview and quick start guide
- [Installation](https://crawler.sh/docs/installation.md): How to install and uninstall the CLI and desktop app
- [CLI Overview](https://crawler.sh/docs/cli.md): CLI architecture and usage
- [CLI Features](https://crawler.sh/docs/cli/features.md): Crawl engine, subcommands, SEO checks, content extraction, .crawl format
- [CLI Reference](https://crawler.sh/docs/cli/reference.md): All flags, page data model, crawl events, output formats
- [Desktop Overview](https://crawler.sh/docs/desktop.md): Desktop app architecture and usage
- [Desktop Features](https://crawler.sh/docs/desktop/features.md): Dashboard cards, SEO analysis, content viewer
- [Desktop Reference](https://crawler.sh/docs/desktop/reference.md): Settings, data model, export formats
- [MCP Server](https://crawler.sh/docs/mcp.md): Run crawler.sh as a Model Context Protocol server for Claude Code, Cursor, Cline, and other agents

## Product pages

- [Homepage](https://crawler.sh/): Main landing page
- [Markdown for AI](https://crawler.sh/markdown-for-ai/): Local Markdown extractor for AI training, RAG, and agent context
- [Product](https://crawler.sh/product.md): Feature overview for CLI and desktop app
- [Technical SEO Audit](https://crawler.sh/technical-seo-audit.md): SEO audit feature page with all checks
- [Download](https://crawler.sh/download.md): Desktop app download and CLI install command
- [Roadmap](https://crawler.sh/roadmap.md): Planned features and development timeline
- [FAQ](https://crawler.sh/faq.md): Common questions about installation, configuration, and usage

## Blog

- [How crawler.sh renders JavaScript without headless Chrome](https://crawler.sh/blog/how-crawler-sh-renders-javascript-without-headless-chrome/): Technical deep dive on the custom render engine, fingerprint hygiene, and politeness defaults
- [Best Web Crawler for MLOps](https://crawler.sh/blog/best-web-crawler-for-mlops/): How crawler.sh fits ML training data pipelines
- [The Complete Guide to Technical SEO Audits](https://crawler.sh/blog/technical-seo-audits/): Crawlability, indexation, site speed, structured data
- [Answer Engine Optimization](https://crawler.sh/blog/answer-engine-optimization/): Optimizing pages for AI answer engines
- [Challenges Collecting Preference Data for RLHF](https://crawler.sh/blog/challenges-collecting-preference-data-rlhf/): Data quality issues in RLHF pipelines

## Changelog

- [v0.8.0: Polite Crawling and Fingerprint Hygiene](https://crawler.sh/changelog/v080-polite-crawling-and-fingerprint-hygiene/): Adaptive crawl posture, robots.txt support, per-host backoff, quieter JS rendering, shared cookie jar with Chrome 131 TLS
- [Changelog](https://crawler.sh/changelog/): Full release notes

## Company

- [About](https://crawler.sh/about.md): Mission, team, and background
- [Contact](https://crawler.sh/contact.md): Bug reports, feature requests, and questions
- [Privacy Policy](https://crawler.sh/privacy-policy/)
- [Terms of Service](https://crawler.sh/terms-of-service/)