Self-hosted Firecrawl alternative with semantic search, grounded Q&A, site adapters, and an autonomous research agent. MIT licensed. One docker compose up and you're running.
GroktoCrawl implements the Firecrawl v2 API surface — scrape, search, map, crawl, extract, browser sessions, and monitors — plus capabilities Firecrawl doesn't offer: a persistent semantic search engine with Qdrant vector index, a grounded Q&A endpoint with citations, a web portal for human users, site adapters for GitHub/Substack/Reddit/YouTube/Bluesky, an intelligent scrape cache with ETag/Last-Modified revalidation, and a full observability stack with health probes and Prometheus metrics. Runs entirely in Docker on your own hardware. Bring your own LLM or use the built-in fixtures.
cp .env.sample .env
docker compose up --build -dEight containers start. The stack includes SearXNG for real web search, a smart scraper, and an Ofelia-scheduled monitor system.
# CLI
./groktocrawl scrape https://example.com
./groktocrawl search "raspberry pi 5" --limit 3
./groktocrawl agent "What were the key Google I/O 2025 announcements?"
# Or raw curl
curl http://localhost:8080/health
curl -X POST http://localhost:8080/v2/scrape -H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'Edit .env to point at a real LLM:
# DeepSeek
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-v4-flash
# OpenAI
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini
# Ollama (local)
LLM_BASE_URL=http://host.docker.internal:11434/v1
LLM_MODEL=llama3.2flowchart TD
subgraph compose["docker-compose.yml"]
valkey[("valkey<br/>(queue + storage)")]
searxng["searxng<br/>(web search)"]
scraper("scraper-svc<br/>(smart fetch)")
browser["browser-svc<br/>(Playwright sessions)"]
agent("agent-svc<br/>(FastAPI + workers)")
ofelia["ofelia<br/>(cron scheduler)"]
valkey --- agent
searxng --- agent
scraper --- agent
browser --- agent
ofelia -.->|docker exec| agent
end
llm_provider("LLM Provider<br/>(DeepSeek / OpenAI / Ollama)")
llm_provider -.->|LLM_BASE_URL| agent
style valkey fill:#ffe0b0
style searxng fill:#b0d4ff
style scraper fill:#b0ffb0
style browser fill:#d4b0ff
style agent fill:#ffb0b0
style ofelia fill:#b0b0b0
The scraper uses a three-tier strategy: check /llms.txt first, try Accept: text/markdown second, render with Playwright third.
Every scrape response includes a quality field with post-extraction content quality assessment (boilerplate detection, completeness checks, block page detection). See docs/adr/0016-extraction-quality-gates.md for details.
groktocrawl is a CLI tool in the repo root. It needs requests.
If you want to avoid installing dependencies into your global Python, use a repo-local uv environment:
uv venv
uv pip install requests
uv run ./groktocrawl scrape https://example.comTo expose a global groktocrawl command while keeping dependencies isolated, create a small wrapper somewhere on your PATH:
cat > ~/.local/bin/groktocrawl <<'EOF'
#!/bin/sh
cd "$HOME/groktocrawl" || exit 1
exec uv run ./groktocrawl "$@"
EOF
chmod +x ~/.local/bin/groktocrawlOr install requests into the Python that runs the script:
python3 -m pip install requests./groktocrawl scrape <url> # Scrape a page to markdown
./groktocrawl search <query> --limit 5 # Search the web (default: general)
./groktocrawl search <query> --sources news # Search news sources only
./groktocrawl search <query> --categories research # Search with content category (mapped to SearXNG)
./groktocrawl search <query> --sources news --categories research # Combined filter
./groktocrawl map <url> --limit 100 # Discover URLs on a site
./groktocrawl crawl <url> --max-depth 2 # Crawl a website
./groktocrawl agent "<prompt>" # Autonomous research agent
./groktocrawl --json --server <url> <cmd> # JSON output, custom server| Method | Endpoint | Description |
|---|---|---|
| POST | /v2/scrape |
Scrape a single URL to clean markdown |
| POST | /v2/agent |
Start an autonomous research agent |
| GET | /v2/agent/:jobId |
Get agent job status and results |
| DELETE | /v2/agent/:jobId |
Cancel an agent job |
| POST | /v2/answer |
Grounded Q&A — search, synthesize, cite in one round-trip |
| POST | /v2/extract |
Extract structured data from URLs (with schema) |
| GET | /v2/extract/:jobId |
Get extract status and results |
| POST | /v2/crawl |
Crawl a website |
| GET | /v2/crawl/:jobId |
Get crawl status |
| DELETE | /v2/crawl/:jobId |
Cancel a crawl |
| POST | /v2/batch/scrape |
Scrape multiple URLs |
| POST | /v2/search |
Search the web with content |
| POST | /v2/map |
Discover URLs on a site |
| POST | /v2/parse |
Upload a file (PDF, DOCX, PPTX, XLSX) and get markdown back |
| POST | /v2/browser |
Create a headless browser session |
| GET | /v2/browser |
List active browser sessions |
| POST | /v2/browser/:id/execute |
Execute action (navigate, click, screenshot, etc.) |
| DELETE | /v2/browser/:id |
Destroy a browser session |
| POST | /v2/monitor |
Create a scheduled change monitor |
| GET | /v2/monitor |
List all monitors |
| GET | /v2/monitor/:id |
Get monitor status and history |
| PATCH | /v2/monitor/:id |
Update monitor config |
| DELETE | /v2/monitor/:id |
Delete a monitor |
| POST | /v2/generate-llmstxt |
Generate an llms.txt file for a website |
| GET | /v2/generate-llmstxt/:jobId |
Get generation status and result |
All Firecrawl v2 API-compatible in request/response shape.
POST /v2/search accepts Firecrawl v2's two-dimensional search model:
| Parameter | Type | Description |
|---|---|---|
query |
string |
Required. Search query |
limit |
int |
Max results (default: 5) |
sources |
string[] |
Source type filter: web, news, images, video, social |
categories |
string[] |
Content category: research, github, pdf, news, science, it, general |
Both sources and categories are translated to SearXNG-native categories and can be combined:
| Firecrawl value | Mapped to SearXNG |
|---|---|
sources=news |
categories=news |
sources=images |
categories=images |
sources=web |
categories=general |
categories=research |
categories=science |
categories=github |
categories=it |
categories=pdf |
categories=general |
Unknown values pass through to SearXNG as-is for forward compatibility. When neither
sources nor categories is specified, defaults to general.
Results are grouped by source type in the response:
{"data": {"web": [...], "images": [], "news": []}}The POST /v2/agent endpoint accepts an optional model field to override the environment-configured LLM on a per-request basis:
{
"prompt": "Research the latest AI safety papers",
"model": "gpt-4o"
}When model is omitted or set to "default", the LLM_MODEL from .env is used. This is useful for routing simple lookups to a cheaper model and complex research to a more capable one.
The agent is powered by a determined research prompt that evaluates source quality, synthesizes across multiple pages, detects contradictions, and cites sources by URL. It does not fabricate information — if the available sources don't answer the question, it says so and suggests what would be needed.
Interactive API documentation is available when the stack is running:
- Swagger UI:
http://localhost:8080/docs - Raw OpenAPI spec:
http://localhost:8080/openapi.json
The spec is auto-generated by FastAPI from the route handlers and Pydantic models — always up to date with the running code. All 17+ endpoints with request/response schemas are documented.
| Feature | Firecrawl Cloud | Firecrawl Self-Hosted | GroktoCrawl |
|---|---|---|---|
| Scrape / Crawl / Map / Search | ✅ | ✅ | ✅ |
| Agent endpoint | ✅ | ❌ | ✅ |
| Extract (schema-based) | ✅ | ❌ | ✅ |
| Browser sessions | ✅ | ❌ | ✅ |
| Scheduled monitors | ✅ | ❌ | ✅ |
| Parse (PDF, DOCX) | ✅ | ✅ | ✅ |
| Generate llms.txt | ❌ | ❌ | ✅ |
| Webhook delivery | ✅ | ❌ | ✅ |
| License | Proprietary | AGPL-3.0 | MIT |
| Self-contained Docker | ❌ | ✅ | ✅ |
| LLM integration | Built-in | Requires API key | BYO or fixture |
| Beyond Firecrawl | |||
| Semantic search / vector index | ❌ | ❌ | ✅ |
| Grounded Q&A (/v2/answer) | ❌ | ❌ | ✅ |
| Web portal for human users | ❌ | ❌ | ✅ |
| Site adapters (GitHub, Substack, Reddit, YouTube, Bluesky) | ❌ | ❌ | ✅ |
| Intelligent scrape cache (ETag/Last-Modified) | ❌ | ❌ | ✅ |
| Politeness protocol (robots.txt, rate limiting) | ❌ | ❌ | ✅ |
| Proxy support | ❌ | ❌ | ✅ |
| Agent SSE streaming | ❌ | ❌ | ✅ |
| Search type spectrum (fast / rich / structured) | ❌ | ❌ | ✅ |
| Artifact-pyramid CLI output | ❌ | ❌ | ✅ |
GroktoCrawl ships as an AgentSkills-compatible skill at skills/groktocrawl/. Any agent that supports the AgentSkills format (Claude Code, Cursor, etc.) can load it:
skills/groktocrawl/
├── SKILL.md # Metadata + instructions
├── scripts/groktocrawl # CLI — all endpoints
├── references/triggers.md # When to use which command
└── assets/examples.md # Usage examples
The skill bundles the CLI directly — no additional setup required beyond having the repo on disk.
If you use Hermes Agent, GroktoCrawl replaces the built-in web_search and web_extract tools with more capable alternatives. To avoid competition between tools:
Remove web from default_toolsets and platform_toolsets.cli in ~/.hermes/config.yaml:
# Before
default_toolsets:
- terminal
- file
- web # ← remove
# After
default_toolsets:
- terminal
- fileThis removes web_search and web_extract from your agent's toolset. All web tasks will route through groktocrawl instead.
The CLI is at groktocrawl in the repo root. Copy it to your PATH:
cp groktocrawl ~/.local/bin/The bundled skill at skills/groktocrawl/ follows the AgentSkills spec. Symlink it into your Hermes skills directory:
ln -sf "$PWD/skills/groktocrawl" ~/.hermes/skills/Then load it in-session with /skill groktocrawl or preload it via hermes -s groktocrawl.
The CLI discovers the server in this order:
--server <url>flagGROKTOCRAWL_API_URLenv varFIRECRAWL_API_URLenv var (backward compat)~/.hermes/.envfile- Default:
http://localhost:8080
Add to ~/.hermes/.env if your instance runs elsewhere:
GROKTOCRAWL_API_URL=http://localhost:8080Set API_KEY in your .env file to enable bearer token authentication:
API_KEY=sk-your-secret-key-hereOnce set, every API call must include an Authorization or X-API-Key header:
curl -X POST http://localhost:8080/v2/scrape \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-your-secret-key-here" \
-d '{"url": "https://example.com"}'
# Or via CLI:
groktocrawl --api-key sk-your-secret-key-here scrape https://example.comWhen no API_KEY is configured, the API is fully open (backward
compatible). Each response includes an X-Security-Warning header and
the /health endpoint adds a security field to warn callers.
The built-in browser and scraper services block navigation to private IPs (RFC 1918), loopback addresses, cloud metadata endpoints, and the Docker host machine. This prevents SSRF-based pivoting through the headless browser. The blocklist applies to both direct URLs and resolved hostnames (DNS rebinding protection).
Only the agent API (port 8080) is exposed to the host. Internal
services (browser-svc, scraper-svc, parse-svc) are reachable only
via Docker internal DNS — they do not publish host ports. All requests
route through the agent API.
See SECURITY.md for our disclosure policy and how to privately report security issues.
GroktoCrawl supports outbound proxy routing via the SCRAPER_PROXY_URL environment variable. When set, all scrape requests route through the specified proxy before reaching their target.
SCRAPER_PROXY_URL=http://user:pass@residential-proxy:8080Supported schemes: http://, https://, socks5://, socks5h://
Behavior:
- The proxy is applied at the transport layer across the full scrape pipeline — httpx clients (Tiers 1-2) and Playwright browser context (Tier 3)
- If the proxy is unreachable, Groktocrawl fails open: it retries the request without a proxy and logs the fallback at WARN level
- Every proxied scrape records
proxy_host=<host:port>in its structured log for operational debugging
Guardrails:
- Opt-in only — users who don't set this variable see zero behavioral change
- Single static proxy — one URL only. For proxy rotation or pool management, front Groktocrawl with a rotating proxy orchestrator (HAProxy, Scrapoxy, etc.)
- Credentials never logged — only the host and port are recorded in scrape logs; the full URL (including auth) is redacted at the logging boundary
Notes:
- Proxy credentials embedded in the URL use standard HTTP Basic Auth encoding. Avoid special characters (@, #, %) in passwords — they can conflict with URL parsing.
- The Playwright proxy uses context-level assignment (
browser.new_context(proxy=...)) for per-job isolation, not launch-level args (--proxy-server).
See ADR-0020 for the full architecture decision.
GroktoCrawl supports site-specific content handlers that extract richer content from JavaScript-heavy sites. When scrape <url> is called, the adapter registry checks if a handler matches the URL before running the generic pipeline. If it matches, the adapter handles extraction with its own fallback chain. If it fails, the generic pipeline runs as normal.
scrape <youtube-url> returns a markdown document with:
- YAML frontmatter: video_id, title, channel, channel_url, thumbnail_url, source
- Markdown body: full video transcript
Fallback chain: youtube_transcript_api (free, no key) → browser render + DOM extraction
Configuration:
| Variable | Default | Description |
|---|---|---|
ADAPTER_YOUTUBE_API_KEY |
(none) | YouTube Data API v3 key (optional — transcript works without it) |
scrape <bsky.app-url> returns a markdown document with:
- YAML frontmatter: author, handle, did, post_id, timestamp, reply_count, like_count, repost_count
- Markdown body: post text + thread replies
Fallback chain: AT Protocol XRPC API (public, no auth) → browser render + DOM extraction
Configuration: None — the public API requires no authentication.
scrape <substack-url> returns a markdown document with:
- YAML frontmatter: title, author, publication, published_date, source
- Markdown body: full article text in clean markdown
Fallback chain: RSS feed (fast, structured, no auth) → readability-lxml page extraction → browser render
Configuration: None — Substack requires no API keys.
Vanity domain detection: The adapter automatically detects Substack-hosted publications behind custom domains (e.g. www.lennysnewsletter.com) by probing {domain}/feed for the Substack RSS generator tag. Results are cached per-domain for 1 hour.
Two adapters handle different URL types on github.com, working together via priority dispatch:
| Priority | Adapter | Handles | Primary Strategy |
|---|---|---|---|
| 200 | GitHub File | raw files, blobs, READMEs, directory listings | raw.githubusercontent.com direct fetch |
| 190 | GitHub Social | issues, PRs, discussions, releases, commits | GraphQL API (v4) |
scrape <github-url> returns structured markdown with YAML frontmatter containing owner, repo, and type-specific metadata.
Resource coverage:
| URL Pattern | Handled By | Features |
|---|---|---|
raw.githubusercontent.com/* |
File adapter | Raw content, no rate limit |
github.com/*/blob/* |
File adapter | Rewrites to raw URL |
github.com/* (repo root) |
File adapter | README + stars/forks/language/topics |
github.com/*/tree/* |
File adapter | Directory listing, items sorted dirs-first |
github.com/*/issues/{n} |
Social adapter | Body, comments, labels, state, milestone |
github.com/*/pull/{n} |
Social adapter | Body, reviews, diff stats, changed files, merge status |
github.com/*/discussions/{n} |
Social adapter | Category, upvotes, answer, comments |
github.com/*/releases/tag/{v} |
Social adapter | Release notes, assets, download URLs |
github.com/*/releases |
Social adapter | Releases list with descriptions |
github.com/*/commit/{sha} |
Social adapter | Message, author, associated PRs |
Fallback chains:
- File adapter: raw.githubusercontent.com direct fetch → GitHub Contents API → generic tier
- Social adapter: GitHub GraphQL API (single query) → GitHub REST API → HTML page scrape (readability) → generic tier
Configuration:
The GITHUB_TOKEN environment variable enables authenticated access:
| Variable | Default | Effect |
|---|---|---|
GITHUB_TOKEN |
(none) | 5,000 API req/hr vs 60/hr unauth; enables GraphQL; always falls back to HTML scrape |
A token with public_repo scope is sufficient for public repositories. For private repos, use repo scope. Without a token, the file adapter works fully and the social adapter falls back to REST (60 req/hr) then HTML scrape — every URL type returns useful content.
- Create
scraper-svc/scraper/adapters/<site>.py - Subclass
SiteAdapter, setname,patterns,priority, implementscrape() - Decorate with
@adapterfor auto-registration - Add any new dependencies to
scraper-svc/pyproject.toml - Add
.envvariables to.env.sampleand document them in this section
See docs/adr/ for the architecture decision records behind the adapter pattern, and CONTRIBUTING.md for the ADR convention.
External anti-detection browser hook: For sites that require advanced browser fingerprinting evasion (Turnstile challenges, DDoS-Guard), the adapter framework supports routing scrape requests to a self-hosted anti-detection browser service. The
AdapterContext.configprovides access to environment variables for configuring the external service endpoint. This is an advanced operator pattern — the external browser service is self-managed and outside the project's scope. Groktocrawl provides the dispatch interface; the external service's behavior, compatibility, and reliability are the operator's responsibility.
Active development. All core Firecrawl v2 API endpoints implemented and integration-tested. See issues for upcoming features. Contributions welcome — see CONTRIBUTING.md.