GroktoCrawl

Self-hosted Firecrawl alternative with semantic search, grounded Q&A, site adapters, and an autonomous research agent. MIT licensed. One docker compose up and you're running.

GroktoCrawl implements the Firecrawl v2 API surface — scrape, search, map, crawl, extract, browser sessions, and monitors — plus capabilities Firecrawl doesn't offer: a persistent semantic search engine with Qdrant vector index, a grounded Q&A endpoint with citations, a web portal for human users, site adapters for GitHub/Substack/Reddit/YouTube/Bluesky, an intelligent scrape cache with ETag/Last-Modified revalidation, and a full observability stack with health probes and Prometheus metrics. Runs entirely in Docker on your own hardware. Bring your own LLM or use the built-in fixtures.

Quick Start

cp .env.sample .env
docker compose up --build -d

Eight containers start. The stack includes SearXNG for real web search, a smart scraper, and an Ofelia-scheduled monitor system.

# CLI
./groktocrawl scrape https://example.com
./groktocrawl search "raspberry pi 5" --limit 3
./groktocrawl agent "What were the key Google I/O 2025 announcements?"

# Or raw curl
curl http://localhost:8080/health
curl -X POST http://localhost:8080/v2/scrape -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Production Setup

Edit .env to point at a real LLM:

# DeepSeek
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-v4-flash

# OpenAI
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini

# Ollama (local)
LLM_BASE_URL=http://host.docker.internal:11434/v1
LLM_MODEL=llama3.2

Architecture

flowchart TD
    subgraph compose["docker-compose.yml"]
        valkey[("valkey<br/>(queue + storage)")]
        searxng["searxng<br/>(web search)"]
        scraper("scraper-svc<br/>(smart fetch)")
        browser["browser-svc<br/>(Playwright sessions)"]
        agent("agent-svc<br/>(FastAPI + workers)")
        ofelia["ofelia<br/>(cron scheduler)"]

        valkey --- agent
        searxng --- agent
        scraper --- agent
        browser --- agent
        ofelia -.->|docker exec| agent
    end
    llm_provider("LLM Provider<br/>(DeepSeek / OpenAI / Ollama)")
    llm_provider -.->|LLM_BASE_URL| agent

    style valkey fill:#ffe0b0
    style searxng fill:#b0d4ff
    style scraper fill:#b0ffb0
    style browser fill:#d4b0ff
    style agent fill:#ffb0b0
    style ofelia fill:#b0b0b0

The scraper uses a three-tier strategy: check /llms.txt first, try Accept: text/markdown second, render with Playwright third.

Every scrape response includes a quality field with post-extraction content quality assessment (boilerplate detection, completeness checks, block page detection). See docs/adr/0016-extraction-quality-gates.md for details.

CLI

groktocrawl is a CLI tool in the repo root. It needs requests.

If you want to avoid installing dependencies into your global Python, use a repo-local uv environment:

uv venv
uv pip install requests
uv run ./groktocrawl scrape https://example.com

To expose a global groktocrawl command while keeping dependencies isolated, create a small wrapper somewhere on your PATH:

cat > ~/.local/bin/groktocrawl <<'EOF'
#!/bin/sh
cd "$HOME/groktocrawl" || exit 1
exec uv run ./groktocrawl "$@"
EOF
chmod +x ~/.local/bin/groktocrawl

Or install requests into the Python that runs the script:

python3 -m pip install requests

./groktocrawl scrape <url>                      # Scrape a page to markdown
./groktocrawl search <query> --limit 5          # Search the web (default: general)
./groktocrawl search <query> --sources news     # Search news sources only
./groktocrawl search <query> --categories research  # Search with content category (mapped to SearXNG)
./groktocrawl search <query> --sources news --categories research  # Combined filter
./groktocrawl map <url> --limit 100             # Discover URLs on a site
./groktocrawl crawl <url> --max-depth 2         # Crawl a website
./groktocrawl agent "<prompt>"                  # Autonomous research agent
./groktocrawl --json --server <url> <cmd>       # JSON output, custom server

API Endpoints

Method	Endpoint	Description
POST	`/v2/scrape`	Scrape a single URL to clean markdown
POST	`/v2/agent`	Start an autonomous research agent
GET	`/v2/agent/:jobId`	Get agent job status and results
DELETE	`/v2/agent/:jobId`	Cancel an agent job
POST	`/v2/answer`	Grounded Q&A — search, synthesize, cite in one round-trip
POST	`/v2/extract`	Extract structured data from URLs (with schema)
GET	`/v2/extract/:jobId`	Get extract status and results
POST	`/v2/crawl`	Crawl a website
GET	`/v2/crawl/:jobId`	Get crawl status
DELETE	`/v2/crawl/:jobId`	Cancel a crawl
POST	`/v2/batch/scrape`	Scrape multiple URLs
POST	`/v2/search`	Search the web with content
POST	`/v2/map`	Discover URLs on a site
POST	`/v2/parse`	Upload a file (PDF, DOCX, PPTX, XLSX) and get markdown back
POST	`/v2/browser`	Create a headless browser session
GET	`/v2/browser`	List active browser sessions
POST	`/v2/browser/:id/execute`	Execute action (navigate, click, screenshot, etc.)
DELETE	`/v2/browser/:id`	Destroy a browser session
POST	`/v2/monitor`	Create a scheduled change monitor
GET	`/v2/monitor`	List all monitors
GET	`/v2/monitor/:id`	Get monitor status and history
PATCH	`/v2/monitor/:id`	Update monitor config
DELETE	`/v2/monitor/:id`	Delete a monitor
POST	`/v2/generate-llmstxt`	Generate an llms.txt file for a website
GET	`/v2/generate-llmstxt/:jobId`	Get generation status and result

All Firecrawl v2 API-compatible in request/response shape.

Search endpoint

POST /v2/search accepts Firecrawl v2's two-dimensional search model:

Parameter	Type	Description
`query`	`string`	Required. Search query
`limit`	`int`	Max results (default: 5)
`sources`	`string[]`	Source type filter: `web`, `news`, `images`, `video`, `social`
`categories`	`string[]`	Content category: `research`, `github`, `pdf`, `news`, `science`, `it`, `general`

Both sources and categories are translated to SearXNG-native categories and can be combined:

Firecrawl value	Mapped to SearXNG
`sources=news`	`categories=news`
`sources=images`	`categories=images`
`sources=web`	`categories=general`
`categories=research`	`categories=science`
`categories=github`	`categories=it`
`categories=pdf`	`categories=general`

Unknown values pass through to SearXNG as-is for forward compatibility. When neither sources nor categories is specified, defaults to general.

Results are grouped by source type in the response:

{"data": {"web": [...], "images": [], "news": []}}

Agent endpoint

The POST /v2/agent endpoint accepts an optional model field to override the environment-configured LLM on a per-request basis:

{
  "prompt": "Research the latest AI safety papers",
  "model": "gpt-4o"
}

When model is omitted or set to "default", the LLM_MODEL from .env is used. This is useful for routing simple lookups to a cheaper model and complex research to a more capable one.

The agent is powered by a determined research prompt that evaluates source quality, synthesizes across multiple pages, detects contradictions, and cites sources by URL. It does not fabricate information — if the available sources don't answer the question, it says so and suggests what would be needed.

OpenAPI / Swagger Docs

Interactive API documentation is available when the stack is running:

Swagger UI: http://localhost:8080/docs
Raw OpenAPI spec: http://localhost:8080/openapi.json

The spec is auto-generated by FastAPI from the route handlers and Pydantic models — always up to date with the running code. All 17+ endpoints with request/response schemas are documented.

Comparison to Firecrawl

Feature	Firecrawl Cloud	Firecrawl Self-Hosted	GroktoCrawl
Scrape / Crawl / Map / Search	✅	✅	✅
Agent endpoint	✅	❌	✅
Extract (schema-based)	✅	❌	✅
Browser sessions	✅	❌	✅
Scheduled monitors	✅	❌	✅
Parse (PDF, DOCX)	✅	✅	✅
Generate llms.txt	❌	❌	✅
Webhook delivery	✅	❌	✅
License	Proprietary	AGPL-3.0	MIT
Self-contained Docker	❌	✅	✅
LLM integration	Built-in	Requires API key	BYO or fixture
Beyond Firecrawl
Semantic search / vector index	❌	❌	✅
Grounded Q&A (/v2/answer)	❌	❌	✅
Web portal for human users	❌	❌	✅
Site adapters (GitHub, Substack, Reddit, YouTube, Bluesky)	❌	❌	✅
Intelligent scrape cache (ETag/Last-Modified)	❌	❌	✅
Politeness protocol (robots.txt, rate limiting)	❌	❌	✅
Proxy support	❌	❌	✅
Agent SSE streaming	❌	❌	✅
Search type spectrum (fast / rich / structured)	❌	❌	✅
Artifact-pyramid CLI output	❌	❌	✅

AgentSkills Compatibility

GroktoCrawl ships as an AgentSkills-compatible skill at skills/groktocrawl/. Any agent that supports the AgentSkills format (Claude Code, Cursor, etc.) can load it:

skills/groktocrawl/
├── SKILL.md                  # Metadata + instructions
├── scripts/groktocrawl       # CLI — all endpoints
├── references/triggers.md    # When to use which command
└── assets/examples.md        # Usage examples

The skill bundles the CLI directly — no additional setup required beyond having the repo on disk.

Hermes Agent Considerations

If you use Hermes Agent, GroktoCrawl replaces the built-in web_search and web_extract tools with more capable alternatives. To avoid competition between tools:

Disable the `web` toolset

Remove web from default_toolsets and platform_toolsets.cli in ~/.hermes/config.yaml:

# Before
default_toolsets:
  - terminal
  - file
  - web              # ← remove

# After
default_toolsets:
  - terminal
  - file

This removes web_search and web_extract from your agent's toolset. All web tasks will route through groktocrawl instead.

Install the CLI

The CLI is at groktocrawl in the repo root. Copy it to your PATH:

cp groktocrawl ~/.local/bin/

Install the AgentSkills skill

The bundled skill at skills/groktocrawl/ follows the AgentSkills spec. Symlink it into your Hermes skills directory:

ln -sf "$PWD/skills/groktocrawl" ~/.hermes/skills/

Then load it in-session with /skill groktocrawl or preload it via hermes -s groktocrawl.

Environment variables

The CLI discovers the server in this order:

--server <url> flag
GROKTOCRAWL_API_URL env var
FIRECRAWL_API_URL env var (backward compat)
~/.hermes/.env file
Default: http://localhost:8080

Add to ~/.hermes/.env if your instance runs elsewhere:

GROKTOCRAWL_API_URL=http://localhost:8080

Security

API Authentication (recommended for production)

Set API_KEY in your .env file to enable bearer token authentication:

API_KEY=sk-your-secret-key-here

Once set, every API call must include an Authorization or X-API-Key header:

curl -X POST http://localhost:8080/v2/scrape \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-secret-key-here" \
  -d '{"url": "https://example.com"}'

# Or via CLI:
groktocrawl --api-key sk-your-secret-key-here scrape https://example.com

When no API_KEY is configured, the API is fully open (backward compatible). Each response includes an X-Security-Warning header and the /health endpoint adds a security field to warn callers.

Private Network Protection

The built-in browser and scraper services block navigation to private IPs (RFC 1918), loopback addresses, cloud metadata endpoints, and the Docker host machine. This prevents SSRF-based pivoting through the headless browser. The blocklist applies to both direct URLs and resolved hostnames (DNS rebinding protection).

Service Architecture

Only the agent API (port 8080) is exposed to the host. Internal services (browser-svc, scraper-svc, parse-svc) are reachable only via Docker internal DNS — they do not publish host ports. All requests route through the agent API.

Reporting Vulnerabilities

See SECURITY.md for our disclosure policy and how to privately report security issues.

Proxy Configuration

GroktoCrawl supports outbound proxy routing via the SCRAPER_PROXY_URL environment variable. When set, all scrape requests route through the specified proxy before reaching their target.

SCRAPER_PROXY_URL=http://user:pass@residential-proxy:8080

Supported schemes: http://, https://, socks5://, socks5h://

Behavior:

The proxy is applied at the transport layer across the full scrape pipeline — httpx clients (Tiers 1-2) and Playwright browser context (Tier 3)
If the proxy is unreachable, Groktocrawl fails open: it retries the request without a proxy and logs the fallback at WARN level
Every proxied scrape records proxy_host=<host:port> in its structured log for operational debugging

Guardrails:

Opt-in only — users who don't set this variable see zero behavioral change
Single static proxy — one URL only. For proxy rotation or pool management, front Groktocrawl with a rotating proxy orchestrator (HAProxy, Scrapoxy, etc.)
Credentials never logged — only the host and port are recorded in scrape logs; the full URL (including auth) is redacted at the logging boundary

Notes:

Proxy credentials embedded in the URL use standard HTTP Basic Auth encoding. Avoid special characters (@, #, %) in passwords — they can conflict with URL parsing.
The Playwright proxy uses context-level assignment (browser.new_context(proxy=...)) for per-job isolation, not launch-level args (--proxy-server).

See ADR-0020 for the full architecture decision.

Adapters

GroktoCrawl supports site-specific content handlers that extract richer content from JavaScript-heavy sites. When scrape <url> is called, the adapter registry checks if a handler matches the URL before running the generic pipeline. If it matches, the adapter handles extraction with its own fallback chain. If it fails, the generic pipeline runs as normal.

YouTube Adapter

scrape <youtube-url> returns a markdown document with:

YAML frontmatter: video_id, title, channel, channel_url, thumbnail_url, source
Markdown body: full video transcript

Fallback chain: youtube_transcript_api (free, no key) → browser render + DOM extraction

Configuration:

Variable	Default	Description
`ADAPTER_YOUTUBE_API_KEY`	(none)	YouTube Data API v3 key (optional — transcript works without it)

Bluesky Adapter

scrape <bsky.app-url> returns a markdown document with:

YAML frontmatter: author, handle, did, post_id, timestamp, reply_count, like_count, repost_count
Markdown body: post text + thread replies

Fallback chain: AT Protocol XRPC API (public, no auth) → browser render + DOM extraction

Configuration: None — the public API requires no authentication.

Substack Adapter

scrape <substack-url> returns a markdown document with:

YAML frontmatter: title, author, publication, published_date, source
Markdown body: full article text in clean markdown

Fallback chain: RSS feed (fast, structured, no auth) → readability-lxml page extraction → browser render

Configuration: None — Substack requires no API keys.

Vanity domain detection: The adapter automatically detects Substack-hosted publications behind custom domains (e.g. www.lennysnewsletter.com) by probing {domain}/feed for the Substack RSS generator tag. Results are cached per-domain for 1 hour.

GitHub Adapters

Two adapters handle different URL types on github.com, working together via priority dispatch:

Priority	Adapter	Handles	Primary Strategy
200	GitHub File	raw files, blobs, READMEs, directory listings	raw.githubusercontent.com direct fetch
190	GitHub Social	issues, PRs, discussions, releases, commits	GraphQL API (v4)

scrape <github-url> returns structured markdown with YAML frontmatter containing owner, repo, and type-specific metadata.

Resource coverage:

URL Pattern	Handled By	Features
`raw.githubusercontent.com/*`	File adapter	Raw content, no rate limit
`github.com//blob/`	File adapter	Rewrites to raw URL
`github.com/*` (repo root)	File adapter	README + stars/forks/language/topics
`github.com//tree/`	File adapter	Directory listing, items sorted dirs-first
`github.com/*/issues/{n}`	Social adapter	Body, comments, labels, state, milestone
`github.com/*/pull/{n}`	Social adapter	Body, reviews, diff stats, changed files, merge status
`github.com/*/discussions/{n}`	Social adapter	Category, upvotes, answer, comments
`github.com/*/releases/tag/{v}`	Social adapter	Release notes, assets, download URLs
`github.com/*/releases`	Social adapter	Releases list with descriptions
`github.com/*/commit/{sha}`	Social adapter	Message, author, associated PRs

Fallback chains:

File adapter: raw.githubusercontent.com direct fetch → GitHub Contents API → generic tier
Social adapter: GitHub GraphQL API (single query) → GitHub REST API → HTML page scrape (readability) → generic tier

Configuration:

The GITHUB_TOKEN environment variable enables authenticated access:

Variable	Default	Effect
`GITHUB_TOKEN`	(none)	5,000 API req/hr vs 60/hr unauth; enables GraphQL; always falls back to HTML scrape

A token with public_repo scope is sufficient for public repositories. For private repos, use repo scope. Without a token, the file adapter works fully and the social adapter falls back to REST (60 req/hr) then HTML scrape — every URL type returns useful content.

Adding a New Adapter

Create scraper-svc/scraper/adapters/<site>.py
Subclass SiteAdapter, set name, patterns, priority, implement scrape()
Decorate with @adapter for auto-registration
Add any new dependencies to scraper-svc/pyproject.toml
Add .env variables to .env.sample and document them in this section

See docs/adr/ for the architecture decision records behind the adapter pattern, and CONTRIBUTING.md for the ADR convention.

External anti-detection browser hook: For sites that require advanced browser fingerprinting evasion (Turnstile challenges, DDoS-Guard), the adapter framework supports routing scrape requests to a self-hosted anti-detection browser service. The AdapterContext.config provides access to environment variables for configuring the external service endpoint. This is an advanced operator pattern — the external browser service is self-managed and outside the project's scope. Groktocrawl provides the dispatch interface; the external service's behavior, compatibility, and reliability are the operator's responsibility.

Project Status

Active development. All core Firecrawl v2 API endpoints implemented and integration-tested. See issues for upcoming features. Contributions welcome — see CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 231 Commits
.devcontainer		.devcontainer
.github		.github
.hermes/plans		.hermes/plans
agent-svc		agent-svc
browser-svc		browser-svc
docs		docs
droid-wiki		droid-wiki
llm-svc		llm-svc
ofelia		ofelia
parse-svc		parse-svc
portal-svc		portal-svc
scraper-svc		scraper-svc
sdk-examples		sdk-examples
search-svc		search-svc
searxng		searxng
semantic-svc		semantic-svc
skills/groktocrawl		skills/groktocrawl
test-site		test-site
test_site		test_site
tests		tests
.coverage		.coverage
.env.example		.env.example
.env.sample		.env.sample
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
VISION.md		VISION.md
docker-compose.yml		docker-compose.yml
groktocrawl		groktocrawl
openapi.json		openapi.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

GroktoCrawl

Quick Start

Production Setup

Architecture

CLI

API Endpoints

Search endpoint

Agent endpoint

OpenAPI / Swagger Docs

Comparison to Firecrawl

AgentSkills Compatibility

Hermes Agent Considerations

Disable the web toolset

Install the CLI

Install the AgentSkills skill

Environment variables

Security

API Authentication (recommended for production)

Private Network Protection

Service Architecture

Reporting Vulnerabilities

Proxy Configuration

Adapters

YouTube Adapter

Bluesky Adapter

Substack Adapter

GitHub Adapters

Adding a New Adapter

Project Status

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Disable the `web` toolset

Packages