Production-ready web crawler for Ruby powered by Playwright β Bringing the power of modern browser automation to the Ruby ecosystem with first-class Rails support.
RubyCrawl provides accurate, JavaScript-enabled web scraping using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
Why RubyCrawl?
- β Real browser β Handles JavaScript, AJAX, and SPAs correctly
- β Zero config β Works out of the box, no Playwright knowledge needed
- β Production-ready β Auto-retry, error handling, resource optimization
- β Multi-page crawling β BFS algorithm with smart URL deduplication
- β Rails-friendly β Generators, initializers, and ActiveJob integration
- β Modular architecture β Clean, testable, maintainable codebase
- π Playwright-powered: Real browser automation for JavaScript-heavy sites and SPAs
- π Production-ready: Designed for Rails apps and production environments with auto-retry and error handling
- π― Simple API: Clean, minimal Ruby interface β zero Playwright or Node.js knowledge required
- β‘ Resource optimization: Built-in resource blocking for 2-3x faster crawls
- π Auto-managed browsers: Browser process reuse and automatic lifecycle management
- π Content extraction: HTML, links (with metadata), and lazy-loaded Markdown conversion
- π Multi-page crawling: BFS (breadth-first search) crawler with configurable depth limits and URL deduplication
- π‘οΈ Smart URL handling: Automatic normalization, tracking parameter removal, and same-host filtering
- π§ Rails integration: First-class Rails support with generators and initializers
- π Modular design: Clean separation of concerns with focused, testable modules
- Features
- Installation
- Quick Start
- Use Cases
- Usage
- Rails Integration
- Production Deployment
- Architecture
- Performance
- Development
- Roadmap
- Contributing
- Why Choose RubyCrawl?
- License
- Support
- Ruby >= 3.0
- Node.js LTS (v18+ recommended) β required for the bundled Playwright service
gem "rubycrawl"Then install:
bundle installAfter bundling, install the Playwright browsers:
bundle exec rake rubycrawl:installThis command:
- β
Installs Node.js dependencies in the bundled
node/directory - β Downloads Playwright browsers (Chromium, Firefox, WebKit) β ~300MB download
- β Creates a Rails initializer (if using Rails)
Note: You only need to run this once. The installation task is idempotent and safe to run multiple times.
Troubleshooting installation:
# If installation fails, check Node.js version
node --version # Should be v18+ LTS
# Enable verbose logging
RUBYCRAWL_NODE_LOG=/tmp/rubycrawl.log bundle exec rake rubycrawl:install
# Check installation status
cd node && npm listrequire "rubycrawl"
# Simple crawl
result = RubyCrawl.crawl("https://example.com")
# Access extracted content
puts result.html # Raw HTML content
puts result.markdown # Converted to Markdown
puts result.links # Extracted links from the page
puts result.metadata # Status code, final URL, etc.RubyCrawl is perfect for:
- π Data aggregation: Crawl product catalogs, job listings, or news articles
- π€ RAG applications: Build knowledge bases for LLM/AI applications by crawling documentation sites
- π SEO analysis: Extract metadata, links, and content structure
- π± Content migration: Convert existing sites to Markdown for static site generators
- π§ͺ Testing: Verify deployed site structure and content
- π Documentation scraping: Create local copies of documentation with preserved links
The simplest way to crawl a URL:
result = RubyCrawl.crawl("https://example.com")
# Access the results
result.html # => "<html>...</html>"
result.markdown # => "# Example Domain\n\nThis domain is..." (lazy-loaded)
result.links # => [{ "url" => "https://...", "text" => "More info" }, ...]
result.metadata # => { "status" => 200, "final_url" => "https://example.com" }
result.text # => "" (coming soon)Crawl an entire site following links with BFS (breadth-first search):
# Crawl up to 100 pages, max 3 links deep
RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
# Each page is yielded as it's crawled (streaming)
puts "Crawled: #{page.url} (depth: #{page.depth})"
# Save to database
Page.create!(
url: page.url,
html: page.html,
markdown: page.markdown,
depth: page.depth
)
endReal-world example: Building a RAG knowledge base
# Crawl documentation site for AI/RAG application
require "rubycrawl"
RubyCrawl.configure(
wait_until: "networkidle", # Ensure JS content loads
block_resources: true # Skip images/fonts for speed
)
pages_crawled = RubyCrawl.crawl_site(
"https://docs.example.com",
max_pages: 500,
max_depth: 5,
same_host_only: true
) do |page|
# Store in vector database for RAG
VectorDB.upsert(
id: Digest::SHA256.hexdigest(page.url),
content: page.markdown, # Clean markdown for better embeddings
metadata: {
url: page.url,
title: page.metadata["title"],
depth: page.depth
}
)
puts "β Indexed: #{page.metadata['title']} (#{page.depth} levels deep)"
end
puts "Crawled #{pages_crawled} pages into knowledge base"| Option | Default | Description |
|---|---|---|
max_pages |
50 | Maximum number of pages to crawl |
max_depth |
3 | Maximum link depth from start URL |
same_host_only |
true | Only follow links on the same domain |
wait_until |
inherited | Page load strategy |
block_resources |
inherited | Block images/fonts/CSS |
The block receives a PageResult with:
page.url # String: Final URL after redirects
page.html # String: Full HTML content
page.markdown # String: Lazy-converted Markdown
page.links # Array: URLs extracted from page
page.metadata # Hash: HTTP status, final URL, etc.
page.depth # Integer: Link depth from start URLSet default options that apply to all crawls:
RubyCrawl.configure(
wait_until: "networkidle", # Wait until network is idle
block_resources: true # Block images, fonts, CSS for speed
)
# All subsequent crawls use these defaults
result = RubyCrawl.crawl("https://example.com")Override defaults for specific requests:
# Use global defaults
result = RubyCrawl.crawl("https://example.com")
# Override for this request only
result = RubyCrawl.crawl(
"https://example.com",
wait_until: "domcontentloaded",
block_resources: false
)| Option | Values | Default | Description |
|---|---|---|---|
wait_until |
"load", "domcontentloaded", "networkidle" |
"load" |
When to consider page loaded |
block_resources |
true, false |
true |
Block images, fonts, CSS, media for faster crawls |
Wait strategies explained:
loadβ Wait for the load event (fastest, good for static sites)domcontentloadedβ Wait for DOM ready (medium speed)networkidleβ Wait until no network requests for 500ms (slowest, best for SPAs)
Sessions allow reusing browser contexts for better performance when crawling multiple pages. They're automatically used by crawl_site, but you can manage them manually for advanced use cases:
# Create a session (reusable browser context)
session_id = RubyCrawl.create_session
begin
# All crawls with this session_id share the same browser context
result1 = RubyCrawl.crawl("https://example.com/page1", session_id: session_id)
result2 = RubyCrawl.crawl("https://example.com/page2", session_id: session_id)
# Browser state (cookies, localStorage) persists between crawls
ensure
# Always destroy session when done
RubyCrawl.destroy_session(session_id)
endWhen to use sessions:
- Multiple sequential crawls to the same domain (better performance)
- Preserving cookies/state set by the site between page visits
- Avoiding browser context creation overhead
Important: Sessions are for performance optimization only. RubyCrawl is designed for crawling public websites. It does not provide authentication or login functionality for protected content.
Note: crawl_site automatically creates and manages a session internally, so you don't need manual session management for multi-page crawling.
Session lifecycle:
- Sessions automatically expire after 30 minutes of inactivity
- Sessions are cleaned up every 5 minutes
- Always call
destroy_sessionwhen done to free resources immediately
The crawl result is a RubyCrawl::Result object with these attributes:
result = RubyCrawl.crawl("https://example.com")
result.html # String: Raw HTML content from page
result.markdown # String: Markdown conversion (lazy-loaded on first access)
result.links # Array: Extracted links with url and text
result.text # String: Plain text (coming soon)
result.metadata # Hash: Comprehensive metadata (see below)Links are extracted with full metadata:
result.links
# => [
# {
# "url" => "https://example.com/about",
# "text" => "About Us",
# "title" => "Learn more about us", # <a title="...">
# "rel" => nil # <a rel="nofollow">
# },
# {
# "url" => "https://example.com/contact",
# "text" => "Contact",
# "title" => null,
# "rel" => "nofollow"
# },
# ...
# ]Note: URLs are automatically converted to absolute URLs by the browser, so relative links like /about become https://example.com/about.
Markdown is lazy-loaded β conversion only happens when you access .markdown:
result = RubyCrawl.crawl(url)
result.html # β
No overhead
result.markdown # β¬
οΈ Conversion happens here (first call only)
result.markdown # β
Cached, instantUses reverse_markdown with GitHub-flavored output.
The metadata hash includes HTTP and HTML metadata:
result.metadata
# => {
# "status" => 200, # HTTP status code
# "final_url" => "https://...", # Final URL after redirects
# "title" => "Page Title", # <title> tag
# "description" => "...", # Meta description
# "keywords" => "ruby, web", # Meta keywords
# "author" => "Author Name", # Meta author
# "og_title" => "...", # Open Graph title
# "og_description" => "...", # Open Graph description
# "og_image" => "https://...", # Open Graph image
# "og_url" => "https://...", # Open Graph URL
# "og_type" => "website", # Open Graph type
# "twitter_card" => "summary", # Twitter card type
# "twitter_title" => "...", # Twitter title
# "twitter_description" => "...", # Twitter description
# "twitter_image" => "https://...",# Twitter image
# "canonical" => "https://...", # Canonical URL
# "lang" => "en", # Page language
# "charset" => "UTF-8" # Character encoding
# }Note: All HTML metadata fields may be null if not present on the page.
RubyCrawl provides specific exception classes for different error scenarios:
begin
result = RubyCrawl.crawl(url)
rescue RubyCrawl::ConfigurationError => e
# Invalid URL or configuration
puts "Configuration error: #{e.message}"
rescue RubyCrawl::TimeoutError => e
# Page load timeout or network timeout
puts "Timeout: #{e.message}"
rescue RubyCrawl::NavigationError => e
# Page navigation failed (404, DNS error, SSL error, etc.)
puts "Navigation failed: #{e.message}"
rescue RubyCrawl::ServiceError => e
# Node service unavailable or crashed
puts "Service error: #{e.message}"
rescue RubyCrawl::Error => e
# Catch-all for any RubyCrawl error
puts "Crawl error: #{e.message}"
endException Hierarchy:
RubyCrawl::Error(base class)RubyCrawl::ConfigurationError- Invalid URL or configurationRubyCrawl::TimeoutError- Timeout during crawlRubyCrawl::NavigationError- Page navigation failedRubyCrawl::ServiceError- Node service issues
Automatic Retry: RubyCrawl automatically retries transient failures (service errors, timeouts) up to 3 times with exponential backoff (2s, 4s, 8s). Configure with:
RubyCrawl.configure(max_retries: 5)
# or per-request
RubyCrawl.crawl(url, retries: 1) # Disable retryRun the installer in your Rails app:
bundle exec rake rubycrawl:installThis creates config/initializers/rubycrawl.rb:
# frozen_string_literal: true
# rubycrawl default configuration
RubyCrawl.configure(
wait_until: "load",
block_resources: true
)# In a controller, service, or background job
class ContentScraperJob < ApplicationJob
def perform(url)
result = RubyCrawl.crawl(url)
# Save to database
ScrapedContent.create!(
url: url,
html: result.html,
status: result.metadata[:status]
)
end
end- Install Node.js on your production servers (LTS version recommended)
- Run installer during deployment:
bundle exec rake rubycrawl:install - Set environment variables (optional):
export RUBYCRAWL_NODE_BIN=/usr/bin/node # Custom Node.js path export RUBYCRAWL_NODE_LOG=/var/log/rubycrawl.log # Service logs
FROM ruby:3.2
# Install Node.js LTS
RUN curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - \
&& apt-get install -y nodejs
# Install system dependencies for Playwright
RUN npx playwright install-deps
WORKDIR /app
COPY Gemfile* ./
RUN bundle install
# Install Playwright browsers
RUN bundle exec rake rubycrawl:install
COPY . .
CMD ["rails", "server"]Add the Node.js buildpack:
heroku buildpacks:add heroku/nodejs
heroku buildpacks:add heroku/rubyAdd to package.json in your Rails root:
{
"engines": {
"node": "18.x"
}
}RubyCrawl uses a simple architecture:
- Ruby Gem provides the public API and handles orchestration
- Node.js Service (bundled, auto-started) manages Playwright browsers
- Communication via HTTP/JSON on localhost
This design keeps things stable and easy to debug. The browser runs in a separate process, so crashes won't affect your Ruby application.
- Resource blocking: Keep
block_resources: true(default) for 2-3x faster crawls when you don't need images/CSS - Wait strategy: Use
wait_until: "load"for static sites,"networkidle"for SPAs - Concurrency: Use background jobs (Sidekiq, etc.) for parallel crawling
- Browser reuse: The first crawl is slower (~2s) due to browser launch; subsequent crawls are much faster (~500ms)
Want to contribute? Check out the contributor guidelines.
# Setup
git clone git@github.com:craft-wise/rubycrawl.git
cd rubycrawl
bin/setup
# Run tests
bundle exec rspec
# Manual testing
bin/console
> RubyCrawl.crawl("https://example.com")Maturity Goals:
- Production battle-tested (1000+ stars, real-world usage)
- Full documentation with video tutorials
- Performance benchmarks vs. alternatives
- Migration guides from Nokogiri, Mechanize, etc.
Contributions are welcome! Please read our contribution guidelines first.
- Simplicity over cleverness: Prefer clear, explicit code
- Stability over speed: Correctness first, optimization second
- Ruby-first: Hide Node.js/Playwright complexity from users
- No vendor lock-in: Pure open source, no SaaS dependencies
RubyCrawl stands out in the Ruby ecosystem with its unique combination of features:
- Idiomatic Ruby API β Feels natural to Rubyists, no need to learn Playwright
- Rails-first design β Generators, initializers, and ActiveJob integration out of the box
- Modular architecture β Clean, testable code following Ruby best practices
- Automatic retry with exponential backoff for transient failures
- Smart error handling with custom exception hierarchy
- Process isolation β Browser crashes don't affect your Ruby application
- Battle-tested β Built on Playwright's proven browser automation
- Zero configuration β Works immediately after installation
- Lazy loading β Markdown conversion only when you need it
- Smart URL handling β Automatic normalization and deduplication
- Comprehensive docs β Clear examples for common use cases
- β JavaScript-enabled crawling (SPAs, AJAX, dynamic content)
- β Multi-page crawling with BFS algorithm
- β Link extraction with metadata (url, text, title, rel)
- β Markdown conversion (GitHub-flavored)
- β Metadata extraction (OG tags, Twitter cards, etc.)
- β Resource blocking for 2-3x performance boost
- RAG applications β Build AI knowledge bases from documentation
- Data aggregation β Extract structured data from multiple pages
- Content migration β Convert sites to Markdown for static generators
- SEO analysis β Extract metadata and link structures
- Testing β Verify deployed site content and structure
The gem is available as open source under the terms of the MIT License.
Built with Playwright by Microsoft β the industry-standard browser automation framework.
Powered by reverse_markdown for GitHub-flavored Markdown conversion.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: ganesh.navale@zohomail.in
Special thanks to:
- Microsoft Playwright team for the robust, production-grade browser automation framework
- The Ruby community for building an ecosystem that values developer happiness and code clarity
- The Node.js community for excellent tooling and libraries that make cross-language integration seamless
- Open source contributors worldwide who make projects like this possible