Skip to content

craft-wise/rubycrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RubyCrawl 🎭

Gem Version License: MIT Ruby Node.js

Production-ready web crawler for Ruby powered by Playwright β€” Bringing the power of modern browser automation to the Ruby ecosystem with first-class Rails support.

RubyCrawl provides accurate, JavaScript-enabled web scraping using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.

Why RubyCrawl?

  • βœ… Real browser β€” Handles JavaScript, AJAX, and SPAs correctly
  • βœ… Zero config β€” Works out of the box, no Playwright knowledge needed
  • βœ… Production-ready β€” Auto-retry, error handling, resource optimization
  • βœ… Multi-page crawling β€” BFS algorithm with smart URL deduplication
  • βœ… Rails-friendly β€” Generators, initializers, and ActiveJob integration
  • βœ… Modular architecture β€” Clean, testable, maintainable codebase

Features

  • 🎭 Playwright-powered: Real browser automation for JavaScript-heavy sites and SPAs
  • πŸš€ Production-ready: Designed for Rails apps and production environments with auto-retry and error handling
  • 🎯 Simple API: Clean, minimal Ruby interface β€” zero Playwright or Node.js knowledge required
  • ⚑ Resource optimization: Built-in resource blocking for 2-3x faster crawls
  • πŸ”„ Auto-managed browsers: Browser process reuse and automatic lifecycle management
  • πŸ“„ Content extraction: HTML, links (with metadata), and lazy-loaded Markdown conversion
  • 🌐 Multi-page crawling: BFS (breadth-first search) crawler with configurable depth limits and URL deduplication
  • πŸ›‘οΈ Smart URL handling: Automatic normalization, tracking parameter removal, and same-host filtering
  • πŸ”§ Rails integration: First-class Rails support with generators and initializers
  • πŸ’Ž Modular design: Clean separation of concerns with focused, testable modules

Table of Contents

Installation

Requirements

  • Ruby >= 3.0
  • Node.js LTS (v18+ recommended) β€” required for the bundled Playwright service

Add to Gemfile

gem "rubycrawl"

Then install:

bundle install

Install Playwright browsers

After bundling, install the Playwright browsers:

bundle exec rake rubycrawl:install

This command:

  • βœ… Installs Node.js dependencies in the bundled node/ directory
  • βœ… Downloads Playwright browsers (Chromium, Firefox, WebKit) β€” ~300MB download
  • βœ… Creates a Rails initializer (if using Rails)

Note: You only need to run this once. The installation task is idempotent and safe to run multiple times.

Troubleshooting installation:

# If installation fails, check Node.js version
node --version  # Should be v18+ LTS

# Enable verbose logging
RUBYCRAWL_NODE_LOG=/tmp/rubycrawl.log bundle exec rake rubycrawl:install

# Check installation status
cd node && npm list

Quick Start

require "rubycrawl"

# Simple crawl
result = RubyCrawl.crawl("https://example.com")

# Access extracted content
puts result.html      # Raw HTML content
puts result.markdown  # Converted to Markdown
puts result.links     # Extracted links from the page
puts result.metadata  # Status code, final URL, etc.

Use Cases

RubyCrawl is perfect for:

  • πŸ“Š Data aggregation: Crawl product catalogs, job listings, or news articles
  • πŸ€– RAG applications: Build knowledge bases for LLM/AI applications by crawling documentation sites
  • πŸ” SEO analysis: Extract metadata, links, and content structure
  • πŸ“± Content migration: Convert existing sites to Markdown for static site generators
  • πŸ§ͺ Testing: Verify deployed site structure and content
  • πŸ“š Documentation scraping: Create local copies of documentation with preserved links

Usage

Basic Crawling

The simplest way to crawl a URL:

result = RubyCrawl.crawl("https://example.com")

# Access the results
result.html      # => "<html>...</html>"
result.markdown  # => "# Example Domain\n\nThis domain is..." (lazy-loaded)
result.links     # => [{ "url" => "https://...", "text" => "More info" }, ...]
result.metadata  # => { "status" => 200, "final_url" => "https://example.com" }
result.text      # => "" (coming soon)

Multi-Page Crawling

Crawl an entire site following links with BFS (breadth-first search):

# Crawl up to 100 pages, max 3 links deep
RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
  # Each page is yielded as it's crawled (streaming)
  puts "Crawled: #{page.url} (depth: #{page.depth})"

  # Save to database
  Page.create!(
    url: page.url,
    html: page.html,
    markdown: page.markdown,
    depth: page.depth
  )
end

Real-world example: Building a RAG knowledge base

# Crawl documentation site for AI/RAG application
require "rubycrawl"

RubyCrawl.configure(
  wait_until: "networkidle",  # Ensure JS content loads
  block_resources: true       # Skip images/fonts for speed
)

pages_crawled = RubyCrawl.crawl_site(
  "https://docs.example.com",
  max_pages: 500,
  max_depth: 5,
  same_host_only: true
) do |page|
  # Store in vector database for RAG
  VectorDB.upsert(
    id: Digest::SHA256.hexdigest(page.url),
    content: page.markdown,  # Clean markdown for better embeddings
    metadata: {
      url: page.url,
      title: page.metadata["title"],
      depth: page.depth
    }
  )

  puts "βœ“ Indexed: #{page.metadata['title']} (#{page.depth} levels deep)"
end

puts "Crawled #{pages_crawled} pages into knowledge base"

Multi-Page Options

Option Default Description
max_pages 50 Maximum number of pages to crawl
max_depth 3 Maximum link depth from start URL
same_host_only true Only follow links on the same domain
wait_until inherited Page load strategy
block_resources inherited Block images/fonts/CSS

Page Result Object

The block receives a PageResult with:

page.url       # String: Final URL after redirects
page.html      # String: Full HTML content
page.markdown  # String: Lazy-converted Markdown
page.links     # Array: URLs extracted from page
page.metadata  # Hash: HTTP status, final URL, etc.
page.depth     # Integer: Link depth from start URL

Configuration

Global Configuration

Set default options that apply to all crawls:

RubyCrawl.configure(
  wait_until: "networkidle",  # Wait until network is idle
  block_resources: true        # Block images, fonts, CSS for speed
)

# All subsequent crawls use these defaults
result = RubyCrawl.crawl("https://example.com")

Per-Request Options

Override defaults for specific requests:

# Use global defaults
result = RubyCrawl.crawl("https://example.com")

# Override for this request only
result = RubyCrawl.crawl(
  "https://example.com",
  wait_until: "domcontentloaded",
  block_resources: false
)

Configuration Options

Option Values Default Description
wait_until "load", "domcontentloaded", "networkidle" "load" When to consider page loaded
block_resources true, false true Block images, fonts, CSS, media for faster crawls

Wait strategies explained:

  • load β€” Wait for the load event (fastest, good for static sites)
  • domcontentloaded β€” Wait for DOM ready (medium speed)
  • networkidle β€” Wait until no network requests for 500ms (slowest, best for SPAs)

Advanced Usage

Session-Based Crawling

Sessions allow reusing browser contexts for better performance when crawling multiple pages. They're automatically used by crawl_site, but you can manage them manually for advanced use cases:

# Create a session (reusable browser context)
session_id = RubyCrawl.create_session

begin
  # All crawls with this session_id share the same browser context
  result1 = RubyCrawl.crawl("https://example.com/page1", session_id: session_id)
  result2 = RubyCrawl.crawl("https://example.com/page2", session_id: session_id)
  # Browser state (cookies, localStorage) persists between crawls
ensure
  # Always destroy session when done
  RubyCrawl.destroy_session(session_id)
end

When to use sessions:

  • Multiple sequential crawls to the same domain (better performance)
  • Preserving cookies/state set by the site between page visits
  • Avoiding browser context creation overhead

Important: Sessions are for performance optimization only. RubyCrawl is designed for crawling public websites. It does not provide authentication or login functionality for protected content.

Note: crawl_site automatically creates and manages a session internally, so you don't need manual session management for multi-page crawling.

Session lifecycle:

  • Sessions automatically expire after 30 minutes of inactivity
  • Sessions are cleaned up every 5 minutes
  • Always call destroy_session when done to free resources immediately

Result Object

The crawl result is a RubyCrawl::Result object with these attributes:

result = RubyCrawl.crawl("https://example.com")

result.html      # String: Raw HTML content from page
result.markdown  # String: Markdown conversion (lazy-loaded on first access)
result.links     # Array: Extracted links with url and text
result.text      # String: Plain text (coming soon)
result.metadata  # Hash: Comprehensive metadata (see below)

Links Format

Links are extracted with full metadata:

result.links
# => [
#   {
#     "url" => "https://example.com/about",
#     "text" => "About Us",
#     "title" => "Learn more about us",  # <a title="...">
#     "rel" => nil                        # <a rel="nofollow">
#   },
#   {
#     "url" => "https://example.com/contact",
#     "text" => "Contact",
#     "title" => null,
#     "rel" => "nofollow"
#   },
#   ...
# ]

Note: URLs are automatically converted to absolute URLs by the browser, so relative links like /about become https://example.com/about.

Markdown Conversion

Markdown is lazy-loaded β€” conversion only happens when you access .markdown:

result = RubyCrawl.crawl(url)
result.html       # βœ… No overhead
result.markdown   # ⬅️ Conversion happens here (first call only)
result.markdown   # βœ… Cached, instant

Uses reverse_markdown with GitHub-flavored output.

Metadata Fields

The metadata hash includes HTTP and HTML metadata:

result.metadata
# => {
#   "status" => 200,                 # HTTP status code
#   "final_url" => "https://...",    # Final URL after redirects
#   "title" => "Page Title",         # <title> tag
#   "description" => "...",          # Meta description
#   "keywords" => "ruby, web",       # Meta keywords
#   "author" => "Author Name",       # Meta author
#   "og_title" => "...",             # Open Graph title
#   "og_description" => "...",       # Open Graph description
#   "og_image" => "https://...",     # Open Graph image
#   "og_url" => "https://...",       # Open Graph URL
#   "og_type" => "website",          # Open Graph type
#   "twitter_card" => "summary",     # Twitter card type
#   "twitter_title" => "...",        # Twitter title
#   "twitter_description" => "...",  # Twitter description
#   "twitter_image" => "https://...",# Twitter image
#   "canonical" => "https://...",    # Canonical URL
#   "lang" => "en",                  # Page language
#   "charset" => "UTF-8"             # Character encoding
# }

Note: All HTML metadata fields may be null if not present on the page.

Error Handling

RubyCrawl provides specific exception classes for different error scenarios:

begin
  result = RubyCrawl.crawl(url)
rescue RubyCrawl::ConfigurationError => e
  # Invalid URL or configuration
  puts "Configuration error: #{e.message}"
rescue RubyCrawl::TimeoutError => e
  # Page load timeout or network timeout
  puts "Timeout: #{e.message}"
rescue RubyCrawl::NavigationError => e
  # Page navigation failed (404, DNS error, SSL error, etc.)
  puts "Navigation failed: #{e.message}"
rescue RubyCrawl::ServiceError => e
  # Node service unavailable or crashed
  puts "Service error: #{e.message}"
rescue RubyCrawl::Error => e
  # Catch-all for any RubyCrawl error
  puts "Crawl error: #{e.message}"
end

Exception Hierarchy:

  • RubyCrawl::Error (base class)
    • RubyCrawl::ConfigurationError - Invalid URL or configuration
    • RubyCrawl::TimeoutError - Timeout during crawl
    • RubyCrawl::NavigationError - Page navigation failed
    • RubyCrawl::ServiceError - Node service issues

Automatic Retry: RubyCrawl automatically retries transient failures (service errors, timeouts) up to 3 times with exponential backoff (2s, 4s, 8s). Configure with:

RubyCrawl.configure(max_retries: 5)
# or per-request
RubyCrawl.crawl(url, retries: 1)  # Disable retry

Rails Integration

Installation

Run the installer in your Rails app:

bundle exec rake rubycrawl:install

This creates config/initializers/rubycrawl.rb:

# frozen_string_literal: true

# rubycrawl default configuration
RubyCrawl.configure(
  wait_until: "load",
  block_resources: true
)

Usage in Rails

# In a controller, service, or background job
class ContentScraperJob < ApplicationJob
  def perform(url)
    result = RubyCrawl.crawl(url)

    # Save to database
    ScrapedContent.create!(
      url: url,
      html: result.html,
      status: result.metadata[:status]
    )
  end
end

Production Deployment

Pre-deployment Checklist

  1. Install Node.js on your production servers (LTS version recommended)
  2. Run installer during deployment:
    bundle exec rake rubycrawl:install
  3. Set environment variables (optional):
    export RUBYCRAWL_NODE_BIN=/usr/bin/node  # Custom Node.js path
    export RUBYCRAWL_NODE_LOG=/var/log/rubycrawl.log  # Service logs

Docker Example

FROM ruby:3.2

# Install Node.js LTS
RUN curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - \
    && apt-get install -y nodejs

# Install system dependencies for Playwright
RUN npx playwright install-deps

WORKDIR /app
COPY Gemfile* ./
RUN bundle install

# Install Playwright browsers
RUN bundle exec rake rubycrawl:install

COPY . .
CMD ["rails", "server"]

Heroku Deployment

Add the Node.js buildpack:

heroku buildpacks:add heroku/nodejs
heroku buildpacks:add heroku/ruby

Add to package.json in your Rails root:

{
  "engines": {
    "node": "18.x"
  }
}

How It Works

RubyCrawl uses a simple architecture:

  • Ruby Gem provides the public API and handles orchestration
  • Node.js Service (bundled, auto-started) manages Playwright browsers
  • Communication via HTTP/JSON on localhost

This design keeps things stable and easy to debug. The browser runs in a separate process, so crashes won't affect your Ruby application.

Performance Tips

  • Resource blocking: Keep block_resources: true (default) for 2-3x faster crawls when you don't need images/CSS
  • Wait strategy: Use wait_until: "load" for static sites, "networkidle" for SPAs
  • Concurrency: Use background jobs (Sidekiq, etc.) for parallel crawling
  • Browser reuse: The first crawl is slower (~2s) due to browser launch; subsequent crawls are much faster (~500ms)

Development

Want to contribute? Check out the contributor guidelines.

# Setup
git clone git@github.com:craft-wise/rubycrawl.git
cd rubycrawl
bin/setup

# Run tests
bundle exec rspec

# Manual testing
bin/console
> RubyCrawl.crawl("https://example.com")

πŸ’Ž Long-term (v1.0.0)

Maturity Goals:

  • Production battle-tested (1000+ stars, real-world usage)
  • Full documentation with video tutorials
  • Performance benchmarks vs. alternatives
  • Migration guides from Nokogiri, Mechanize, etc.

Contributing

Contributions are welcome! Please read our contribution guidelines first.

Development Philosophy

  • Simplicity over cleverness: Prefer clear, explicit code
  • Stability over speed: Correctness first, optimization second
  • Ruby-first: Hide Node.js/Playwright complexity from users
  • No vendor lock-in: Pure open source, no SaaS dependencies

Why Choose RubyCrawl?

RubyCrawl stands out in the Ruby ecosystem with its unique combination of features:

🎯 Built for Ruby Developers

  • Idiomatic Ruby API β€” Feels natural to Rubyists, no need to learn Playwright
  • Rails-first design β€” Generators, initializers, and ActiveJob integration out of the box
  • Modular architecture β€” Clean, testable code following Ruby best practices

πŸš€ Production-Grade Reliability

  • Automatic retry with exponential backoff for transient failures
  • Smart error handling with custom exception hierarchy
  • Process isolation β€” Browser crashes don't affect your Ruby application
  • Battle-tested β€” Built on Playwright's proven browser automation

πŸ’Ž Developer Experience

  • Zero configuration β€” Works immediately after installation
  • Lazy loading β€” Markdown conversion only when you need it
  • Smart URL handling β€” Automatic normalization and deduplication
  • Comprehensive docs β€” Clear examples for common use cases

🌐 Rich Feature Set

  • βœ… JavaScript-enabled crawling (SPAs, AJAX, dynamic content)
  • βœ… Multi-page crawling with BFS algorithm
  • βœ… Link extraction with metadata (url, text, title, rel)
  • βœ… Markdown conversion (GitHub-flavored)
  • βœ… Metadata extraction (OG tags, Twitter cards, etc.)
  • βœ… Resource blocking for 2-3x performance boost

πŸ“Š Perfect for Modern Use Cases

  • RAG applications β€” Build AI knowledge bases from documentation
  • Data aggregation β€” Extract structured data from multiple pages
  • Content migration β€” Convert sites to Markdown for static generators
  • SEO analysis β€” Extract metadata and link structures
  • Testing β€” Verify deployed site content and structure

License

The gem is available as open source under the terms of the MIT License.

Credits

Built with Playwright by Microsoft β€” the industry-standard browser automation framework.

Powered by reverse_markdown for GitHub-flavored Markdown conversion.

Support

Acknowledgments

Special thanks to:

  • Microsoft Playwright team for the robust, production-grade browser automation framework
  • The Ruby community for building an ecosystem that values developer happiness and code clarity
  • The Node.js community for excellent tooling and libraries that make cross-language integration seamless
  • Open source contributors worldwide who make projects like this possible

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors