Skip to content

parsanol/parsanol-ruby

Repository files navigation

Parsanol

RubyGems Version License Build

A high-performance PEG (Parsing Expression Grammar) parser construction library for Ruby with optional Rust native extensions.

Purpose

Parsanol provides a declarative DSL for constructing parsers using PEG semantics. It offers excellent error reporting, memory efficiency through object pooling, and optional Rust native extensions for maximum performance. The library is designed as a drop-in replacement for Parslet while offering significant performance improvements.

Note

Parsanol is inspired by the Parslet library by Kaspar Schiess. While maintaining full API compatibility with Parslet, Parsanol features a complete independent implementation with additional performance optimizations and features.

Features

Installation

Add this line to your application’s Gemfile:

gem 'parsanol'

And then execute:

bundle install

Or install it yourself as:

gem install parsanol

Usage

Basic Parser

Define parsers by creating a class that inherits from Parsanol::Parser and declaring rules:

require 'parsanol'

class MyParser < Parsanol::Parser
  rule(:keyword) { str('if') | str('while') }
  rule(:expression) { keyword >> str('(') >> expression >> str(')') }
  root(:expression)
end

parser = MyParser.new
result = parser.parse('if(x)')

Error Reporting

Parsanol provides detailed error messages when parsing fails:

begin
  parser.parse('invalid input')
rescue Parsanol::ParseFailed => e
  puts e.message
  # => "Expected 'if' at line 1 char 1."
end

Transformation

Convert parse trees to AST using pattern-based transformations:

class MyTransform < Parsanol::Transform
  rule(keyword: simple(:k)) { KeywordNode.new(k) }
  rule(expression: subtree(:e)) { ExpressionNode.new(e) }
end

ast = MyTransform.new.apply(parse_tree)

Native Extension

For maximum performance, compile the Rust native extension:

# Install Rust toolchain first
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Compile the extension
bundle exec rake compile

Slice Support

All parse results include source position information through Parsanol::Slice objects:

# Parse returns results with position info
result = parser.parse("hello world", mode: :native)
name = result[:name]

# Access the value
name.to_s     # => "hello"

# Access position information
name.offset   # => 0 (byte offset in original input)
name.length   # => 5
name.line_and_column  # => [1, 1] (1-indexed)

# Compare with strings (Slice compares by content)
name == "hello"  # => true

# Extract from original source
name.extract_from(input)  # => "hello"

JSON Output with Position Info

When using JSON mode, position information is included inline with each value:

result = parser.parse("hello", mode: :json)
# => {
#   "name": {
#     "value": "hello",
#     "offset": 0,
#     "length": 5,
#     "line": 1,
#     "column": 1
#   }
# }

This format ensures position information is available for all downstream consumers including IDEs, linters, and error reporting tools.

Slice API

class Parsanol::Slice
  # Core attributes
  def content       # String content
  def offset        # Byte offset in original input
  def length        # Length of the slice
  def line_and_column  # [line, column] tuple (requires line cache)

  # String compatibility
  def to_s          # Returns content
  def to_str        # Implicit string conversion
  def ==(other)     # Compares content with String or Slice

  # JSON serialization
  def to_json       # Returns { "value" => ..., "offset" => ..., ... }
  def as_json       # Returns hash with position info

  # Utility
  def to_span(input)  # Returns SourceSpan object
  def extract_from(input)  # Extracts content from original input
end

This is essential for:

  • Linters - Map errors back to source locations

  • IDEs - Provide go-to-definition, hover info

  • Comment attachment - Attach remarks to AST nodes by position

  • Source extraction - Get original text for any parsed element

Migrating from Parslet

Parsanol provides full Parslet API compatibility with two migration modes.

Drop-in Replacement (Zero Code Changes)

Simply replace the parslet gem with parsanol in your Gemfile:

# Gemfile
- gem 'parslet'
+ gem 'parsanol'

Your existing code works without modification:

# No changes needed!
require 'parslet'  # Parsanol aliases itself

class MyParser < Parslet::Parser
  rule(:number) { match('[0-9]').repeat(1) }
  root(:number)
end

parser = MyParser.new
parser.parse('123')  # Works exactly the same

API Compatibility Matrix

Parslet API Status Notes

str('foo')

Literal string match

match('[0-9]')

Character class

any

Any single character

>> (sequence)

Sequential composition

| (choice)

Ordered choice

.repeat(n, m)

Repetition with bounds

.maybe

Optional (zero or one)

.as(:name)

Label capture

.absent?

Negative lookahead

.present?

Positive lookahead

infix_expression

Precedence climbing

exp('…​')

Treetop-style expression parsing

Parslet::Transform

Tree transformation

simple(:x)

Match simple value

sequence(:x)

Match array of values

subtree(:x)

Match any subtree

Parslet::Slice

Parsanol::Slice compatible

.capture(:name)

Named capture extraction (NEW in 1.2.0)

scope { }

Isolated capture context (NEW in 1.2.0)

dynamic { |ctx| }

Runtime-determined parsing (NEW in 1.2.0)

Note
The new capture, scope, and dynamic atoms provide powerful extraction and context-sensitive parsing capabilities. See the Captures section for details.

Architecture

Parsanol architecture overview
                        ┌─────────────────────────────────────┐
                        │           User Parser               │
                        │   (inherits from Parsanol::Parser)  │
                        └─────────────────┬───────────────────┘
                                          │
                        ┌─────────────────▼───────────────────┐
                        │         Parsing Backend             │
                        ├─────────────────┬───────────────────┤
                        │   Pure Ruby     │   Rust Native     │
                        │   (default)     │   (optional)      │
                        └─────────────────┴───────────────────┘
                                          │
                        ┌─────────────────▼───────────────────┐
                        │         Parse Tree                  │
                        │   (with Slice position info)        │
                        └─────────────────┬───────────────────┘
                                          │
                        ┌─────────────────▼───────────────────┐
                        │      Parsanol::Transform            │
                        │   (pattern-based transformation)    │
                        └─────────────────┬───────────────────┘
                                          │
                        ┌─────────────────▼───────────────────┐
                        │            User AST                 │
                        └─────────────────────────────────────┘

Parse Modes

Parsanol offers 3 parsing modes through the parse method. All modes return Parsanol::Slice objects with position information:

result = parser.parse(input, mode: :native)  # mode is optional, :native is default
Mode Backend Keys Values Best For

:ruby

Pure Ruby

Symbol

Slice

Debugging, fallback

:native

Rust FFI

Symbol

Slice

Production (DEFAULT)

:json

Rust FFI

String

Hash + position

APIs, serialization

All modes include position info (offset, length, line, column) by default.

Mode Details

Ruby Mode (:ruby)

Pure Ruby parsing engine. Use for debugging grammar issues or when native extension is unavailable.

Native Mode (:native)

Rust parser via FFI with automatic transformation to Ruby-friendly format (Symbol keys). ~20x faster than pure Ruby. This is the default mode.

JSON Mode (:json)

Rust parser that returns JSON-serializable output with inline position information. Use for APIs and when you need JSON-compatible output.

ZeroCopy Interface (Low-Level API)

For maximum performance (~29x faster than pure Ruby), use the ZeroCopy interface which bypasses Ruby transformation:

# Low-level API: Direct Rust access, String keys
grammar = Parsanol::Native.serialize_grammar(parser.root)
result = Parsanol::Native.parse_to_ruby_objects(grammar, input)
# Returns: { "name" => Slice("hello", offset: 0, length: 5) }

# High-level ZeroCopy: Include module for direct Ruby objects
class FastParser < Parsanol::Parser
  include Parsanol::ZeroCopy

  rule(:number) { match('[0-9]').repeat(1) }
  root(:number)

  output_types(number: MyNumberClass)  # Map to Ruby classes
end

parser = FastParser.new
expr = parser.parse("42")  # Returns MyNumberClass instance directly
Method Keys Use Case

parse_to_ruby_objects

String

Low-level, Slice objects directly from Rust

Parsanol::ZeroCopy module

Ruby objects

Maximum performance, direct object construction

Note
ZeroCopy requires the native extension and type mapping definitions.

When to Use Parse Modes vs ZeroCopy

Your Need Use This Why

Building an API

JSON mode (:json)

Direct JSON serialization

Building a linter/IDE

Native mode (:native)

Position info for errors

Need position info

Parse Modes (not ZeroCopy)

ZeroCopy skips position tracking

High-throughput parsing

ZeroCopy

Maximum performance

Type-safe AST with methods

ZeroCopy

Direct typed object construction

Debugging grammar

Ruby mode (:ruby)

Pure Ruby, easier to trace

ZeroCopy Example: Calculator with Direct Object Construction

# 1. Define your AST classes with methods
module Calculator
  class Expr
    def eval = raise NotImplementedError
  end

  class Number < Expr
    attr_reader :value
    def initialize(value) = @value = value
    def eval = @value
  end

  class BinOp < Expr
    attr_reader :left, :op, :right
    def initialize(left:, op:, right:)
      @left, @op, @right = left, op, right
    end
    def eval
      case @op
      when '+' then @left.eval + @right.eval
      when '-' then @left.eval - @right.eval
      when '*' then @left.eval * @right.eval
      when '/' then @left.eval / @right.eval
      end
    end
  end
end

# 2. Define parser with ZeroCopy and output_types
class CalculatorParser < Parsanol::Parser
  include Parsanol::ZeroCopy

  rule(:number) { match('[0-9]').repeat(1).as(:int) }
  rule(:expression) { (number.as(:left) >> add_op >> expression.as(:right)).as(:binop) | number }
  root(:expression)

  # Map rules to Ruby classes - Rust constructs these directly!
  output_types(
    number: Calculator::Number,
    binop: Calculator::BinOp
  )
end

# 3. Parse and evaluate - no transform needed!
parser = CalculatorParser.new
expr = parser.parse("2 + 3 * 4")  # Returns Calculator::BinOp directly
puts expr.eval  # => 14 (with proper precedence)

Low-Level ZeroCopy: parse_to_ruby_objects

When you don’t need typed objects, use parse_to_ruby_objects for direct Slice access:

# Direct FFI call - bypasses transformation, String keys
grammar = Parsanol::Native.serialize_grammar(MyParser.new.root)
result = Parsanol::Native.parse_to_ruby_objects(grammar, input)

# Result structure (String keys, Slice values):
# { "name" => Slice("hello", offset: 0, length: 5),
#   "value" => Slice("42", offset: 10, length: 2) }

# Access position info directly
result["name"].offset    # => 0
result["name"].to_s      # => "hello"

ZeroCopy Requirements

The ZeroCopy module requires:

  1. Native extension - Run bundle exec rake compile

  2. Type mapping - Define output_types in your parser

  3. Matching constructors - Your Ruby classes must accept the parsed attributes

For complex types, you may also need Rust-side type definitions with #[derive(RubyObject)] for full zero-copy FFI construction.

Parsing Backends (Rust Core)

Behind the scenes, the Rust implementation uses one of two parsing backends:

Backend Use Case Characteristics

Packrat (default)

Complex grammars

O(n) guaranteed, higher memory

Bytecode VM

Simple patterns

Lower memory, faster for linear patterns

Auto

Variable workloads

Analyzes grammar, selects best backend

The Ruby bindings automatically use the best backend for your grammar:

  • Uses Backend::Auto by default (same as parsanol-rs)

  • Detects nested repetitions, overlapping choices

  • Recommends Packrat for complex grammars

  • Falls back to Bytecode for simple patterns

Note
The backend selection is transparent to Ruby users. The parser object automatically uses the optimal backend based on grammar analysis.

For more details on backend selection and grammar analysis, see the Parsing Backends documentation.

Captures, Scopes, and Dynamic Atoms

Parsanol 1.2.0 introduces powerful new features for extracting and managing parsed data.

Capture Atoms

Extract named values from parsed input, similar to named groups in regular expressions:

require 'parsanol/parslet'

include Parsanol::Parslet

# Basic capture
parser = str('hello').capture(:greeting)
result = parser.parse("hello")
puts result[:greeting].to_s  # => "hello"

# Multiple captures - parse key=value pairs
kv_parser = match('[a-z]+').capture(:key) >>
              str('=') >>
              match('[a-zA-Z0-9]+').capture(:value)

result = kv_parser.parse("name=Alice")
puts result[:key].to_s    # => "name"
puts result[:value].to_s  # => "Alice"

Scope Atoms

Create isolated capture contexts. Captures inside a scope are discarded when the scope exits:

# Without scope: inner captures leak out
parser = str('a').capture(:temp) >> str('b') >> str('c').capture(:temp)

# With scope: inner captures are discarded
parser = str('prefix').capture(:outer) >>
         scope { str('inner').capture(:inner) } >>
         str('suffix').capture(:outer_end)

result = parser.parse("prefix inner suffix")
puts result[:inner]  # => nil (discarded)
puts result[:outer]  # => "prefix"

Scopes are essential for: - Parsing nested structures without capture pollution - Recursive parsing with isolated capture state - Memory-bounded parsing of repeated structures

Dynamic Atoms

Runtime-determined parsing via callbacks. The grammar can change based on context:

# Type-driven value parsing
class TypeParser < Parsanol::Parser
  include Parsanol::Parslet

  rule(:type) { match('[a-z]+').capture(:type) }
  rule(:value) do
    dynamic do |ctx|
      case ctx[:type].to_s
      when 'int' then match('\d+')
      when 'str' then match('[a-z]+')
      when 'bool' then str('true') | str('false')
      else match('[a-z]+')
      end.capture(:value)
    end
  end
  rule(:declaration) { type >> str(':') >> match('[a-z]+').capture(:name) >> str('=') >> value }
  root :declaration
end

parser = TypeParser.new
result = parser.parse("int:count=42")
puts result[:type].to_s   # => "int"
puts result[:value].to_s  # => "42"

The DynamicContext provides: - ctx[:name] - Access captured values - ctx.remaining - Remaining input from current position - ctx.pos - Current byte position - ctx.input - Full input string

Streaming Builder API

For maximum performance, use the streaming builder API which eliminates intermediate AST construction:

require 'parsanol'

class StringCollector
  include Parsanol::BuilderCallbacks

  def initialize
    @strings = []
  end

  def on_string(value, offset, length)
    @strings << value
  end

  def finish
    @strings
  end
end

grammar = Parsanol::Native.serialize_grammar(MyParser.new.root)
builder = StringCollector.new
result = Parsanol::Native.parse_with_builder(grammar, input, builder)
# result: ["hello", "world"]

Available Callback Methods

Method Description Default

on_start(input)

Parsing started

No-op

on_success

Parsing succeeded

No-op

on_error(message)

Parsing failed

No-op

on_string(value, offset, length)

String/slice matched

No-op

on_int(value)

Integer matched

No-op

on_float(value)

Float matched

No-op

on_bool(value)

Boolean matched

No-op

on_nil

Nil matched

No-op

on_hash_start(size)

Entering a hash/object

No-op

on_hash_key(key)

Hash key encountered

No-op

on_hash_end(size)

Exiting a hash/object

No-op

on_array_start(size)

Entering an array

No-op

on_array_end(size)

Exiting an array

No-op

finish

Parsing complete

Returns nil

Parallel Parsing

Parse multiple inputs using all CPU cores:

require 'parsanol/parallel'

grammar = MyParser.new.serialize_grammar
inputs = Dir.glob("*.json").map { |f| File.read(f) }

# Parse all files in parallel
results = Parsanol::Parallel.parse_batch(grammar, inputs)

# With configuration
config = Parsanol::Parallel::Config.new
  .with_num_threads(4)
  .with_min_chunk_size(50)

results = Parsanol::Parallel.parse_batch(grammar, inputs, config: config)

Infix Expression Parsing

Built-in support for parsing infix expressions with operator precedence:

class CalculatorParser < Parsanol::Parser
  rule(:number) { match('[0-9]').repeat(1).as(:int) }
  rule(:primary) { number | str('(') >> expr >> str(')') }

  rule(:expr) {
    infix_expression(primary,
      [str('*'), 2, :left],
      [str('/'), 2, :left],
      [str('+'), 1, :left],
      [str('-'), 1, :left],
      [str('^'), 3, :right]  # Right-associative
    )
  }
  root(:expr)
end

Treetop Expression Syntax

Parsanol supports treetop-style expression strings for quick grammar definition:

# Using exp() for treetop-style expressions
class QuickParser < Parsanol::Parser
  rule(:word) { exp("'a' 'b' ?") }  # 'a' followed by optional 'b'
  root(:word)
end

# Equivalent to:
rule(:word) { str('a') >> str('b').maybe }

Treetop Syntax Reference

Syntax Description

'hello'

Literal string match

[a-z]

Character class

.

Any single character

'a' 'b'

Sequence (concatenation)

'a' / 'b'

Alternative (choice)

'a' ?

Optional (zero or one)

'a' *

Zero or more repetitions

'a' +

One or more repetitions

'a'{2,5}

Between 2 and 5 repetitions

('a' / 'b')

Grouping

Note

Whitespace is required before operators: 'a' ? not 'a'?

Expression Parsing Performance

The expression parser is pure Ruby (not Rust-accelerated) since it runs only at grammar definition time. The resulting atoms can still be used with Rust-accelerated parsing:

atom = Parsanol.exp("'a' +")

# Ruby parsing
atom.parse('aaa')

# Rust-accelerated parsing (if native extension available)
grammar = Parsanol::Native.serialize_grammar(atom)
Parsanol::Native.parse_to_ruby_objects(grammar, 'aaa')

Security Features

For parsing untrusted input, use built-in limits:

result = Parsanol::Native.parse_with_limits(
  grammar_json,
  untrusted_input,
  max_input_size: 10 * 1024 * 1024,  # 10 MB max
  max_recursion_depth: 100            # Limit recursion
)

Debug Tools

Enable tracing for debugging grammars:

# Parse with trace
result, trace = Parsanol::Native.parse_with_trace(grammar_json, input)
puts trace

# Generate grammar visualization
mermaid = Parsanol::Native.grammar_to_mermaid(grammar_json)
dot = Parsanol::Native.grammar_to_dot(grammar_json)

Development

Setup

bundle install

Testing

# Run all tests
bundle exec rake spec

# Run unit tests only
bundle exec rake spec:unit

# Run specific test file
bundle exec rspec spec/parsanol/atoms/str_spec.rb

Compiling Native Extension

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Compile the native extension
bundle exec rake compile

# Verify native extension is working
ruby -I lib -e "require 'parsanol'; puts Parsanol::Native.available?"
# => true

Running Benchmarks

# Quick benchmarks
bundle exec rake benchmark

# Comprehensive benchmark suite
bundle exec rake benchmark:all

License

MIT License - see LICENSE file for details.

Acknowledgments

Parsanol is inspired by the Parslet library. We thank Kaspar Schiess and all Parslet contributors for creating an excellent parser library that served as inspiration for this project.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages