XML and HTML documents are the foundation that powers the modern internet. Almost every web page utilizes HTML formatting to display content. In this comprehensive guide, we will explore how to parse XML and HTML files in Ruby using the popular Nokogiri gem.
What is XML?
XML stands for Extensible Markup Language. It is a markup language that defines a set of rules for encoding documents in both a human and machine-readable format.
XML is commonly used for exchanging data between systems, configuration files, and various types of documents. It structures, stores and transports information in a way that is flexible and platform-independent.
What is HTML?
HTML stands for HyperText Markup Language. It is used to create web pages and web applications. HTML documents contain text content which describes the structure and presentation of a web page using markup tags.
The tags provide context and semantic meaning to the content. For example, <h1> tag signifies a heading, <p> signifies a paragraph.
Why Parse XML and HTML in Ruby?
Here are some common use cases for parsing XML and HTML documents in Ruby:
-
Web Scraping – Extracting content from websites by analyzing the underlying HTML documents. This is useful for aggregating data from various sites.
-
Consuming Web Services – Many web APIs return data in XML format which needs to be parsed before usage in Ruby programs.
-
Processing Configuration Files – Ruby applications may leverage XML configuration files which require XML parsing.
-
Testing Tools – Parsing HTML documents to verify correctness or extract coverage metrics.
Parsing XML and HTML using Nokogiri
Nokogiri is the most popular library in Ruby for parsing, searching, modifying and querying XML/HTML documents.
Let‘s explore how to utilize Nokogiri for common parsing tasks.
Installation
Install Nokogiri using the gem command:
gem install nokogiri
Then in your Ruby file import Nokogiri:
require ‘nokogiri‘
Loading XML Documents
We can load and parse external XML documents in several ways:
From String
xml_data = Nokogiri::XML(‘<root><child/></root>‘)
puts xml_data.class # Prints Nokogiri::XML::Document
This parses the XML string and loads it as a Nokogiri::XML::Document.
From File
file = File.open(‘data.xml‘)
doc = Nokogiri::XML(file)
puts doc.class # Prints Nokogiri::XML::Document
This loads the XML file and parses the contents into a document.
From HTTP Response
require ‘open-uri‘
xml_data = open(‘https://example.com/data.xml‘)
doc = Nokogiri::XML(xml_data)
This makes an HTTP request to fetch the remote XML document and parses it.
Searching XML with XPath
XPath is a query language for selecting nodes from XML documents.
Let‘s see an example – consider the following XML file data.xml:
<?xml version="1.0" encoding="UTF-8"?>
<document>
<products>
<product id="001">
<name>widget</name>
<price>9.99</price>
<inStock>true</inStock>
</product>
<product id="002">
<name>gadget</name>
<price>14.99</price>
<inStock>false</inStock>
</product>
</products>
</document>
To fetch all names of products:
require ‘nokogiri‘
file = File.open(‘data.xml‘)
doc = Nokogiri::XML(file)
names = doc.xpath("//name")
puts names.map(&:text)
# Prints widget, gadget
The XPath query //name selects all <name> nodes in the entire document.
Some other useful XPath examples:
/document/products– Selects<products>node under root//price– All price nodes//*[@id]– All nodes with id attribute/products/product[1]– First product
This demonstrates the power of XPath for searching relevant elements in complex XML documents.
Modifying XML
Nokogiri also provides ways to modify existing XML documents or generate new ones programmatically.
Editing Values
Continuing the products example – to change the price of the first product:
doc.at_xpath(‘//product[1]/price‘).content = ‘10.99‘
puts doc.to_xml
# Updated price to 10.99
This locates the node and updates the content text.
Adding Nodes
New nodes can be added using the add_child or << methods.
new_product = Nokogiri::XML::Node.new(‘product‘, doc)
new_product.parent = doc.at_xpath(‘//products‘)
name = Nokogiri::XML::Node.new(‘name‘, doc)
name.content = ‘router‘
new_product << name
price = Nokogiri::XML::Node.new(‘price‘, doc)
price.content = ‘29.99‘
new_product << price
puts doc.to_xml # New product added
Here we:
- Created new
productnode - Set it‘s parent as
productstag - Added child elements like
name,price - Printed final modified XML
This allows generating XML dynamically in Ruby.
Removing Nodes
Similarly nodes can also be removed:
doc.search(‘//product‘).last.remove
puts doc.to_xml # Last product removed
Converting XML to JSON
For web services and APIs, JSON is a more popular exchange format than XML nowadays.
We can convert between XML and JSON easily using Nokogiri + Oj gem:
Gemfile
gem ‘nokogiri‘
gem ‘oj‘
app.rb
require ‘nokogiri‘
require ‘oj‘
xml = ‘<data><id>001</id><name> ACME </name></data>‘
doc = Nokogiri::XML(xml)
hash = Oj.load(doc.to_json)
puts hash
# {"data"=>{"id"=>"001", "name"=>"ACME"}}
json = ‘{"data":{"id":"001","name":"ACME"}}‘
doc = Nokogiri::XML(Oj.dump(json))
puts doc.to_xml
This provides an easy XML-JSON conversion pipeline using Nokogiri and Oj.
Parsing HTML using Nokogiri
Similar to XML, Nokogiri can also parse and process HTML content using it‘s Nokogiri::HTML API.
Example:
html = <<~HTML
<html>
<head>
<title>Services</title>
</head>
<body>
<div class="services">
<ul>
<li>Web Development</li>
<li>Mobile Apps</li>
</ul>
</div>
</body>
</html>
HTML
doc = Nokogiri::HTML(html)
h1 = doc.at(‘h1‘)
puts h1.text # Our Services
titles = doc.search(‘title‘)
puts titles.first.text # Services
This demonstrates parsing the sample HTML document and extracting information using Nokogiri search API.
Some key differences from XML are:
- HTML parsing is slightly more relaxed on invalid markup
- Special methods like
at,searchto find nodes by CSS/element names - More focus on text content rather than structure
Advanced Usage
We‘ve covered the basics of parsing and searching XML/HTML using Nokogiri. Here are some more advanced use cases:
Web Scraping
Nokogiri shines for web scraping needs – extracting information from websites.
For example, this script scrapes product listings from an ecommerce page:
require ‘nokogiri‘
require ‘open-uri‘
url = ‘https://store.example.com/products‘
html = open(url)
doc = Nokogiri::HTML(html)
# Find product listings
doc.search(‘.product-item‘).each do |item|
name = item.at(‘.product-name‘).text
price = item.at(‘.price‘).text
puts "#{name} - #{price}"
# Eg. Super Widget - $9.99
end
It loads the HTML page, searches for product DOM elements, extracts details like name and price and prints them.
Processing Configuration Files
Nokogiri can parse various XML configuration files like Travis CI, Jekyll, Docker and extract settings for consumption inside Ruby programs.
For example, .travis.yml:
require ‘nokogiri‘
yaml = File.read(‘.travis.yml‘)
doc = Nokogiri::XML(yaml)
config = {}
doc.search(‘//env‘).each do |node|
config[node[‘name‘]] = node[‘value‘]
end
p config # Prints environment configs
Writing Testing Tools
For testing frameworks and assertions, comparison of HTML/XML content is useful.
Nokogiri provides matchers for this:
# expected.rb
require ‘nokogiri‘
doc = Nokogiri::XML(...)
exp_xml = doc.to_xml
# actual.rb
require ‘test/unit‘
require ‘nokogiri‘
class TestingExample < Test::Unit::TestCase
def test_match
act_xml = generate_xml()
assert_equal exp_xml, act_xml, "Expected XML does not match actual"
# Other Nokogiri Matchers
# assert_xml_equal
# assert_xml_identical
end
end
This shows usage of Nokogiri for building testing tools for XML/HTML comparisons.
Conclusion
Nokogiri is an indispensable tool for processing XML and HTML documents in Ruby. It has a powerful search API via XPath and CSS patterns along with options to edit, generate or convert documents.
Some examples we covered:
- Parse/load XML from files, strings or URLs
- Query documents using XPath
- Modify documents – add/update/delete nodes
- Convert XML to JSON
- HTML parsing for web scraping
- Build testing tools for comparisons
- Process configuration files
Nokogiri provides a mature way to tackle many XML/HTML related tasks in Ruby with performance and stability across platforms. For production systems involving document processing, do consider using it over regular expressions or string manipulation.


