How to Parse XML in Ruby

XML and HTML documents are the foundation that powers the modern internet. Almost every web page utilizes HTML formatting to display content. In this comprehensive guide, we will explore how to parse XML and HTML files in Ruby using the popular Nokogiri gem.

What is XML?

XML stands for Extensible Markup Language. It is a markup language that defines a set of rules for encoding documents in both a human and machine-readable format.

XML is commonly used for exchanging data between systems, configuration files, and various types of documents. It structures, stores and transports information in a way that is flexible and platform-independent.

What is HTML?

HTML stands for HyperText Markup Language. It is used to create web pages and web applications. HTML documents contain text content which describes the structure and presentation of a web page using markup tags.

The tags provide context and semantic meaning to the content. For example, <h1> tag signifies a heading, <p> signifies a paragraph.

Why Parse XML and HTML in Ruby?

Here are some common use cases for parsing XML and HTML documents in Ruby:

Web Scraping – Extracting content from websites by analyzing the underlying HTML documents. This is useful for aggregating data from various sites.
Consuming Web Services – Many web APIs return data in XML format which needs to be parsed before usage in Ruby programs.
Processing Configuration Files – Ruby applications may leverage XML configuration files which require XML parsing.
Testing Tools – Parsing HTML documents to verify correctness or extract coverage metrics.

Parsing XML and HTML using Nokogiri

Nokogiri is the most popular library in Ruby for parsing, searching, modifying and querying XML/HTML documents.

Let‘s explore how to utilize Nokogiri for common parsing tasks.

Installation

Install Nokogiri using the gem command:

gem install nokogiri

Then in your Ruby file import Nokogiri:

require ‘nokogiri‘

Loading XML Documents

We can load and parse external XML documents in several ways:

From String

xml_data = Nokogiri::XML(‘<root><child/></root>‘)
puts xml_data.class # Prints Nokogiri::XML::Document

This parses the XML string and loads it as a Nokogiri::XML::Document.

From File

file = File.open(‘data.xml‘) 
doc = Nokogiri::XML(file)
puts doc.class # Prints Nokogiri::XML::Document

This loads the XML file and parses the contents into a document.

From HTTP Response

require ‘open-uri‘

xml_data = open(‘https://example.com/data.xml‘) 
doc = Nokogiri::XML(xml_data)

This makes an HTTP request to fetch the remote XML document and parses it.

Searching XML with XPath

XPath is a query language for selecting nodes from XML documents.

Let‘s see an example – consider the following XML file data.xml:

<?xml version="1.0" encoding="UTF-8"?>
<document>
  <products>
    <product id="001">
      <name>widget</name>
      <price>9.99</price>
      <inStock>true</inStock>
    </product>
    <product id="002">
       <name>gadget</name>
       <price>14.99</price>
       <inStock>false</inStock>
    </product>
  </products>
</document>

To fetch all names of products:

require ‘nokogiri‘

file = File.open(‘data.xml‘)
doc = Nokogiri::XML(file)

names = doc.xpath("//name") 
puts names.map(&:text)
# Prints widget, gadget

The XPath query //name selects all <name> nodes in the entire document.

Some other useful XPath examples:

/document/products – Selects <products> node under root
//price – All price nodes
//*[@id] – All nodes with id attribute
/products/product[1] – First product

This demonstrates the power of XPath for searching relevant elements in complex XML documents.

Modifying XML

Nokogiri also provides ways to modify existing XML documents or generate new ones programmatically.

Editing Values

Continuing the products example – to change the price of the first product:

doc.at_xpath(‘//product[1]/price‘).content = ‘10.99‘ 
puts doc.to_xml 
# Updated price to 10.99

This locates the node and updates the content text.

Adding Nodes

New nodes can be added using the add_child or << methods.

new_product = Nokogiri::XML::Node.new(‘product‘, doc)
new_product.parent = doc.at_xpath(‘//products‘)

name = Nokogiri::XML::Node.new(‘name‘, doc)
name.content = ‘router‘
new_product << name

price = Nokogiri::XML::Node.new(‘price‘, doc)  
price.content = ‘29.99‘
new_product << price   

puts doc.to_xml # New product added

Here we:

Created new product node
Set it‘s parent as products tag
Added child elements like name, price
Printed final modified XML

This allows generating XML dynamically in Ruby.

Removing Nodes

Similarly nodes can also be removed:

doc.search(‘//product‘).last.remove
puts doc.to_xml # Last product removed

Converting XML to JSON

For web services and APIs, JSON is a more popular exchange format than XML nowadays.

We can convert between XML and JSON easily using Nokogiri + Oj gem:

Gemfile

gem ‘nokogiri‘
gem ‘oj‘

app.rb

require ‘nokogiri‘
require ‘oj‘

xml = ‘<data><id>001</id><name> ACME </name></data>‘
doc = Nokogiri::XML(xml)

hash = Oj.load(doc.to_json)
puts hash
# {"data"=>{"id"=>"001", "name"=>"ACME"}}  

json = ‘{"data":{"id":"001","name":"ACME"}}‘ 
doc = Nokogiri::XML(Oj.dump(json)) 
puts doc.to_xml

This provides an easy XML-JSON conversion pipeline using Nokogiri and Oj.

Parsing HTML using Nokogiri

Similar to XML, Nokogiri can also parse and process HTML content using it‘s Nokogiri::HTML API.

Example:

html = <<~HTML
  <html>
    <head>
      <title>Services</title>
    </head>
    <body>

      <div class="services">
        <ul>
          <li>Web Development</li>      
          <li>Mobile Apps</li>
        </ul>
      </div>
    </body>
  </html>  
HTML

doc = Nokogiri::HTML(html)

h1 = doc.at(‘h1‘)  
puts h1.text # Our Services

titles = doc.search(‘title‘)  
puts titles.first.text # Services

This demonstrates parsing the sample HTML document and extracting information using Nokogiri search API.

Some key differences from XML are:

HTML parsing is slightly more relaxed on invalid markup
Special methods like at, search to find nodes by CSS/element names
More focus on text content rather than structure

Advanced Usage

We‘ve covered the basics of parsing and searching XML/HTML using Nokogiri. Here are some more advanced use cases:

Web Scraping

Nokogiri shines for web scraping needs – extracting information from websites.

For example, this script scrapes product listings from an ecommerce page:

require ‘nokogiri‘
require ‘open-uri‘

url = ‘https://store.example.com/products‘

html = open(url) 
doc = Nokogiri::HTML(html)

# Find product listings
doc.search(‘.product-item‘).each do |item|
  name = item.at(‘.product-name‘).text
  price = item.at(‘.price‘).text

  puts "#{name} - #{price}" 
  # Eg. Super Widget - $9.99  
end

It loads the HTML page, searches for product DOM elements, extracts details like name and price and prints them.

Processing Configuration Files

Nokogiri can parse various XML configuration files like Travis CI, Jekyll, Docker and extract settings for consumption inside Ruby programs.

For example, .travis.yml:

require ‘nokogiri‘

yaml = File.read(‘.travis.yml‘) 

doc = Nokogiri::XML(yaml)
config = {}

doc.search(‘//env‘).each do |node|
  config[node[‘name‘]] = node[‘value‘]
end

p config # Prints environment configs

Writing Testing Tools

For testing frameworks and assertions, comparison of HTML/XML content is useful.

Nokogiri provides matchers for this:

# expected.rb 
require ‘nokogiri‘

doc = Nokogiri::XML(...)
exp_xml = doc.to_xml

# actual.rb
require ‘test/unit‘
require ‘nokogiri‘

class TestingExample < Test::Unit::TestCase

  def test_match
    act_xml = generate_xml()

    assert_equal exp_xml, act_xml, "Expected XML does not match actual"

    # Other Nokogiri Matchers
    # assert_xml_equal
    # assert_xml_identical
  end

end

This shows usage of Nokogiri for building testing tools for XML/HTML comparisons.

Conclusion

Nokogiri is an indispensable tool for processing XML and HTML documents in Ruby. It has a powerful search API via XPath and CSS patterns along with options to edit, generate or convert documents.

Some examples we covered:

Parse/load XML from files, strings or URLs
Query documents using XPath
Modify documents – add/update/delete nodes
Convert XML to JSON
HTML parsing for web scraping
Build testing tools for comparisons
Process configuration files

Nokogiri provides a mature way to tackle many XML/HTML related tasks in Ruby with performance and stability across platforms. For production systems involving document processing, do consider using it over regular expressions or string manipulation.

How to Parse XML in Ruby

What is XML?

What is HTML?

Why Parse XML and HTML in Ruby?

Parsing XML and HTML using Nokogiri

Installation

Loading XML Documents

From String

From File

From HTTP Response

Searching XML with XPath

Modifying XML

Editing Values

Adding Nodes

Removing Nodes

Converting XML to JSON

Parsing HTML using Nokogiri

Advanced Usage

Web Scraping

Processing Configuration Files

Writing Testing Tools

Conclusion

A Complete Guide to Installing and Configuring the LXQt Openbox Window Manager on Manjaro

Mastering Callback Functions in C++

Run Python Script on Terminal – Raspberry Pi

Handling High Precision Rounding in SQL-Based Systems

Secure and Efficient File Transfers Between Windows and Linux with SCP

How to Use the DATE_SUB() Function in MySQL: An Expert‘s Guide

Linuxhaxor.net – About Open Source & Linux

What is XML?

What is HTML?

Why Parse XML and HTML in Ruby?

Parsing XML and HTML using Nokogiri

Installation

Loading XML Documents

From String

From File

From HTTP Response

Searching XML with XPath

Modifying XML

Editing Values

Adding Nodes

Removing Nodes

Converting XML to JSON

Parsing HTML using Nokogiri

Advanced Usage

Web Scraping

Processing Configuration Files

Writing Testing Tools

Conclusion

Related posts:

Similar Posts

Linuxhaxor.net – About Open Source & Linux