YAML (YAML Ain‘t Markup Language) has become a popular data serialization language used for configuration files and data storage. Its simple and human-readable format makes YAML easy to hand-edit while remaining structured enough for machines to parse.

As a system administrator, you‘ll often find yourself needing to parse YAML files to extract or modify configuration data. While you can use programming languages like Python or Ruby to parse YAML, it‘s useful to know how to parse YAML directly in Bash scripts.

In this comprehensive guide, we‘ll cover different methods and best practices for parsing YAML in Bash, including:

  • Using built-in Bash capabilities
  • Leveraging command-line utilities like grep, sed, awk
  • Installing and using YAML parsers like yq
  • Handling more complex YAML data structures
  • Accounting for poor YAML formatting
  • Performance considerations

We‘ll look at specific examples of parsing and modifying YAML config files for common applications like Kubernetes, Docker, Ansible, and more. By the end, you‘ll be well-equipped to wrangle YAML data with ease in your Bash scripting and system administration workflows.

Overview of YAML Format

Before we dive into the details of parsing, let‘s briefly recap YAML syntax and data structures.

YAML uses spaces for indentation rather than brackets to denote nested elements. Here‘s an example YAML file:

top_level_key: value
SECTION1:
  key1: value1  
  key2: 
    nested_key1: nested_value1
    nested_key2: nested_value2
SECTION2:
  keyA: valueA
  keyB: valueB  

As you can see, YAML supports:

  • Key-value pairs
  • Nested elements denoted by indentation
  • Lists as denoted by the "-" indicator:
fruits: 
  - Apple
  - Orange
  - Banana
  • More complex data types like maps, objects, references, etc which we won‘t cover here

This combination of simplicity and extensibility is what makes YAML so ubiquitously used across domains.

Now let‘s look at different methods for parsing this versatile data format in Bash.

Parsing YAML in Bash with Built-in Capabilities

Bash provides built-in capabilities to perform simple parsing and extraction of values from YAML content stored in a shell variable or file.

The fundamentals include:

  • File redirections for input/output
  • Command substitutions to capture output
  • Parameter expansions to manipulate strings
  • Conditional logic like if/then/else
  • Built-in tools like awk and sed

Let‘s see some examples of parsing YAML using just Bash builtins.

Accessing a top-level key

Here‘s a simple YAML file config.yml:

database:
  host: localhost
  port: 5432  

To access the database host, we can use:

#!/bin/bash

CONFIG=$(<config.yml)
DATABASE_HOST=$(echo "$CONFIG" | awk ‘/database/{print $2}‘)  

echo "Database host: $DATABASE_HOST"

This prints:

Database host: localhost 

Here‘s what‘s happening:

  1. Load contents of YAML file into $CONFIG variable
  2. Extract line matching "database" using awk
  3. Print the 2nd field which is the host value
  4. Access value with $DATABASE_HOST variable

The same result could also be achieved with grep and cut instead of awk.

You can encapsulate this in a function to cleanly access any top-level key:

function get_yaml_value {
  local key=$1
  local value=$(echo "$CONFIG" | grep "$key:" | cut -d‘ ‘ -f2)  
  echo $value
}

DATABASE_PORT=$(get_yaml_value "database.port")
echo "Database port: $DATABASE_PORT" 

This makes reuse easy without repeating the parsing logic.

Accessing nested YAML elements

To access nested elements requires recursively grepping and cutting each level until reaching the desired key.

Consider this nested YAML:

server:
  application:
    enabled: true
    port: 3000
    environment: production

To grab the application port, we need 3 levels of searching:

APP_PORT=$(echo "$CONFIG" | grep -A3 "server:" | grep "application" | grep -A1 "port:" | cut -d‘:‘ -f2 | xargs)
echo "APP Port: $APP_PORT"

Breaking this messy chain down:

  1. Select block under server: key using -A3 flag
  2. Search for application block
  3. Find port key within application block using -A1
  4. Extract port value by cutting field 2 delimited by ":"
  5. Trim whitespace with xargs

Repeating this manual traversal gets tedious fast. But it works as a last resort when lacking other tools.

Modifying YAML elements

To modify rather than just access YAML data, we can redirect the output to a new file after transforming the value.

For example, to change the database port in our earlier example config.yml, we can use:

NEW_PORT=5432

sed "s/port:.*/port: $NEW_PORT/" config.yml > config_modified.yml

The substitution command s/// replaces whatever follows the "port:" key with our new desired port number.

While doable for simple cases, this approach also gets complex with nested data.

Limitations of Built-in YAML Parsing

Bash‘s built-ins can parse simple YAML files in a pinch. But real-world configurations are often more complex containing nested objects, references, arrays, etc.

Manually handling these intricate structures requires convoluted Linux piping that becomes fragile and hard to maintain.

So for production-grade parsing, we need to turn to more purpose-built tools.

Parsing YAML using Command-Line Tools

Linux ecosystems like YAML provide more specialized command-line programs and libraries to parse structured data. These make handling production YAML configs easier.

Some popular YAML parsers to consider:

1. yq – a handy command-line tool for manipulating YAML. Works similarly to the jq utility for JSON.

2. python-yaml – Python library with YAML parser bindings accessible via the CLI.

3. ruby-yaml – Ruby library with parser accessible in scripts

Let‘s examine each option starting with yq which strikes the best balance of usability and capability.

Parsing YAML with yq

The yq tool provides a jq-like interface tailored for YAML processing from the command line.

Note: yq v4 was recently rewritten in Go after earlier Python versions. This guide assumes v4+.

Here‘s an example docker-compose.yml:

services:

  webapp:
    image: nginx:latest
    ports: 
      - "80:80"
    env_file: .env

  database:
    image: postgres:13
    volumes:
      - dbdata:/var/lib/postgresql/data  
    env_file:
      - .secrets  

volumes:
  dbdata:

Basic YAML Extraction

Let‘s grab some simple elements:

# Get webapp image 
yq ‘.services.webapp.image‘ docker-compose.yml
# nginx:latest

# Get db volume name
yq ‘.volumes | keys[]‘ docker-compose.yml  
# dbdata

This reads similarly to parsing JSON with jq.

We can also pipe the output to extract into bash variables:

# Capture in vars 
WEB_IMAGE=$(yq ‘.services.webapp.image‘ docker-compose.yml)
echo $WEB_IMAGE # nginx:latest

Nested YAML Parsing

yq makes accessing nested elements easier than with grep/cut:

# Get db image 
yq ‘.services.database.image‘ docker-compose.yml
# postgres:13

JSON-style wildcards also work for iterating YAML lists:

# Get env_files 
yq ‘.services.database.env_file[]‘ docker-compose.yml
# .secrets

We can wrap values with {} to prevent unwanted stdout formatting for programmatic use:

# Prevent formatting artifacts  
SECRET_FILES=$(yq -r ‘.services.database.env_file[]‘ docker-compose.yml) 
echo $SECRET_FILES # .secrets

The -r raw output flag is handy to parse YAML cleanly from Bash without interference.

Editing YAML Values

Similar to jq, yq also enables modifying YAML instead of just reading it:

# Set new webapp image 
yq ‘.services.webapp.image = "nginx:alpine"‘ docker-compose.yml

# Append additional env_file  
yq ‘.services.webapp.env_file += ".env.production"‘ docker-compose.yml

# Redirect output to new file
yq ‘.services.webapp.env_file += ".env.production"‘ docker-compose.yml > docker-compose-updated.yml

This facilitates updating configuration without directly editing YAML source files.

Conversion Between Formats

Interestingly, yq can also translate between JSON and YAML in either direction:

# YAML to JSON 
yq -o=json docker-compose.yml 

# JSON to YAML
DATA=‘{"key":"value"}‘; yq -o=yaml echo "$DATA"

This allows for integrating YAML config data into JSON-based application code flows.

Python & Ruby Libraries

Python and Ruby also provide YAML parsing from their extensive standard libraries:

import yaml
with open("config.yml") as f:
  data = yaml.safe_load(f)
print(data[‘database‘][‘host‘])  
require ‘yaml‘
config = YAML.load_file(‘config.yml‘)
puts config[‘database‘][‘host‘]

This enables strictly programmatic manipulation versus text-based streaming parsing. Tradeoffs are mainly just syntactic preference and environment availability.

For most ad-hoc scripting use cases, yq provides the best balance of flexibility and ease-of-use. But when building larger applications, the programming libraries may facilitate integration.

Dealing with Complex YAML Files

Up to this point, we assumed clean and well-structured YAML adhering strictly to conventions.

Unfortunately, real-world configurations tend to be messy lacking consistent indentation, containing custom extensions, duplicate keys, etc.

Such "malformed" YAML can break parsers that rely on strict validity. So we need to handle common issues that arise:

Missing YAML Document Start Indicator

All YAML streams should start with --- as the document start indicator according to spec:

---
key: value

But many configs omit this for brevity:

key: value 

Certain parsers will refuse to process YAML missing start/end indicators.

Solution: If supported, enable the more lenient "roundtrip parser" usually available in most YAML tools. This relaxes expectations around some formatting.

Inconsistent Indentation

Proper indentation denotes structure and scope in YAML:

key: 
  inner_key: value

But inconsistent tabs and spaces often get introduced:

key:
   inner_key: value

Solution: Use a tool like yamlfmt to standardize formatting with consistent spaces before parsing by other tools:

yamlfmt messy.yml | yq ‘.‘ -      

Custom Tag Extensions

YAML allows custom tag annotations for domain-specific typing:

product: !my_app_tag TypeA 

Custom extensions can cause difficulty for parsers not configured to handle these.

Solution: Generic parsers allow ignoring tags by disabling type processing. For example, yq has a --ignore-tags option.

Duplicate Keys

Most YAML parsers override earlier values when duplicate keys are encountered:

servers:
  - host: A
  - host: B

This can lead to unexpected config losses.

Solution: Establish convention to name keys uniquely, e.g:

servers:
  - host1: A  
  - host2: B

Also normalize configs to remove duplicates before parsing.

Malformed YAML

Sometimes YAML configs become so malformed they defy programmatic reparation.

Solution: As a last resort, regex can help scrape values by pattern matching despite widespread issues. For example:

# Scrape port despite syntax issues  
PORT=$(grep -Eo "port:[[:space:]]*[0-9]+" bad.yml)  
echo ${PORT#:*} # 3000 

This is brittle but can work temporarily before re-generating proper YAML.

Parsing Performance Considerations

When parsing large YAML configurations or in performance-sensitive environments like lambdas or containers, optimization may be necessary.

Here are some tips:

  • Minimize external processes: Reduce forking additional processes like grep, cut, etc which add overhead. Libraries tend to parse natively in-process.

  • Reduce file access: Read files once instead of multiple passes. Map to data structures instead of recurring access.

  • Increase locality: Parse into native data structures so lookups minimize file traversal.

  • Stream parse: Incrementally load when possible instead of all at once.

  • Validate once: Check validity first before repeated handling to allow optimizations.

  • Consider binary: Binary YAML codecs can improve performance compared to bulkier text.

Profile to compare alternatives and identify bottlenecks when performance is lacking.

Putting into Practice

The true test of any technical guide is whether it enables tangible outcomes.

So let‘s go from theory to application by looking at some real-world examples of parsing YAML configs you‘re likely to encounter as a Linux system manager.

We‘ll leverage what we‘ve covered to solve common tasks:

1. Modifying Kubernetes Deployment YAML

Task: Change Docker image used in a Deployment.

Solution:

# Original container image
yq ‘.spec.template.spec.containers[0].image‘ deploy.yaml  

# Update image using yq
yq ‘.spec.template.spec.containers[0].image="nginx:alpine"‘ deploy.yaml > updated_deploy.yaml

# Validate change
yq ‘.spec.template.spec.containers[0].image‘ updated_deploy.yaml
# "nginx:alpine" - updated!

This allows modifying Kubernetes YAML declaratively without needing to understand low-level API details.

2. Adding additional port map in Docker Compose

Task: Append an extra port binding for a service without modifying original yaml

Solution:

# Define new port map
NEW_PORT_MAP="11020:80"

# Append port to existing list using yq merge keys
echo "$NEW_PORT_MAP" | yq -y ‘.services.web.ports += strenv(NEW_PORT_MAP)‘ docker-compose.yml > docker-compose-updated.yml

# Validate change
yq ‘.services.web.ports[]‘ docker-compose-updated.yml 
# 80:80
# 12020:80 

Now we‘ve extended compose file while keeping original YAML intact.

3. Modifying Ansible Configuration

Task: Change cache timeout for Ansible deployments from default

Solution:

# View default timeout
cat ansible.cfg | yq ‘.defaults.fact_caching_timeout‘

# Set new value
yq ‘.defaults.fact_caching_timeout=3600‘ ansible.cfg > ansible-modified.cfg

# Confirm new timeout  
yq ‘.defaults.fact_caching_timeout‘ ansible-modified.cfg
# 3600

This will accelerate future playbook runs.

Recap

As you can see, whether deploying containers, configuring CLIs, or managing a fleet, being able to directly manipulate YAML unlocks administrative superpowers!

Conclusion

YAML has cemented itself as a fixture in DevOps toolchains – and by extension – a common surface area for Linux systems administrators. Knowing how to effectively parse and modify YAML removes friction when interacting with the modern application stack.

In this guide, we covered core approaches for working with YAML in Bash scripts leveraging both built-in and external tools – focusing on the excellent yq utility. You should now feel empowered to handle YAML configurations in automation workflows.

While YAML is designed as "human-friendly", the proliferation of tools undoubtedly adds overhead and complexity over direct JSON usage. Do evaluate if YAML remains appropriate for new internal tooling, especially when speed is critical.

But when inheriting existing systems predicated on YAML pipelines, this guide equips you to parse configurations with confidence using pure Bash scripting. Just don‘t forget to regularly reformat for maintainability as cruft inevitably accumulates!

Similar Posts