YAML (YAML Ain‘t Markup Language) has become a popular data serialization language used for configuration files and data storage. Its simple and human-readable format makes YAML easy to hand-edit while remaining structured enough for machines to parse.
As a system administrator, you‘ll often find yourself needing to parse YAML files to extract or modify configuration data. While you can use programming languages like Python or Ruby to parse YAML, it‘s useful to know how to parse YAML directly in Bash scripts.
In this comprehensive guide, we‘ll cover different methods and best practices for parsing YAML in Bash, including:
- Using built-in Bash capabilities
- Leveraging command-line utilities like grep, sed, awk
- Installing and using YAML parsers like yq
- Handling more complex YAML data structures
- Accounting for poor YAML formatting
- Performance considerations
We‘ll look at specific examples of parsing and modifying YAML config files for common applications like Kubernetes, Docker, Ansible, and more. By the end, you‘ll be well-equipped to wrangle YAML data with ease in your Bash scripting and system administration workflows.
Overview of YAML Format
Before we dive into the details of parsing, let‘s briefly recap YAML syntax and data structures.
YAML uses spaces for indentation rather than brackets to denote nested elements. Here‘s an example YAML file:
top_level_key: value
SECTION1:
key1: value1
key2:
nested_key1: nested_value1
nested_key2: nested_value2
SECTION2:
keyA: valueA
keyB: valueB
As you can see, YAML supports:
- Key-value pairs
- Nested elements denoted by indentation
- Lists as denoted by the "-" indicator:
fruits:
- Apple
- Orange
- Banana
- More complex data types like maps, objects, references, etc which we won‘t cover here
This combination of simplicity and extensibility is what makes YAML so ubiquitously used across domains.
Now let‘s look at different methods for parsing this versatile data format in Bash.
Parsing YAML in Bash with Built-in Capabilities
Bash provides built-in capabilities to perform simple parsing and extraction of values from YAML content stored in a shell variable or file.
The fundamentals include:
- File redirections for input/output
- Command substitutions to capture output
- Parameter expansions to manipulate strings
- Conditional logic like if/then/else
- Built-in tools like
awkandsed
Let‘s see some examples of parsing YAML using just Bash builtins.
Accessing a top-level key
Here‘s a simple YAML file config.yml:
database:
host: localhost
port: 5432
To access the database host, we can use:
#!/bin/bash
CONFIG=$(<config.yml)
DATABASE_HOST=$(echo "$CONFIG" | awk ‘/database/{print $2}‘)
echo "Database host: $DATABASE_HOST"
This prints:
Database host: localhost
Here‘s what‘s happening:
- Load contents of YAML file into
$CONFIGvariable - Extract line matching "database" using
awk - Print the 2nd field which is the host value
- Access value with
$DATABASE_HOSTvariable
The same result could also be achieved with grep and cut instead of awk.
You can encapsulate this in a function to cleanly access any top-level key:
function get_yaml_value {
local key=$1
local value=$(echo "$CONFIG" | grep "$key:" | cut -d‘ ‘ -f2)
echo $value
}
DATABASE_PORT=$(get_yaml_value "database.port")
echo "Database port: $DATABASE_PORT"
This makes reuse easy without repeating the parsing logic.
Accessing nested YAML elements
To access nested elements requires recursively grepping and cutting each level until reaching the desired key.
Consider this nested YAML:
server:
application:
enabled: true
port: 3000
environment: production
To grab the application port, we need 3 levels of searching:
APP_PORT=$(echo "$CONFIG" | grep -A3 "server:" | grep "application" | grep -A1 "port:" | cut -d‘:‘ -f2 | xargs)
echo "APP Port: $APP_PORT"
Breaking this messy chain down:
- Select block under
server:key using-A3flag - Search for
applicationblock - Find
portkey within application block using-A1 - Extract port value by cutting field 2 delimited by ":"
- Trim whitespace with xargs
Repeating this manual traversal gets tedious fast. But it works as a last resort when lacking other tools.
Modifying YAML elements
To modify rather than just access YAML data, we can redirect the output to a new file after transforming the value.
For example, to change the database port in our earlier example config.yml, we can use:
NEW_PORT=5432
sed "s/port:.*/port: $NEW_PORT/" config.yml > config_modified.yml
The substitution command s/// replaces whatever follows the "port:" key with our new desired port number.
While doable for simple cases, this approach also gets complex with nested data.
Limitations of Built-in YAML Parsing
Bash‘s built-ins can parse simple YAML files in a pinch. But real-world configurations are often more complex containing nested objects, references, arrays, etc.
Manually handling these intricate structures requires convoluted Linux piping that becomes fragile and hard to maintain.
So for production-grade parsing, we need to turn to more purpose-built tools.
Parsing YAML using Command-Line Tools
Linux ecosystems like YAML provide more specialized command-line programs and libraries to parse structured data. These make handling production YAML configs easier.
Some popular YAML parsers to consider:
1. yq – a handy command-line tool for manipulating YAML. Works similarly to the jq utility for JSON.
2. python-yaml – Python library with YAML parser bindings accessible via the CLI.
3. ruby-yaml – Ruby library with parser accessible in scripts
Let‘s examine each option starting with yq which strikes the best balance of usability and capability.
Parsing YAML with yq
The yq tool provides a jq-like interface tailored for YAML processing from the command line.
Note: yq v4 was recently rewritten in Go after earlier Python versions. This guide assumes v4+.
Here‘s an example docker-compose.yml:
services:
webapp:
image: nginx:latest
ports:
- "80:80"
env_file: .env
database:
image: postgres:13
volumes:
- dbdata:/var/lib/postgresql/data
env_file:
- .secrets
volumes:
dbdata:
Basic YAML Extraction
Let‘s grab some simple elements:
# Get webapp image
yq ‘.services.webapp.image‘ docker-compose.yml
# nginx:latest
# Get db volume name
yq ‘.volumes | keys[]‘ docker-compose.yml
# dbdata
This reads similarly to parsing JSON with jq.
We can also pipe the output to extract into bash variables:
# Capture in vars
WEB_IMAGE=$(yq ‘.services.webapp.image‘ docker-compose.yml)
echo $WEB_IMAGE # nginx:latest
Nested YAML Parsing
yq makes accessing nested elements easier than with grep/cut:
# Get db image
yq ‘.services.database.image‘ docker-compose.yml
# postgres:13
JSON-style wildcards also work for iterating YAML lists:
# Get env_files
yq ‘.services.database.env_file[]‘ docker-compose.yml
# .secrets
We can wrap values with {} to prevent unwanted stdout formatting for programmatic use:
# Prevent formatting artifacts
SECRET_FILES=$(yq -r ‘.services.database.env_file[]‘ docker-compose.yml)
echo $SECRET_FILES # .secrets
The -r raw output flag is handy to parse YAML cleanly from Bash without interference.
Editing YAML Values
Similar to jq, yq also enables modifying YAML instead of just reading it:
# Set new webapp image
yq ‘.services.webapp.image = "nginx:alpine"‘ docker-compose.yml
# Append additional env_file
yq ‘.services.webapp.env_file += ".env.production"‘ docker-compose.yml
# Redirect output to new file
yq ‘.services.webapp.env_file += ".env.production"‘ docker-compose.yml > docker-compose-updated.yml
This facilitates updating configuration without directly editing YAML source files.
Conversion Between Formats
Interestingly, yq can also translate between JSON and YAML in either direction:
# YAML to JSON
yq -o=json docker-compose.yml
# JSON to YAML
DATA=‘{"key":"value"}‘; yq -o=yaml echo "$DATA"
This allows for integrating YAML config data into JSON-based application code flows.
Python & Ruby Libraries
Python and Ruby also provide YAML parsing from their extensive standard libraries:
import yaml
with open("config.yml") as f:
data = yaml.safe_load(f)
print(data[‘database‘][‘host‘])
require ‘yaml‘
config = YAML.load_file(‘config.yml‘)
puts config[‘database‘][‘host‘]
This enables strictly programmatic manipulation versus text-based streaming parsing. Tradeoffs are mainly just syntactic preference and environment availability.
For most ad-hoc scripting use cases, yq provides the best balance of flexibility and ease-of-use. But when building larger applications, the programming libraries may facilitate integration.
Dealing with Complex YAML Files
Up to this point, we assumed clean and well-structured YAML adhering strictly to conventions.
Unfortunately, real-world configurations tend to be messy lacking consistent indentation, containing custom extensions, duplicate keys, etc.
Such "malformed" YAML can break parsers that rely on strict validity. So we need to handle common issues that arise:
Missing YAML Document Start Indicator
All YAML streams should start with --- as the document start indicator according to spec:
---
key: value
But many configs omit this for brevity:
key: value
Certain parsers will refuse to process YAML missing start/end indicators.
Solution: If supported, enable the more lenient "roundtrip parser" usually available in most YAML tools. This relaxes expectations around some formatting.
Inconsistent Indentation
Proper indentation denotes structure and scope in YAML:
key:
inner_key: value
But inconsistent tabs and spaces often get introduced:
key:
inner_key: value
Solution: Use a tool like yamlfmt to standardize formatting with consistent spaces before parsing by other tools:
yamlfmt messy.yml | yq ‘.‘ -
Custom Tag Extensions
YAML allows custom tag annotations for domain-specific typing:
product: !my_app_tag TypeA
Custom extensions can cause difficulty for parsers not configured to handle these.
Solution: Generic parsers allow ignoring tags by disabling type processing. For example, yq has a --ignore-tags option.
Duplicate Keys
Most YAML parsers override earlier values when duplicate keys are encountered:
servers:
- host: A
- host: B
This can lead to unexpected config losses.
Solution: Establish convention to name keys uniquely, e.g:
servers:
- host1: A
- host2: B
Also normalize configs to remove duplicates before parsing.
Malformed YAML
Sometimes YAML configs become so malformed they defy programmatic reparation.
Solution: As a last resort, regex can help scrape values by pattern matching despite widespread issues. For example:
# Scrape port despite syntax issues
PORT=$(grep -Eo "port:[[:space:]]*[0-9]+" bad.yml)
echo ${PORT#:*} # 3000
This is brittle but can work temporarily before re-generating proper YAML.
Parsing Performance Considerations
When parsing large YAML configurations or in performance-sensitive environments like lambdas or containers, optimization may be necessary.
Here are some tips:
-
Minimize external processes: Reduce forking additional processes like
grep,cut, etc which add overhead. Libraries tend to parse natively in-process. -
Reduce file access: Read files once instead of multiple passes. Map to data structures instead of recurring access.
-
Increase locality: Parse into native data structures so lookups minimize file traversal.
-
Stream parse: Incrementally load when possible instead of all at once.
-
Validate once: Check validity first before repeated handling to allow optimizations.
-
Consider binary: Binary YAML codecs can improve performance compared to bulkier text.
Profile to compare alternatives and identify bottlenecks when performance is lacking.
Putting into Practice
The true test of any technical guide is whether it enables tangible outcomes.
So let‘s go from theory to application by looking at some real-world examples of parsing YAML configs you‘re likely to encounter as a Linux system manager.
We‘ll leverage what we‘ve covered to solve common tasks:
1. Modifying Kubernetes Deployment YAML
Task: Change Docker image used in a Deployment.
Solution:
# Original container image
yq ‘.spec.template.spec.containers[0].image‘ deploy.yaml
# Update image using yq
yq ‘.spec.template.spec.containers[0].image="nginx:alpine"‘ deploy.yaml > updated_deploy.yaml
# Validate change
yq ‘.spec.template.spec.containers[0].image‘ updated_deploy.yaml
# "nginx:alpine" - updated!
This allows modifying Kubernetes YAML declaratively without needing to understand low-level API details.
2. Adding additional port map in Docker Compose
Task: Append an extra port binding for a service without modifying original yaml
Solution:
# Define new port map
NEW_PORT_MAP="11020:80"
# Append port to existing list using yq merge keys
echo "$NEW_PORT_MAP" | yq -y ‘.services.web.ports += strenv(NEW_PORT_MAP)‘ docker-compose.yml > docker-compose-updated.yml
# Validate change
yq ‘.services.web.ports[]‘ docker-compose-updated.yml
# 80:80
# 12020:80
Now we‘ve extended compose file while keeping original YAML intact.
3. Modifying Ansible Configuration
Task: Change cache timeout for Ansible deployments from default
Solution:
# View default timeout
cat ansible.cfg | yq ‘.defaults.fact_caching_timeout‘
# Set new value
yq ‘.defaults.fact_caching_timeout=3600‘ ansible.cfg > ansible-modified.cfg
# Confirm new timeout
yq ‘.defaults.fact_caching_timeout‘ ansible-modified.cfg
# 3600
This will accelerate future playbook runs.
Recap
As you can see, whether deploying containers, configuring CLIs, or managing a fleet, being able to directly manipulate YAML unlocks administrative superpowers!
Conclusion
YAML has cemented itself as a fixture in DevOps toolchains – and by extension – a common surface area for Linux systems administrators. Knowing how to effectively parse and modify YAML removes friction when interacting with the modern application stack.
In this guide, we covered core approaches for working with YAML in Bash scripts leveraging both built-in and external tools – focusing on the excellent yq utility. You should now feel empowered to handle YAML configurations in automation workflows.
While YAML is designed as "human-friendly", the proliferation of tools undoubtedly adds overhead and complexity over direct JSON usage. Do evaluate if YAML remains appropriate for new internal tooling, especially when speed is critical.
But when inheriting existing systems predicated on YAML pipelines, this guide equips you to parse configurations with confidence using pure Bash scripting. Just don‘t forget to regularly reformat for maintainability as cruft inevitably accumulates!


