As a developer, processing and analyzing text files is a common task. Awk is a handy command-line tool for working with files containing string data organized in rows and columns. With awk, you can easily split large files into manageable chunks, extract only the data you need, and perform complex pattern matching and data manipulation operations.
In this comprehensive guide, we will explore the ins and outs of splitting file strings with awk. Whether you are a Linux admin, DevOps engineer, or application developer, mastering awk will make you more productive in processing log files, CSV data, and other text-based assets.
An Introduction to Awk
Awk is a standard Linux utility that owes its name to the initials of its creators: Alfred Aho, Peter Weinberger, and Brian Kernighan. It interprets a special-purpose programming language designed for processing text files.
The awk language allows you to:
- Scan a file line-by-line
- Split each input line into fields or columns
- Compare the content of each line against patterns
- Perform actions like printing, counting, or replacing text on lines that match your specified conditions
In a nutshell, awk lets you easily slice and dice text files to filter, transform, and report on data.
Awk is well-suited for formatting output, validating input data, creating reports from log files and databases, performing simple numeric computations, and a wide variety of other text processing tasks. Its interpreted nature and built-in features like associative arrays, regular expressions, and numeric functions make it a handy tool compared to full-fledged programming languages.
Now that we know what awk is and why it is useful, let‘s look at how to use it for splitting string data in files.
Splitting Files with Awk
Awk handles a text file as a series of records. By default, each line in the text file is considered a record. This makes processing files organized as rows and columns quite straightforward.
The basic workflow when using awk is:
- Specify a pattern that matches the lines you want to operate on
- Execute actions like printing, counting, replacing etc. on the matching lines
For example:
awk ‘/search pattern/ {action}‘ inputfile
This will apply the action to lines matching "search pattern" in inputfile.
The action can print the entire line (print $0), print specific columns (print $1,$2 etc.) or perform other text processing and reporting functions.
Now let‘s go through some practical examples of using awk to split and transform files containing string data.
Example 1: Print the Entire File
Printing the full contents of the file is awk‘s default behavior if no pattern or action is supplied.
awk ‘{print}‘ inputfile
This loops through each line of inputfile and prints it through the print statement.
For example, if data.csv contains:
Name,Age,City
John,30,New York
Jane,25,Chicago
Bob,20,Miami
Then awk ‘{print}‘ data.csv would output:
Name,Age,City
John,30,New York
Jane,25,Chicago
Bob,20,Miami
While this behaves like cat data.csv, awk gives us more flexibility to further process the file‘s contents.
Example 2: Print Matching Lines
We can filter the file to only print lines that match a specific pattern using:
awk ‘/pattern/ {print}‘ inputfile
For example, awk ‘/John/ {print}‘ data.csv prints only lines containing "John":
Name,Age,City
John,30,New York
And awk ‘/Chicago/ {print}‘ data.csv prints only the line with Chicago:
Jane,25,Chicago
This allows us to extract records matching complex logic like names, ages, locations etc.
Example 3: Print Specific Columns
To print only certain columns from the matched lines, we use the $N syntax.
$0 refers to the full line, $1 is the first column, $2 the second column and so on.
For example, awk ‘/John/ {print $1,$3}‘ data.csv prints just the name and city where "John" appears:
Name,City
John,New York
And we can combine the column printing with a different search pattern like:
awk ‘/Miami/ {print $1,$2}‘ data.csv
This prints the name and age where Miami appears:
Bob,20
Splitting columns like this along with pattern matching enables extracting subsets and summaries from large files.
Example 4: Save Output to a File
To save the output to a new file instead of printing to standard output, we redirect it using > filename:
awk ‘{print $1,$2}‘ data.csv > names_ages.csv
This saves a 2-column file extracting just names and ages to names_ages.csv.
The same applies for any other print statements:
awk ‘/John/ {print $0}‘ data.csv > john_record.csv
Saves the full record for John to the file john_record.csv.
Example 5: Count Pattern Matches
To count occurrences of a pattern like cities or names, awk provides convenient variables to increment and print:
awk ‘/Miami/ {++cities} END {print "City count:", cities}‘ data.csv
This prints:
City count: 1
Here ++cities increments the counter each time "Miami" appears on a line. And END {print} tallies the final count after processing the whole file.
We can adapt this easily for names, ages or any other field we want to count matches for.
Example 6: Filter Lines by Length
Awk stores the length of the current line in the built-in length variable.
We can use this to filter lines shorter or longer than N characters:
# Lines longer than N
awk ‘length > N‘ inputfile
# Lines shorter than N
awk ‘length < N‘ inputfile
For example, to print only lines longer than 25 characters in data.csv:
awk ‘length > 25‘ data.csv
And to print only shorter lines:
awk ‘length < 25‘ data.csv
The length check runs on each line and does the filtering for us automatically.
Example 7: Print Non-Empty Lines
Another handy built-in variable NF contains the number of fields or columns in the current input line.
We can use this to print only non-empty lines:
awk ‘NF > 0‘ inputfile
And print only empty lines with:
awk ‘NF == 0‘ inputfile
This provides an easy way to weed out extraneous newlines mid-file or other edge cases.
Example 8: Number of Lines
To print the total number of lines, awk provides the NR variable storing the number of input records so far:
awk ‘END {print NR}‘ inputfile
This keeps a running count in NR and prints the total when it reaches end-of-file.
We can extend this to an average, percentage or other calculation based on the line count.
Example 9: Miscellaneous Checks
Here are some other handy text processing techniques with awk:
-
Check for alphabetic lines only
awk ‘/^[A-Za-z]*$/‘ inputfile -
Check for numeric lines only
awk ‘/^[0-9]*$/‘ inputfile -
Check for empty lines
awk ‘/^$/‘ inputfile -
Print lines containing specific character
awk ‘/c/‘ inputfilePrints lines with character
c -
Print lines not containing character
awk ‘!/c/‘ inputfile
And many more advanced criteria are possible using awk‘s pattern matching operators and regular expressions.
Going Further with Awk
While the above examples cover basic file splitting with awk, we‘ve only scratched the surface of awk‘s capabilities.
Here are some additional topics for leveling up your awk skills:
- User-defined variables and functions
- Control flow statements like if-else conditions and while loops
- Built-in arithmetic, string and I/O functions
- Associative arrays
- Multiple input file handling
- Generating formatted reports
- Debugging scripts
By combining awk with other Linux text processing tools like sed, grep, sort and uniq, you can solve complex data extraction and transformation challenges without needing heavy programs like Python or Perl.
To recap, here is a cheat sheet of some useful awk syntax we covered for quick reference:
# Print entire file
awk ‘{print}‘ inputfile
# Print matching lines
awk ‘/pattern/ {print}‘ inputfile
# Print specific columns
awk ‘/pattern/ {print $1,$2}‘ inputfile
# Save output to file
awk ‘{print $1}‘ inputfile > outfile
# Line count for pattern
awk ‘/pattern/ {++count} END{print count}‘ inputfile
# Line length filter
awk ‘length > 20‘ inputfile
# Number of lines
awk ‘END {print NR}‘ inputfile
I hope this overview inspires you to reach deeper into awk for all your file parsing and reporting needs! Let me know in the comments if you have any favorite awk tricks or other text processing tools worth covering.


