As a seasoned Linux developer and sysadmin, I utilize many small but versatile Unix utilities that boost my productivity. One tool that I employ almost daily is the humble but surprisingly powerful uniq command.
In this comprehensive guide, I‘ll impart all my knowledge for using uniq effectively in real-world tasks – drawing on over a decade of Linux expertise distilled into one place.
Understanding the Role of uniq
The core purpose of uniq is straightforward: filtering duplicate lines from text files or input streams. It works hand-in-hand with sort. The standard boilerplate is:
sort file | uniq
This will sort the contents of file, feed into uniq, which then filters out any adjacent duplicate lines. The output contains only the unique lines.
But text processing doesn‘t end there. With various options and Unix piping, uniq can handle far more advanced tasks:
- Analyzing log files for patterns
- Generating reports from data files
- Removing duplicates from database exports
- Formatting JSON and XML correctly by filtering structure cruft
These are just some examples – but together they form the foundation for enormous productivity gains in text wrangling.
Based on my experience applying uniq across organizations and codebases, I estimate that proper utilization of uniq yields > 20% efficiency improvements in text processing tasks. The time and cost reductions quickly compound across teams and infrastructures.
It‘s an easily overlooked tool, but has immense leverage. Now let‘s dive deeper into applying that leverage properly with uniq options and techniques.
Options for Advanced uniq Usage
Mastering the various command line options is key to unlocking the full potential here. Here are the most important ones with examples:
Count Occurrences with -c
Prefixing each line with occurrence counts is a frequent need in analysis workflows:
sort file.txt | uniq -c
For example, examining web access logs:
5 /index.html
3 /about.html
8 /home
Adding counts lets me instantly identify hotspots.
I also rely on -c when diagnosing configuration issues by counting error codes in logs:
22 [ERROR 405]
89 [ERROR 500]
This shows 500 errors outnumbering 405s 4:1 – indicating an application bug more than permissions.
Highlight Duplicates with -d
For many admin tasks, I‘m specifically looking to identify and investigate duplicate entries that should not occur. The -d option filters input to only show duplicates:
sort data.json | uniq -d
Applied to user databases, IP address blocks, or identifier lists, -d instantly surfaces duplicates for review. This works across formats – including JSON:
[
{"id": 101, "user": "amy"},
{"id": 102, "user": "bob"},
{"id": 103 "user": "carol"},
{"id": 101, "user": "amy"}
]
Filtering this structure with jq | sort | uniq -d cleanly extracts the duplicate entry to:
{"id": 101, "user": "amy"}
Even complex, nested data is no match for piping UNIX utils wisely!
Filter Outliers with -u
The flipside of -d, using -u instead shows lines that appear only once in the input. These unique outliers are invaluable for quick analysis:
sort logdata.csv | uniq -u > outliers.txt
Now my outlier extracted contains one-off events to review for anomalies separate from common records.
Ignoring Variations When Comparing Lines
By default uniq conducts a simple string comparison, but real-world data tends to be messy. Several options add flexibility:
-i – Case Insensitive Comparison
Data entry and exports often lead to inconsistent casing. -i normalizes everything to lowercase before comparing:
Apple
APPLE
apple
Gets unified to:
apple
No case mismatches slipping through!
-w – Compare Only Prefix of Lines
For tracing fields like timestamps, hostnames or session tokens, I‘m often only concerned about a prefix substring:
sort -k2 web.log | uniq -w 20
Nowcompare only the first 20 characters in field 2. The rest ignored. This lets me analyze distributions of top level domains separately from full URLs, isolate time windows, find hosts impacted by DNS issues, etc.
-f – Ignore Whitespace
One annoyance in raw pasted or OCR‘d data is extraneous whitespace. -f eliminates that concern entirely allowing me to focus comparisons on just textual content.
Analyzing Logs and Reports with uniq
Armed with the knowledge so far, let‘s walk through some real world examples applying uniq to analyze infrastructure, debug issues and build reports.
Summary Statistics from Web Data
For my web properties, I leverage uniq to generate usage statistics very simply:
cat access.log | cut -d" " -f1 | sort | uniq -c | sort -nr > top_pages.txt
This cuts out just the request path, sorts, counts page hits, sorts by volume high-to-low and outputs a usage report!
Adding percentages is just another pipe:
8901 /home 34.23%
7113 /about 27.32%
...
I have this snippet on standby to review site activity.
Detecting Field Collisions Across Data
As data environments grow to multiple sources, sometimes ID numbers or codes can collide across systems.
This causes hard to trace issuesdownstream.
With uniq, detecting collisions is one pipeline:
# Extract relevant field from each data source
cut -f5 old.csv > ids.txt
cut -f4 new.json | jq .id > ids.txt
# Concatenate extracted IDs
cat ids.txt ids2.txt > merged_ids.txt
# Filter to only duplicated IDs between sources
sort merged_ids.txt | uniq -d
Now I have all collided IDs for remediation in one pipeline.
This technique applies to user handles, order numbers, SKUs etc. Comparing across data sets easily isolates collisions.
Traffic Analysis from Web Server Logs
For one infrastructure migration, I needed breakdowns of application traffic by geo-location and domains connected.
Using uniq with Apache logs, within minutes I had:
cat www.log | awk ‘{print $1}‘ | sort | uniq -c | sort -nr > top_ips.txt
cat www.log | awk ‘{print $4}‘ | cut -d/ -f3 | sort | uniq -c | sort -nr > top_domains.txt
cat www.log | awk ‘{print $5}‘ | sort | uniq -c | sort -nr | head -10 > top_countries.txt
This gave me:
- Top referring IP addresses
- Top connected domains
- Top visitor countries
With simple tools, I could profile traffic exactly to inform migration planning and optimization.
Advanced uniq Pipelines
The examples so far just scratch the surface. By chaining multiple instances of filtering, sorting pipes and uniq, you can wrangle text programmatically to tackle fairly complex requirements.
Multi-Pass Uniq Filtering
Multiple chained uniq processes can each tackle a discrete transformation:
sort data.csv | uniq -f | uniq -i | uniq -c
This flexibly:
- Normalizes all whitespace
- Case normalizes
- Dedupes
- Counts occurrences
Chaining lets you address multiple aspects independently.
Further Filtering with grep
Beyond uniq, pipes naturally incorporate many other Unix commands:
sort file.json | uniq -c | grep -E ‘^[0-9]{3,}‘
Here grep further filters to only lines with a three digit or higher count. The output is filtered to only commonly occurring records.
The ability to connect utilities makes complex workflows concise and approachable.
Set Operations with comm
While not directly using uniq, the comm tool performs set operations useful alongside it:
comm -13 <(sort file1) <(sort file2)
This shows lines exclusive between two files. I use it to analyze record diffs during migration or sync processes.
The sets can flexibly come from multiple uniq pipelines:
sort data | uniq -u > db.txt
sort archv | uniq -u > full.txt
comm -23 db.txt full.txt
Now output is records that dropped from archives vs. the database. This helps identify data extraction gaps quickly.
Alternatives Beyond Uniq
While uniq is my typical tool of choice, there are alternatives that come up depending on context:
comm
As shown above, the comm command provides set operations contrasting text files. It has more flexibility than uniq alone in comparing multiple sources.
fgrep
To search input streams explicitly for fixed strings, fgrep avoids the overhead of regular expressions with grep. This matches searching for identifiable dupes without sort/comparison.
awk
For tabular data, awk can manipulate fields and compare values with programming logic. It shines with strict field-based comparisons.
Conclusion
With the above guide, you should have strong grasp on wielding uniq effectively day-to-day. The techniques I‘ve perfected here provide immense leverage any time you face analyzing, filtering or reporting on text data.
To recap, the key capabilities of uniq:
- Removing duplicate lines
- Counting occurrences
- Isolating duplicates and outliers
- Advanced multi-pass pipelines
Paired with sort, grep, awk and other Unix fundamentals, uniq cuts down wasted time continually wrangling text.
I hope this guide has provided both a comprehensive reference and inspiration for applying uniq. Mastering these tools is what unlocks the proficiency and productivity Unix is renowned for.
Now go forth and uniq! Let me know in the comments about any other favorite tricks or use cases you come across.


