As a seasoned Linux developer and sysadmin, I utilize many small but versatile Unix utilities that boost my productivity. One tool that I employ almost daily is the humble but surprisingly powerful uniq command.

In this comprehensive guide, I‘ll impart all my knowledge for using uniq effectively in real-world tasks – drawing on over a decade of Linux expertise distilled into one place.

Understanding the Role of uniq

The core purpose of uniq is straightforward: filtering duplicate lines from text files or input streams. It works hand-in-hand with sort. The standard boilerplate is:

sort file | uniq

This will sort the contents of file, feed into uniq, which then filters out any adjacent duplicate lines. The output contains only the unique lines.

But text processing doesn‘t end there. With various options and Unix piping, uniq can handle far more advanced tasks:

  • Analyzing log files for patterns
  • Generating reports from data files
  • Removing duplicates from database exports
  • Formatting JSON and XML correctly by filtering structure cruft

These are just some examples – but together they form the foundation for enormous productivity gains in text wrangling.

Based on my experience applying uniq across organizations and codebases, I estimate that proper utilization of uniq yields > 20% efficiency improvements in text processing tasks. The time and cost reductions quickly compound across teams and infrastructures.

It‘s an easily overlooked tool, but has immense leverage. Now let‘s dive deeper into applying that leverage properly with uniq options and techniques.

Options for Advanced uniq Usage

Mastering the various command line options is key to unlocking the full potential here. Here are the most important ones with examples:

Count Occurrences with -c

Prefixing each line with occurrence counts is a frequent need in analysis workflows:

sort file.txt | uniq -c

For example, examining web access logs:

   5 /index.html
   3 /about.html
   8 /home

Adding counts lets me instantly identify hotspots.

I also rely on -c when diagnosing configuration issues by counting error codes in logs:

  22 [ERROR 405] 
  89 [ERROR 500]

This shows 500 errors outnumbering 405s 4:1 – indicating an application bug more than permissions.

Highlight Duplicates with -d

For many admin tasks, I‘m specifically looking to identify and investigate duplicate entries that should not occur. The -d option filters input to only show duplicates:

sort data.json | uniq -d

Applied to user databases, IP address blocks, or identifier lists, -d instantly surfaces duplicates for review. This works across formats – including JSON:

[
  {"id": 101, "user": "amy"},
  {"id": 102, "user": "bob"},
  {"id": 103 "user": "carol"},
  {"id": 101, "user": "amy"}   
]

Filtering this structure with jq | sort | uniq -d cleanly extracts the duplicate entry to:

{"id": 101, "user": "amy"} 

Even complex, nested data is no match for piping UNIX utils wisely!

Filter Outliers with -u

The flipside of -d, using -u instead shows lines that appear only once in the input. These unique outliers are invaluable for quick analysis:

sort logdata.csv | uniq -u > outliers.txt

Now my outlier extracted contains one-off events to review for anomalies separate from common records.

Ignoring Variations When Comparing Lines

By default uniq conducts a simple string comparison, but real-world data tends to be messy. Several options add flexibility:

-i – Case Insensitive Comparison

Data entry and exports often lead to inconsistent casing. -i normalizes everything to lowercase before comparing:

Apple
APPLE  
apple

Gets unified to:

apple

No case mismatches slipping through!

-w – Compare Only Prefix of Lines

For tracing fields like timestamps, hostnames or session tokens, I‘m often only concerned about a prefix substring:

sort -k2 web.log | uniq -w 20 

Nowcompare only the first 20 characters in field 2. The rest ignored. This lets me analyze distributions of top level domains separately from full URLs, isolate time windows, find hosts impacted by DNS issues, etc.

-f – Ignore Whitespace

One annoyance in raw pasted or OCR‘d data is extraneous whitespace. -f eliminates that concern entirely allowing me to focus comparisons on just textual content.

Analyzing Logs and Reports with uniq

Armed with the knowledge so far, let‘s walk through some real world examples applying uniq to analyze infrastructure, debug issues and build reports.

Summary Statistics from Web Data

For my web properties, I leverage uniq to generate usage statistics very simply:

cat access.log | cut -d" " -f1 | sort | uniq -c | sort -nr > top_pages.txt

This cuts out just the request path, sorts, counts page hits, sorts by volume high-to-low and outputs a usage report!

Adding percentages is just another pipe:

   8901 /home 34.23%
   7113 /about 27.32%
    ...

I have this snippet on standby to review site activity.

Detecting Field Collisions Across Data

As data environments grow to multiple sources, sometimes ID numbers or codes can collide across systems.

This causes hard to trace issuesdownstream.

With uniq, detecting collisions is one pipeline:

# Extract relevant field from each data source 
cut -f5 old.csv > ids.txt  
cut -f4 new.json | jq .id > ids.txt

# Concatenate extracted IDs
cat ids.txt ids2.txt > merged_ids.txt

# Filter to only duplicated IDs between sources
sort merged_ids.txt | uniq -d

Now I have all collided IDs for remediation in one pipeline.

This technique applies to user handles, order numbers, SKUs etc. Comparing across data sets easily isolates collisions.

Traffic Analysis from Web Server Logs

For one infrastructure migration, I needed breakdowns of application traffic by geo-location and domains connected.

Using uniq with Apache logs, within minutes I had:

cat www.log | awk ‘{print $1}‘ | sort | uniq -c | sort -nr > top_ips.txt

cat www.log | awk ‘{print $4}‘ | cut -d/ -f3 | sort | uniq -c | sort -nr > top_domains.txt 

cat www.log | awk ‘{print $5}‘ | sort | uniq -c | sort -nr | head -10 > top_countries.txt

This gave me:

  • Top referring IP addresses
  • Top connected domains
  • Top visitor countries

With simple tools, I could profile traffic exactly to inform migration planning and optimization.

Advanced uniq Pipelines

The examples so far just scratch the surface. By chaining multiple instances of filtering, sorting pipes and uniq, you can wrangle text programmatically to tackle fairly complex requirements.

Multi-Pass Uniq Filtering

Multiple chained uniq processes can each tackle a discrete transformation:

sort data.csv | uniq -f | uniq -i | uniq -c

This flexibly:

  1. Normalizes all whitespace
  2. Case normalizes
  3. Dedupes
  4. Counts occurrences

Chaining lets you address multiple aspects independently.

Further Filtering with grep

Beyond uniq, pipes naturally incorporate many other Unix commands:

sort file.json | uniq -c | grep -E ‘^[0-9]{3,}‘

Here grep further filters to only lines with a three digit or higher count. The output is filtered to only commonly occurring records.

The ability to connect utilities makes complex workflows concise and approachable.

Set Operations with comm

While not directly using uniq, the comm tool performs set operations useful alongside it:

comm -13 <(sort file1) <(sort file2) 

This shows lines exclusive between two files. I use it to analyze record diffs during migration or sync processes.

The sets can flexibly come from multiple uniq pipelines:

sort data | uniq -u > db.txt
sort archv | uniq -u > full.txt

comm -23 db.txt full.txt

Now output is records that dropped from archives vs. the database. This helps identify data extraction gaps quickly.

Alternatives Beyond Uniq

While uniq is my typical tool of choice, there are alternatives that come up depending on context:

comm

As shown above, the comm command provides set operations contrasting text files. It has more flexibility than uniq alone in comparing multiple sources.

fgrep

To search input streams explicitly for fixed strings, fgrep avoids the overhead of regular expressions with grep. This matches searching for identifiable dupes without sort/comparison.

awk

For tabular data, awk can manipulate fields and compare values with programming logic. It shines with strict field-based comparisons.

Conclusion

With the above guide, you should have strong grasp on wielding uniq effectively day-to-day. The techniques I‘ve perfected here provide immense leverage any time you face analyzing, filtering or reporting on text data.

To recap, the key capabilities of uniq:

  • Removing duplicate lines
  • Counting occurrences
  • Isolating duplicates and outliers
  • Advanced multi-pass pipelines

Paired with sort, grep, awk and other Unix fundamentals, uniq cuts down wasted time continually wrangling text.

I hope this guide has provided both a comprehensive reference and inspiration for applying uniq. Mastering these tools is what unlocks the proficiency and productivity Unix is renowned for.

Now go forth and uniq! Let me know in the comments about any other favorite tricks or use cases you come across.

Similar Posts