Skip to content

allenai/duplodocus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Duplodocus CLI

High-performance exact and fuzzy (MinHash) document deduplication tool, natively implemented in Rust for processing large-scale JSONL datasets.

Table of Contents

Overview

This tool provides four deduplication strategies optimized for different dataset sizes and requirements:

Method Storage Best For
Exact + Memory In-memory Small datasets (<10GB), simple exact matching
Exact + Disk Disk-based Large datasets, exact matching, distributed processing
MinHash + Memory In-memory Small datasets (<10GB), fuzzy matching
MinHash + Disk Disk-based Large datasets, fuzzy matching, distributed processing

Key Features

  • Exact Deduplication: Removes documents with identical content using fast hash-based matching
  • Fuzzy Deduplication: Identifies near-duplicates using MinHash LSH based on Lee et al. 2021
  • Scalable: Memory-based for simplicity or disk-based for datasets that don't fit in RAM
  • Distributed: Disk-based methods support parallel processing across multiple machines
  • Flexible: Annotate duplicates or remove them entirely

Theory

Some notes on the theory behind this tooling and some details about the internals are contained in the primer.

Installation

Prerequisites

  • Rust toolchain (1.70+)
  • Git

AWS EC2 Setup (Optional)

For large-scale processing on AWS i4i/i7i instances with NVMe drives:

# Configure RAID0 array from NVMe drives
sudo yum install mdadm -y
sudo mdadm --create /dev/md0 --level=0 --raid-devices=8 \
  /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 \
  /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1
sudo mkfs.xfs /dev/md0
sudo mkdir /mnt/raid0
sudo mount /dev/md0 /mnt/raid0
sudo chown -R $USER /mnt/raid0

# Install build dependencies
sudo yum install gcc cmake openssl-devel g++ htop git -y

# Install s5cmd for fast S3 transfers
wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz
tar -xvzf s5cmd_2.2.2_Linux-64bit.tar.gz
sudo mv s5cmd /usr/local/bin

Build from Source

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.bashrc

# Clone and build
git clone git@github.com:allenai/duplodocus.git
cd dedup-tool
cargo build --release

# Binary will be at: ./target/release/dedup-tool

Download Data (if using S3)

# Configure AWS credentials
aws configure

# Download JSONL files
s5cmd cp -sp s3://your-bucket/path/to/data/* /mnt/raid0/input_data/

Quick Start

Exact Deduplication (Small Dataset)

Remove documents with identical content:

cargo run --release -- exact-dedup-memory \
  --input-dir /data/documents \
  --output-dir /data/unique \
  --text-key "content"

Fuzzy Deduplication (Small Dataset)

Find and remove near-duplicates:

cargo run --release -- minhash-memory \
  --input-dir /data/documents \
  --storage-dir /tmp/work \
  --output-dir /data/deduped \
  --text-key "text" \
  --num-buckets 20 \
  --bucket-size 5 \
  --remove-duplicates true \
  --cleanup-storage

Deduplication Methods

Exact Deduplication

Memory-Based (Simple)

Best for datasets under 100GB. Processes everything in one pass:

cargo run --release -- exact-dedup-memory \
  --input-dir /data/docs \
  --output-dir /data/unique \
  --text-key "content" \
  --annotate-key "duplicate_info"  # Optional: annotate instead of remove

Options:

  • --hash-key: Use pre-computed hash field instead of hashing text
  • --hash-bits: Number of bits for hash (default: 128)
  • --annotate-key: Add duplicate metadata instead of removing documents

Disk-Based (Distributed)

For large datasets or distributed processing:

Step 1: Group documents by hash

cargo run --release -- exact-dedup-disk-group \
  --input-dir /data/docs \
  --storage-dir /scratch/work \
  --hash-key "doc_hash" \
  --num-bins 100

Step 2: Remove duplicates

cargo run --release -- exact-dedup-disk-prune \
  --storage-dir /scratch/work \
  --output-dir /data/unique \
  --hash-key "doc_hash"

Fuzzy Deduplication (MinHash)

Memory-Based (Simple)

All-in-one fuzzy deduplication for smaller datasets:

cargo run --release -- minhash-memory \
  --input-dir /data/docs \
  --storage-dir /tmp/work \
  --output-dir /data/deduped \
  --text-key "text" \
  --num-buckets 20 \
  --bucket-size 5 \
  --ngram-size 5 \
  --remove-duplicates true \
  --cleanup-storage

Key Parameters:

  • --num-buckets: Number of LSH bands (more = stricter matching, default: 20)
  • --bucket-size: Hashes per band (more = stricter matching, default: 5)
  • --ngram-size: N-gram size for document shingling (default: 5)
  • --tokenizer: Options: "cl100k", "p50k", "uniseg", or character-level
  • --config: Optional YAML config file for all parameters

Disk-Based (Distributed)

For large-scale distributed processing across multiple machines:

Step 1: Build file map (run once)

cargo run --release -- mh-build-file-map \
  --input-dir /data/docs \
  --storage-dir /shared/work

Step 2: Hash documents (parallel across workers)

# Worker 0
cargo run --release -- mh-hash-docs \
  --local-input /data/docs \
  --storage-dir /shared/work \
  --text-key "text" \
  --path-chunk 0 \
  --num-path-chunks 10 \
  --num-buckets 20 \
  --bucket-size 5

# Worker 1
cargo run --release -- mh-hash-docs \
  --local-input /data/docs \
  --storage-dir /shared/work \
  --text-key "text" \
  --path-chunk 1 \
  --num-path-chunks 10 \
  --num-buckets 20 \
  --bucket-size 5

# ... repeat for workers 2-9

Step 3: Gather edges (run once, requires all signatures)

cargo run --release -- mh-gather-edges \
  --storage-dir /shared/work

Step 4: Build Union-Find (run once on single machine)

cargo run --release -- mh-build-uf \
  --storage-dir /shared/work \
  --num-path-chunks 10

Step 5: Clean files (parallel across workers)

# Worker 0
cargo run --release -- mh-clean-files \
  --input-dir /data/docs \
  --storage-dir /shared/work \
  --output-dir /data/deduped \
  --path-chunk 0 \
  --num-path-chunks 10 \
  --remove-duplicates true

# Repeat for other workers...

Examples

Detailed examples with step-by-step instructions are available in the examples/ directory:

  • examples/exact_simple/ - Simple exact deduplication
  • examples/exact_multi/ - Distributed exact deduplication
  • examples/fuzzy_simple/ - Simple fuzzy deduplication
  • examples/fuzzy_multi/ - Distributed fuzzy deduplication
  • examples/essential/ - Essential patterns and best practices

Configuration

YAML Configuration (Optional)

For complex setups, you can use a YAML config file:

# minhash_config.yaml
minhash_params:
  num_buckets: 26
  bucket_size: 11
  ngram_size: 5
  permutation_seed: 42
  tokenizer: "cl100k_base"
eng_params:
  num_docs: 1000000
  max_lines_per_path: 100000
  num_sig_chunks: 8
output_params:
  annotate: false
  annotate_key: metadata.minhash # minhash output data location
  remove_duplicates: true # just annotate, don't remove
  delete_while_cleaning: false

Use with:

cargo run --release -- minhash-memory \
  --input-dir /data/docs \
  --storage-dir /tmp/work \
  --output-dir /data/deduped \
  --text-key "text" \
  --config minhash_config.yaml

System Requirements

Memory-Based Methods

  • RAM: Dataset size + 2-3GB overhead
  • Storage: Input size + output size
  • Best for: Datasets under 100GB

Disk-Based Methods

  • RAM: ~8-16GB minimum
  • Storage: 3-5x input dataset size (for intermediate files)
  • Fast local storage strongly recommended (NVMe/SSD)
  • Best for: Datasets over 100GB or distributed processing

Recommended Instances (AWS)

  • Small jobs: Any instance with enough memory to fit the dataset in RAM.
  • Large jobs: i4i.32xlarge or larger (NVMe storage)
  • Distributed: Multiple i4i.32xlarge instances

Design Principles

  • No remote I/O in Rust: All S3 interaction happens outside Rust (use s5cmd, boto3, etc.)
  • Fast local storage: Assumes fast disk for intermediate files
  • Small file assumption: Individual JSONL files should fit in memory
  • Unique basenames: Input files must have unique basenames within input directory

Performance Tips

  1. Use RAID0 for NVMe drives on cloud instances for maximum I/O throughput
  2. Adjust --num-path-chunks based on available workers
  3. Monitor disk space - intermediate files can be 3-5x input size
  4. Use --cleanup-storage carefully in distributed settings
  5. Set appropriate --num-buckets and --bucket-size for your similarity threshold

Troubleshooting

Out of memory errors: Use disk-based methods instead of memory-based

Slow performance: Ensure you're using fast local storage (NVMe/SSD), not network storage

Missing intermediate files: Ensure all parallel steps complete before running sequential steps

About

Tooling for exact and MinHash deduplication of large-scale text datasets

Resources

License

Stars

Watchers

Forks

Packages

No packages published