High-performance exact and fuzzy (MinHash) document deduplication tool, natively implemented in Rust for processing large-scale JSONL datasets.
- Overview
- Theory/Primer
- Installation
- Quick Start
- Deduplication Methods
- Examples
- Configuration
- System Requirements
This tool provides four deduplication strategies optimized for different dataset sizes and requirements:
| Method | Storage | Best For |
|---|---|---|
| Exact + Memory | In-memory | Small datasets (<10GB), simple exact matching |
| Exact + Disk | Disk-based | Large datasets, exact matching, distributed processing |
| MinHash + Memory | In-memory | Small datasets (<10GB), fuzzy matching |
| MinHash + Disk | Disk-based | Large datasets, fuzzy matching, distributed processing |
- Exact Deduplication: Removes documents with identical content using fast hash-based matching
- Fuzzy Deduplication: Identifies near-duplicates using MinHash LSH based on Lee et al. 2021
- Scalable: Memory-based for simplicity or disk-based for datasets that don't fit in RAM
- Distributed: Disk-based methods support parallel processing across multiple machines
- Flexible: Annotate duplicates or remove them entirely
Some notes on the theory behind this tooling and some details about the internals are contained in the primer.
- Rust toolchain (1.70+)
- Git
For large-scale processing on AWS i4i/i7i instances with NVMe drives:
# Configure RAID0 array from NVMe drives
sudo yum install mdadm -y
sudo mdadm --create /dev/md0 --level=0 --raid-devices=8 \
/dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 \
/dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1
sudo mkfs.xfs /dev/md0
sudo mkdir /mnt/raid0
sudo mount /dev/md0 /mnt/raid0
sudo chown -R $USER /mnt/raid0
# Install build dependencies
sudo yum install gcc cmake openssl-devel g++ htop git -y
# Install s5cmd for fast S3 transfers
wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz
tar -xvzf s5cmd_2.2.2_Linux-64bit.tar.gz
sudo mv s5cmd /usr/local/bin# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.bashrc
# Clone and build
git clone git@github.com:allenai/duplodocus.git
cd dedup-tool
cargo build --release
# Binary will be at: ./target/release/dedup-tool# Configure AWS credentials
aws configure
# Download JSONL files
s5cmd cp -sp s3://your-bucket/path/to/data/* /mnt/raid0/input_data/Remove documents with identical content:
cargo run --release -- exact-dedup-memory \
--input-dir /data/documents \
--output-dir /data/unique \
--text-key "content"Find and remove near-duplicates:
cargo run --release -- minhash-memory \
--input-dir /data/documents \
--storage-dir /tmp/work \
--output-dir /data/deduped \
--text-key "text" \
--num-buckets 20 \
--bucket-size 5 \
--remove-duplicates true \
--cleanup-storageBest for datasets under 100GB. Processes everything in one pass:
cargo run --release -- exact-dedup-memory \
--input-dir /data/docs \
--output-dir /data/unique \
--text-key "content" \
--annotate-key "duplicate_info" # Optional: annotate instead of removeOptions:
--hash-key: Use pre-computed hash field instead of hashing text--hash-bits: Number of bits for hash (default: 128)--annotate-key: Add duplicate metadata instead of removing documents
For large datasets or distributed processing:
Step 1: Group documents by hash
cargo run --release -- exact-dedup-disk-group \
--input-dir /data/docs \
--storage-dir /scratch/work \
--hash-key "doc_hash" \
--num-bins 100Step 2: Remove duplicates
cargo run --release -- exact-dedup-disk-prune \
--storage-dir /scratch/work \
--output-dir /data/unique \
--hash-key "doc_hash"All-in-one fuzzy deduplication for smaller datasets:
cargo run --release -- minhash-memory \
--input-dir /data/docs \
--storage-dir /tmp/work \
--output-dir /data/deduped \
--text-key "text" \
--num-buckets 20 \
--bucket-size 5 \
--ngram-size 5 \
--remove-duplicates true \
--cleanup-storageKey Parameters:
--num-buckets: Number of LSH bands (more = stricter matching, default: 20)--bucket-size: Hashes per band (more = stricter matching, default: 5)--ngram-size: N-gram size for document shingling (default: 5)--tokenizer: Options: "cl100k", "p50k", "uniseg", or character-level--config: Optional YAML config file for all parameters
For large-scale distributed processing across multiple machines:
Step 1: Build file map (run once)
cargo run --release -- mh-build-file-map \
--input-dir /data/docs \
--storage-dir /shared/workStep 2: Hash documents (parallel across workers)
# Worker 0
cargo run --release -- mh-hash-docs \
--local-input /data/docs \
--storage-dir /shared/work \
--text-key "text" \
--path-chunk 0 \
--num-path-chunks 10 \
--num-buckets 20 \
--bucket-size 5
# Worker 1
cargo run --release -- mh-hash-docs \
--local-input /data/docs \
--storage-dir /shared/work \
--text-key "text" \
--path-chunk 1 \
--num-path-chunks 10 \
--num-buckets 20 \
--bucket-size 5
# ... repeat for workers 2-9Step 3: Gather edges (run once, requires all signatures)
cargo run --release -- mh-gather-edges \
--storage-dir /shared/workStep 4: Build Union-Find (run once on single machine)
cargo run --release -- mh-build-uf \
--storage-dir /shared/work \
--num-path-chunks 10Step 5: Clean files (parallel across workers)
# Worker 0
cargo run --release -- mh-clean-files \
--input-dir /data/docs \
--storage-dir /shared/work \
--output-dir /data/deduped \
--path-chunk 0 \
--num-path-chunks 10 \
--remove-duplicates true
# Repeat for other workers...Detailed examples with step-by-step instructions are available in the examples/ directory:
examples/exact_simple/- Simple exact deduplicationexamples/exact_multi/- Distributed exact deduplicationexamples/fuzzy_simple/- Simple fuzzy deduplicationexamples/fuzzy_multi/- Distributed fuzzy deduplicationexamples/essential/- Essential patterns and best practices
For complex setups, you can use a YAML config file:
# minhash_config.yaml
minhash_params:
num_buckets: 26
bucket_size: 11
ngram_size: 5
permutation_seed: 42
tokenizer: "cl100k_base"
eng_params:
num_docs: 1000000
max_lines_per_path: 100000
num_sig_chunks: 8
output_params:
annotate: false
annotate_key: metadata.minhash # minhash output data location
remove_duplicates: true # just annotate, don't remove
delete_while_cleaning: falseUse with:
cargo run --release -- minhash-memory \
--input-dir /data/docs \
--storage-dir /tmp/work \
--output-dir /data/deduped \
--text-key "text" \
--config minhash_config.yaml- RAM: Dataset size + 2-3GB overhead
- Storage: Input size + output size
- Best for: Datasets under 100GB
- RAM: ~8-16GB minimum
- Storage: 3-5x input dataset size (for intermediate files)
- Fast local storage strongly recommended (NVMe/SSD)
- Best for: Datasets over 100GB or distributed processing
- Small jobs: Any instance with enough memory to fit the dataset in RAM.
- Large jobs: i4i.32xlarge or larger (NVMe storage)
- Distributed: Multiple i4i.32xlarge instances
- No remote I/O in Rust: All S3 interaction happens outside Rust (use s5cmd, boto3, etc.)
- Fast local storage: Assumes fast disk for intermediate files
- Small file assumption: Individual JSONL files should fit in memory
- Unique basenames: Input files must have unique basenames within input directory
- Use RAID0 for NVMe drives on cloud instances for maximum I/O throughput
- Adjust
--num-path-chunksbased on available workers - Monitor disk space - intermediate files can be 3-5x input size
- Use
--cleanup-storagecarefully in distributed settings - Set appropriate
--num-bucketsand--bucket-sizefor your similarity threshold
Out of memory errors: Use disk-based methods instead of memory-based
Slow performance: Ensure you're using fast local storage (NVMe/SSD), not network storage
Missing intermediate files: Ensure all parallel steps complete before running sequential steps
