FastDedup (FDedup) is a fast and memory-efficient FASTX PCR deduplication tool written in Rust. It utilizes needletail for high-performance sequence parsing, xxh3 for rapid hashing, and fxhash for a low-overhead memory cache.
Paper in preparation, you can check it here.
- Fast & Memory Efficient: Uses zero-allocation sequence parsing and a non-cryptographic high-speed hashing cache, which automatically scales based on the estimated input file size.
- Supports Compressed Formats: Transparently reads and writes both uncompressed and GZIP compressed (
.gz) FASTQ/FASTA files. - Incremental Deduplication & Auto-Recovery: By default, FDedup appends new sequences to an existing output file. It safely pre-loads existing hashes to prevent duplicates. If an uncompressed output file is corrupted due to a previous crash, FDedup automatically truncates it to the last valid sequence and resumes safely.
If you want to build it from source, you need to have the following dependencies installed:
You can download the latest pre-compiled binaries from the releases page
The recommended way to install FastDedup is with pixi through bioconda:
pixi add bioconda::fdedupYou can install FastDedup directly from Cargo:
cargo install fastdedupfdedup [OPTIONS] --input <INPUT>-1, --input <INPUT>: Path to the input FASTA/FASTQ/GZ file (R1 or Single-End).-2, --input-r2 <INPUT_R2>: Path to the input R2 file (Optional, enables Paired-End mode).-o, --output <OUTPUT>: Path to the output file (R1 or Single-End). Defaults tooutput_R1.fastq.gz.-p, --output-r2 <OUTPUT_R2>: Path to the output R2 file (Required if-2is provided).-f, --force: Overwrite the output file if it exists (instead of pre-loading hashes and appending).-v, --verbose: Print processing stats, such as execution time, number of sequences, and duplication rates.-s, --dry-run: Calculate duplication rate without creating an output file.-t, --threshold <THRESHOLD>: Threshold for automatic hash size selection$^1$ (default: 0.01).-H, --hash <HASH>: Manually specify hash size (64 or 128 bits).
1: The probability
Note: you need $\sqrt{2+2^{64}10^{-3}} \approx 0.1910^9$ sequences to have a 1‰ chance of collision with 64-bit hashing, and
$0.28*10^{17}$ sequences to have the same chance with 128-bit hashing.
You can run it directly from Cargo:
cargo run --release -- --input <INPUT> [OPTIONS]You can also rely on Pixi to run:
pixi run cargo build --release
pixi run fdedup --input <INPUT> [OPTIONS]You can download the latest release and run the containerized version of FDedup:
Using Apptainer:
apptainer run fdedup.sif fdedup --input <INPUT> [OPTIONS]Using Singularity:
singularity run fdedup.sif fdedup --input <INPUT> [OPTIONS]Note:
--forceis very slow when used in a Singularity container. We recommend just deleting the output file before running the container if you want to start from scratch.
You can build the container yourself using pixitainer:
- Install pixitainer:
pixi global install -c https://prefix.dev/raphaelribes -c https://prefix.dev/conda-forge pixitainer- Build the container:
pixi containerizeIf you are using FDedup in a pre-processing step, we recommend you to not export your file to a .gz format.
If there is any crash, FDedup cannot restart from a compressed file, and you will lose all the progress.
It is because a corrupted gzipped flux will make the file unreadable, and you will have to start from scratch using --force.
However, if you output to an uncompressed format, FDedup will automatically detect any crash-induced corruption, safely truncate the file to the last valid sequence, and seamlessly resume deduplication.
- Support for Paired-End read deduplication.
- Add Multithreading to parallelize sequence hashing and processing.
- Support tracking sequence abundances (counts) instead of naive discarding.
- Add a possibility for exporting sequences as FASTA.
- Improve error handling.
This project is licensed under the MIT License. See the LICENSE file for details.
Raphaël Ribes (coding and design)
Céline Mandier (design)
Computations were performed on the ISDM-MESO HPC platform, funded in the framework of State-region planning contracts (Contrat de plan État-région – CPER) by the French Government, the Occitanie/Pyrénées-Méditerranée Region, Montpellier Méditerranée Métropole, and the University of Montpellier.