fastq-dl

Download FASTQ files from the European Nucleotide Archive or the Sequence Read Archive repositories.

Introduction

fastq-dl takes an ENA/SRA accession (BioProject/Study, Biosample/Sample, Experiment, or Run) and queries ENA (via Data Warehouse API) to determine the associated metadata. It then downloads FASTQ files for each Run. For Samples or Experiments with multiple Runs, users can optionally merge the runs.

Installation

Dependencies

fastq-dl depends on the following:

    - pip
    - python >=3.10
    - pysradb >=2.3
    - sracha
    - wget

Bioconda

fastq-dl is available from Bioconda and I highly recommend you go this route to for installation, as it will handle dependencies as well.

conda create -n fastq-dl -c conda-forge -c bioconda fastq-dl
conda activate fastq-dl

PyPi

fastq-dl is also available from PyPi, so you can use pip to install it.

Note: You will need to ensure you have installed the dependencies.

pip install fastq-dl
fastq-dl --version
fastq-dl --help
fastq-dl --check

Usage

fastq-dl --help
                                                                               
 Usage: fastq-dl [OPTIONS]                                                     
                                                                               
 Download FASTQ files from ENA or SRA.                                         
                                                                               
╭─ Required Options ──────────────────────────────────────────────────────────╮
│ *  --accession  -a  TEXT  ENA/SRA accession to query. (Study, Sample,       │
│                           Experiment, Run accession) [required]             │
╰─────────────────────────────────────────────────────────────────────────────╯
╭─ Provider Options ──────────────────────────────────────────────────────────╮
│ --provider          [ena|sra]                Specify which provider (ENA or │
│                                              SRA) to use. [default: ena]    │
│ --protocol          [ftp|https]              Protocol to use for ENA        │
│                                              downloads. [default: ftp]      │
│ --sra-lite                                   Set preference to SRA Lite     │
│ --skip-compression                           Skip compression of SRA        │
│                                              downloads.                     │
│ --gzip-level        INTEGER RANGE [1<=x<=9]  Gzip compression level for SRA │
│                                              downloads (1=fast, 9=best).    │
│                                              [default: 1]                   │
╰─────────────────────────────────────────────────────────────────────────────╯
╭─ Download Options ──────────────────────────────────────────────────────────╮
│ --max-attempts            -m  INTEGER  Maximum number of download attempts. │
│                                        [default: 3]                         │
│ --only-provider                        Only attempt download from specified │
│                                        provider.                            │
│ --only-download-metadata               Skip FASTQ downloads, and retrieve   │
│                                        only the metadata.                   │
│ --group-by-experiment                  Group Runs by experiment accession.  │
│ --group-by-sample                      Group Runs by sample accession.      │
│ --ignore                  -I           Skip MD5 validation (ENA) or relax   │
│                                        integrity checks (SRA).              │
╰─────────────────────────────────────────────────────────────────────────────╯
╭─ Additional Options ────────────────────────────────────────────────────────╮
│ --outdir       -o  TEXT     Directory to output downloads to. [default: ./] │
│ --prefix           TEXT     Prefix to use for naming log files. [default:   │
│                             fastq]                                          │
│ --cpus             INTEGER  Total cpus used for SRA conversion and          │
│                             compression. [default: 4]                       │
│ --connections      INTEGER  HTTP connections per file for SRA downloads.    │
│                             [default: 8]                                    │
│ --force        -F           Overwrite existing files.                       │
│ --silent                    Only critical errors will be printed.           │
│ --sleep        -s  INTEGER  Minimum amount of time to sleep between retries │
│                             (API query and download) [default: 10]          │
│ --check                     Check that required external tools are          │
│                             installed and exit.                             │
│ --version      -V           Show the version and exit.                      │
│ --verbose      -v           Print debug related text.                       │
│ --help         -h           Show this message and exit.                     │
╰─────────────────────────────────────────────────────────────────────────────╯

fastq-dl requires a single ENA/SRA Study, Sample, Experiment, or Run accession and FASTQs for all Runs that fall under the given accession will be downloaded. For example, if a Study accession is given all Runs under that studies umbrella will be downloaded. By default, fastq-dl will try to download from ENA first, then SRA.

--accession

The accession you would like to download associated FASTQS for. Currently the following types of accessions are accepted.

Accession Type	Prefixes	Example
BioProject	PRJEB, PRJNA, PRJDB	PRJEB42779, PRJNA480016, PRJDB14838
Study	ERP, DRP, SRP	ERP126685, DRP009283, SRP158268
BioSample	SAMD, SAME, SAMN	SAMD00258402, SAMEA7997453, SAMN06479985
Sample	ERS, DRS, SRS	ERS5684710, DRS259711, SRS2024210
Experiment	ERX, DRX, SRX	ERX5050800, DRX406443, SRX4563689
Run	ERR, DRR, SRR	ERR5260405, DRR421224, SRR7706354

The accessions are using regular expressions from the ENA Training Modules - Accession Numbers section.

--provider

fastq-dl gives you the option to download from ENA or SRA. the --provider option will specify which provider you would like to attempt downloads from first. If a download fails from the first provider, additional attempts will be made using the other provider.

ENA was selected as the default provider because the FASTQs are available directly without the need for conversion.

--only-provider

By default, fastq-dl will fallback on a secondary provider to attempt downloads. There may be cases where you would prefer to disable this feature, and that is exactly the purpose of --only-provider. When provided, if a FASTQ cannot be downloaded from the original provider, no additional attempts will be made.

--group-by-experiment & --group-by-sample

There maybe times you might want to group Run accessions based on a Experiment or Sample accessions. This will merge FASTQs associated with a Run accession based its associated Experiment accession (--group-by-experiment) or Sample accession (--group-by-sample).

--sra-lite

Downloads from SRA are provided in SRA Normalized and SRA Lite formats. SRA Normalized is the original format with full base quality scores and SRA Lite are smaller due to simplifying the quality scores to a uniform Q30. By default the preference will be set to SRA Normalized, if you prefer SRA Lite you can use --sra-lite to set the preference to SRA Lite.

--skip-compression

By default, FASTQs downloaded from SRA are compressed using sracha to save space. However, this can be time consuming (especially for large files!). You can use the --skip-compression option to skip this step and save time at the cost of disk space.

This option is ignored for ENA downloads as they are already provided as GZip compressed files.

--gzip-level

Controls the gzip compression level for SRA downloads, ranging from 1 (fastest, least compression) to 9 (slowest, best compression). The default is 1, which prioritizes speed. If disk space is a concern and you can afford longer download times, increase this value.

This option is ignored when --skip-compression is used or for ENA downloads.

--connections

Controls the number of HTTP connections used per file for SRA downloads. The default is 8. Increasing this value may improve download speeds on high-bandwidth connections, while decreasing it can help on slower or less stable networks.

This option only applies to SRA downloads.

Output Files

Extension	Description
`-run-info.tsv`	Tab-delimited file containing metadata for each Run downloaded
`-run-mergers.tsv`	Tab-delimited file merge information from `--group-by-experiment` or `--group-by-sample`
`.fastq.gz`	FASTQ files downloaded from ENA or SRA

Example Usage

Download FASTQs associated with a Study

Sometimes you might be reading a paper and they very kindly provided a BioProject of all the samples they sequenced. So, you decide you want to download FASTQs for all the samples associated with the BioProject. fastq-dl can help you with that!

fastq-dl --accession PRJNA248678 --provider SRA
fastq-dl --accession PRJNA248678

The above commands will download the 3 Runs that fall under Study accession PRJNA248678 from either SRA (--provider SRA) or ENA (without --provider).

Download FASTQs associated with an Experiment

Let's say instead of the whole BioProject you just want a single Experiment. You can do that as well.

fastq-dl --accession SRX477044

The above command would download the Run accessions from ENA that fall under Experiment SRX477044.

The relationship of Experiment to Run is a 1-to-many relationship, or there can be many Run accessions associated with a single Experiment Accession (e.g. re-sequencing the same sample). Although in most cases, it is a 1-to-1 relationship, you can use --group-by-experiment to merge multiple runs associated with an Experiment accession into a single FASTQ file.

Download FASTQs associated with an Sample

Ok, this time you just want a single Sample, or Biosample.

fastq-dl --accession SRS1904245 --provider SRA

The above command would download the Run accessions from SRA that fall under Sample SRS1904245.

Similar to Experiment accessions, the relationship of Sample to Run is a 1-to-many relationship, or there can be many Run accessions associated with a single Sample Accession. Although in most cases, it is a 1-to-1 relationship, you can use --group-by-sample to merge multiple runs associated with an Sample accession into a single FASTQ file.

_Warning! For some type strains (e.g. S. aureus USA300) a Biosample accession might be associated with 100s or 1000s of Run accessions. These Runs are likely associated with many different conditions and really should not fall under a single BioSample accession. Please consider this when using --group-by-sample.

Download FASTQs associated with a Run

Let's keep it super simple and just download a Run.

fastq-dl --accession SRR1178105 --provider SRA

The above command would download the Run SRR1178105 from SRA. Run accessions are the end of the line (1-to-1 relationship), so you will always get the expected Run.

Motivation

fastq-dl, is a spin-off of ena-dl (pre-2017), that has been developed for usage with Bactopia. With this in mind, EBI/NCBI and provide their own tools (enaBrowserTools and SRA Toolkit) that offer more extensive access to their databases.

Disclaimer

AI tools were used in the development of this project.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.claude/skills/update-catalog		.claude/skills/update-catalog
.github		.github
.vscode		.vscode
fastq_dl		fastq_dl
paper		paper
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
catalog.json		catalog.json
citation.cff		citation.cff
codecov.yml		codecov.yml
environment.yml		environment.yml
justfile		justfile
llms.txt		llms.txt
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

fastq-dl

Introduction

Installation

Dependencies

Bioconda

PyPi

Usage

--accession

--provider

--only-provider

--group-by-experiment & --group-by-sample

--sra-lite

--skip-compression

--gzip-level

--connections

Output Files

Example Usage

Download FASTQs associated with a Study

Download FASTQs associated with an Experiment

Download FASTQs associated with an Sample

Download FASTQs associated with a Run

Motivation

Disclaimer

About

Uh oh!

Releases 21

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

fastq-dl

Introduction

Installation

Dependencies

Bioconda

PyPi

Usage

--accession

--provider

--only-provider

--group-by-experiment & --group-by-sample

--sra-lite

--skip-compression

--gzip-level

--connections

Output Files

Example Usage

Download FASTQs associated with a Study

Download FASTQs associated with an Experiment

Download FASTQs associated with an Sample

Download FASTQs associated with a Run

Motivation

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 21

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages