This GitHub repository describes and distributes all script used in "Efficient whole genome haplotyping and high-throughput single molecule phasing with barcode-linked reads", see figure 1 for overview. The main repo is used for pre-processing of read data and takes raw fastq files as input and outputs either (1) fastq files for metagenomic de novo analysis, (2) fastq files for Human Genome haplotyping or (3) bam files ready custom variant calling and phasing analysis.
To run subsequent analysis to the (1) output and get metagenomic assemblies look at BLR metagenomics. Processing from (2) for for Human Genome Haplotyping or human reference-free assembly please consider the wfa2tenx GitHub.
BLR Analysis is now also available at OMICtools.
Here follows a list with links to all bioinformats software needed to use of this part of the pipeline.
It will also be required to have downloaded Picard Tools and a Bowtie2 reference genome (e.g. GRCh38), available at e.g. Illumina iGenomes. Lastly to utilize all aspects of the pipeline some GNU software are also needed.
First, download this GitHub repository by writing the cloning command in your terminal.
git clone https://github.com/FrickTobias/BLR.git
Then provide BLR_Analysis with the appropriate paths for Picard Tools and your Bowtie2 reference data (consult example folder for further details).
bash setpath.sh </path/to/picardtools/jarfile> </path/to/bowtie2_reference/fastafile>
For all available options, see -h (--help) and for more details consult the step-by-step folder which describes all steps performed by BLR_automation. For examples and analysis file contents, see the example folder.
First trim read sequences and extract barcode sequences with BLR_automation and cluster barcode sequences (stop analysis at second step using -e, --end 2).
bash BLR_automation.sh -e 2 -r -m <john.doe@myworkplace.com> -p <processors> <read_1.fq> <read_2.fq> <output>
Following this, run athena_assembly.sh provided in the BLR_metagenomics GitHub repository.
Start by running the complete pre-processing pipeline with the fastq generation option -f (--fastq).
bash BLR_automation.sh -f -r -m <john.doe@myworkplace.com -p <processors> <read_1.fq> <read_2.fq> <output>
Continue by converting filtered fastq files to Long Ranger/Supernova input format using wfa2tenx and run the appropriate pipeline.
Run the preprocessing pipeline using default settings.
bash BLR_automation.sh -r -m <john.doe@myworkplace.com -p <processors> <read_1.fq> <read_2.fq> <output>
Use the .rmdup.x2.filt.bam files for further analysis.
Figure 1: BLR data analysis overview. (a) Reads are trimmed for their first handle using cutadapt followed by extraction fo the barcode sequence to a separate fasta files. Reads continue to be trimmed for another handle sequence just before the insert sequences and lastly reads are stripped of any traces of reverse complements of handle sequences from their 3' end. (b) Barcodes are split into several files files depending on their first three bases and clustered independently using CD-HIT-454. These are then combined into a summary file, NNN.clstr. (c) Trimmed reads are assembled into an initial assembly with IDBA-UD which is then used as reference for mapping the origin trimmed reads to with BWA which also incorporates the clustered barcode sequences into the resulting bam file. The bamfile is used to assemble the original read data, using the spacially devided (mapping positions) barcode information. The resulting assembly is then processed by ARCS and put into LINKS to yield the final scaffolds. (d) Trimmed reads are mapped with Bowtie2 converted to bam files. This bam file is tagged with barcode information by tag_bam.py. Picardtools is used to remove PCR and optical duplicates and is then used again to mark duplicate positions where reads have different barcodes. The marked bam file is filtered for barcode duplicates using cluster_rmdup.py and subsequently filtered for clusters with large amounts of molecules by filter_clusters.py. This bam file has its reads converted to fastq files and converted according to input format specifications of Long Ranger and Supernova by wfa2tenx.py.
