Skip to content

Team Lead + Project info for participant promotion #20

@NoushinN

Description

@NoushinN

Hi Everyone, Baraa and I interviewed Sam Chorlton last week and as Baraa posted, we thought it was a great project, needing more refinement at the team lead level. Sam got back to us filling all these fields so we probably won't need to contact him again for more info. His response is pasted here and please also note his optional abstract graphics! :)

Listed Team Leader. Optional: contact email, link to website

Sam Chorlton
sam.chorlton@pm.me
https://github.com/chorltsd/REUSE

The Hook. In two sentences advertise the goal of your project and why it would be fun to join this particular team

Help change the world by filtering unneeded sequences from a next-generation sequencing dataset, enriching signal from noise and enabling rapid pathogen discovery, isolation of sequence types (eg. rRNA), contaminant removal and more.

Project Abstract: (~300 words) Should have been completed with the application but may be revised. Detail what the project is about and what you're going to do.

Filtering unwanted sequences from nucleic acid sequencing data is an important step in many analyses. It has been used to remove technical artefacts (eg. PhiX), discover known and novel pathogens, isolate nucleic acid types (eg rRNA), and remove noise in metagenomic studies. This step significantly improves the speed and quality of subsequent analyses.

Here I propose an end-to-end pipeline (REUSE) for Rapidly Eliminating Unwanted SEquences from large sequencing datasets. The result of REUSE will be sequences that do not belong to a reference sequence. This pipeline will be based on previously established techniques for isolating known and novel pathogens among sequencing data. It will seek to dramatically speed up the process, optimize flaws in other pipelines, and automate it from start to finish. It will likely include a k-mer filter, read alignment, read assembly, and contig alignment. Some of these steps will be based on publicly available tools, such as RNA-STAR and Trinity, whereas others will need to be programmed from the ground up.

The work at HackSeq18 will focus on development of the most novel and needed module, the k-mer filter (k-REUSE). Previous evidence indicates that k-mers can be used to rapidly screen and filter sequences, and that a k-mer of 21 basepairs is sufficient to discriminate between unrelated species.(1) Currently published applications, such as Kontaminant(1), Cookiecutter(2), BBDuk(3) and others have several limitations, including lack of parallelization, high memory requirements (>50gb for the human genome), and lack of ability to save the reference index to disk. Other techniques, such a read alignment, are too slow to use on large datasets.

The goal of HackSeq18 will be the development of k-REUSE and comparison to other filters. Further development will likely be needed after the hackathon for integration of k-REUSE into the complete REUSE pipeline and ultimate application to extremely large datasets.

  1. Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, et al. Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data. PloS One. 2015;10(6):e0129059.

  2. Starostina E, Tamazian G, Dobrynin P, O’Brien S, Komissarov A. Cookiecutter: a tool for kmer-based read filtering and extraction. bioRxiv. 2015 Aug 16;024679.

  3. Bushnell B. BBTools [Internet]. DOE Joint Genome Institute. [cited 2018 Jul 25]. Available from: https://jgi.doe.gov/data-and-tools/bbtools/

Required Skills: What skills or knowledge is required a priori from the participants to be able to take part on this team

-Strong skills in a fast programming language such as C or C++, or the ability to reference fast libraries from within python or another language. Team members will be needed to program the k-mer based filter.
-If you have the above skills, you likely have Github skills.
-Understanding the basics of next generation sequencing data and formats (eg FASTA, FASTQ).
-Ability to have fun.

Optional Skills: What skills or knowledge would be beneficial for the participants to have but is not necessary to take part.

-Overview of existing pathogen discovery pipelines (RINS, SURPI, Kontaminant, Cookiecutter, DeconSeq).
-Ability to write (particularly science).
-Ability to science (particularly to evaluate our pipeline in comparison with existing pipelines, generate comparative graphs and tables, and perform basic statistics).
abstract graphic

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions