Home

Welcome to the 2016_project_9 wiki!

Original project description

The genetic diversity in Africa is immense. The diversity across the entire continent has not previously been captured on any commercial SNP array to date. Developing a cost-efficient and representative genotype array with SNPs that provide good coverage across the African continent is key to conducting large-scale medical genetic studies in Africa I propose a project in which we write a SNP selection algorithm applicable to whole genome sequence (WGS) data. This will involve writing an algorithm that chooses SNPs that tag other SNPs most efficiently across individuals from several African populations. This will be applied in conjunction with a commercial lists of pre-approved SNPs, lists of SNPs of general interest and various lists with ranking of SNPs in order to select a set of tag SNPs that can be put on a commercial SNP array. We intend to write the code in Python. Other tag SNP selection algorithms exist, but none of these are geared towards handling WGS data efficiently. By making use of random access to block gzipped files we intend to write a memory efficient algorithm applicable to WGS data. We envision this algorithm being used in combination with existing imputation methods to make use of haplotype (multi-marker) tagging in addition to simple pairwise LD. We aim to provide a fully functional piece of software and a list of tag SNPs at the end of hackseq.

Team

https://github.com/tommycarstensen - Tommy
https://github.com/awreynolds - Austin
https://github.com/dfornika - Dan
https://github.com/ameintjes - Ayton
https://github.com/shaze - Scott
https://github.com/marciam - Marcia
https://github.com/jocelynjyl - Jocelyn
https://github.com/alyeffy - Alyssa - not present Oct.15
https://github.com/vmon588 - Vince
https://github.com/brianlee99 - Brian

Group Communication

https://hackseq.slack.com/messages/project9/

General Help

Background reading

Xu Z, Kaplan NL, Taylor JA. TAGster: efficient selection of LD tag SNPs in single or multiple populations. Bioinformatics. 2007 Dec 1;23(23):3254–5. doi:10.1093/bioinformatics/btm426.
Hoffmann TJ, Zhan Y, Kvale MN, Hesselson SE, Gollub J, Iribarren C, et al. Design and coverage of high throughput genotyping arrays optimized for individuals of East Asian, African American, and Latino race/ethnicity using imputation and a novel hybrid SNP selection algorithm. Genomics. 2011 Dec;98(6):422–30. doi: 10.1016/j.ygeno.2011.08.007.
Gurdasani D, Carstensen T, Tekola-Ayele F, Pagani L, Tachmazidou I, Hatzikotoulas K, et al. The African Genome Variation Project shapes medical genetics in Africa. Nature. 2014 Dec 3;517(7534):327–32. doi: 10.1038/nature13997.

Tomm's proposed tasks in advance of Hackseq

Calculation of LD values with existing code (e.g. PLINK) or new Python code (medium task).
Selection of tag SNPs from phased (and unphased) data with new code (big task).
Automated calling of the LD based method and IMPUTE2 in each cycle of a hybrid algorithm, if we choose to go down this path. Making all elements of the code play well together. (big task)
Evaluation of the hybrid and strictly LD based algorithm (big task).
Evaluation of our algorithm (speed and accuracy) against existing algorithms; e.g. TAGster (big task).
Write a paper (medium task).

1000 Genomes populations

YRI Yoruba in Ibadan, Nigeria
LWK Luhya in Webuye, Kenya
GWD Gambian in Western Divisions in the Gambia
MSL Mende in Sierra Leone
ESN Esan in Nigeria

Various documents

Google Slide Google Sheet

Saturday Morning action items

Ayton - to get 5 different references panels, each with 1 population left out, in IMPUTE format (excluding ref populations) (?)
Scott, Vince, Marcia - calculate LD on 1000G data using PLINK
Austin, Jocelyn, Dan, Brian - to write outline about how to process chromosomes separately
Dan, Tomm - tag SNP selection
Marcia - document on Github

misc notes from first planning discussion

PLINK to calculate LD
push everything to Github regularly
someone gets 1000 genomes haplotype panel and get PLINK, start calculating LD values
capture common variation
emphasis on writing algorithm, not producing SNPs; existing relevant data is limiting at this point but will improve in the next few months
select tag SNPs from each of 5 1000 Genomes populations, impute with whole panel, exclude population we're going with... (?)
TAGster won't work on whole genome data - main problem with TAGster is it takes too long and uses too much memory
VCF files - can come 1000 Genomes FTP site
will be computationally heavy in the end when we run the algorithm, but early work can be run on a laptop
when 2 tag SNPs, choose the 1 with higher score (Affymetrix?) - white list
expect to have white list and black list
pre-selected SNPs to be included (i.e. "white list")
ignore indels
input: a vcf for multiple chromosomes
output: ~1M tag SNPs
then extract 1 population...impute missing SNPs, see how well it compares (?)
don't use rs IDs
we'll use build 37
Illumina has uploaded the [relevant data] just recently?
VCF files will be split by chromosome -
parallelization wouldn't work since won't know budget for each chromosome

Calculation of LD panels

PLINK 1.9
chromosome 20 to start
use PLINK to do initial filtering of SNPs; at least minor allele frequency 0.05 on a population level
use windows of 200K bp (?)

From TC's slides

Input

VCF files (chrom20 for testing) for calculation of LD values as input for the actual algorithm
Pre-selected tag SNPs
Blacklisted SNPs - not contributing to LD threshold counts, so only LD values can be calculated prior to
SNP scores for selection of tag SNP between SNPs with identical LD scores
LD values or count/identity of SNPs above LD threshold for each population
Minimum MAF threshold
Maximum distance (cM or bp) that SNPs can be apart
Maximum number of tag SNPs
Minimum LD threshold

Placeholders

Set of candidate SNPs
File or dictionary of LD counts/scores

Output

Candidate set of SNP that should be ignored during each iteration

End of day recap for Saturday

Ayton - finished with reference panel; wrote script that runs, looks for indices that it should exclude; output on shared space (?); input is IMPUTE2 data set, large file (11GB, compressed?)
Scott - calculated LD files; now working on taking request list, find any SNPs that are in LD with that, combine 2 LD files(?)
Brian - working on learning concepts
Marcia - learned file formats, git, started wiki
Jocelyn, Austin - prepared architecture, code for parsing input, argparse
Austin - start of python wrapper to call out for PLINK
Dan - splitting out VCF files for input
Tommy - led the team, provided tech support, software design

Plan for Sunday

PLINK wrapper - Austin to edit to include other functions, such as black list
input parser (done)
preselect SNPs - Scott
select tag SNPs based on counts - Vince
update counts - Vince (combined with selecting tag SNPs)
write output (probably VCF) - Dan
evaluate imputation accuracy w IMPUTE2 - Ayton
visualize results (in R ?), plot w minor allele frequency in X-axis, imputation accuracy on y-axis, for each of 5 populations - Marcia

For later

black list

Sunday morning check-in

concentrate on greedy algorithm, not hybrid algorithm
Tommy has written lots of code as scaffold.py, including code to select tag SNPs and update counts (i.e. 4 and 5 above)
Need someone to generate flow-chart to use in Monday presentation -> Marcia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!