-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to the 2016_project_9 wiki!
The genetic diversity in Africa is immense. The diversity across the entire continent has not previously been captured on any commercial SNP array to date. Developing a cost-efficient and representative genotype array with SNPs that provide good coverage across the African continent is key to conducting large-scale medical genetic studies in Africa I propose a project in which we write a SNP selection algorithm applicable to whole genome sequence (WGS) data. This will involve writing an algorithm that chooses SNPs that tag other SNPs most efficiently across individuals from several African populations. This will be applied in conjunction with a commercial lists of pre-approved SNPs, lists of SNPs of general interest and various lists with ranking of SNPs in order to select a set of tag SNPs that can be put on a commercial SNP array. We intend to write the code in Python. Other tag SNP selection algorithms exist, but none of these are geared towards handling WGS data efficiently. By making use of random access to block gzipped files we intend to write a memory efficient algorithm applicable to WGS data. We envision this algorithm being used in combination with existing imputation methods to make use of haplotype (multi-marker) tagging in addition to simple pairwise LD. We aim to provide a fully functional piece of software and a list of tag SNPs at the end of hackseq.
- https://github.com/tommycarstensen - Tommy
- https://github.com/awreynolds - Austin
- https://github.com/dfornika - Dan
- https://github.com/ameintjes - Ayton
- https://github.com/shaze - Scott
- https://github.com/marciam - Marcia
- https://github.com/jocelynjyl - Jocelyn
- https://github.com/alyeffy - Alyssa - not present Oct.15
- https://github.com/vmon588 - Vince
- https://github.com/brianlee99 - Brian
https://hackseq.slack.com/messages/project9/
- Xu Z, Kaplan NL, Taylor JA. TAGster: efficient selection of LD tag SNPs in single or multiple populations. Bioinformatics. 2007 Dec 1;23(23):3254–5. doi:10.1093/bioinformatics/btm426.
- Hoffmann TJ, Zhan Y, Kvale MN, Hesselson SE, Gollub J, Iribarren C, et al. Design and coverage of high throughput genotyping arrays optimized for individuals of East Asian, African American, and Latino race/ethnicity using imputation and a novel hybrid SNP selection algorithm. Genomics. 2011 Dec;98(6):422–30. doi: 10.1016/j.ygeno.2011.08.007.
- Gurdasani D, Carstensen T, Tekola-Ayele F, Pagani L, Tachmazidou I, Hatzikotoulas K, et al. The African Genome Variation Project shapes medical genetics in Africa. Nature. 2014 Dec 3;517(7534):327–32. doi: 10.1038/nature13997.
- Calculation of LD values with existing code (e.g. PLINK) or new Python code (medium task).
- Selection of tag SNPs from phased (and unphased) data with new code (big task).
- Automated calling of the LD based method and IMPUTE2 in each cycle of a hybrid algorithm, if we choose to go down this path. Making all elements of the code play well together. (big task)
- Evaluation of the hybrid and strictly LD based algorithm (big task).
- Evaluation of our algorithm (speed and accuracy) against existing algorithms; e.g. TAGster (big task).
- Write a paper (medium task).
- YRI Yoruba in Ibadan, Nigeria
- LWK Luhya in Webuye, Kenya
- GWD Gambian in Western Divisions in the Gambia
- MSL Mende in Sierra Leone
- ESN Esan in Nigeria
- Ayton - to get 5 different references panels, each with 1 population left out, in IMPUTE format (excluding ref populations) (?)
- Scott, Vince, Marcia - calculate LD on 1000G data using PLINK
- Austin, Jocelyn, Dan, Brian - to write outline about how to process chromosomes separately
- Dan, Tomm - tag SNP selection
- Marcia - document on Github
- PLINK to calculate LD
- push everything to Github regularly
- someone gets 1000 genomes haplotype panel and get PLINK, start calculating LD values
- capture common variation
- emphasis on writing algorithm, not producing SNPs; existing relevant data is limiting at this point but will improve in the next few months
- select tag SNPs from each of 5 1000 Genomes populations, impute with whole panel, exclude population we're going with... (?)
- TAGster won't work on whole genome data - main problem with TAGster is it takes too long and uses too much memory
- VCF files - can come 1000 Genomes FTP site
- will be computationally heavy in the end when we run the algorithm, but early work can be run on a laptop
- when 2 tag SNPs, choose the 1 with higher score (Affymetrix?) - white list
- expect to have white list and black list
- pre-selected SNPs to be included (i.e. "white list")
- ignore indels
- input: a vcf for multiple chromosomes
- output: ~1M tag SNPs
- then extract 1 population...impute missing SNPs, see how well it compares (?)
- don't use rs IDs
- we'll use build 37
- Illumina has uploaded the [relevant data] just recently?
- VCF files will be split by chromosome -
- parallelization wouldn't work since won't know budget for each chromosome
- PLINK 1.9
- chromosome 20 to start
- use PLINK to do initial filtering of SNPs; at least minor allele frequency 0.05 on a population level
- use windows of 200K bp (?)
- VCF files (chrom20 for testing) for calculation of LD values as input for the actual algorithm
- Pre-selected tag SNPs
- Blacklisted SNPs - not contributing to LD threshold counts, so only LD values can be calculated prior to
- SNP scores for selection of tag SNP between SNPs with identical LD scores
- LD values or count/identity of SNPs above LD threshold for each population
- Minimum MAF threshold
- Maximum distance (cM or bp) that SNPs can be apart
- Maximum number of tag SNPs
- Minimum LD threshold
- Set of candidate SNPs
- File or dictionary of LD counts/scores
- Candidate set of SNP that should be ignored during each iteration
- Ayton - finished with reference panel; wrote script that runs, looks for indices that it should exclude; output on shared space (?); input is IMPUTE2 data set, large file (11GB, compressed?)
- Scott - calculated LD files; now working on taking request list, find any SNPs that are in LD with that, combine 2 LD files(?)
- Brian - working on learning concepts
- Marcia - learned file formats, git, started wiki
- Jocelyn, Austin - prepared architecture, code for parsing input, argparse
- Austin - start of python wrapper to call out for PLINK
- Dan - splitting out VCF files for input
- Tommy - led the team, provided tech support, software design
- PLINK wrapper - Austin to edit to include other functions, such as black list
- input parser (done)
- preselect SNPs - Scott
- select tag SNPs based on counts - Vince
- update counts - Vince (combined with selecting tag SNPs)
- write output (probably VCF) - Dan
- evaluate imputation accuracy w IMPUTE2 - Ayton
- visualize results (in R ?), plot w minor allele frequency in X-axis, imputation accuracy on y-axis, for each of 5 populations - Marcia
- black list
- concentrate on greedy algorithm, not hybrid algorithm
- Tommy has written lots of code as scaffold.py, including code to select tag SNPs and update counts (i.e. 4 and 5 above)
- Need someone to generate flow-chart to use in Monday presentation -> Marcia