Genomics – nicholastsmith

Posted on February 19, 2021February 19, 2021 by Nicholas T Smith

Simulated Genetic Drift in Populations

This video shows a simulation of genetic drift in a synthetically generated population as a result of sexual reproduction. A fixed population size is divided into different distinct sub-populations with differing allele frequencies. Starting with an initial population, 80 simulated epochs are passed in which each population member is replaced via sexual reproduction from two randomly selected parents. Pairs of parents are chosen such that the probability they come from the same sub-population is higher than the probability that they from come different sub-populations.

Figure 1: Simulated Genetic Drift in Populations

In the animation, the high-dimensional allele features for each population member are represented in two dimensions using PCA. The size and color of each marker encodes information about sub-population makeup and may be used to help distinguish highly mixed samples.

Although the data is entirely synthetic, it demonstrates the middling affect that cross-breeding has on the genetic makeup of the overall population. As the number of sub-populations grows and the overall population size remains fixed, the impact of cross-breeding is intensified and the population as a whole converges more rapidly.

Posted on February 5, 2021December 11, 2022 by Nicholas T Smith

CoV-2 Genome Explorer

At the time of this writing, NCBI Virus has over 40,000 complete CoV-2 sequences available. I made an interactive webpage to help visualize a random subset of these genomes using principal component analysis.

Figure 1: CoV-2 Explorer Interface

The x, y, size, and color axes respectively represent the first four principal axes. The plot markers identify viruses with unique nucleotide sequences in the GU280_gp02 (spike glycoprotein) gene region. The range sliders can be used to visualize shift in the viral genomes over time. The bar chart in the bottom right counts the frequency of SNP mutations in the sequences present in the specified date range.

Posted on January 26, 2021January 26, 2021 by Nicholas T Smith

Ancestry Determination Part 3: openSNP Data Evaluation

The website openSNP allows users to share and discuss genetic information [1]. The purpose of the site is perhaps best summarized by its bio on Twitter: “crowdsourcing genome wide association studies” [2]. This post uses the machine learning and statistical techniques developed in previous blog entries to analyze the ancestry of openSNP members who have genotyping data associated to their accounts. By comparing the derived results with the self-reported answers from openSNP users, the accuracy of these methods on real-world data is assessed.
Read more

Posted on January 2, 2021January 2, 2021 by Nicholas T Smith

Identifying Family Relationships using Genetic Similarity Measures

In general, are parents more genetically similar to their children or to their siblings? Why do some siblings look alike while others look completely different? How can the genetic similarity between two individuals be computed? This post uses biostatistics to explore genetic similarity and tries to answer these and other questions.