Computer Science – nicholastsmith

Posted on February 16, 2023February 16, 2023 by Nicholas T Smith

The Matrix-Vector Product and the SVD

The singular value decomposition (SVD) allows one to re-write a given matrix as a sum of rank one matrices. Specifically, using the SVD, one may re-write a given matrix A as follows:

$\mathbf{A}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T = \sum\limits_{i=1}^{n}{\mathbf{U}_i\mathbf{\Sigma}_{ii}\mathbf{V}_i^T}$ ,

where $\mathbf{V}_i^T$ is the transpose of the i-th column of V. Further, the Eckart-Young-Mirsky theorem proves that the best rank k approximation to the matrix A is found by summing only the first k elements of the right-hand sum.
Read more

Posted on October 31, 2022November 1, 2022 by Nicholas T Smith

Kernel Recipes and Kernel PCA

One strength of kernel methods is their ability to operate directly on non-numerical objects like sets. As seen in the previous post, the Jaccard index on sets satisfies Mercer’s condition and thus is a valid kernel. The process of proving a similarity measure is a valid kernel is somewhat involved, but thankfully several theorems can be employed to get more mileage out of the set of known good kernels. This post outlines some recipes for producing new valid kernels and introduces a method for obtaining numerical representations of samples using kernel methods.
Read more

Posted on September 8, 2022 by Nicholas T Smith

On The Importance of Centering in PCA

The previous post presents methods for efficiently performing principal component analysis (PCA) on certain rectangular sparse matrices. Since routines for performing the singular value decomposition (SVD) on sparse matrices are readily available (e.g. svds and TruncatedSVD), it is reasonable to investigate the influence centering has on the resulting transformation.
Read more

Posted on September 5, 2022December 11, 2022 by Nicholas T Smith

Weighted Sparse PCA for Rectangular Matrices

Consider a sparse mxn rectangular matrix $\mathbf{X}$ , where either $m >> n$ or $m << n$ . Performing a principal component analysis (PCA) on $\mathbf{X}$ involves computing the eigenvectors of its covariance matrix. This is often accomplished using the singular value decomposition (SVD) of the centered matrix $\mathbf{X}-\mathbf{U}$ . But, with large sparse matrices, this centering step is frequently intractable. If it is tractable, however, to compute the eigendecomposition of either an mxm or an nxn dense matrix in memory, other approaches are possible.

Posted on May 29, 2021December 11, 2022 by Nicholas T Smith

ICP In Practice

This post explores the iterative constrained pathways rule ensemble (ICPRE) method introduced in an earlier post using the Titanic dataset popularized by Kaggle [1]. The purpose of the text is to introduce the features and explore the behavior of the library.

Some of the code snippets in this post are shortened for brevity sake. To obtain the full source and data, please see the ICPExamples GitHub page [2].
Read more

Posted on May 18, 2021December 11, 2022 by Nicholas T Smith

The Iterative Constrained Pathways Optimizer

Many optimization methods seek an optimal parameter set with regard to error or likelihood. Such a solution is most desirable in many regards. However, when the broader context of a problem is included, the indisputable superiority of the optimum frequently becomes less clear. This context often includes other guidelines and restrictions that may limit the usefulness of solutions lacking certain properties. Unfortunately, typical loss criteria can rarely take these into account.

This blog post presents a method that abandons the quest for optimality and instead focuses on better satisfying the broader context of a problem. It describes a method that does not attempt to find the minimum, but instead simply tries to get closer to it while respecting imposed constraints. This blog post describes the iterative constrained pathways optimizer.
Read more

Posted on January 2, 2021January 2, 2021 by Nicholas T Smith

Identifying Family Relationships using Genetic Similarity Measures

In general, are parents more genetically similar to their children or to their siblings? Why do some siblings look alike while others look completely different? How can the genetic similarity between two individuals be computed? This post uses biostatistics to explore genetic similarity and tries to answer these and other questions.

Posted on September 15, 2020December 11, 2022 by Nicholas T Smith

The Shooting Regressor; Randomized Gradient-Based Ensembles

Gradient boosting stands apart from many other supervised learning approaches in some regard. It is an ensemble method, However, the ensemble is constructed sequentially. It estimates a target value. But, it does so indirectly through the gradient.

Posted on July 3, 2019December 14, 2022 by Nicholas T Smith

RangeMap: A Simple Interval Query Datastructure

Maps are a fundamental data structure. Their prevalence is a testament to their importance. Indeed, many search problems can be reduced to the construction of an appropriate map. However, a search problem occasionally arises that is difficult to solve, at least directly, with a map. The interval query problem is one such problem.

Posted on October 27, 2018November 2, 2018 by Nicholas T Smith

A Method for Addressing Nonhomogeneous Data using an Implicit Hierarchical Linear Model

Datasets containing nonhomogenous groups of samples present a challenge to linear models. In particular, such datasets violate the assumption that there is a linear relationship between the independent and dependent variables. If the data is grouped into distinct clusters, linear models may predict responses that fall in between the clusters. These predictions can be quite far from the targets depending on how the data is structured. In this post, a method is presented for automatically handling nonhomogenous datasets using linear models.