Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored.
We consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals and p-values, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset.
data/stores simulated data.figures/contains raw figures from the paper.simulate_confidence_intervals.Ruses command line arguments to generate confidence intervals on simulated data.simulate_hypothesis_tests.Ruses command line arguments to generated p-values on simulated data.run_sh.shstores bash command line calls to simulations scripts to generate data for the paper figures.make_figures.shcallsFigureX_.*.Rfiles, using simulation results, to create figures.FigureX_.*.Rrecreates paper figureX, often using data generated by simulation scripts.scripts/functions/contains functions imported by other scripts.confidence_intervals.Rcontains code to construct confidence intervals.estimation.Rcontains code to construct point estimates using the MLE and median estimators.hypothesis_tests.Rcontains code to construct p-values.selection_methods.Rcontains code to select the elbow using the ZG and discrete derivative rules.simluations.Rcontains code to simulate data.