Skip to content

rflperry/elbow_inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Post-selection inference on the proportion of variance explained by PCA

Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored.

We consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals and p-values, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset.

Repository overview

  • data/ stores simulated data.
  • figures/ contains raw figures from the paper.
  • simulate_confidence_intervals.R uses command line arguments to generate confidence intervals on simulated data.
  • simulate_hypothesis_tests.R uses command line arguments to generated p-values on simulated data.
  • run_sh.sh stores bash command line calls to simulations scripts to generate data for the paper figures.
  • make_figures.sh calls FigureX_.*.R files, using simulation results, to create figures.
  • FigureX_.*.R recreates paper figure X, often using data generated by simulation scripts.
  • scripts/functions/ contains functions imported by other scripts.
    • confidence_intervals.R contains code to construct confidence intervals.
    • estimation.R contains code to construct point estimates using the MLE and median estimators.
    • hypothesis_tests.R contains code to construct p-values.
    • selection_methods.R contains code to select the elbow using the ZG and discrete derivative rules.
    • simluations.R contains code to simulate data.

About

Post-selection inference on the PCA proportion of variance explained.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published