CASIdata provides the datasets from Efron & Hastie (2016, ISBN: 9781108107952), Computer Age Statistical Inference: Algorithms, Evidence, and Data Science in an accessible R format for those who want to use them for teaching, study or to try to reproduce or extend analyses from the book. They were downloaded from Trevor Hastie’s web site, https://hastie.su.domains/CASI_files/DATA/, but quite a few files were messy and required some processing to make into R datasets.
Even so, some of the datasets may require data cleaning, renaming of variables, re-shaping or other tidying steps to be useful for analysis. But that’s part of learning.
This package is not yet on CRAN. You can install it from this GitHub repo or from R-universe
remotes::install.github("friendly/CASIdata")
install.packages('CASIdata', repos = c('https://friendly.r-universe.dev'))Loading package: CASIdata
| Dataset | dim | Title |
|---|---|---|
| DTI | 15443x4 | DTI Brain Imaging Data |
| als | 1822x371 | ALS Data |
| baseball | 18x3 | Baseball Batting Averages |
| bivnorm | 40x2 | Bivariate Normal Data |
| butterfly | 24x2 | Butterfly Species Data |
| cellinfusion | 25x4 | Cell Infusion Data |
| cholesterol | 164x2 | Cholesterol Data |
| diabetes | 442x12 | Diabetes Data |
| doseresponse | 11x2 | Dose Response Data |
| galaxy | 270x3 | Galaxy Data |
| haplotype | 197x102 | Human Ancestry Haplotype Data |
| insurance | 60x3 | Insurance Life Table Data |
| leukemia_small | 3571x72 | Leukemia Gene Expression Data (Small) |
| ncog | 96x6 | NCOG Head and Neck Cancer Data |
| nodes | 844x2 | Lymph Nodes Cancer Data |
| pediatric | 1620x7 | Pediatric Cancer Survival Data |
| police | 2748x1 | Police Racial Bias Data |
| prostz | 6032x1 | Prostate Cancer Z-values |
| student_score | 22x5 | Student Score Data |
| supernova | 39x11 | Type Ia Supernova Data |
| vasoconstriction | 39x2 | Vasoconstriction Data |
The following dataset appears in data-raw/CASI-save.R but is not
(yet) included in the package:
| Dataset | Reason |
|---|---|
SPAM |
Variable names need cleanup; requires mapping from UCI Spambase documentation |
See data-raw/missing-datasets.md for details on resolving this.
These large datasets are referenced in the book but not included in the package due to size constraints. They can be downloaded directly from the sources listed below.
- protein_kernel: 1708 x 1708 inner-product (kernel) matrix for
human proteins (Section 19.6). Computed using a string kernel on
bag-of-4-grams amino acid representations.
- Source: https://hastie.su.domains/CASI_files/DATA/protein_kernel.txt
- Load in R:
protein_kernel <- matrix(scan("https://hastie.su.domains/CASI_files/DATA/protein_kernel.txt", what=0), 1708, 1708)
- protein_label: Response labels (-1/+1) for the 1708 proteins (45
positives, 1663 negatives).
- Source: https://hastie.su.domains/CASI_files/DATA/protein_label.txt
- Load in R:
protein_label <- scan("https://hastie.su.domains/CASI_files/DATA/protein_label.txt", what=0)
- prostmat: 6033 x 102 gene expression matrix comparing 50 controls
vs 52 prostate cancer patients (Section 3.3).
- Source: https://hastie.su.domains/CASI_files/DATA/prostmat.csv
- Load in R:
prostmat <- read.csv("https://hastie.su.domains/CASI_files/DATA/prostmat.csv") - Note: Column names need cleanup (see
data-raw/missing-datasets.mdfor renaming code)
- leukemia_big: 7128 x 72 gene expression matrix (10MB). A larger
version of
leukemia_small.- Source: https://hastie.su.domains/CASI_files/DATA/leukemia_big.csv
- Load in R:
leukemia_big <- read.csv("https://hastie.su.domains/CASI_files/DATA/leukemia_big.csv")
- CIFAR-100: 100 image classes, 600 images each (32x32x3 color). Used in Chapter 18.
- MNIST: Handwritten digit database, 60K training + 10K test images (28x28 grayscale). Used in Chapter 18.
Some datasets had variables renamed for clarity:
| Dataset | Original | Renamed |
|---|---|---|
butterfly |
x, y | k, count |
police |
X2.411 | z |
prostz |
X1.47236666651029 | z |
galaxy |
Reshaped from wide to long format with mag, red, freq |
No examples yet.
library(CASIdata)
## basic example code