CASIdata

CASIdata provides the datasets from Efron & Hastie (2016, ISBN: 9781108107952), Computer Age Statistical Inference: Algorithms, Evidence, and Data Science in an accessible R format for those who want to use them for teaching, study or to try to reproduce or extend analyses from the book. They were downloaded from Trevor Hastie’s web site, https://hastie.su.domains/CASI_files/DATA/, but quite a few files were messy and required some processing to make into R datasets.

Even so, some of the datasets may require data cleaning, renaming of variables, re-shaping or other tidying steps to be useful for analysis. But that’s part of learning.

Installation

This package is not yet on CRAN. You can install it from this GitHub repo or from R-universe

remotes::install.github("friendly/CASIdata")
install.packages('CASIdata', repos = c('https://friendly.r-universe.dev'))

Datasets included here

Loading package: CASIdata

Dataset	dim	Title
DTI	15443x4	DTI Brain Imaging Data
als	1822x371	ALS Data
baseball	18x3	Baseball Batting Averages
bivnorm	40x2	Bivariate Normal Data
butterfly	24x2	Butterfly Species Data
cellinfusion	25x4	Cell Infusion Data
cholesterol	164x2	Cholesterol Data
diabetes	442x12	Diabetes Data
doseresponse	11x2	Dose Response Data
galaxy	270x3	Galaxy Data
haplotype	197x102	Human Ancestry Haplotype Data
insurance	60x3	Insurance Life Table Data
leukemia_small	3571x72	Leukemia Gene Expression Data (Small)
ncog	96x6	NCOG Head and Neck Cancer Data
nodes	844x2	Lymph Nodes Cancer Data
pediatric	1620x7	Pediatric Cancer Survival Data
police	2748x1	Police Racial Bias Data
prostz	6032x1	Prostate Cancer Z-values
student_score	22x5	Student Score Data
supernova	39x11	Type Ia Supernova Data
vasoconstriction	39x2	Vasoconstriction Data

Missing Datasets

The following dataset appears in data-raw/CASI-save.R but is not (yet) included in the package:

Dataset	Reason
`SPAM`	Variable names need cleanup; requires mapping from UCI Spambase documentation

See data-raw/missing-datasets.md for details on resolving this.

External Datasets (Not Included)

These large datasets are referenced in the book but not included in the package due to size constraints. They can be downloaded directly from the sources listed below.

CASI datasets (too large for CRAN)

protein_kernel: 1708 x 1708 inner-product (kernel) matrix for human proteins (Section 19.6). Computed using a string kernel on bag-of-4-grams amino acid representations.
- Source: https://hastie.su.domains/CASI_files/DATA/protein_kernel.txt
- Load in R: protein_kernel <- matrix(scan("https://hastie.su.domains/CASI_files/DATA/protein_kernel.txt", what=0), 1708, 1708)
protein_label: Response labels (-1/+1) for the 1708 proteins (45 positives, 1663 negatives).
- Source: https://hastie.su.domains/CASI_files/DATA/protein_label.txt
- Load in R: protein_label <- scan("https://hastie.su.domains/CASI_files/DATA/protein_label.txt", what=0)
prostmat: 6033 x 102 gene expression matrix comparing 50 controls vs 52 prostate cancer patients (Section 3.3).
- Source: https://hastie.su.domains/CASI_files/DATA/prostmat.csv
- Load in R: prostmat <- read.csv("https://hastie.su.domains/CASI_files/DATA/prostmat.csv")
- Note: Column names need cleanup (see data-raw/missing-datasets.md for renaming code)
leukemia_big: 7128 x 72 gene expression matrix (10MB). A larger version of leukemia_small.
- Source: https://hastie.su.domains/CASI_files/DATA/leukemia_big.csv
- Load in R: leukemia_big <- read.csv("https://hastie.su.domains/CASI_files/DATA/leukemia_big.csv")

Image datasets (hosted externally)

CIFAR-100: 100 image classes, 600 images each (32x32x3 color). Used in Chapter 18.
- Source: https://www.cs.toronto.edu/~kriz/cifar.html
MNIST: Handwritten digit database, 60K training + 10K test images (28x28 grayscale). Used in Chapter 18.
- Source: http://yann.lecun.com/exdb/mnist/

Variable Renaming

Some datasets had variables renamed for clarity:

Dataset	Original	Renamed
`butterfly`	x, y	k, count
`police`	X2.411	z
`prostz`	X1.47236666651029	z
`galaxy`	Reshaped from wide to long format with `mag`, `red`, `freq`

Example

No examples yet.

library(CASIdata)
## basic example code

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.claude		.claude
R		R
data-raw		data-raw
data		data
docs		docs
extra		extra
inst		inst
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.nojekyll		.nojekyll
CASIdata.Rproj		CASIdata.Rproj
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
cran-comments.md		cran-comments.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CASIdata

Installation

Datasets included here

Missing Datasets

External Datasets (Not Included)

CASI datasets (too large for CRAN)

Image datasets (hosted externally)

Variable Renaming

Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CASIdata

Installation

Datasets included here

Missing Datasets

External Datasets (Not Included)

CASI datasets (too large for CRAN)

Image datasets (hosted externally)

Variable Renaming

Example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages