A fast implementation of hurdle models using Rcpp. This package provides the same functionality as the hurdle function in the pscl package, but with improved performance through C++ implementations of key functions. This package is optimized for efficient peak-gene link analysis in large-scale single-nucleus multiome datasets with millions of cells.
Pre-built binaries are available from R-universe, which does not require a C++ compiler or Fortran toolchain:
install.packages("fasthurdle", repos = c("https://mkanai.r-universe.dev", "https://cloud.r-project.org"))Installing from source requires a C++ compiler and GNU Fortran (gfortran, required by the RcppArmadillo dependency):
# install.packages("pak")
pak::pkg_install("mkanai/fasthurdle")For environments where installing compiler toolchains is difficult, a pre-built Docker image is available:
docker run --rm -it masakanai/fasthurdleOn HPC clusters that support Singularity / Apptainer, you can convert the Docker image:
singularity build fasthurdle.sif docker://masakanai/fasthurdle
singularity exec fasthurdle.sif RTo build the image locally, a Dockerfile is provided under docker/:
docker build -t masakanai/fasthurdle docker/library(fasthurdle)
# Generate some sample data
set.seed(42)
n <- 500
x <- rnorm(n)
z <- rnorm(n)
lambda <- exp(1 + 0.5 * x)
p <- plogis(0.5 - 0.5 * z)
y <- rbinom(n, size = 1, prob = p) * rpois(n, lambda = lambda)
# Create a data frame
df <- data.frame(y = y, x = x, z = z)
# Fit a hurdle model
model <- fasthurdle(y ~ x | z, data = df, dist = "poisson", zero.dist = "binomial")
# Print the model summary
summary(model)Note: fasthurdle uses OpenMP multithreading for improved performance. Control the number of threads using the OMP_NUM_THREADS environment variable. When fitting many models, avoid additional parallelization (e.g., with parallel or future) to prevent oversubscription.
Hurdle models are well-suited for analyzing peak-gene associations in single-nucleus multiome data (scRNA-seq + scATAC-seq), as originally introduced in Open4Gene. The two-part hurdle model independently fits: (1) a binomial zero-inflation model testing whether peak accessibility affects the probability of nonzero expression, and (2) a negative binomial count model testing whether peak accessibility affects expression magnitude among expressing cells. This approach explicitly accounts for technical and biological sparsity while modeling the count-based nature of expression measurements with overdispersion.
library(fasthurdle)
# Generate sample data
set.seed(42)
n <- 500
peak_acc <- rpois(n, lambda = 2)
log_total_counts <- rnorm(n, mean = 8, sd = 1)
pct_counts_mito <- runif(n, min = 0, max = 0.2)
lambda <- exp(0.5 + 0.3 * peak_acc + 0.2 * log_total_counts - 0.5 * pct_counts_mito)
p <- plogis(1 - 0.4 * peak_acc - 0.3 * pct_counts_mito)
gene_expr <- rbinom(n, size = 1, prob = p) * rpois(n, lambda = lambda)
# Create a data frame
df <- data.frame(
gene_expr = gene_expr,
peak_acc = peak_acc,
log_total_counts = log_total_counts,
pct_counts_mito = pct_counts_mito
)
# Fit hurdle model with negative binomial count model and binomial zero hurdle
# Use log_total_counts as offset in count model (modeling rates) and covariate in zero model
# Note: You can also use other distributions as needed (dist: "poisson", "geometric"; zero.dist: "poisson", "negbin", "geometric")
model <- fasthurdle(
gene_expr ~ peak_acc + pct_counts_mito + offset(log_total_counts) | peak_acc + log_total_counts + pct_counts_mito,
data = df,
dist = "negbin",
zero.dist = "binomial"
)
# Extract results
model_summary <- summary(model)
# Get coefficients for peak accessibility
peak_coef_zero <- model_summary$coefficients$zero["peak_acc", ] # Zero hurdle component
peak_coef_count <- model_summary$coefficients$count["peak_acc", ] # Count component
# Extract specific statistics
beta_zero <- peak_coef_zero[1] # Coefficient (log-odds of nonzero expression per unit increase in peak accessibility)
se_zero <- peak_coef_zero[2] # Standard error
z_zero <- peak_coef_zero[3] # Z-statistic
p_zero <- peak_coef_zero[4] # P-value
beta_count <- peak_coef_count[1] # Coefficient (change in log gene expression rate per unit increase in peak accessibility)
se_count <- peak_coef_count[2] # Standard error
z_count <- peak_coef_count[3] # Z-statistic
p_count <- peak_coef_count[4] # P-value
# Model fit statistics
aic <- AIC(model)
bic <- BIC(model)For large-scale peak-gene pair testing, use fast_negbin_hurdle, which provides the best performance by directly accepting a model matrix and skipping formula processing:
library(fasthurdle)
# Prepare model matrices and response
# Count model: peak_acc + covariates, with log_total_counts as offset
X <- model.matrix(~ peak_acc + pct_counts_mito, data = df)
y <- df$gene_expr
offsetx <- df$log_total_counts
# Zero model: peak_acc + covariates, with log_total_counts as a free covariate
Z <- model.matrix(~ peak_acc + log_total_counts + pct_counts_mito, data = df)
# Fit the model (same result extraction as fasthurdle above)
model <- fast_negbin_hurdle(X, y, Z = Z, offsetx = offsetx)Average speedup of fasthurdle compared to pscl::hurdle:
| Count Model | Zero Hurdle | Speedup Factor |
|---|---|---|
| geometric | binomial | 2.2x |
| geometric | geometric | 2.3x |
| geometric | negbin | 5.1x |
| geometric | poisson | 2.9x |
| negbin | binomial | 13x |
| negbin | geometric | 12.1x |
| negbin | negbin | 11x |
| negbin | poisson | 11.3x |
| poisson | binomial | 3.4x |
| poisson | geometric | 3.1x |
| poisson | negbin | 5.8x |
| poisson | poisson | 3.7x |
Note: Benchmarks run with sample sizes of 1,000, 10,000, and 100,000. Speedup factor is the ratio of pscl execution time to fasthurdle execution time.
- Supports the same models as
pscl::hurdle:- Count distributions: Poisson, Negative Binomial, Geometric
- Zero hurdle distributions: Binomial, Poisson, Negative Binomial, Geometric
- Compatible API with
pscl::hurdle - Improved performance through C++ implementations
The pscl package, where the original hurdle function was implemented, was developed at the Political Science Computational Laboratory, led by Simon Jackman at Stanford University. The hurdle and count data models in the pscl package were re-written by Achim Zeileis and Christian Kleiber.
The use of hurdle models for peak-gene link analysis in single-nucleus multiome data was originally introduced in Open4Gene (Liu, H. et al., 2025).
- New feature: Added statistical utilities for hurdle model p-value combination and FDR control:
CCT(): Cauchy Combination Test (ACAT) for combining p-values under arbitrary dependency structures.jiang_doerge_fdr(): Two-stage FDR procedure for hurdle models, screening on one component and confirming on the other.acat_stagewise(): ACAT-based omnibus screening with stage-wise Holm confirmation, classifying regulatory mechanisms as "dual", "switch", "rheostat", or "omnibus_only".
- Bug fix: Fixed multiple convergence issues in the negative binomial count model:
- Fixed missing
theta*log(theta)term in the count model log-likelihood (CountNegBinFunctor). The analytical gradient was correct, but the inconsistency with the function value caused BFGS line search failures and incorrect theta/coefficient estimates. - Fixed
maxitandreltolfromhurdle.control()not being forwarded to the C++ optimizer. The roptim library defaulted tomaxit=100instead of the intended10000, causing premature convergence. - Fit count model starting values on
y > 0subset only. The count component models a zero-truncated distribution, but starting values were previously computed from a Poisson GLM on the full dataset including zeros. Fitting on expressing cells provides starting values closer to the truncated MLE, reducing optimizer iterations by 40–75% at high zero fractions and avoiding degenerate local optima at >95% zeros.
- Fixed missing
- New feature: Extended
fast_negbin_hurdle()to support flexible model specification:- Added
Zparameter for specifying a separate design matrix for the zero component. This enables use cases like scRNA-seq depth correction, wherelog(library_size)is an offset in the count model but a covariate in the zero model. - Added
offsetxandoffsetzparameters for specifying offsets in the count and zero components.
- Added
- Initial release.
GPL-2
Masahiro Kanai (mkanai@broadinstitute.org)