fasthurdle

A fast implementation of hurdle models using Rcpp. This package provides the same functionality as the hurdle function in the pscl package, but with improved performance through C++ implementations of key functions. This package is optimized for efficient peak-gene link analysis in large-scale single-nucleus multiome datasets with millions of cells.

Installation

Pre-built binaries from R-universe (recommended)

Pre-built binaries are available from R-universe, which does not require a C++ compiler or Fortran toolchain:

install.packages("fasthurdle", repos = c("https://mkanai.r-universe.dev", "https://cloud.r-project.org"))

From source

Installing from source requires a C++ compiler and GNU Fortran (gfortran, required by the RcppArmadillo dependency):

# install.packages("pak")
pak::pkg_install("mkanai/fasthurdle")

Docker / Singularity

For environments where installing compiler toolchains is difficult, a pre-built Docker image is available:

docker run --rm -it masakanai/fasthurdle

On HPC clusters that support Singularity / Apptainer, you can convert the Docker image:

singularity build fasthurdle.sif docker://masakanai/fasthurdle
singularity exec fasthurdle.sif R

To build the image locally, a Dockerfile is provided under docker/:

docker build -t masakanai/fasthurdle docker/

General usage

library(fasthurdle)

# Generate some sample data
set.seed(42)
n <- 500
x <- rnorm(n)
z <- rnorm(n)
lambda <- exp(1 + 0.5 * x)
p <- plogis(0.5 - 0.5 * z)
y <- rbinom(n, size = 1, prob = p) * rpois(n, lambda = lambda)

# Create a data frame
df <- data.frame(y = y, x = x, z = z)

# Fit a hurdle model
model <- fasthurdle(y ~ x | z, data = df, dist = "poisson", zero.dist = "binomial")

# Print the model summary
summary(model)

Note: fasthurdle uses OpenMP multithreading for improved performance. Control the number of threads using the OMP_NUM_THREADS environment variable. When fitting many models, avoid additional parallelization (e.g., with parallel or future) to prevent oversubscription.

Peak-gene link analysis

Hurdle models are well-suited for analyzing peak-gene associations in single-nucleus multiome data (scRNA-seq + scATAC-seq), as originally introduced in Open4Gene. The two-part hurdle model independently fits: (1) a binomial zero-inflation model testing whether peak accessibility affects the probability of nonzero expression, and (2) a negative binomial count model testing whether peak accessibility affects expression magnitude among expressing cells. This approach explicitly accounts for technical and biological sparsity while modeling the count-based nature of expression measurements with overdispersion.

Using `fasthurdle` with formula interface

library(fasthurdle)

# Generate sample data
set.seed(42)
n <- 500
peak_acc <- rpois(n, lambda = 2)
log_total_counts <- rnorm(n, mean = 8, sd = 1)
pct_counts_mito <- runif(n, min = 0, max = 0.2)
lambda <- exp(0.5 + 0.3 * peak_acc + 0.2 * log_total_counts - 0.5 * pct_counts_mito)
p <- plogis(1 - 0.4 * peak_acc - 0.3 * pct_counts_mito)
gene_expr <- rbinom(n, size = 1, prob = p) * rpois(n, lambda = lambda)

# Create a data frame
df <- data.frame(
  gene_expr = gene_expr,
  peak_acc = peak_acc,
  log_total_counts = log_total_counts,
  pct_counts_mito = pct_counts_mito
)

# Fit hurdle model with negative binomial count model and binomial zero hurdle
# Use log_total_counts as offset in count model (modeling rates) and covariate in zero model
# Note: You can also use other distributions as needed (dist: "poisson", "geometric"; zero.dist: "poisson", "negbin", "geometric")
model <- fasthurdle(
  gene_expr ~ peak_acc + pct_counts_mito + offset(log_total_counts) | peak_acc + log_total_counts + pct_counts_mito,
  data = df,
  dist = "negbin",
  zero.dist = "binomial"
)

# Extract results
model_summary <- summary(model)

# Get coefficients for peak accessibility
peak_coef_zero <- model_summary$coefficients$zero["peak_acc", ]    # Zero hurdle component
peak_coef_count <- model_summary$coefficients$count["peak_acc", ]  # Count component

# Extract specific statistics
beta_zero <- peak_coef_zero[1]    # Coefficient (log-odds of nonzero expression per unit increase in peak accessibility)
se_zero <- peak_coef_zero[2]      # Standard error
z_zero <- peak_coef_zero[3]       # Z-statistic
p_zero <- peak_coef_zero[4]       # P-value

beta_count <- peak_coef_count[1]  # Coefficient (change in log gene expression rate per unit increase in peak accessibility)
se_count <- peak_coef_count[2]    # Standard error
z_count <- peak_coef_count[3]     # Z-statistic
p_count <- peak_coef_count[4]     # P-value

# Model fit statistics
aic <- AIC(model)
bic <- BIC(model)

Using `fast_negbin_hurdle` for high-performance analysis

For large-scale peak-gene pair testing, use fast_negbin_hurdle, which provides the best performance by directly accepting a model matrix and skipping formula processing:

library(fasthurdle)

# Prepare model matrices and response
# Count model: peak_acc + covariates, with log_total_counts as offset
X <- model.matrix(~ peak_acc + pct_counts_mito, data = df)
y <- df$gene_expr
offsetx <- df$log_total_counts

# Zero model: peak_acc + covariates, with log_total_counts as a free covariate
Z <- model.matrix(~ peak_acc + log_total_counts + pct_counts_mito, data = df)

# Fit the model (same result extraction as fasthurdle above)
model <- fast_negbin_hurdle(X, y, Z = Z, offsetx = offsetx)

Benchmark Results

Average speedup of fasthurdle compared to pscl::hurdle:

Count Model	Zero Hurdle	Speedup Factor
geometric	binomial	2.2x
geometric	geometric	2.3x
geometric	negbin	5.1x
geometric	poisson	2.9x
negbin	binomial	13x
negbin	geometric	12.1x
negbin	negbin	11x
negbin	poisson	11.3x
poisson	binomial	3.4x
poisson	geometric	3.1x
poisson	negbin	5.8x
poisson	poisson	3.7x

Note: Benchmarks run with sample sizes of 1,000, 10,000, and 100,000. Speedup factor is the ratio of pscl execution time to fasthurdle execution time.

Features

Supports the same models as pscl::hurdle:
- Count distributions: Poisson, Negative Binomial, Geometric
- Zero hurdle distributions: Binomial, Poisson, Negative Binomial, Geometric
Compatible API with pscl::hurdle
Improved performance through C++ implementations

Acknowledgements

The pscl package, where the original hurdle function was implemented, was developed at the Political Science Computational Laboratory, led by Simon Jackman at Stanford University. The hurdle and count data models in the pscl package were re-written by Achim Zeileis and Christian Kleiber.

The use of hurdle models for peak-gene link analysis in single-nucleus multiome data was originally introduced in Open4Gene (Liu, H. et al., 2025).

Changelog

v1.1.1 (2026-03-09)

New feature: Added statistical utilities for hurdle model p-value combination and FDR control:
- CCT(): Cauchy Combination Test (ACAT) for combining p-values under arbitrary dependency structures.
- jiang_doerge_fdr(): Two-stage FDR procedure for hurdle models, screening on one component and confirming on the other.
- acat_stagewise(): ACAT-based omnibus screening with stage-wise Holm confirmation, classifying regulatory mechanisms as "dual", "switch", "rheostat", or "omnibus_only".

v1.1.0 (2026-03-05)

Bug fix: Fixed multiple convergence issues in the negative binomial count model:
- Fixed missing theta*log(theta) term in the count model log-likelihood (CountNegBinFunctor). The analytical gradient was correct, but the inconsistency with the function value caused BFGS line search failures and incorrect theta/coefficient estimates.
- Fixed maxit and reltol from hurdle.control() not being forwarded to the C++ optimizer. The roptim library defaulted to maxit=100 instead of the intended 10000, causing premature convergence.
- Fit count model starting values on y > 0 subset only. The count component models a zero-truncated distribution, but starting values were previously computed from a Poisson GLM on the full dataset including zeros. Fitting on expressing cells provides starting values closer to the truncated MLE, reducing optimizer iterations by 40–75% at high zero fractions and avoiding degenerate local optima at >95% zeros.
New feature: Extended fast_negbin_hurdle() to support flexible model specification:
- Added Z parameter for specifying a separate design matrix for the zero component. This enables use cases like scRNA-seq depth correction, where log(library_size) is an offset in the count model but a covariate in the zero model.
- Added offsetx and offsetz parameters for specifying offsets in the count and zero components.

v1.0.0 (2025-07-28)

Initial release.

License

GPL-2

Contact

Masahiro Kanai (mkanai@broadinstitute.org)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
R		R
docker		docker
inst/benchmark		inst/benchmark
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.clang-format		.clang-format
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
fasthurdle.Rproj		fasthurdle.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fasthurdle

Installation

Pre-built binaries from R-universe (recommended)

From source

Docker / Singularity

General usage

Peak-gene link analysis

Using `fasthurdle` with formula interface

Using `fast_negbin_hurdle` for high-performance analysis

Benchmark Results

Features

Acknowledgements

Changelog

v1.1.1 (2026-03-09)

v1.1.0 (2026-03-05)

v1.0.0 (2025-07-28)

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fasthurdle

Installation

Pre-built binaries from R-universe (recommended)

From source

Docker / Singularity

General usage

Peak-gene link analysis

Using fasthurdle with formula interface

Using fast_negbin_hurdle for high-performance analysis

Benchmark Results

Features

Acknowledgements

Changelog

v1.1.1 (2026-03-09)

v1.1.0 (2026-03-05)

v1.0.0 (2025-07-28)

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Using `fasthurdle` with formula interface

Using `fast_negbin_hurdle` for high-performance analysis

Packages