I saw an R package I hadn’t heard of called collapse performing well on the duckdb database-like ops benchmark, and I thought I’d try it out on a little statistical simulation. tl;dr: collapse is sick.

Context

I wanted to know the null p-value distribution that you’d get when mis-applying a t-test to non-normally distributed data with a small number of data points. So to simulate this I:

generate 100k groups
generate 6*100k random t-distributed (df = 3) synthetic observations
compute the p-value test statistic for each group. In R, this is 2*pt(sqrt(6) * abs(mean(x) / sd(x)), 5, lower.tail = FALSE) where x are the 6 random observations for the group.
- This can be done naively with t.test(x)$p.value of course, but that comes with a lot of slow baggage.
- It’s probably possible to figure out the distribution I’m trying to simulate analytically. Simulating it is easier.

We’ll try it in dplyr, data.table, and collapse, making sure to use a single thread on the latter two:

Code

library(dplyr)
library(data.table); setDTthreads(1)
library(collapse); set_collapse(nthreads = 1)

And generate a synthetic dataset. We’ll use 300k groups just so we can get a really smooth histogram with a large number of bins:

Code

n = 3e5

val_dt = data.table(g = rep(1:n, each = 6),
                    x = rt(6 * n, 3))

val_dt

              g           x
          <int>       <num>
      1:      1 -0.08460379
      2:      1  0.76900599
      3:      1 -3.09828899
      4:      1  0.37445603
      5:      1 -0.16769057
     ---                   
1799996: 300000 -0.18144463
1799997: 300000  0.33434800
1799998: 300000 -1.20625276
1799999: 300000  0.43735380
1800000: 300000  0.36069760

Functions

We’ll try it in dplyr, data.table, and collapse.

`dplyr`

So easy to write, so clean.

Code

dplyr_soln = function(val_dt) {
  val_dt |>
    group_by(g) |>
    summarise(p = 2 * pt(abs(mean(x) * sqrt(6) / sd(x)), 
                         5, lower.tail = FALSE))
}

`data.table` two ways

My first attempt didn’t perform as well as I would have hoped, which turned out because I wasn’t using the GForce-optimized functions for the mean and standard deviation. To use those, the expression in j can ONLY contain the GForce-optimized functions min, max, mean, median, var, sd, sum, prod, first, last, head, tail. base::Arithmetic and stats::pt are not among them!

To solve this I used two data.table chained expressions [][]. It’s possible that there’s a more performant way to do it in one pass over the groups (while still using GForce), but I wanted to keep it to within the realm of functions I know how to write intuitively.

Code

dt_soln = function(val_dt) {
  val_dt[,.(p = 2 * pt(abs(mean(x) / (sd(x) / sqrt(6))), 
                       5, lower.tail = FALSE)), 
         by = g]
}

dt_fast_soln = function(val_dt) {
  
  val_dt[,.(m = mean(x),
            s = sd(x)),
         by = g][, p := 2 * (1 - pt(abs(m * sqrt(6) / s),
                                    5, lower.tail = FALSE))]
}

`collapse`

I’ll talk about the amazing performance later, but what really impressed me is how easy it was to write. I literally just copy-pasted my dplyr solution (the gold standard of “easy to write” IMO) then typed four “f”s to switch to the collapse version of the dplyr/base functions (e.g. summarise() -> fsummarise()).

For me, “easy to write” is a HUGE benefit. Difficult to quantify, but so, so important for the interactive programming I do 80% of the time.

Code

collapse_soln = function(val_dt) {
  val_dt |>
  # V here
    fgroup_by(g) |>
  # V here                    V here               V and here
    fsummarise(p = 2 * pt(abs(fmean(x) * sqrt(6) / fsd(x)), 
                          5, lower.tail = FALSE))
}

Benchmark

I tried to follow the guidelines on the data.table benchmarking guidance page, though my iterations are well under the 1 second mark recommended. I wanted to test my realistic use case, not something that’s artificially huge :shrug: .

Code

# the all.equal() check fails because dplyr returns a tibble. That's the only difference,
# but it necessitates setting check = FALSE.

bench_res = bench::mark(dplyr = dplyr_soln(val_dt),
                        dt_slow = dt_soln(val_dt),
                        dt_fast = dt_fast_soln(val_dt),
                        collapse = collapse_soln(val_dt),
                        check = FALSE, 
                        min_iterations = 2)

bench_res |> 
  fselect(expression:mem_alloc, n_itr)

# A tibble: 4 × 5
  expression      min   median `itr/sec` mem_alloc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 dplyr         3.74s    3.82s     0.262      50MB
2 dt_slow       3.09s    3.28s     0.305    12.7MB
3 dt_fast     84.97ms  90.91ms    11.0      60.9MB
4 collapse    52.19ms   54.3ms    18.2        29MB

data.table is about 42× faster than dplyr, and collapse is 70.4× faster. I think it somehow manages to vectorize the group-wise means / sds, but also use those as inputs to a single call to the internally vectorized stats::pt().

Misc

I also tried it in r-polars, which tied with collapse in terms of speed IF you don’t count the conversion to the pl$DataFrame() class. The syntax was horrifying. Plus, the conversion to the pl$DataFrame() class renders many R functions inoperable from what I can tell. It’s almost like you’re not writing R anymore, it feels more like writing a separate language through an R interface.
I’m also quite smitten with collapse’s three letter aliases to the data frame functions, like smr() instead of summarise() and mtt() over mutate(). I’m lazy and an inaccurate typist, so hitting three keys is easier, faster, and more accurate for me than either typing out the whole name or typing then waiting for the tab-complete dialog. My one wish for an addition to collapse would be an alias of arn() to collapse::roworder().

And here’s the distribution I was trying to simulate so I can have a figure on this post. Violating the normality assumptions leads to a non-uniform p-value distribution and miscalibrated statistics, surprise!

Code

library(ggplot2)

collapse_soln(val_dt) |> 
  ggplot(aes(p)) + 
  geom_histogram(breaks = seq(0,1, length.out = 50),
                 aes(y = after_stat(density)),
                 color = 'grey50') + 
  geom_hline(yintercept = 1, lty = 2, color = 'grey10') + 
  theme_light() + 
  theme(text = element_text(color = "grey10"),
        strip.text = element_text(color = "grey10"),
        plot.background = element_rect(color = "grey90", fill = "grey90"),
        panel.background = element_rect(color = "grey90", fill = "grey90"),
        panel.grid = element_line(color = "grey70")) + 
  labs(title = "This distribution is not uniform!")

A histogram of simulated p-values from the distribution. The distribution is non-uniform (depleted near 0 and 1 with a mode around 0.3), indicating the t-test assumptions are violated.

Context

Functions

dplyr

data.table two ways

collapse

Benchmark

Misc

`dplyr`

`data.table` two ways

`collapse`