Code
library(dplyr)
library(data.table); setDTthreads(1)
library(collapse); set_collapse(nthreads = 1)collapse is sickAndrew Ghazi
November 29, 2023
I saw an R package I hadn’t heard of called collapse performing well on the duckdb database-like ops benchmark, and I thought I’d try it out on a little statistical simulation. tl;dr: collapse is sick.
I wanted to know the null p-value distribution that you’d get when mis-applying a t-test to non-normally distributed data with a small number of data points. So to simulate this I:
2*pt(sqrt(6) * abs(mean(x) / sd(x)), 5, lower.tail = FALSE) where x are the 6 random observations for the group.
t.test(x)$p.value of course, but that comes with a lot of slow baggage.We’ll try it in dplyr, data.table, and collapse, making sure to use a single thread on the latter two:
And generate a synthetic dataset. We’ll use 300k groups just so we can get a really smooth histogram with a large number of bins:
g x
<int> <num>
1: 1 -0.08460379
2: 1 0.76900599
3: 1 -3.09828899
4: 1 0.37445603
5: 1 -0.16769057
---
1799996: 300000 -0.18144463
1799997: 300000 0.33434800
1799998: 300000 -1.20625276
1799999: 300000 0.43735380
1800000: 300000 0.36069760
We’ll try it in dplyr, data.table, and collapse.
dplyrSo easy to write, so clean.
data.table two waysMy first attempt didn’t perform as well as I would have hoped, which turned out because I wasn’t using the GForce-optimized functions for the mean and standard deviation. To use those, the expression in j can ONLY contain the GForce-optimized functions min, max, mean, median, var, sd, sum, prod, first, last, head, tail. base::Arithmetic and stats::pt are not among them!
To solve this I used two data.table chained expressions [][]. It’s possible that there’s a more performant way to do it in one pass over the groups (while still using GForce), but I wanted to keep it to within the realm of functions I know how to write intuitively.
collapseI’ll talk about the amazing performance later, but what really impressed me is how easy it was to write. I literally just copy-pasted my dplyr solution (the gold standard of “easy to write” IMO) then typed four “f”s to switch to the collapse version of the dplyr/base functions (e.g. summarise() -> fsummarise()).
For me, “easy to write” is a HUGE benefit. Difficult to quantify, but so, so important for the interactive programming I do 80% of the time.
I tried to follow the guidelines on the data.table benchmarking guidance page, though my iterations are well under the 1 second mark recommended. I wanted to test my realistic use case, not something that’s artificially huge :shrug: .
# the all.equal() check fails because dplyr returns a tibble. That's the only difference,
# but it necessitates setting check = FALSE.
bench_res = bench::mark(dplyr = dplyr_soln(val_dt),
dt_slow = dt_soln(val_dt),
dt_fast = dt_fast_soln(val_dt),
collapse = collapse_soln(val_dt),
check = FALSE,
min_iterations = 2)
bench_res |>
fselect(expression:mem_alloc, n_itr)# A tibble: 4 × 5
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 dplyr 3.74s 3.82s 0.262 50MB
2 dt_slow 3.09s 3.28s 0.305 12.7MB
3 dt_fast 84.97ms 90.91ms 11.0 60.9MB
4 collapse 52.19ms 54.3ms 18.2 29MB
data.table is about 42× faster than dplyr, and collapse is 70.4× faster. I think it somehow manages to vectorize the group-wise means / sds, but also use those as inputs to a single call to the internally vectorized stats::pt().
r-polars, which tied with collapse in terms of speed IF you don’t count the conversion to the pl$DataFrame() class. The syntax was horrifying. Plus, the conversion to the pl$DataFrame() class renders many R functions inoperable from what I can tell. It’s almost like you’re not writing R anymore, it feels more like writing a separate language through an R interface.collapse’s three letter aliases to the data frame functions, like smr() instead of summarise() and mtt() over mutate(). I’m lazy and an inaccurate typist, so hitting three keys is easier, faster, and more accurate for me than either typing out the whole name or typing then waiting for the tab-complete dialog. My one wish for an addition to collapse would be an alias of arn() to collapse::roworder().And here’s the distribution I was trying to simulate so I can have a figure on this post. Violating the normality assumptions leads to a non-uniform p-value distribution and miscalibrated statistics, surprise!
library(ggplot2)
collapse_soln(val_dt) |>
ggplot(aes(p)) +
geom_histogram(breaks = seq(0,1, length.out = 50),
aes(y = after_stat(density)),
color = 'grey50') +
geom_hline(yintercept = 1, lty = 2, color = 'grey10') +
theme_light() +
theme(text = element_text(color = "grey10"),
strip.text = element_text(color = "grey10"),
plot.background = element_rect(color = "grey90", fill = "grey90"),
panel.background = element_rect(color = "grey90", fill = "grey90"),
panel.grid = element_line(color = "grey70")) +
labs(title = "This distribution is not uniform!")