Comparison with vroom
How is the write_fst and read_fst speed compared to vroom package with the same number of CPU cores?
Hi @dipterix,
thanks for your question!
Using package syntheticbench (on CRAN soon) and it's build in streamers for fst and vroom you can make a comparision between the two:
# devtools::install_github("fstpackage/syntheticbench", ref = "develop")
library(syntheticbench)
# dataset generator used on fst homepage
# (future method will have a cleaner interface)
generator <- table_generator(
"vroom vs fst",
function(nr_of_rows) {
data.frame(
Logical = sample_logical(nr_of_rows, true_false_na_ratio = c(85, 10, 5)),
Integer = sample_integer(nr_of_rows, max_value = 100L),
Real = sample_integer(nr_of_rows, 1, 10000, max_distict_values = 20) / 100,
Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)}
)
# benchmark vroom against fst for 4e7 rows and dataset defined above (~15 mins)
res <- synthetic_bench() %>%
bench_tables(generator) %>%
bench_rows(4e7) %>%
bench_streamers(streamer_vroom(), streamer_fst()) %>%
compute()
This gives you a result table like
| Mode | ID | DataID | Compression | Size | Time | NrOfRows | OrigSize | SpeedMBs |
|---|---|---|---|---|---|---|---|---|
| write | vroom | vroom vs fst | -1 | 1263284140 | 106115336301 | 4e+07 | 800002184 | 7.538987 |
| write | fst | vroom vs fst | -1 | 363285087 | 276296701 | 4e+07 | 800002184 | 2895.446023 |
| write | vroom | vroom vs fst | -1 | 1259292777 | 106478294002 | 4e+07 | 800002184 | 7.513289 |
| write | fst | vroom vs fst | -1 | 362738623 | 289502501 | 4e+07 | 800002184 | 2763.368818 |
| write | vroom | vroom vs fst | -1 | 1251302390 | 104129016100 | 4e+07 | 800002184 | 7.682798 |
| write | fst | vroom vs fst | -1 | 364656876 | 286291601 | 4e+07 | 800002184 | 2794.361348 |
| ...rows removed here | ||||||||
| read | fst | vroom vs fst | -1 | 358677118 | 243049401 | 4e+07 | 800002184 | 3291.520904 |
| read | vroom | vroom vs fst | -1 | 1257336252 | 2595551401 | 4e+07 | 280004824 | 107.878744 |
| read | vroom | vroom vs fst | -1 | 1243302779 | 2586288202 | 4e+07 | 280004824 | 108.265128 |
| read | fst | vroom vs fst | -1 | 359388255 | 229142200 | 4e+07 | 800002184 | 3491.291364 |
| read | fst | vroom vs fst | -1 | 363652879 | 234680301 | 4e+07 | 800002184 | 3408.902156 |
| read | vroom | vroom vs fst | -1 | 1249281565 | 2625604101 | 4e+07 | 280004824 | 106.643962 |
You can plot the results using
library(ggplot2)
library(dplyr)
# show the result bars
res %>%
group_by(ID, Mode) %>%
summarise(Speed = median(SpeedMBs)) %>%
ggplot() +
geom_bar(aes(Mode, fill = ID, weight = Speed), position = "dodge")

As you can see, with this dataset, fst is more than an order of magnitude faster.
Note that the vroom streamer disables ALTREP for all columns when reading a csv file. This is important, because when you don't disable the ALTREP representation, the column vectors are not really materialized and the benchmark becomes meaningless.
There is an option to specify the column types in method vroom(). I will try to add benchmarks using that option later on (helping vroom a bit) and add data.table to the mix as well...
purge before benchmark
Hi @MarcusKlik . I was profiling my package these days and I found a very interesting thing (not sure if this was only on my system, I'm using OSX Catalina). My system loads the files to be read into Cached Files. Usually this place is considered "inactive memories" on mac and is considered not quite important.

However, my profiling results are highly affected by it. By running the same IO code twice, the cold start (file not cached) is significantly slower than warm start (file cached).
File reading, disk IO - cold start

Total run time: 74.585 seconds to process 20GB fst files (loading time ~ 56 seconds)
File reading, disk IO - warm start

Total run time: 24.699 seconds to process 20GB fst files (loading time ~ 10 seconds)
The cold-start can be reproduced by purge command, which clears all the cached files. This leads me to thinking, when you perform comparisons between fst and other methods, have you tried clearing system buffers, cached files before reading? Here are some related topics
My another thinking is since "caching files" improves IO performance so much, is it possible to detect whether the file has a cold-start? If so, move the file to cache and read it. (this could be a whole new project and it might be complicated, I just want to share this idea, not requesting a new feature). On my machine, one single thread can only reach max of 300MB/s, which matches with "cold-start" result (20GB/56sec ~ 300MB/s, you are using omp critical, so essentially single threaded). However, when I load the file with multi-threads, it reaches >0.6~0.9 GB/s for quite a few seconds. This means, if we can find a method to load data to "cached memory" at 600MB/s, fst performance could potentially be doubled.
Although for large files (20BG), my system falls back to <300MB/s after a few seconds, for small files (<1GB) this feature is quite appealing.
To correct the comment above, I'm not sure whether >0.6~0.9 GB/s can only be reached via multi-threads or not, I was using multisession parallel in R to run a for loop. You might be able to achieve that speed by adjusting streamer buffer size to system optimal. Also I was using cstdio instead of fstream, which gives me 1.5x performance under omp critical section.
Hi @dipterix, thanks for sharing your ideas and suggestions!
And yes, you're very right, the caching effect is extremely important to take into account. To make sure the files are not cached in the benchmarks, I only use each file once. So new files are generated for each measurement. That's much slower for the benchmark, but at least we can be sure the caching effect do not play a role in the benchmarks.
The caching effect is an OS level optimization where bytes read from disk are kept in RAM to use with subsequent reads to disk. So the first time we measure actual disk performance but after that we basically measure RAM retrieval performance, definitely not what we want to measure. Your 300 MB/s limit is most probably caused by a limitation in disk speed on your system. Single threaded disk speed can go up to multiple GB/s for fast NVME disks (but still slower than RAM :-)).
It's definitely interesting to investigate the performance differences between cstdio and fstream in more detail and see if any gains can be made there!