fst Comparison with vroom

How is the write_fst and read_fst speed compared to vroom package with the same number of CPU cores?

Oct 21 '19 18:10 dipterix

Hi @dipterix,

thanks for your question!

Using package syntheticbench (on CRAN soon) and it's build in streamers for fst and vroom you can make a comparision between the two:

# devtools::install_github("fstpackage/syntheticbench", ref = "develop")
library(syntheticbench)

# dataset generator used on fst homepage
# (future method will have a cleaner interface)
generator <- table_generator(
  "vroom vs fst",
  function(nr_of_rows) {
    data.frame(
      Logical = sample_logical(nr_of_rows, true_false_na_ratio = c(85, 10, 5)),
      Integer = sample_integer(nr_of_rows, max_value = 100L),
      Real    = sample_integer(nr_of_rows, 1, 10000, max_distict_values = 20) / 100,
      Factor  = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
    )}
)

# benchmark vroom against fst for 4e7 rows and dataset defined above (~15 mins)
res <- synthetic_bench() %>%
  bench_tables(generator) %>%
  bench_rows(4e7) %>%
  bench_streamers(streamer_vroom(), streamer_fst()) %>%
  compute()

This gives you a result table like

Mode	ID	DataID	Compression	Size	Time	NrOfRows	OrigSize	SpeedMBs
write	vroom	vroom vs fst	-1	1263284140	106115336301	4e+07	800002184	7.538987
write	fst	vroom vs fst	-1	363285087	276296701	4e+07	800002184	2895.446023
write	vroom	vroom vs fst	-1	1259292777	106478294002	4e+07	800002184	7.513289
write	fst	vroom vs fst	-1	362738623	289502501	4e+07	800002184	2763.368818
write	vroom	vroom vs fst	-1	1251302390	104129016100	4e+07	800002184	7.682798
write	fst	vroom vs fst	-1	364656876	286291601	4e+07	800002184	2794.361348
...rows removed here
read	fst	vroom vs fst	-1	358677118	243049401	4e+07	800002184	3291.520904
read	vroom	vroom vs fst	-1	1257336252	2595551401	4e+07	280004824	107.878744
read	vroom	vroom vs fst	-1	1243302779	2586288202	4e+07	280004824	108.265128
read	fst	vroom vs fst	-1	359388255	229142200	4e+07	800002184	3491.291364
read	fst	vroom vs fst	-1	363652879	234680301	4e+07	800002184	3408.902156
read	vroom	vroom vs fst	-1	1249281565	2625604101	4e+07	280004824	106.643962

You can plot the results using

library(ggplot2)
library(dplyr)

# show the result bars
res %>%
  group_by(ID, Mode) %>%
  summarise(Speed = median(SpeedMBs)) %>%
  ggplot() +
  geom_bar(aes(Mode, fill = ID, weight = Speed), position = "dodge")

As you can see, with this dataset, fst is more than an order of magnitude faster.

Note that the vroom streamer disables ALTREP for all columns when reading a csv file. This is important, because when you don't disable the ALTREP representation, the column vectors are not really materialized and the benchmark becomes meaningless.

There is an option to specify the column types in method vroom(). I will try to add benchmarks using that option later on (helping vroom a bit) and add data.table to the mix as well...

Oct 21 '19 22:10 MarcusKlik

`purge` before benchmark

Hi @MarcusKlik . I was profiling my package these days and I found a very interesting thing (not sure if this was only on my system, I'm using OSX Catalina). My system loads the files to be read into Cached Files. Usually this place is considered "inactive memories" on mac and is considered not quite important.

However, my profiling results are highly affected by it. By running the same IO code twice, the cold start (file not cached) is significantly slower than warm start (file cached).

File reading, disk IO - cold start

Total run time: 74.585 seconds to process 20GB fst files (loading time ~ 56 seconds)

File reading, disk IO - warm start

Total run time: 24.699 seconds to process 20GB fst files (loading time ~ 10 seconds)

The cold-start can be reproduced by purge command, which clears all the cached files. This leads me to thinking, when you perform comparisons between fst and other methods, have you tried clearing system buffers, cached files before reading? Here are some related topics

My another thinking is since "caching files" improves IO performance so much, is it possible to detect whether the file has a cold-start? If so, move the file to cache and read it. (this could be a whole new project and it might be complicated, I just want to share this idea, not requesting a new feature). On my machine, one single thread can only reach max of 300MB/s, which matches with "cold-start" result (20GB/56sec ~ 300MB/s, you are using omp critical, so essentially single threaded). However, when I load the file with multi-threads, it reaches >0.6~0.9 GB/s for quite a few seconds. This means, if we can find a method to load data to "cached memory" at 600MB/s, fst performance could potentially be doubled.

Although for large files (20BG), my system falls back to <300MB/s after a few seconds, for small files (<1GB) this feature is quite appealing.

Sep 19 '20 00:09 dipterix

To correct the comment above, I'm not sure whether >0.6~0.9 GB/s can only be reached via multi-threads or not, I was using multisession parallel in R to run a for loop. You might be able to achieve that speed by adjusting streamer buffer size to system optimal. Also I was using cstdio instead of fstream, which gives me 1.5x performance under omp critical section.

Sep 19 '20 00:09 dipterix

Hi @dipterix, thanks for sharing your ideas and suggestions!

And yes, you're very right, the caching effect is extremely important to take into account. To make sure the files are not cached in the benchmarks, I only use each file once. So new files are generated for each measurement. That's much slower for the benchmark, but at least we can be sure the caching effect do not play a role in the benchmarks.

The caching effect is an OS level optimization where bytes read from disk are kept in RAM to use with subsequent reads to disk. So the first time we measure actual disk performance but after that we basically measure RAM retrieval performance, definitely not what we want to measure. Your 300 MB/s limit is most probably caused by a limitation in disk speed on your system. Single threaded disk speed can go up to multiple GB/s for fast NVME disks (but still slower than RAM :-)).

It's definitely interesting to investigate the performance differences between cstdio and fstream in more detail and see if any gains can be made there!

Sep 22 '20 11:09 MarcusKlik

Comparison with vroom

purge before benchmark

`purge` before benchmark