fst icon indicating copy to clipboard operation
fst copied to clipboard

Comparison with vroom

Open dipterix opened this issue 6 years ago • 4 comments

How is the write_fst and read_fst speed compared to vroom package with the same number of CPU cores?

dipterix avatar Oct 21 '19 18:10 dipterix

Hi @dipterix,

thanks for your question!

Using package syntheticbench (on CRAN soon) and it's build in streamers for fst and vroom you can make a comparision between the two:

# devtools::install_github("fstpackage/syntheticbench", ref = "develop")
library(syntheticbench)

# dataset generator used on fst homepage
# (future method will have a cleaner interface)
generator <- table_generator(
  "vroom vs fst",
  function(nr_of_rows) {
    data.frame(
      Logical = sample_logical(nr_of_rows, true_false_na_ratio = c(85, 10, 5)),
      Integer = sample_integer(nr_of_rows, max_value = 100L),
      Real    = sample_integer(nr_of_rows, 1, 10000, max_distict_values = 20) / 100,
      Factor  = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
    )}
)

# benchmark vroom against fst for 4e7 rows and dataset defined above (~15 mins)
res <- synthetic_bench() %>%
  bench_tables(generator) %>%
  bench_rows(4e7) %>%
  bench_streamers(streamer_vroom(), streamer_fst()) %>%
  compute()

This gives you a result table like

Mode ID DataID Compression Size Time NrOfRows OrigSize SpeedMBs
write vroom vroom vs fst -1 1263284140 106115336301 4e+07 800002184 7.538987
write fst vroom vs fst -1 363285087 276296701 4e+07 800002184 2895.446023
write vroom vroom vs fst -1 1259292777 106478294002 4e+07 800002184 7.513289
write fst vroom vs fst -1 362738623 289502501 4e+07 800002184 2763.368818
write vroom vroom vs fst -1 1251302390 104129016100 4e+07 800002184 7.682798
write fst vroom vs fst -1 364656876 286291601 4e+07 800002184 2794.361348
...rows removed here
read fst vroom vs fst -1 358677118 243049401 4e+07 800002184 3291.520904
read vroom vroom vs fst -1 1257336252 2595551401 4e+07 280004824 107.878744
read vroom vroom vs fst -1 1243302779 2586288202 4e+07 280004824 108.265128
read fst vroom vs fst -1 359388255 229142200 4e+07 800002184 3491.291364
read fst vroom vs fst -1 363652879 234680301 4e+07 800002184 3408.902156
read vroom vroom vs fst -1 1249281565 2625604101 4e+07 280004824 106.643962

You can plot the results using

library(ggplot2)
library(dplyr)

# show the result bars
res %>%
  group_by(ID, Mode) %>%
  summarise(Speed = median(SpeedMBs)) %>%
  ggplot() +
  geom_bar(aes(Mode, fill = ID, weight = Speed), position = "dodge")

image

As you can see, with this dataset, fst is more than an order of magnitude faster.

Note that the vroom streamer disables ALTREP for all columns when reading a csv file. This is important, because when you don't disable the ALTREP representation, the column vectors are not really materialized and the benchmark becomes meaningless.

There is an option to specify the column types in method vroom(). I will try to add benchmarks using that option later on (helping vroom a bit) and add data.table to the mix as well...

MarcusKlik avatar Oct 21 '19 22:10 MarcusKlik

purge before benchmark

Hi @MarcusKlik . I was profiling my package these days and I found a very interesting thing (not sure if this was only on my system, I'm using OSX Catalina). My system loads the files to be read into Cached Files. Usually this place is considered "inactive memories" on mac and is considered not quite important.

image

However, my profiling results are highly affected by it. By running the same IO code twice, the cold start (file not cached) is significantly slower than warm start (file cached).

File reading, disk IO - cold start image

Total run time: 74.585 seconds to process 20GB fst files (loading time ~ 56 seconds)

File reading, disk IO - warm start image

Total run time: 24.699 seconds to process 20GB fst files (loading time ~ 10 seconds)

The cold-start can be reproduced by purge command, which clears all the cached files. This leads me to thinking, when you perform comparisons between fst and other methods, have you tried clearing system buffers, cached files before reading? Here are some related topics

My another thinking is since "caching files" improves IO performance so much, is it possible to detect whether the file has a cold-start? If so, move the file to cache and read it. (this could be a whole new project and it might be complicated, I just want to share this idea, not requesting a new feature). On my machine, one single thread can only reach max of 300MB/s, which matches with "cold-start" result (20GB/56sec ~ 300MB/s, you are using omp critical, so essentially single threaded). However, when I load the file with multi-threads, it reaches >0.6~0.9 GB/s for quite a few seconds. This means, if we can find a method to load data to "cached memory" at 600MB/s, fst performance could potentially be doubled.

Although for large files (20BG), my system falls back to <300MB/s after a few seconds, for small files (<1GB) this feature is quite appealing.

dipterix avatar Sep 19 '20 00:09 dipterix

To correct the comment above, I'm not sure whether >0.6~0.9 GB/s can only be reached via multi-threads or not, I was using multisession parallel in R to run a for loop. You might be able to achieve that speed by adjusting streamer buffer size to system optimal. Also I was using cstdio instead of fstream, which gives me 1.5x performance under omp critical section.

dipterix avatar Sep 19 '20 00:09 dipterix

Hi @dipterix, thanks for sharing your ideas and suggestions!

And yes, you're very right, the caching effect is extremely important to take into account. To make sure the files are not cached in the benchmarks, I only use each file once. So new files are generated for each measurement. That's much slower for the benchmark, but at least we can be sure the caching effect do not play a role in the benchmarks.

The caching effect is an OS level optimization where bytes read from disk are kept in RAM to use with subsequent reads to disk. So the first time we measure actual disk performance but after that we basically measure RAM retrieval performance, definitely not what we want to measure. Your 300 MB/s limit is most probably caused by a limitation in disk speed on your system. Single threaded disk speed can go up to multiple GB/s for fast NVME disks (but still slower than RAM :-)).

It's definitely interesting to investigate the performance differences between cstdio and fstream in more detail and see if any gains can be made there!

MarcusKlik avatar Sep 22 '20 11:09 MarcusKlik