-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6982: [R] Add bindings for compare and boolean kernels #7668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks OK aside from the comment about ARROW-9380
|
I rebased. You should be able to support |
aa53235 to
77b9b98
Compare
77b9b98 to
a9149d7
Compare
|
Here's an informal benchmark that shows the benefit of pushing all of this work down into Arrow, reproducing how the "old" way (on current master) calls library(arrow)
tab <- read_parquet("nyc-taxi/2019/06/data.parquet", as_data_frame = FALSE)
dim(tab)
## [1] 6941024 18
bench::mark(
new = as.vector(mean(tab$fare_amount[tab$trip_distance > 1 & tab$passenger_count < 4], na.rm = TRUE)),
old = mean(as.vector(tab$fare_amount[as.vector(tab$trip_distance) > 1 & as.vector(tab$passenger_count) < 4]), na.rm = TRUE)
)
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
## <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
## 1 new 47.4ms 47.7ms 17.6 10.2KB 1.95 9 1 512ms
## 2 old 207.3ms 213.8ms 4.70 327.6MB 12.5 3 8 638ms
## # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list> |
|
Merging; will address any followup concerns in ARROW-9187. |
The scope of this has grown to something larger than the description. In addition to adding bindings to boolean kernels, it also changes how the dplyr filter expressions are generated and evaluated for RecordBatch and Table. Previously, any R function could be used to `filter()` because evaluation happened in R by calling `as.vector` on any Arrays referenced. Now, `filter()` translates R function names to Arrow function names, and evaluation passes the function and arguments to `call_function`. The benefit is that filtering a RecordBatch/Table happens all in Arrow, no pulling data into R and then sending back to Arrow to filter it. The cost is that only functions supported in Arrow can be used now. In addition to these improvements, the patch includes some extra validation, testing, and print method upgrades. There are a number of less-than-ideal design choices in here. Some are related to https://issues.apache.org/jira/browse/ARROW-9001 because we have to track/make a guess as to whether the result of `call_function` should be an Array, ChunkedArray, etc. There's also a bit of duplication here between the two Arrow expression classes, this R-specific parse tree of array/compute expressions and the other Dataset filter expressions. I think that's unavoidable at this time but we should and I expect we will rationalize this in the near future. Closes #7668 from nealrichardson/r-kernels Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
commit 306cb2b94b893fffc4a9c862a19dc251ea11dd29
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 16:55:15 2020 +0200
use cpp11::external_pointer
commit 8e20b3280e168af6321b533a6c0d881b5b818bc8
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 16:49:13 2020 +0200
- IntegerVector_
commit e297475cd8d45d6315306d591b73b33c8f005272
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 16:46:27 2020 +0200
- Rcpp::RawVector_
commit 8a4ad16ab41eb2f1c992433adbd6538c14086627
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 16:38:32 2020 +0200
lint
commit 59d106f27d6fefd21b348a9897f5362470d88226
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 16:38:10 2020 +0200
as_sexp(const std::vector<std::shared_ptr<T>>) uses to_r_list
commit 837227febd1ef4c9e095d511c70981c527279587
Merge: 0b7a9079d d16793b
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 12:37:46 2020 +0200
Merge remote-tracking branch 'upstream/master' into cpp11
commit 0b7a9079d78d56aa359f2434abbfba905ac61d34
Author: Romain François <romain@rstudio.com>
Date: Mon Jul 27 12:34:50 2020 +0200
Update r/src/schema.cpp
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit a05e76d108b944e301a35f9704aee1839e6dc68f
Author: Romain François <romain@rstudio.com>
Date: Mon Jul 27 12:33:38 2020 +0200
Update r/src/buffer.cpp
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit 31f5aafb23a71b8cd4416a9d789139191f20e471
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 12:32:21 2020 +0200
s/r_vec/RVector/ merge glitch
commit 27740b5fbcd1388ca47c7107db875396620fa530
Merge: aa7d43e28 508ddf2e2
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 12:29:35 2020 +0200
Merge branch 'cpp11' of https://github.com/romainfrancois/arrow into cpp11
commit 508ddf2e23f91b3ee05e2151e7e65c7538346284
Author: Romain François <romain@rstudio.com>
Date: Mon Jul 27 12:29:19 2020 +0200
Update r/src/arrow_rcpp.h
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit 621e2ac03f1940324e803810c1b7ef10e60affea
Author: Romain François <romain@rstudio.com>
Date: Mon Jul 27 12:29:08 2020 +0200
Update r/src/arrow_rcpp.h
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit 9c521cb48a9367820072004147f1f07dd7a404ed
Author: Romain François <romain@rstudio.com>
Date: Mon Jul 27 12:28:56 2020 +0200
Update r/src/arrow_rcpp.h
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit 5a8c993f47606e4710f5272f5191610f8485aa49
Author: Romain François <romain@rstudio.com>
Date: Mon Jul 27 12:28:31 2020 +0200
Update r/src/arrow_rcpp.h
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit 5c9a577a1da672683377f2f2bd3273e4c9e61afb
Author: Romain François <romain@rstudio.com>
Date: Mon Jul 27 12:27:26 2020 +0200
Update r/src/array_from_vector.cpp
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit aa7d43e2899fddf2d41b961ae0d908f8a8c1b554
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 12:22:24 2020 +0200
no longer using Rcpp::StringVector_
commit e9cd2d56aac8c73cf7e5f7c484895ab3932a1f1a
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 12:21:35 2020 +0200
no longer using Rcpp::LogicalVector_
commit 245053580c12532c45e3666ecca05a3213e586cb
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 12:20:24 2020 +0200
no longer using NumericVector_
commit cb5b9b6d767ac6a38c7ff682685c34ad05d4e5b2
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 12:16:14 2020 +0200
nho longer need Rcpp::List_
commit 7b17eb3659f9da1351b9b5a12b4dd980aa5dde0f
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 12:14:52 2020 +0200
csv.cpp -Rcpp
commit facbf2198bce8c631acece8daa010fc79751f9fc
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 11:30:43 2020 +0200
compute.cpp -Rcpp
commit 84644923d569a916b863ad08a50bc81d7bfd3346
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 11:25:59 2020 +0200
+ utility from_r_list
commit 6dee592a79892b5e5acab83de4efc53ef3b381fb
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 09:55:23 2020 +0200
json.cpp -Rcpp
commit 1dc52f39577771f17912a89f69494d1b065b071d
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 09:41:48 2020 +0200
- RCPP_EXPOSED_ENUM_NODECL
commit 82ea62e0427a15af19dabf49105aaf37466aaed6
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 09:41:09 2020 +0200
no need for explicit as_sexp()
commit f4d46d7811512e336b032494f03608aa767d2f0f
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 09:40:32 2020 +0200
remove unused
commit 38c1412ab0fb2bf29a2bdeb2893a1d9cf70df6ff
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 09:40:19 2020 +0200
forward declare as_sexp() implementations, as this is useful for cpp11::writable::list(std::initializer_list<named_arg>)
commit 66c18b82486f6aca9894e314667445d5193da3ba
Author: Romain Francois <romain@rstudio.com>
Date: Mon Jul 27 09:20:04 2020 +0200
filesystem.cpp -Rcpp
commit eee77284c6c13c0e6061f52763bd4c6f0c156d1b
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 19:24:55 2020 +0200
array_to_vector using cpp11::list instead of Rcpp::List
commit f1454cad88f8729bad9dd451e0e79f43d79787ce
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 18:23:47 2020 +0200
retire strings function, as we can just use cpp11::writable::strings instead
commit 41ab305ed5e8e9e2857b0e5d58b1b6f73ad051ab
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 18:18:12 2020 +0200
cache tbl_df classes
commit 2d5681dd3bf13f4cda41618c20393cb84eecc1a2
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 17:49:16 2020 +0200
feather.cpp -> cpp11
commit 70a3c4594e37be8258e5d1a96b43f51be62089be
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 17:29:36 2020 +0200
schema.cpp -> cpp11
commit 343d8d7424a0be8d75c289fae0d37fc864d6bec4
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 17:13:17 2020 +0200
retire List_to_shared_ptr_vector<> which is handled by arrow::r::input<> automatically
commit e4240524269a17fcdd756df6392b22e3b0757c62
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 15:23:29 2020 +0200
no lint for now
commit 81f0b3f2fddde0c9576b6f96d4176d343a1303a3
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 15:21:25 2020 +0200
oops added Mask again by error
commit ca932470c2d8376fe8776cdf34e9fdd5fc380ab9
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 15:17:11 2020 +0200
not explicit just yet, as this depends on r-lib/cpp11#58
commit 20b5b0d2cb5f807bb72d53cd9fc7f968101389f7
Merge: f39b0e646 baf2094
Author: Romain François <romain@rstudio.com>
Date: Fri Jul 24 14:51:50 2020 +0200
Merge branch 'master' into cpp11
commit f39b0e6462b477386d068d7a46f6d6f60aaa175c
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 14:50:09 2020 +0200
marking ctor explicit
commit 35fdb67c0ccb10078edd67a1c51bd26b7783f69b
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 14:46:37 2020 +0200
enable_if suggestion from @bkietz
commit cc5c2ee9ba1527fb8ea6c966d891d160230d9096
Merge: 4c17779a4 4589b73d2
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 14:43:01 2020 +0200
Merge branch 'cpp11' of https://github.com/romainfrancois/arrow into cpp11
commit 4589b73d2a9a8950a985573925950ba0be01de26
Author: Romain François <romain@rstudio.com>
Date: Fri Jul 24 14:37:05 2020 +0200
Update r/src/arrow_rcpp.h
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit 4632a34eaf6db3b86ecc34b7898d189c583440d5
Author: Romain François <romain@rstudio.com>
Date: Fri Jul 24 14:36:50 2020 +0200
Update r/src/arrow_rcpp.h
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit bedf7d99ae13c7539c100d1a3bea5e7db853490b
Author: Romain François <romain@rstudio.com>
Date: Fri Jul 24 14:36:37 2020 +0200
Update r/src/arrow_rcpp.h
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit 839f1527ba79e5f9019597489d32c9920e5dcfee
Author: Romain François <romain@rstudio.com>
Date: Fri Jul 24 14:36:28 2020 +0200
Update r/src/arrow_rcpp.h
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit d0fcbe8da2d3ee90e2ff4d2c5945dfe23af223fe
Author: Romain François <romain@rstudio.com>
Date: Fri Jul 24 14:36:16 2020 +0200
Update r/src/arrow_rcpp.h
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit 3223105ff0c855f72dc23d0a246f21e4f3aa6bf8
Author: Romain François <romain@rstudio.com>
Date: Fri Jul 24 14:35:24 2020 +0200
Update r/src/arrow_rcpp.h
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit e7aecd964ccb8882bc87e27bdec2dcdcdb65550f
Author: Romain François <romain@rstudio.com>
Date: Fri Jul 24 14:34:50 2020 +0200
Update r/src/memorypool.cpp
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
commit 4c17779a4caa5754f265d0ef1f4ff97aba723976
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 14:34:10 2020 +0200
More uses of arrow::r::Index
commit 83b4fc293c1741e7daf1f92fd66b90d80444e038
Author: Romain Francois <romain@rstudio.com>
Date: Fri Jul 24 11:31:13 2020 +0200
Move Index handling to C++ side
commit 88308ad4c286a58ec4aa377728138f8e2c22ab03
Merge: 6db322bd7 a06a0f4c6
Author: Neal Richardson <neal.p.richardson@gmail.com>
Date: Thu Jul 23 08:56:28 2020 -0700
Merge branch 'master' into cpp11
commit 623c9dc4972ab60d6205b07a3dcb23418e98aa18
Author: Romain Francois <romain@rstudio.com>
Date: Thu Jul 23 17:42:17 2020 +0200
Rcpp::warning() -> cpp11::warning()
commit 4a4c5f585e9aac6b0231551a49c50a81ce8a472f
Author: Romain Francois <romain@rstudio.com>
Date: Thu Jul 23 17:41:05 2020 +0200
Rcpp::stop() -> cpp11::stop()
commit 6db322bd7bd473e2ff184a619aa1e3cffed8dbd2
Author: Romain Francois <romain@rstudio.com>
Date: Thu Jul 23 16:23:58 2020 +0200
Retire special Rcpp::traits that powered Rcpp::wrap<std::shared_ptr<>>
Use cpp11::as_sexp() explicitly where appropriate
commit d21b892cd769a057bd4562e15e6b3d77beecae21
Author: Romain Francois <romain@rstudio.com>
Date: Thu Jul 23 16:01:01 2020 +0200
Going through `char` confuses as_cpp()
commit 69df6c6f681446013c369a5d7f2cffb3baf8068d
Author: Romain Francois <romain@rstudio.com>
Date: Thu Jul 23 15:37:13 2020 +0200
No longer expecting Rcpp::not_compatible exceptions
commit 29500906a73bf4cf268508439ea6bbd50180c776
Author: Romain Francois <romain@rstudio.com>
Date: Thu Jul 23 14:38:44 2020 +0200
as_index() as a stop gap until cpp11::as_cpp<int>() handles NA_logical_ r-lib/cpp11#53
commit c7f3020f1d562b9bba9001f8077e44e1b802fffa
Author: Romain Francois <romain@rstudio.com>
Date: Thu Jul 23 14:37:55 2020 +0200
using BEGIN_CPP11/END_CPP11
commit a06a0f4c6b3268bbbc8da77521f4a229d77a9c94
Author: Sagnik Chakraborty <sagnikc@dremio.com>
Date: Thu Jul 23 17:57:28 2020 +0530
ARROW-9328: [C++][Gandiva] Add LTRIM, RTRIM, BTRIM functions for string
Closes apache#7641 from sagnikc-dremio/master and squashes the following commits:
4a9985f <Sagnik Chakraborty> ARROW-9328: Add LTRIM, RTRIM, BTRIM functions for string
Authored-by: Sagnik Chakraborty <sagnikc@dremio.com>
Signed-off-by: Praveen <praveen@dremio.com>
commit cc875e7
Author: Neal Richardson <neal.p.richardson@gmail.com>
Date: Wed Jul 22 11:36:08 2020 -0700
ARROW-6982: [R] Add bindings for compare and boolean kernels
The scope of this has grown to something larger than the description. In addition to adding bindings to boolean kernels, it also changes how the dplyr filter expressions are generated and evaluated for RecordBatch and Table. Previously, any R function could be used to `filter()` because evaluation happened in R by calling `as.vector` on any Arrays referenced. Now, `filter()` translates R function names to Arrow function names, and evaluation passes the function and arguments to `call_function`. The benefit is that filtering a RecordBatch/Table happens all in Arrow, no pulling data into R and then sending back to Arrow to filter it. The cost is that only functions supported in Arrow can be used now.
In addition to these improvements, the patch includes some extra validation, testing, and print method upgrades.
There are a number of less-than-ideal design choices in here. Some are related to https://issues.apache.org/jira/browse/ARROW-9001 because we have to track/make a guess as to whether the result of `call_function` should be an Array, ChunkedArray, etc.
There's also a bit of duplication here between the two Arrow expression classes, this R-specific parse tree of array/compute expressions and the other Dataset filter expressions. I think that's unavoidable at this time but we should and I expect we will rationalize this in the near future.
Closes apache#7668 from nealrichardson/r-kernels
Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
commit 5df1cab6ae5ac771878450661901bec8330bf1f8
Author: Romain Francois <romain@rstudio.com>
Date: Wed Jul 22 17:22:00 2020 +0200
using cpp11::as_sexp() instead of Rcpp::wrap()
commit b8e3e3c88cd819ea49da1825a020fdebf8ab5c75
Author: Romain Francois <romain@rstudio.com>
Date: Wed Jul 22 16:18:43 2020 +0200
lint
commit 962be82f99d99bae1ef1ee61b4fdf61585882b82
Author: Romain Francois <romain@rstudio.com>
Date: Wed Jul 22 16:18:06 2020 +0200
Support for as_cpp<Enum>. Related to r-lib/cpp11#52
commit c27f782
Author: Jorge C. Leitao <jorgecarleitao@gmail.com>
Date: Wed Jul 22 07:43:41 2020 -0600
ARROW-9534: [Rust] [DataFusion] Added support for lit to all supported rust types.
@andygrove fyi
Closes apache#7811 from jorgecarleitao/lit
Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
commit 126c0eb054447bc3e8f0762f08253c4652301a62
Author: Romain Francois <romain@rstudio.com>
Date: Wed Jul 22 14:38:43 2020 +0200
More uses of cpp11::as_cpp()
commit c52a3e2521bc22ee71ed04af7c0d869bff3ff4aa
Author: Romain Francois <romain@rstudio.com>
Date: Wed Jul 22 14:30:53 2020 +0200
remove ConstReferenceSmartPtrInputParameter, and use cpp11::as_cpp<shared_ptr<T>> to extract it from an R6 host
commit a32c3dbd06b4a4ce645412a240773c800bc5365f
Author: Romain Francois <romain@rstudio.com>
Date: Wed Jul 22 10:54:29 2020 +0200
no longer need ConstReferenceVectorSmartPtrInputParameter
commit c269d17937aea00ee5228c1cb7efac9df8ea238c
Author: Romain Francois <romain@rstudio.com>
Date: Wed Jul 22 10:49:06 2020 +0200
arrow::r::input<> specializations for
- const std::shared_ptr<T>&
- const std::unique_ptr<T>&
- const std::vector<std::shared_ptr<T>>&
steering away from Rcpp::traits::input_parameter<>
commit 852ad6562c6a3c9ca49473fc88a2e9eb456b13bd
Author: Romain Francois <romain@rstudio.com>
Date: Tue Jul 21 14:56:01 2020 +0200
Remove obsolete functions
commit 8640f2e75acae6182439816f7e68be50f7abb320
Author: Romain Francois <romain@rstudio.com>
Date: Tue Jul 21 14:46:13 2020 +0200
using arrow::r::input<T> instead of Rcpp::traits::input_parameter<T> to ease transition
commit 3f44a467246c9eba89c297f37ad00adefcf8d613
Author: Romain Francois <romain@rstudio.com>
Date: Tue Jul 21 14:45:20 2020 +0200
LinkingTo: cpp11
commit aa51b5a
Author: Uwe L. Korn <uwe.korn@quantco.com>
Date: Tue Jul 21 13:18:30 2020 +0200
ARROW-9535: [Python] Remove symlink fixes from conda recipe
Closes apache#7810 from xhochy/test-windows-fix
Authored-by: Uwe L. Korn <uwe.korn@quantco.com>
Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
commit c09a82a
Author: Sutou Kouhei <kou@clear-code.com>
Date: Tue Jul 21 06:54:25 2020 +0900
ARROW-9508: [Release][APT][Yum] Enable verification for arm64 binaries
Closes apache#7791 from kou/release-verify-binaries
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
The scope of this has grown to something larger than the description. In addition to adding bindings to boolean kernels, it also changes how the dplyr filter expressions are generated and evaluated for RecordBatch and Table. Previously, any R function could be used to
filter()because evaluation happened in R by callingas.vectoron any Arrays referenced. Now,filter()translates R function names to Arrow function names, and evaluation passes the function and arguments tocall_function. The benefit is that filtering a RecordBatch/Table happens all in Arrow, no pulling data into R and then sending back to Arrow to filter it. The cost is that only functions supported in Arrow can be used now.In addition to these improvements, the patch includes some extra validation, testing, and print method upgrades.
There are a number of less-than-ideal design choices in here. Some are related to https://issues.apache.org/jira/browse/ARROW-9001 because we have to track/make a guess as to whether the result of
call_functionshould be an Array, ChunkedArray, etc.There's also a bit of duplication here between the two Arrow expression classes, this R-specific parse tree of array/compute expressions and the other Dataset filter expressions. I think that's unavoidable at this time but we should and I expect we will rationalize this in the near future.