Skip to content

Commit c9588c5

Browse files
committed
ARROW-13834: [R][Documentation] Document the process of creating R bindings for compute kernels and rationale behind conventions
Closes #11915 from thisisnic/ARROW-13834_bindings Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
1 parent f165370 commit c9588c5

5 files changed

Lines changed: 227 additions & 0 deletions

File tree

r/_pkgdown.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,8 @@ navbar:
8484
href: articles/developers/install_details.html
8585
- text: Docker
8686
href: articles/developers/docker.html
87+
- text: Writing Bindings
88+
href: articles/developers/bindings.html
8789
reference:
8890
- title: Multi-file datasets
8991
contents:
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
# Writing Bindings
2+
3+
```{r, include=FALSE}
4+
library(arrow, warn.conflicts = FALSE)
5+
library(dplyr, warn.conflicts = FALSE)
6+
```
7+
8+
When writing bindings between C++ compute functions and R functions, the aim is
9+
to expose the C++ functionality via the same interface as existing R functions. The syntax and
10+
functionality should match that of the existing R functions
11+
(though there are some exceptions) so that users are able to use existing tidyverse
12+
or base R syntax, whilst taking advantage of the speed and functionality of the
13+
underlying arrow package.
14+
15+
One of main ways in which users interact with arrow is via
16+
[dplyr](https://dplyr.tidyverse.org/) syntax called on Arrow objects. For
17+
example, when a user calls `dplyr::mutate()` on an Arrow Tabular,
18+
Dataset, or arrow data query object, the Arrow implementation of `mutate()` is
19+
used and under the hood, translates the dplyr code into Arrow C++ code.
20+
21+
When using `dplyr::mutate()` or `dplyr::filter()`, you may want to use functions
22+
from other packages. The example below uses `stringr::str_detect()`.
23+
24+
```{r}
25+
library(dplyr)
26+
library(stringr)
27+
starwars %>%
28+
filter(str_detect(name, "Darth"))
29+
```
30+
This functionality has also been implemented in Arrow, e.g.:
31+
32+
```{r}
33+
library(arrow)
34+
arrow_table(starwars) %>%
35+
filter(str_detect(name, "Darth")) %>%
36+
collect()
37+
```
38+
39+
This is possible as a **binding** has been created between the call to the
40+
stringr function `str_detect()` and the Arrow C++ code, here as a direct mapping
41+
to `match_substring_regex`. You can see this for yourself by inspecting the
42+
arrow data query object without retrieving the results via `collect()`.
43+
44+
45+
```{r}
46+
arrow_table(starwars) %>%
47+
filter(str_detect(name, "Darth"))
48+
```
49+
50+
In the following sections, we'll walk through how to create a binding between an
51+
R function and an Arrow C++ function.
52+
53+
# Walkthrough
54+
55+
Imagine you are writing the bindings for the C++ function
56+
[`starts_with()`](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
57+
and want to bind it to the (base) R function `startsWith()`.
58+
59+
First, take a look at the docs for both of those functions.
60+
61+
## Examining the R function
62+
63+
Here are the docs for R's `startsWith()` (also available at https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html)
64+
65+
```{r, echo=FALSE, out.width="50%"}
66+
knitr::include_graphics("./startswithdocs.png")
67+
```
68+
69+
It takes 2 parameters; `x` - the input, and `prefix` - the characters to check
70+
if `x` starts with.
71+
72+
## Examining the C++ function
73+
74+
Now, go to
75+
[the compute function documentation](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
76+
and look for the Arrow C++ library's `starts_with()` function:
77+
78+
```{r, echo=FALSE, out.width="100%"}
79+
knitr::include_graphics("./starts_with_docs.png")
80+
```
81+
82+
The docs show that `starts_with()` is a unary function, which means that it takes a
83+
single data input. The data input must be a string-like class, and the returned
84+
value is boolean, both of which match up to R's `startsWith()`.
85+
86+
There is an options class associated with `starts_with()` - called [`MatchSubstringOptions`](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE)
87+
- so let's take a look at that.
88+
89+
```{r, echo=FALSE, out.width="100%"}
90+
knitr::include_graphics("./matchsubstringoptions.png")
91+
```
92+
93+
Options classes allow the user to control the behaviour of the function. In
94+
this case, there are two possible options which can be supplied - `pattern` and
95+
`ignore_case`, which are described in the docs shown above.
96+
97+
## Comparing the R and C++ functions
98+
99+
What conclusions can be drawn from what you've seen so far?
100+
101+
Base R's `startsWith()` and Arrow's `starts_with()` operate on equivalent data
102+
types, return equivalent data types, and as there are no options implemented in
103+
R that Arrow doesn't have, this should be fairly simple to map without a great
104+
deal of extra work.
105+
106+
As `starts_with()` has an options class associated with it, we'll need to make
107+
sure that it's linked up with this in the R code.
108+
109+
In case you're wondering about the difference between arguments in R and options
110+
in Arrow, in R, arguments to functions can include the actual data to be
111+
analysed as well as options governing how the function works, whereas in the
112+
C++ compute functions, the arguments are the data to be analysed and the
113+
options are for specifying how exactly the function works.
114+
115+
So let's get started.
116+
117+
## Step 1 - add unit tests
118+
119+
We recommend a test-driven-development approach - write failing tests first,
120+
then check that they fail, and then write the code needed to make them pass.
121+
Thinking up-front about the behavior which needs testing can make it easier to
122+
reason about the code which needs writing later.
123+
124+
Look up the R function that you want to bind the compute kernel to, and write a
125+
set of unit tests that use a dplyr pipeline and `compare_dplyr_binding()` (and
126+
perhaps even `compare_dplyr_error()` if necessary. These functions compare the
127+
output of the original function with the dplyr bindings and make sure they match.
128+
We recommend looking at the documentation next to the source code for these
129+
functions to get a better understanding of how they work.
130+
131+
You should make sure you're testing all parameters of the R function in your
132+
tests.
133+
134+
Below is a possible example test for `startsWith()`.
135+
136+
```{r, eval = FALSE}
137+
test_that("startsWith behaves identically in dplyr and Arrow", {
138+
df <- tibble(x = c("Foo", "bar", "baz", "qux"))
139+
compare_dplyr_binding(
140+
.input %>%
141+
filter(startsWith(x, "b")) %>%
142+
collect(),
143+
df
144+
)
145+
146+
})
147+
```
148+
149+
## Step 2 - Hook up the compute function with options class if necessary
150+
151+
If the C++ compute function can have options specified, make sure that the
152+
function is linked with its options class in `make_compute_options()` in the
153+
file `arrow/r/src/compute.cpp`. You can find out if a compute function requires
154+
options by looking in the docs here: https://arrow.apache.org/docs/cpp/compute.html
155+
156+
In the case of `starts_with()`, it looks something like this:
157+
158+
```cpp
159+
if (func_name == "starts_with") {
160+
using Options = arrow::compute::MatchSubstringOptions;
161+
bool ignore_case = false;
162+
if (!Rf_isNull(options["ignore_case"])) {
163+
ignore_case = cpp11::as_cpp<bool>(options["ignore_case"]);
164+
}
165+
return std::make_shared<Options>(cpp11::as_cpp<std::string>(options["pattern"]),
166+
ignore_case);
167+
}
168+
```
169+
170+
You can usually copy and paste from a similar existing example. In this case,
171+
as the option `ignore_case` doesn't map to any parameters of `startsWith()`, we
172+
give it a default value of `false` but if it's been set, use the set value
173+
instead. As the `pattern` argument maps directly to `prefix` in `startsWith()`
174+
we can pass it straight through.
175+
176+
## Step 3 - Map the R function to the C++ kernel
177+
178+
The next task is writing the code which binds the R function to the C++ kernel.
179+
180+
### Step 3a - See if direct mapping is appropriate
181+
Compare the C++ function and R function. If they are simple functions with no
182+
options, it might be possible to directly map between the C++ and R in
183+
`unary_function_map`, in the case of compute functions that operate on single
184+
columns of data, or `binary_function_map` for those which operate on 2 columns
185+
of data.
186+
187+
As `startsWith()` requires options, direct mapping is not appropriate.
188+
189+
### Step 3b - If direct mapping not possible, try a modified implementation
190+
If the function cannot be mapped directly, some extra work may be needed to
191+
ensure that calling the arrow version of the function results in the same result
192+
as calling the R version of the function. In this case, the function will need
193+
adding to the `nse_funcs` list in `arrow/r/R/dplyr-functions.R`. Here is how
194+
this might look for `startsWith()`:
195+
196+
```{r, eval = FALSE}
197+
nse_funcs$startsWith <- function(x, prefix) {
198+
Expression$create(
199+
"starts_with",
200+
x,
201+
options = list(pattern = prefix)
202+
)
203+
}
204+
```
205+
206+
Hint: you can use `call_function()` to call a compute function directly from R.
207+
This might be useful if you want to experiment with a compute function while
208+
you're writing bindings for it, e.g.
209+
210+
```{r}
211+
call_function(
212+
"starts_with",
213+
Array$create(c("Apache", "Arrow", "R", "package")),
214+
options = list(pattern = "A")
215+
)
216+
```
217+
218+
## Step 4 - Run (and potentially add to) your tests.
219+
220+
In the process of implementing the function, you may end up implementing more
221+
tests, for example if you discover unusual edge cases. This is fine - add them
222+
to the ones you wrote originally, and run them all. If they pass, you're done!
223+
Submit a PR. If you've modified the C++ code in the
224+
R package (for example, when hooking up a binding to its options class), you
225+
should make sure to run `arrow/r/lint.sh` to lint the code.
87.8 KB
Loading
9.49 KB
Loading
41.4 KB
Loading

0 commit comments

Comments
 (0)