|
| 1 | +# Writing Bindings |
| 2 | + |
| 3 | +```{r, include=FALSE} |
| 4 | +library(arrow, warn.conflicts = FALSE) |
| 5 | +library(dplyr, warn.conflicts = FALSE) |
| 6 | +``` |
| 7 | + |
| 8 | +When writing bindings between C++ compute functions and R functions, the aim is |
| 9 | +to expose the C++ functionality via the same interface as existing R functions. The syntax and |
| 10 | +functionality should match that of the existing R functions |
| 11 | +(though there are some exceptions) so that users are able to use existing tidyverse |
| 12 | +or base R syntax, whilst taking advantage of the speed and functionality of the |
| 13 | +underlying arrow package. |
| 14 | + |
| 15 | +One of main ways in which users interact with arrow is via |
| 16 | +[dplyr](https://dplyr.tidyverse.org/) syntax called on Arrow objects. For |
| 17 | +example, when a user calls `dplyr::mutate()` on an Arrow Tabular, |
| 18 | +Dataset, or arrow data query object, the Arrow implementation of `mutate()` is |
| 19 | +used and under the hood, translates the dplyr code into Arrow C++ code. |
| 20 | + |
| 21 | +When using `dplyr::mutate()` or `dplyr::filter()`, you may want to use functions |
| 22 | +from other packages. The example below uses `stringr::str_detect()`. |
| 23 | + |
| 24 | +```{r} |
| 25 | +library(dplyr) |
| 26 | +library(stringr) |
| 27 | +starwars %>% |
| 28 | + filter(str_detect(name, "Darth")) |
| 29 | +``` |
| 30 | +This functionality has also been implemented in Arrow, e.g.: |
| 31 | + |
| 32 | +```{r} |
| 33 | +library(arrow) |
| 34 | +arrow_table(starwars) %>% |
| 35 | + filter(str_detect(name, "Darth")) %>% |
| 36 | + collect() |
| 37 | +``` |
| 38 | + |
| 39 | +This is possible as a **binding** has been created between the call to the |
| 40 | +stringr function `str_detect()` and the Arrow C++ code, here as a direct mapping |
| 41 | +to `match_substring_regex`. You can see this for yourself by inspecting the |
| 42 | +arrow data query object without retrieving the results via `collect()`. |
| 43 | + |
| 44 | + |
| 45 | +```{r} |
| 46 | +arrow_table(starwars) %>% |
| 47 | + filter(str_detect(name, "Darth")) |
| 48 | +``` |
| 49 | + |
| 50 | +In the following sections, we'll walk through how to create a binding between an |
| 51 | +R function and an Arrow C++ function. |
| 52 | + |
| 53 | +# Walkthrough |
| 54 | + |
| 55 | +Imagine you are writing the bindings for the C++ function |
| 56 | +[`starts_with()`](https://arrow.apache.org/docs/cpp/compute.html#containment-tests) |
| 57 | +and want to bind it to the (base) R function `startsWith()`. |
| 58 | + |
| 59 | +First, take a look at the docs for both of those functions. |
| 60 | + |
| 61 | +## Examining the R function |
| 62 | + |
| 63 | +Here are the docs for R's `startsWith()` (also available at https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html) |
| 64 | + |
| 65 | +```{r, echo=FALSE, out.width="50%"} |
| 66 | +knitr::include_graphics("./startswithdocs.png") |
| 67 | +``` |
| 68 | + |
| 69 | +It takes 2 parameters; `x` - the input, and `prefix` - the characters to check |
| 70 | +if `x` starts with. |
| 71 | + |
| 72 | +## Examining the C++ function |
| 73 | + |
| 74 | +Now, go to |
| 75 | +[the compute function documentation](https://arrow.apache.org/docs/cpp/compute.html#containment-tests) |
| 76 | +and look for the Arrow C++ library's `starts_with()` function: |
| 77 | + |
| 78 | +```{r, echo=FALSE, out.width="100%"} |
| 79 | +knitr::include_graphics("./starts_with_docs.png") |
| 80 | +``` |
| 81 | + |
| 82 | +The docs show that `starts_with()` is a unary function, which means that it takes a |
| 83 | +single data input. The data input must be a string-like class, and the returned |
| 84 | +value is boolean, both of which match up to R's `startsWith()`. |
| 85 | + |
| 86 | +There is an options class associated with `starts_with()` - called [`MatchSubstringOptions`](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE) |
| 87 | +- so let's take a look at that. |
| 88 | + |
| 89 | +```{r, echo=FALSE, out.width="100%"} |
| 90 | +knitr::include_graphics("./matchsubstringoptions.png") |
| 91 | +``` |
| 92 | + |
| 93 | +Options classes allow the user to control the behaviour of the function. In |
| 94 | +this case, there are two possible options which can be supplied - `pattern` and |
| 95 | +`ignore_case`, which are described in the docs shown above. |
| 96 | + |
| 97 | +## Comparing the R and C++ functions |
| 98 | + |
| 99 | +What conclusions can be drawn from what you've seen so far? |
| 100 | + |
| 101 | +Base R's `startsWith()` and Arrow's `starts_with()` operate on equivalent data |
| 102 | +types, return equivalent data types, and as there are no options implemented in |
| 103 | +R that Arrow doesn't have, this should be fairly simple to map without a great |
| 104 | +deal of extra work. |
| 105 | + |
| 106 | +As `starts_with()` has an options class associated with it, we'll need to make |
| 107 | +sure that it's linked up with this in the R code. |
| 108 | + |
| 109 | +In case you're wondering about the difference between arguments in R and options |
| 110 | +in Arrow, in R, arguments to functions can include the actual data to be |
| 111 | +analysed as well as options governing how the function works, whereas in the |
| 112 | +C++ compute functions, the arguments are the data to be analysed and the |
| 113 | +options are for specifying how exactly the function works. |
| 114 | + |
| 115 | +So let's get started. |
| 116 | + |
| 117 | +## Step 1 - add unit tests |
| 118 | + |
| 119 | +We recommend a test-driven-development approach - write failing tests first, |
| 120 | +then check that they fail, and then write the code needed to make them pass. |
| 121 | +Thinking up-front about the behavior which needs testing can make it easier to |
| 122 | +reason about the code which needs writing later. |
| 123 | + |
| 124 | +Look up the R function that you want to bind the compute kernel to, and write a |
| 125 | +set of unit tests that use a dplyr pipeline and `compare_dplyr_binding()` (and |
| 126 | +perhaps even `compare_dplyr_error()` if necessary. These functions compare the |
| 127 | +output of the original function with the dplyr bindings and make sure they match. |
| 128 | +We recommend looking at the documentation next to the source code for these |
| 129 | +functions to get a better understanding of how they work. |
| 130 | + |
| 131 | +You should make sure you're testing all parameters of the R function in your |
| 132 | +tests. |
| 133 | + |
| 134 | +Below is a possible example test for `startsWith()`. |
| 135 | + |
| 136 | +```{r, eval = FALSE} |
| 137 | +test_that("startsWith behaves identically in dplyr and Arrow", { |
| 138 | + df <- tibble(x = c("Foo", "bar", "baz", "qux")) |
| 139 | + compare_dplyr_binding( |
| 140 | + .input %>% |
| 141 | + filter(startsWith(x, "b")) %>% |
| 142 | + collect(), |
| 143 | + df |
| 144 | + ) |
| 145 | +
|
| 146 | +}) |
| 147 | +``` |
| 148 | + |
| 149 | +## Step 2 - Hook up the compute function with options class if necessary |
| 150 | + |
| 151 | +If the C++ compute function can have options specified, make sure that the |
| 152 | +function is linked with its options class in `make_compute_options()` in the |
| 153 | +file `arrow/r/src/compute.cpp`. You can find out if a compute function requires |
| 154 | +options by looking in the docs here: https://arrow.apache.org/docs/cpp/compute.html |
| 155 | + |
| 156 | +In the case of `starts_with()`, it looks something like this: |
| 157 | + |
| 158 | +```cpp |
| 159 | + if (func_name == "starts_with") { |
| 160 | + using Options = arrow::compute::MatchSubstringOptions; |
| 161 | + bool ignore_case = false; |
| 162 | + if (!Rf_isNull(options["ignore_case"])) { |
| 163 | + ignore_case = cpp11::as_cpp<bool>(options["ignore_case"]); |
| 164 | + } |
| 165 | + return std::make_shared<Options>(cpp11::as_cpp<std::string>(options["pattern"]), |
| 166 | + ignore_case); |
| 167 | + } |
| 168 | +``` |
| 169 | +
|
| 170 | +You can usually copy and paste from a similar existing example. In this case, |
| 171 | +as the option `ignore_case` doesn't map to any parameters of `startsWith()`, we |
| 172 | +give it a default value of `false` but if it's been set, use the set value |
| 173 | +instead. As the `pattern` argument maps directly to `prefix` in `startsWith()` |
| 174 | +we can pass it straight through. |
| 175 | +
|
| 176 | +## Step 3 - Map the R function to the C++ kernel |
| 177 | +
|
| 178 | +The next task is writing the code which binds the R function to the C++ kernel. |
| 179 | +
|
| 180 | +### Step 3a - See if direct mapping is appropriate |
| 181 | +Compare the C++ function and R function. If they are simple functions with no |
| 182 | +options, it might be possible to directly map between the C++ and R in |
| 183 | +`unary_function_map`, in the case of compute functions that operate on single |
| 184 | +columns of data, or `binary_function_map` for those which operate on 2 columns |
| 185 | +of data. |
| 186 | +
|
| 187 | +As `startsWith()` requires options, direct mapping is not appropriate. |
| 188 | +
|
| 189 | +### Step 3b - If direct mapping not possible, try a modified implementation |
| 190 | +If the function cannot be mapped directly, some extra work may be needed to |
| 191 | +ensure that calling the arrow version of the function results in the same result |
| 192 | +as calling the R version of the function. In this case, the function will need |
| 193 | +adding to the `nse_funcs` list in `arrow/r/R/dplyr-functions.R`. Here is how |
| 194 | +this might look for `startsWith()`: |
| 195 | +
|
| 196 | +```{r, eval = FALSE} |
| 197 | +nse_funcs$startsWith <- function(x, prefix) { |
| 198 | + Expression$create( |
| 199 | + "starts_with", |
| 200 | + x, |
| 201 | + options = list(pattern = prefix) |
| 202 | + ) |
| 203 | +} |
| 204 | +``` |
| 205 | + |
| 206 | +Hint: you can use `call_function()` to call a compute function directly from R. |
| 207 | +This might be useful if you want to experiment with a compute function while |
| 208 | +you're writing bindings for it, e.g. |
| 209 | + |
| 210 | +```{r} |
| 211 | +call_function( |
| 212 | + "starts_with", |
| 213 | + Array$create(c("Apache", "Arrow", "R", "package")), |
| 214 | + options = list(pattern = "A") |
| 215 | +) |
| 216 | +``` |
| 217 | + |
| 218 | +## Step 4 - Run (and potentially add to) your tests. |
| 219 | + |
| 220 | +In the process of implementing the function, you may end up implementing more |
| 221 | +tests, for example if you discover unusual edge cases. This is fine - add them |
| 222 | +to the ones you wrote originally, and run them all. If they pass, you're done! |
| 223 | +Submit a PR. If you've modified the C++ code in the |
| 224 | +R package (for example, when hooking up a binding to its options class), you |
| 225 | +should make sure to run `arrow/r/lint.sh` to lint the code. |
0 commit comments