-
Notifications
You must be signed in to change notification settings - Fork 31
Expand file tree
/
Copy pathpartition_mutate_ex2.Rmd
More file actions
230 lines (190 loc) · 10.5 KB
/
partition_mutate_ex2.Rmd
File metadata and controls
230 lines (190 loc) · 10.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "Partitioning Mutate, Example 2"
author: "John Mount, Win-Vector LLC"
date: "2017-11-24"
output:
tufte::tufte_html: default
tufte::tufte_handout:
citation_package: natbib
latex_engine: xelatex
tufte::tufte_book:
citation_package: natbib
latex_engine: xelatex
#bibliography: skeleton.bib
link-citations: yes
---
```{r setupa, include=FALSE}
library("tufte")
# invalidate cache when the tufte version changes
knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))
options(htmltools.dir.version = FALSE)
```
[`Sparklyr`](https://spark.rstudio.com), with its [`dplyr`](https://CRAN.R-project.org/package=dplyr) translations allows [`R`](https://www.r-project.org), to perform the heavy lifting that has traditionally been the exclusive domain of proprietary systems such as `SAS`.
In general, `dplyr` is good at handling intermediate variables in the mutate function so users don't need to think about it.
However, some of that breaks down when the processing is done on the `Apache Spark` side. [Win-Vector LLC](http://win-vector.com)
developed the [`seplyr`](https://winvector.github.io/seplyr/) package to use with
consulting clients to mitigate some of these situations.^[And we distribute the package as open-source to give back to the `R` community.]
In this article we will demonstrate we `seplyr` functions: `if_else_device()` and `partition_mutate_qt()`.
This is a follow-on example building on our ["Partitioning Mutate" article](http://winvector.github.io/FluidData/partition_mutate.html),
showing a larger block sequence based on swaps.^[The source code for this article can be found [here](https://github.com/WinVector/WinVector.github.io/blob/master/FluidData/partition_mutate_ex2.Rmd).] For more motivation and context please see
[the first article](http://winvector.github.io/FluidData/partition_mutate.html).
Please consider the following example data (on a remote `Spark` cluster).
```{r sed, include=FALSE}
library("seplyr")
sc <-
sparklyr::spark_connect(version = '2.2.0',
master = "local")
dL <- data.frame(rowNum = 1:5,
a_1 = "",
a_2 = "",
b_1 = "",
b_2 = "",
c_1 = "",
c_2 = "",
d_1 = "",
d_2 = "",
e_1 = "",
e_2 = "",
stringsAsFactors = FALSE)
d <- dplyr::copy_to(sc, dL,
name = 'd',
overwrite = TRUE,
temporary = TRUE)
```
```{r p1}
class(d)
d %.>%
# avoid https://github.com/tidyverse/dplyr/issues/3216
dplyr::collect(.) %.>%
knitr::kable(.)
```
We find in non-trivial projects it is often necessary to simulate block-`if(){}else{}`
structures in `dplyr` pipelines.
For our example: suppose we wish to assign columns in a complementary
to treatment and control design^[Abraham Wald designed some sequential analysis procedures
in this way as Nina Zumel [remarked](https://github.com/WinVector/ODSCWest2017/tree/master/MythsOfDataScience). Another string example is conditionals where you are trying to vary on a per-row basis which column is assigned to, instead of varying what value is assigned from.]
To write such a procedure in pure `dplyr` we might simulate block with code such
as the following^[Only showing work on the `a` group right now. We are assuming
we want to perform this task on all the grouped letter columns.]
```{r dL}
library("seplyr")
packageVersion("seplyr")
plan <- if_else_device(
testexpr =
"rand()>=0.5",
thenexprs = c(
"a_1" := "'treatment'",
"a_2" := "'control'"),
elseexprs = c(
"a_1" := "'control'",
"a_2" := "'treatment'")) %.>%
partition_mutate_se(.)
```
We are using the indent notation to indicate the code-blocks we are simulating
with row-wise `if(){}else{}` blocks.^[For more on this concept, please see: [the `if_else_device` reference](https://winvector.github.io/seplyr/reference/if_else_device.html).] The `if_else_device` is also using quoted expressions (or value-oriented standard notation).^[One can over-worry about this, but in the end all a
[non-standard evaluation](http://adv-r.had.co.nz/Computing-on-the-language.html) scheme saves
you is a few quote marks (at the cost of transparency, and a lot of down-stream headaches).
Our advice is to compose the expressions using your smart `R`-code
editor of choice and then throw on the additional quote marks after you have the statements
as you want them.]
In the end we can examine and execute the mutate plan:
```{r dLex}
print(plan)
d %.>%
mutate_seb(., plan) %.>%
select_se(., grepdf('^ifebtest_.*', ., invert=TRUE)) %.>%
dplyr::collect(.) %.>%
knitr::kable(.)
```
Our larger goal was to perform this same operation on each of the 5 letter groups.
We do this easily as follows:^[A better overall design would be to use
[`cdata::rowrecs_to_blocks_q()`](https://winvector.github.io/cdata/reference/rowrecs_to_blocks_q.html),
then perform a single bulk operation on rows, and then pivot/transpose back
with [`cdata::blocks_to_rowrecs_q()`](https://winvector.github.io/cdata/reference/blocks_to_rowrecs_q.html).
But let's see how we simply work with a problem at hand.]
```{r dLong}
plan <- lapply(c('a', 'b', 'c', 'd', 'e'),
function(gi) {
if_else_device(
"rand()>=0.5",
thenexprs = c(
paste0(gi, "_1") := "'treatment'",
paste0(gi, "_2") := "'control'"),
elseexprs = c(
paste0(gi, "_1") := "'control'",
paste0(gi, "_2") := "'treatment'"))
}) %.>%
unlist(.) %.>%
partition_mutate_se(.)
d %.>%
mutate_seb(., plan) %.>%
select_se(., grepdf('^ifebtest_.*', ., invert=TRUE)) %.>%
dplyr::collect(.) %.>%
knitr::kable(.)
```
Please keep in mind: we are using a very simple and regular sequence only
for purposes of illustration. The intent is to show the types of issues
one runs into when standing-up non-trivial applications in `Sparklyr`.
The purpose of [`seplyr::partition_mutate_qt()`](https://github.com/WinVector/seplyr)
is to re-arrange statements and break them into blocks of non-dependent statements (no
statement in a block depends on any other in the same block, and all value dependencies
are respected by the block order). `seplyr::partition_mutate_qt()` if further defined to do this
in a performant manner.^[That is to pick a small number of blocks, in our case the plan consisted of
`r length(plan)` blocks. The simple method of introducing a block boundary at each first use
of derived value (without statement re-ordering) would create a very much larger set of blocks
(which cause problems of their own). In particular the impression code and comments
of [upcoming `dplyr` fix](https://github.com/tidyverse/dbplyr/commit/36a44cd4b5f70bc06fb004e7787157165766d091) appear
to indicate an undesirable large number of blocks solution.]
Without such partition planning the current version of `dplyr` (`r packageVersion("dplyr")`)
the results of `dplyr::mutate()` do not seem to be well-defined when values are created
and re-used in the same `dplyr::mutate()` block. This is not a currently documented limitation, but
it is present:
```{r bad}
ex <- dplyr::mutate(d,
condition_tmp = rand()>=0.5,
a_1 = ifelse( condition_tmp,
'treatment',
a_1),
a_2 = ifelse( condition_tmp,
'control',
a_2),
a_1 = ifelse( !( condition_tmp ),
'control',
a_1),
a_2 = ifelse( !( condition_tmp ),
'treatment',
a_2))
knitr::kable(dplyr::collect(dplyr::select(ex, a_1, a_2)))
```
Notice above the many `NA` columns, which are errors.^[Note: no mere re-ordering of the statements would give this result.]
```{r sq}
dplyr::show_query(ex)
```
Looking at the query we see that one of the conditional statements is missing (notice only 3 case statements, not 4):^[Likely
the `dplyr` `SQL` generator does not perform a correct live-value analysis and therefor gets fooled into thinking a statement can
safely be eliminated (when it can not). `seplyr::partition_mutate_qt()` performs a correct live value calculation and make sure
`dplyr::mutate()` is only seeing trivial blocks (blocks where no value depends on any calculation in the same block).]
Conclusion
----------
`seplyr::if_else_device()` and `seplyr::partition_mutate_qt()` type capability is
essential for executing non-trivial code at scale in `Sparklyr`. For more on the `if_else_device`
we suggest reading up on the [function reference example](https://winvector.github.io/seplyr/reference/if_else_device.html), and for a review
on the `partition_mutate` variations we suggest the ["Partitioning Mutate" article](http://winvector.github.io/FluidData/partition_mutate.html).
Links
-----
[Win-Vector LLC](http://www.win-vector.com/) supplies a number of open-source
[`R`](https://www.r-project.org) packages for working effectively with big data.
These include:
* **[wrapr](https://winvector.github.io/wrapr/)**: supplies code re-writing tools that make coding *over* ["non standard evaluation"](http://adv-r.had.co.nz/Computing-on-the-language.html) interfaces (such as `dplyr`) *much* easier.
* **[cdata](https://winvector.github.io/cdata/)**: supplies pivot/un-pivot functionality at big data scale.
* **[rquery](https://github.com/WinVector/rquery)**: (in development) big data scale relational data operators.
* **[seplyr](https://winvector.github.io/seplyr/)**: supplies improved interfaces for many data manipulation tasks.
* **[replyr](https://winvector.github.io/replyr/)**: supplies tools and patches for using `dplyr` on big data.
Partitioning mutate articles:
* **[Partitioning Mutate](http://winvector.github.io/FluidData/partition_mutate.html)**: basic example.
* **[Partitioning Mutate, Example 2](http://winvector.github.io/FluidData/partition_mutate_ex2.html)**: `ifelse` example.
* **[Partitioning Mutate, Example 3](http://winvector.github.io/FluidData/partition_mutate_ex3.html)** [`rquery`](https://github.com/WinVector/rquery) example.
Topics such as the above are often discussed on the [Win-Vector blog](http://www.win-vector.com/blog/).
```{r cleanup, include=FALSE}
sparklyr::spark_disconnect(sc)
```