WinVector.github.io/FluidData/partition_mutate_ex2.Rmd at master · WinVector/WinVector.github.io

History

230 lines (190 loc) · 10.5 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

---

title: "Partitioning Mutate, Example 2"

author: "John Mount, Win-Vector LLC"

date: "2017-11-24"

output:

tufte::tufte_html: default

tufte::tufte_handout:

citation_package: natbib

latex_engine: xelatex

tufte::tufte_book:

citation_package: natbib

latex_engine: xelatex

#bibliography: skeleton.bib

link-citations: yes

---

```{r setupa, include=FALSE}

library("tufte")

# invalidate cache when the tufte version changes

knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))

options(htmltools.dir.version = FALSE)

```

[`Sparklyr`](https://spark.rstudio.com), with its [`dplyr`](https://CRAN.R-project.org/package=dplyr) translations allows [`R`](https://www.r-project.org), to perform the heavy lifting that has traditionally been the exclusive domain of proprietary systems such as `SAS`.

In general, `dplyr` is good at handling intermediate variables in the mutate function so users don't need to think about it.

However, some of that breaks down when the processing is done on the `Apache Spark` side. [Win-Vector LLC](http://win-vector.com)

developed the [`seplyr`](https://winvector.github.io/seplyr/) package to use with

consulting clients to mitigate some of these situations.^[And we distribute the package as open-source to give back to the `R` community.]

In this article we will demonstrate we `seplyr` functions: `if_else_device()` and `partition_mutate_qt()`.

This is a follow-on example building on our ["Partitioning Mutate" article](http://winvector.github.io/FluidData/partition_mutate.html),

showing a larger block sequence based on swaps.^[The source code for this article can be found [here](https://github.com/WinVector/WinVector.github.io/blob/master/FluidData/partition_mutate_ex2.Rmd).] For more motivation and context please see

[the first article](http://winvector.github.io/FluidData/partition_mutate.html).

Please consider the following example data (on a remote `Spark` cluster).

```{r sed, include=FALSE}

library("seplyr")

sc <-

sparklyr::spark_connect(version = '2.2.0',

master = "local")

dL <- data.frame(rowNum = 1:5,

a_1 = "",

a_2 = "",

b_1 = "",

b_2 = "",

c_1 = "",

c_2 = "",

d_1 = "",

d_2 = "",

e_1 = "",

e_2 = "",

stringsAsFactors = FALSE)

d <- dplyr::copy_to(sc, dL,

name = 'd',

overwrite = TRUE,

temporary = TRUE)

```

```{r p1}

class(d)

d %.>%

# avoid https://github.com/tidyverse/dplyr/issues/3216

dplyr::collect(.) %.>%

knitr::kable(.)

```

We find in non-trivial projects it is often necessary to simulate block-`if(){}else{}`

structures in `dplyr` pipelines.

For our example: suppose we wish to assign columns in a complementary

to treatment and control design^[Abraham Wald designed some sequential analysis procedures

in this way as Nina Zumel [remarked](https://github.com/WinVector/ODSCWest2017/tree/master/MythsOfDataScience). Another string example is conditionals where you are trying to vary on a per-row basis which column is assigned to, instead of varying what value is assigned from.]

To write such a procedure in pure `dplyr` we might simulate block with code such

as the following^[Only showing work on the `a` group right now. We are assuming

we want to perform this task on all the grouped letter columns.]

```{r dL}

library("seplyr")

packageVersion("seplyr")

plan <- if_else_device(

testexpr =

"rand()>=0.5",

thenexprs = c(

"a_1" := "'treatment'",

"a_2" := "'control'"),

elseexprs = c(

"a_1" := "'control'",

"a_2" := "'treatment'")) %.>%

partition_mutate_se(.)

```

We are using the indent notation to indicate the code-blocks we are simulating

with row-wise `if(){}else{}` blocks.^[For more on this concept, please see: [the `if_else_device` reference](https://winvector.github.io/seplyr/reference/if_else_device.html).] The `if_else_device` is also using quoted expressions (or value-oriented standard notation).^[One can over-worry about this, but in the end all a

[non-standard evaluation](http://adv-r.had.co.nz/Computing-on-the-language.html) scheme saves

you is a few quote marks (at the cost of transparency, and a lot of down-stream headaches).

Our advice is to compose the expressions using your smart `R`-code

editor of choice and then throw on the additional quote marks after you have the statements

as you want them.]

In the end we can examine and execute the mutate plan:

```{r dLex}

print(plan)

d %.>%

mutate_seb(., plan) %.>%

select_se(., grepdf('^ifebtest_.*', ., invert=TRUE)) %.>%

dplyr::collect(.) %.>%

knitr::kable(.)

```

Our larger goal was to perform this same operation on each of the 5 letter groups.

We do this easily as follows:^[A better overall design would be to use

[`cdata::rowrecs_to_blocks_q()`](https://winvector.github.io/cdata/reference/rowrecs_to_blocks_q.html),

then perform a single bulk operation on rows, and then pivot/transpose back

with [`cdata::blocks_to_rowrecs_q()`](https://winvector.github.io/cdata/reference/blocks_to_rowrecs_q.html).

But let's see how we simply work with a problem at hand.]

```{r dLong}

plan <- lapply(c('a', 'b', 'c', 'd', 'e'),

function(gi) {

if_else_device(

"rand()>=0.5",

thenexprs = c(

paste0(gi, "_1") := "'treatment'",

paste0(gi, "_2") := "'control'"),

elseexprs = c(

paste0(gi, "_1") := "'control'",

paste0(gi, "_2") := "'treatment'"))

}) %.>%

unlist(.) %.>%

partition_mutate_se(.)

d %.>%

mutate_seb(., plan) %.>%

select_se(., grepdf('^ifebtest_.*', ., invert=TRUE)) %.>%

dplyr::collect(.) %.>%

knitr::kable(.)

```

Please keep in mind: we are using a very simple and regular sequence only

for purposes of illustration. The intent is to show the types of issues

one runs into when standing-up non-trivial applications in `Sparklyr`.

The purpose of [`seplyr::partition_mutate_qt()`](https://github.com/WinVector/seplyr)

is to re-arrange statements and break them into blocks of non-dependent statements (no

statement in a block depends on any other in the same block, and all value dependencies

are respected by the block order). `seplyr::partition_mutate_qt()` if further defined to do this

in a performant manner.^[That is to pick a small number of blocks, in our case the plan consisted of

`r length(plan)` blocks. The simple method of introducing a block boundary at each first use

of derived value (without statement re-ordering) would create a very much larger set of blocks

(which cause problems of their own). In particular the impression code and comments

of [upcoming `dplyr` fix](https://github.com/tidyverse/dbplyr/commit/36a44cd4b5f70bc06fb004e7787157165766d091) appear

to indicate an undesirable large number of blocks solution.]

Without such partition planning the current version of `dplyr` (`r packageVersion("dplyr")`)

the results of `dplyr::mutate()` do not seem to be well-defined when values are created

and re-used in the same `dplyr::mutate()` block. This is not a currently documented limitation, but

it is present:

```{r bad}

ex <- dplyr::mutate(d,

condition_tmp = rand()>=0.5,

a_1 = ifelse( condition_tmp,

'treatment',

a_1),

a_2 = ifelse( condition_tmp,

'control',

a_2),

a_1 = ifelse( !( condition_tmp ),

'control',

a_1),

a_2 = ifelse( !( condition_tmp ),

'treatment',

a_2))

knitr::kable(dplyr::collect(dplyr::select(ex, a_1, a_2)))

```

Notice above the many `NA` columns, which are errors.^[Note: no mere re-ordering of the statements would give this result.]

```{r sq}

dplyr::show_query(ex)

```

Looking at the query we see that one of the conditional statements is missing (notice only 3 case statements, not 4):^[Likely

the `dplyr` `SQL` generator does not perform a correct live-value analysis and therefor gets fooled into thinking a statement can

safely be eliminated (when it can not). `seplyr::partition_mutate_qt()` performs a correct live value calculation and make sure

`dplyr::mutate()` is only seeing trivial blocks (blocks where no value depends on any calculation in the same block).]

Conclusion

----------

`seplyr::if_else_device()` and `seplyr::partition_mutate_qt()` type capability is

essential for executing non-trivial code at scale in `Sparklyr`. For more on the `if_else_device`

we suggest reading up on the [function reference example](https://winvector.github.io/seplyr/reference/if_else_device.html), and for a review

on the `partition_mutate` variations we suggest the ["Partitioning Mutate" article](http://winvector.github.io/FluidData/partition_mutate.html).

Links

-----

[Win-Vector LLC](http://www.win-vector.com/) supplies a number of open-source

[`R`](https://www.r-project.org) packages for working effectively with big data.

These include:

* **[wrapr](https://winvector.github.io/wrapr/)**: supplies code re-writing tools that make coding *over* ["non standard evaluation"](http://adv-r.had.co.nz/Computing-on-the-language.html) interfaces (such as `dplyr`) *much* easier.

* **[cdata](https://winvector.github.io/cdata/)**: supplies pivot/un-pivot functionality at big data scale.

* **[rquery](https://github.com/WinVector/rquery)**: (in development) big data scale relational data operators.

* **[seplyr](https://winvector.github.io/seplyr/)**: supplies improved interfaces for many data manipulation tasks.

* **[replyr](https://winvector.github.io/replyr/)**: supplies tools and patches for using `dplyr` on big data.

Partitioning mutate articles:

* **[Partitioning Mutate](http://winvector.github.io/FluidData/partition_mutate.html)**: basic example.

* **[Partitioning Mutate, Example 2](http://winvector.github.io/FluidData/partition_mutate_ex2.html)**: `ifelse` example.

* **[Partitioning Mutate, Example 3](http://winvector.github.io/FluidData/partition_mutate_ex3.html)** [`rquery`](https://github.com/WinVector/rquery) example.

Topics such as the above are often discussed on the [Win-Vector blog](http://www.win-vector.com/blog/).

```{r cleanup, include=FALSE}

sparklyr::spark_disconnect(sc)

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

partition_mutate_ex2.Rmd

Latest commit

History

partition_mutate_ex2.Rmd

File metadata and controls