Skip to content

Improve performance of case_when() and case_match()#6711

Merged
DavisVaughan merged 8 commits intotidyverse:mainfrom
DavisVaughan:feature/positional-names-case-when
Feb 13, 2023
Merged

Improve performance of case_when() and case_match()#6711
DavisVaughan merged 8 commits intotidyverse:mainfrom
DavisVaughan:feature/positional-names-case-when

Conversation

@DavisVaughan
Copy link
Copy Markdown
Member

@DavisVaughan DavisVaughan commented Feb 10, 2023

Closes #6674

Three improvements to make these decently fast again:

  • Minor: Calls to current_env() and caller_env() are now lazy (only evaluated when needed) (needs Can as_function() avoid materializing call ? r-lib/rlang#1558)

  • Major: We no longer call as_label() to turn the expressions into labels which were being used as informative names that showed up in error messages. This stinks, but as_label() is just too slow to be used eagerly like this. We can revisit this again in the future if we add a lazy ALTREP character vector which could materialize the as_label() calls on demand (i.e. at error time). In the meantime, we now use positional names that also include LHS/RHS information, like ..2 (RHS), which ends up being decent.

  • Major: I was previously wrapping the evaluation of the lhs/rhs of the formulas in with_case_errors() to give an informative indexed error about which formula failed to evaluate. This got slow as you increased the number of expressions in case_when() (2 * N expressions = number of with_case_errors() calls) because it uses try_fetch(), which is built on tryCatch(), which is in R not C (probably a good Luke project!). I've changed it up so 1 call to with_case_errors() now wraps the entire evaluation loop, and we track some index information to use in the error if one occurs.

Sadly the private option added in rlang (r-lib/rlang@33db700) wasn't enough to make as_label() fast enough to be used eagerly like this. But the infix_overflows() call there is a big part of the performance degradation we see in my benchmarks below.

When I work on #6678 that will end up also knocking off another 50us or so from switching away from vec_assert() in vec_case_when() and if_else().

After that, any other improvements will come from moving vec_case_when() to C in vctrs.

Right now we are in the 2-4x slower range. After #6678 I expect to be 2x slower most of the time. I think we can do a release with that. Then the C implementation in vctrs should fix the rest.

library(dplyr)

# Size 1
x <- 1L

bench::mark(
  one = case_when(x == 1L ~ TRUE), 
  two = case_when(x == 1L ~ TRUE, x == 2L ~ NA),
  three = case_when(x == 1L ~ TRUE, x == 2L ~ NA, TRUE ~ FALSE),
  iterations = 50000,
  check = FALSE
)

# dplyr 1.0.10
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 one          56.8µs   74.7µs    12802.     168KB     18.2
#> 2 two          75.5µs   95.4µs    10514.        0B     22.5
#> 3 three        97.4µs  120.2µs     8390.        0B     23.9

# This PR
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 one           209µs    278µs     3388.   503.8KB     11.3
#> 2 two           242µs    326µs     2879.     1.3KB     12.2
#> 3 three         272µs    350µs     2787.     1.3KB     14.3

# dplyr 1.1.0 (smacked by the infix-overflow `as_label()` performance issue)
# (note that i had to go down to only 5,000 iterations here)
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 one          1.67ms   2.12ms      477.    1.01MB     16.6
#> 2 two          3.14ms   3.86ms      261.    3.36KB     18.2
#> 3 three        3.27ms   4.32ms      228.    3.36KB     16.5
library(dplyr)

# Size 1000
x <- 1:1000

bench::mark(
  one = case_when(x < 200 ~ TRUE), 
  two = case_when(x < 200 ~ TRUE, x < 500 ~ NA),
  three = case_when(x < 200 ~ TRUE, x < 500 ~ NA, TRUE ~ FALSE),
  iterations = 50000,
  check = FALSE
)

# dplyr 1.0.10
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 one          73.2µs   92.2µs    10873.    60.1KB     17.0
#> 2 two         108.6µs  136.9µs     7383.   100.9KB     17.3
#> 3 three       136.1µs  171.2µs     5914.   130.6KB     18.3

# This PR
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 one           222µs    271µs     3720.    44.1KB     14.4
#> 2 two           279µs    334µs     3011.    60.1KB     15.4
#> 3 three         318µs    382µs     2649.      70KB     16.5

# dplyr 1.1.0
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 one          2.03ms   2.39ms      400.      45KB     15.0
#> 2 two          3.98ms   4.45ms      222.    62.2KB     16.6
#> 3 three         4.1ms   4.64ms      211.      72KB     16.6

@DavisVaughan DavisVaughan marked this pull request as ready for review February 10, 2023 17:54
@DavisVaughan DavisVaughan requested a review from lionel- February 10, 2023 17:58
Comment thread R/case-when.R
Comment on lines -174 to -177
.size <- vec_size_common(!!!conditions, !!!values, .size = .size, .call = error_call)
.size <- vec_size_common(!!!conditions, !!!values, .size = .size)

conditions <- vec_recycle_common(!!!conditions, .size = .size, .call = error_call)
values <- vec_recycle_common(!!!values, .size = .size, .call = error_call)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were superfluous

Comment thread R/case-when.R
dots_env = dots_env,
error_call = error_call
)
env_error_info <- new_environment()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An environment to track information used by with_case_errors()

Comment thread tests/testthat/_snaps/case-when.md Outdated
Comment on lines +76 to +83
! Can't recycle `..1 (LHS)` (size 2) to match `..1 (RHS)` (size 3).
Code
(expect_error(case_when(c(TRUE, FALSE) ~ 1, c(FALSE, TRUE, FALSE) ~ 2, c(FALSE,
TRUE, FALSE, NA) ~ 3)))
Output
<error/vctrs_error_incompatible_size>
Error in `case_when()`:
! Can't recycle `c(TRUE, FALSE)` (size 2) to match `c(FALSE, TRUE, FALSE)` (size 3).
! Can't recycle `..1 (LHS)` (size 2) to match `..2 (LHS)` (size 3).
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly important to show LHS vs RHS in these errors, otherwise it can be very confusing

Comment thread NEWS.md Outdated
Comment thread R/case-match.R
default_env = default_env,
dots_env = dots_env,
error_call = error_call
default_env = caller_env(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that this is expensive enough to care about.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is super minor honestly, but we've worked hard in vctrs to make env evaluation lazy and it seems a waste to throw that away

Comment thread R/case-when.R Outdated
}

with_case_errors <- function(expr, side, i, error_call) {
with_case_errors <- function(expr, error_call, env_error_info) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you taking advantage of try_fetch() features here? Is it worth changing to a bare withCallingHandlers()?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, withCallingHandlers() is as fast as it gets. Probably still better to wrap from the outside though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still worth having this as a separate function? It's kind of hard to follow now that the lexical envs are intertwined.

Comment thread R/case-when.R Outdated
Comment on lines +254 to +255
names(lhs) <- paste0("..", seq_args, " (LHS)")
names(rhs) <- paste0("..", seq_args, " (RHS)")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that people don't seem to general understand LHS/RHS might be better to just use left/right.

Comment thread R/case-when.R Outdated
}

with_case_errors <- function(expr, side, i, error_call) {
with_case_errors <- function(expr, error_call, env_error_info) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, withCallingHandlers() is as fast as it gets. Probably still better to wrap from the outside though.

Comment thread R/case-when.R Outdated
}

with_case_errors <- function(expr, side, i, error_call) {
with_case_errors <- function(expr, error_call, env_error_info) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still worth having this as a separate function? It's kind of hard to follow now that the lexical envs are intertwined.

Rather than wrapping `eval_tidy()`, we now wrap the whole for loop, assuming that the only errors that can happen in the loop are related to expression evaluation. This avoids repeated calls to `try_fetch()`, which can be expensive due to `tryCatch()` being implemented in R rather than C.

This is particularly important when there are many conditions, as the repeated `try_fetch()` calls have a noticeable performance impact
@DavisVaughan DavisVaughan force-pushed the feature/positional-names-case-when branch from 6b92b11 to 1d18013 Compare February 13, 2023 20:09
@DavisVaughan DavisVaughan merged commit 688bcde into tidyverse:main Feb 13, 2023
@DavisVaughan DavisVaughan deleted the feature/positional-names-case-when branch February 13, 2023 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v1.1.0 runtime for case_when with grouping variable is slow

3 participants