ARROW-14817 [R] Implement bindings for `lubridate::tz()` #12357

dragosmg · 2022-02-07T15:34:35Z

This PR should make the following snippets of code equivalent:

df <- tibble::tibble(
  x = as.POSIXct(c("2022-02-07", "2022-03-04"), tz = "Pacific/Marquesas")  
)

df %>%
mutate(timezone = tz(x))
# A tibble: 2 × 2
  x                   timezone         
  <dttm>              <chr>            
1 2022-02-07 00:00:00 Pacific/Marquesas
2 2022-03-04 00:00:00 Pacific/Marquesas

and

dr %>%
  record_batch() %>%
  mutate(timezone = tz(x) %>%
  collect()

…ore tests

github-actions · 2022-02-07T15:35:02Z

https://issues.apache.org/jira/browse/ARROW-14817

github-actions · 2022-02-07T15:35:04Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

dragosmg · 2022-02-07T16:26:53Z

lubridate::tz() supports a bunch of types of inputs, for example character:

suppressMessages(library(lubridate))
suppressMessages(library(dplyr))
tibble(
  a = "2022-02-07"
) %>% 
  mutate(timezone = tz(a))
#> # A tibble: 1 × 2
#>   a          timezone
#>   <chr>      <chr>   
#> 1 2022-02-07 UTC

^{Created on 2022-02-07 by the reprex package (v2.0.1)}

Not sure we want to support all of them. Albeit, it defaults to "UTC" when extracting the timezone from a character.

jonkeane

Thanks for this. This is looking good (and nice and simple, which is fantastic!) I have a few questions for you so far

r/R/dplyr-funcs-datetime.R

dragosmg · 2022-02-10T10:01:30Z

To add to the conversation above. I think the right way to go is to extract timezone from the timestamp metadata and not look at each individual value. However, that could result in a situation where a value is NA, but the column is a timestamp which will result in something like:

suppressMessages(library(arrow))
suppressMessages(library(dplyr))
df <- tibble(
  x = as.POSIXct(c("2022-02-07", NA), tz = "Pacific/Marquesas")
)

df %>% 
  record_batch() %>% 
  mutate(timezone_x = tz(x)) %>% 
  collect()
#> A tibble: 2 × 2
#>   x                   timezone_x       
#>   <dttm>              <chr>            
#> 1 2022-02-07 00:00:00 Pacific/Marquesas
#> 2 NA                  Pacific/Marquesas

^{Created on 2022-02-10 by the reprex package (v2.0.1)}

…snapshot()` for the error chunk

jonkeane · 2022-02-10T13:52:46Z

I'm going to answer these two points below together, since they are similar. Both have tradeoffs, but what I do in circumstances like this is to look through our other bindings and see if I can find examples that do similar things.

To add to the conversation above. I think the right way to go is to extract timezone from the timestamp metadata and not look at each individual value. However, that could in a situation where a value is NA

I agree that we should use the column metadata here (which is what it looks like you're doing), but then wrap it in an if_else detection if any elements are NA (one example is how we implemented if_else()(😂)

arrow/r/R/dplyr-funcs-conditional.R

Lines 51 to 61 in af33dd1

    
           if_else_binding <- function(condition, true, false, missing = NULL) { 
        
             if (!is.null(missing)) { 
        
               return(if_else_binding( 
        
                 call_binding("is.na", (condition)), 
        
                 missing, 
        
                 if_else_binding(condition, true, false) 
        
               )) 
        
             } 
        
             build_expr("if_else", condition, true, false) 
        
           }

we also do similar with is.finite() and is.infinite()

arrow/r/R/dplyr-funcs-type.R

Lines 239 to 249 in af33dd1

    
           register_binding("is.finite", function(x) { 
        
             is_fin <- Expression$create("is_finite", x) 
        
             # for compatibility with base::is.finite(), return FALSE for NA_real_ 
        
             is_fin & !call_binding("is.na", is_fin) 
        
           }) 
        
           register_binding("is.infinite", function(x) { 
        
             is_inf <- Expression$create("is_inf", x) 
        
             # for compatibility with base::is.infinite(), return FALSE for NA_real_ 
        
             is_inf & !call_binding("is.na", is_inf) 
        
           })

I can't figure out how to surface an error from tz() as it gets caught by the dplyr query and then the dplyr-equivalent expression still gets evaluated in R.

One example of having an error in a binding is

arrow/r/R/dplyr-funcs-conditional.R

Lines 86 to 88 in af33dd1

    
           if (!call_binding("is.logical", query[[i]])) { 
        
             abort("Left side of each formula in case_when() must be a logical expression") 
        
           }

…NA`, lubridate returns `"UTC"`

dragosmg · 2022-02-10T21:09:57Z

@jonkeane Thanks for your comments. On surfacing errors: for example, I used abort() when working out how to implement format() (https://github.com/dragosmg/arrow/blob/9c99c03d4bdd5b80dea8acf88a3949808662bd32/r/R/dplyr-funcs-type.R#L259), but that error never surfaces as it is intercepted by the mutate() "context" and data is pulled into R.

dragosmg · 2022-02-10T21:15:09Z

Worth noting that the current implementation diverges from lubridate::tz() as indicated in the conversation, namely it does not return "UTC" for strings, numerics or NAs.

suppressMessages(library(lubridate))
tz(NA)
#> [1] "UTC"
tz(2)
#> Warning: tz(): Don't know how to compute timezone for object of class numeric;
#> returning "UTC". This warning will become an error in the next major version of
#> lubridate.
#> [1] "UTC"
tz(2.2)
#> Warning: tz(): Don't know how to compute timezone for object of class numeric;
#> returning "UTC". This warning will become an error in the next major version of
#> lubridate.
#> [1] "UTC"
tz("string")
#> [1] "UTC"

^{Created on 2022-02-10 by the reprex package (v2.0.1)}

dragosmg · 2022-02-10T21:21:45Z

NA will be "contagious" in arrow's tz() binding (as opposed to lubridate::tz() where they aren't):

suppressMessages(library(arrow))
suppressMessages(library(lubridate))
suppressMessages(library(dplyr))

df <- tibble(
  x = as.POSIXct(c("2022-02-07", NA), tz = "Pacific/Marquesas")
)

df %>% 
  record_batch() %>% 
  mutate(timezone_x = tz(x)) %>%
  collect()
#> # A tibble: 2 × 2
#>   x                   timezone_x
#>   <dttm>              <chr>     
#> 1 2022-02-07 00:00:00 UTC       
#> 2 NA                  NA
 
df %>% 
  mutate(timezone_x = tz(x))
#> # A tibble: 2 × 2
#>   x                   timezone_x       
#>   <dttm>              <chr>            
#> 1 2022-02-07 00:00:00 Pacific/Marquesas
#> 2 NA                  Pacific/Marquesas

^{Created on 2022-02-10 by the reprex package (v2.0.1)}

jonkeane · 2022-02-10T22:03:12Z

Worth noting that the current implementation diverges from lubridate::tz() as indicated in the conversation, namely it does not return "UTC" for strings, numerics or NAs.

I think this is totally fine. For at least the numerics, lubridate will soon be doing what we do. For the others (strings, and bare NAs) it doesn't look like they are going to deprecate them, but I don't think we need to also follow that. We should note that (at least in a comment in the code, or in our documentation for the bindings if there's a place to put it).

NA will be "contagious" in arrow's tz() binding (as opposed to lubridate::tz() where they aren't):

Yeah, we have a choice here. I'm actually a bit surprised that lubridate::tz() returns a timezone that's not UTC there given the docs(!). But actually that behavior seems right. A bare NA either shouldn't produce a timezone (or produce a default UTC for backwards compatibility). But if the NA has a timezone attribute, that should be used.

Given that, we probably should take out the is.na() bit in our binding since that will match what lubridate::tz() here. I only suggested adding it if we were aiming to mimic the lubridate::tz(NA) behavior here (which I thought would be UTC, but turns out, it's not!). That has the added benefit that we don't do by-element comparison, which is nice!

…nit test

r/tests/testthat/_snaps/dplyr-funcs-datetime.md

r/tests/testthat/test-dplyr-funcs-datetime.R

jonkeane

This is getting there! A few more comments and suggestions.

r/R/dplyr-funcs-datetime.R

r/tests/testthat/test-dplyr-funcs-datetime.R

jonkeane · 2022-02-15T19:46:54Z

r/tests/testthat/test-dplyr-funcs-datetime.R

+        collect(),
+      df
+    ),
+    error = TRUE


I think what you're looking for here (and below too) is:

Suggested change

error = TRUE

warning = TRUE

That will ensure that the warning we're getting is the "not supported" kind, and then compare the results after that.

arrow/r/tests/testthat/helper-expectation.R

Lines 81 to 85 in 9678566

#' @param warning The expected warning from the RecordBatch and Table comparison

#' paths, passed to `expect_warning()`. Special values:

#' * `NA` (the default) for ensuring no warning message

#' * `TRUE` is a special case to mean to check for the

#' "not supported in Arrow; pulling data into R" message.

error = TRUE is being passed via ... to expect_equal() which also has ... which ends up being silently ignored.

Not sure what to say:

error is an argument to expect_snapshot() and warning is to compare_dplyr_equal()

I like the expect_shapshot(..., error = TRUE) a bit better since with the change we would lose some of the message cover:

OH, sorry I was off one level of the nesting. I might be misunderstanding the output that you're seeing there, but I interpret that as the snapshot is recording that there's no output (i.e. the expectation has passed). Which is what we expect with warning = TRUE

We could use warning = TRUE and compare_dplyr_binding() (and no longer use expect_snapshot()), but we will no longer be testing the error messages (which we worked to improve only because they were visible).

I would say that if we want to check the error messages, we should do that directly with expect_error() or expect_snapshot() wrapped around something like:

call_binding("tz", "this is a string!")

Or we might need to turn that R string into an Arrow Scalar or Array...)

What I was testing was the warning in compare_dplyr_warning() being promoted to error level. It is the message for this error we improved which otherwise would not have been visible.

I don't think something like call_binding("tz", ...) works because whatever we pass via the ... needs to be an Expression. Which I think is the reason why arrow::tz() only works in a dplyr context, i.e. it doesn't work on standalone Arrays. That was why I was testing errors with dplyr pipelines.

…g else)

r/R/dplyr-funcs-datetime.R

…ore tests

…snapshot()` for the error chunk

…NA`, lubridate returns `"UTC"`

…nit test

…g else)

jonkeane · 2022-02-19T14:42:54Z

This is very very very close! I rebased off of the default branch since #12447 merged to get that fix.

I also updated the tests to mostly be the more simple R object in the call_binding but kept one that uses Expression$create(...) since Expressions are the type that will show up in the dplyr query.

I did notice when testing locally (and I assume this'll pop up in the CI too), I think there might be a slight problem with #12447. It looks like type({Expression object}) is returning the function for getting the type and not the Type object itself (hence everyone's favorite error message: object of type closure is not subsetable)

I think that

arrow/r/R/type.R

Line 81 in f9f2c08

type.Expression <- function(x) x$type

should actually be type.Expression <- function(x) x$type()

It might be nice to add a test that asserts not just that type(x) and x$type are the same, but that type(x) produces the right Type object. So like this line:

arrow/r/tests/testthat/test-type.R

Line 238 in f9f2c08

expect_equal(x$type, type(x))

could also be tested with something like expect_equal(type(x), int32()) which will make sure that what's coming out of type(x) there is definitely a type.

It's a bit muddying this PR, but I'm happy for you to make those small adjustments here rather than a whole new Jira + commit chain + another wait to rebase after that's fixed.

Other than that, this PR looks ready to go!

into lubridate_tz_binding

… unit tests

thisisnic

With the update to type.Expression, looks like this is all done now.

jonkeane

Thanks for this!

ursabot · 2022-02-22T14:01:21Z

Benchmark runs are scheduled for baseline = 5216c2b and contender = 0eaafe8. 0eaafe8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️7.49% ⬆️5.58%] test-mac-arm
[Failed ⬇️1.81% ⬆️3.25%] ursa-i9-9960x
[Finished ⬇️0.52% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

first pass at implementing the binding for lubridate::tz(). needs m…

d6b66d2

…ore tests

github-actions bot added the Component: R label Feb 7, 2022

dragosmg added 2 commits February 8, 2022 10:22

tz() returns "UTC" for strings, date and NAs + more tests

ebc1443

updated NEWS

104c779

dragosmg marked this pull request as ready for review February 8, 2022 10:33

jonkeane reviewed Feb 8, 2022

View reviewed changes

r/R/dplyr-funcs-datetime.R Outdated Show resolved Hide resolved

r/R/dplyr-funcs-datetime.R Show resolved Hide resolved

dragosmg added 2 commits February 10, 2022 10:41

fall back to the simpler implementation for tz()

9710ffd

split good and error behaviour in 2 different chunks and use `expect_…

38e38e0

…snapshot()` for the error chunk

dragosmg added 2 commits February 10, 2022 21:02

update tz() binding to return NA when x is NA

d88e4c9

update tests to reflect different behaviour for NA: arrow returns `…

ff23580

…NA`, lubridate returns `"UTC"`

revert to first / original implementation and remove NA propagation u…

6014efa

…nit test

jonkeane reviewed Feb 10, 2022

View reviewed changes

r/tests/testthat/_snaps/dplyr-funcs-datetime.md Outdated Show resolved Hide resolved

jonkeane reviewed Feb 10, 2022

View reviewed changes

r/tests/testthat/test-dplyr-funcs-datetime.R Outdated Show resolved Hide resolved

dragosmg added 2 commits February 15, 2022 10:17

updated unit tests and snapshots for tz()

baaed40

update tz() with more meaningful error messages

badcdb6

dragosmg requested a review from jonkeane February 15, 2022 10:19

dragosmg changed the title ~~ARROW-14817 [R] Implement bindings for lubridate::tz~~ ARROW-14817 [R] Implement bindings for lubridate::tz() Feb 15, 2022

jonkeane requested changes Feb 15, 2022

View reviewed changes

dragosmg and others added 3 commits February 15, 2022 20:43

meaningful names for the test dataframe columns + update snapshots

f03ab7d

simplify the implementation (start with timestamp & error for anythin…

022070a

…g else)

some clean up

d47fd4d

jonkeane reviewed Feb 19, 2022

View reviewed changes

r/R/dplyr-funcs-datetime.R Outdated Show resolved Hide resolved

dragosmg and others added 17 commits February 19, 2022 08:22

first pass at implementing the binding for lubridate::tz(). needs m…

d1d5946

…ore tests

tz() returns "UTC" for strings, date and NAs + more tests

91c5e66

updated NEWS

46080f6

fall back to the simpler implementation for tz()

0233ec5

split good and error behaviour in 2 different chunks and use `expect_…

de7f992

…snapshot()` for the error chunk

update tz() binding to return NA when x is NA

07c2a65

update tests to reflect different behaviour for NA: arrow returns `…

0936441

…NA`, lubridate returns `"UTC"`

revert to first / original implementation and remove NA propagation u…

de648f4

…nit test

updated unit tests and snapshots for tz()

165031c

update tz() with more meaningful error messages

30e2cea

meaningful names for the test dataframe columns + update snapshots

384fddc

simplify the implementation (start with timestamp & error for anythin…

af96552

…g else)

some clean up

3cee34a

removed snapshot md file

fa75502

class extraction

4bf48f6

simplify class extraction once apache#12447 is merged

554345b

a few test additions|changes

3bff651

jonkeane force-pushed the lubridate_tz_binding branch from 792f4ae to 3bff651 Compare February 19, 2022 14:32

dragosmg added 2 commits February 21, 2022 16:08

Merge branch 'lubridate_tz_binding' of https://github.com/dragosmg/arrow

ea3a395

into lubridate_tz_binding

update the type method for Expression to return a function + more…

85c079c

… unit tests

dragosmg marked this pull request as ready for review February 21, 2022 16:22

dragosmg requested a review from jonkeane February 21, 2022 17:45

thisisnic approved these changes Feb 22, 2022

View reviewed changes

jonkeane approved these changes Feb 22, 2022

View reviewed changes

jonkeane closed this in 0eaafe8 Feb 22, 2022

dragosmg deleted the lubridate_tz_binding branch February 24, 2022 09:24

asfimport mentioned this pull request Feb 22, 2022

[R] Implement bindings for lubridate::tz #18954

Closed

	#' @param warning The expected warning from the RecordBatch and Table comparison
	#' paths, passed to `expect_warning()`. Special values:
	#' * `NA` (the default) for ensuring no warning message
	#' * `TRUE` is a special case to mean to check for the
	#' "not supported in Arrow; pulling data into R" message.

ARROW-14817 [R] Implement bindings for lubridate::tz() #12357

ARROW-14817 [R] Implement bindings for lubridate::tz() #12357

Uh oh!

Conversation

dragosmg commented Feb 7, 2022

Uh oh!

github-actions bot commented Feb 7, 2022

Uh oh!

github-actions bot commented Feb 7, 2022

Uh oh!

dragosmg commented Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dragosmg commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonkeane commented Feb 10, 2022

Uh oh!

dragosmg commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dragosmg commented Feb 10, 2022

Uh oh!

dragosmg commented Feb 10, 2022

Uh oh!

jonkeane commented Feb 10, 2022

Uh oh!

Uh oh!

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonkeane Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

dragosmg Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

jonkeane Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

dragosmg Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

jonkeane Feb 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dragosmg Feb 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dragosmg Feb 16, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonkeane commented Feb 19, 2022

Uh oh!

thisisnic left a comment

Choose a reason for hiding this comment

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

ursabot commented Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

ARROW-14817 [R] Implement bindings for `lubridate::tz()` #12357

ARROW-14817 [R] Implement bindings for `lubridate::tz()` #12357

dragosmg commented Feb 7, 2022 •

edited

Loading

dragosmg commented Feb 10, 2022 •

edited

Loading

dragosmg commented Feb 10, 2022 •

edited

Loading

jonkeane Feb 15, 2022 •

edited

Loading

dragosmg Feb 15, 2022 •

edited

Loading

ursabot commented Feb 22, 2022 •

edited

Loading