Skip to content

Conversation

@dragosmg
Copy link
Contributor

@dragosmg dragosmg commented Feb 7, 2022

This PR should make the following snippets of code equivalent:

df <- tibble::tibble(
  x = as.POSIXct(c("2022-02-07", "2022-03-04"), tz = "Pacific/Marquesas")  
)

df %>%
mutate(timezone = tz(x))
# A tibble: 2 × 2
  x                   timezone         
  <dttm>              <chr>            
1 2022-02-07 00:00:00 Pacific/Marquesas
2 2022-03-04 00:00:00 Pacific/Marquesas

and

dr %>%
  record_batch() %>%
  mutate(timezone = tz(x) %>%
  collect()

@github-actions
Copy link

github-actions bot commented Feb 7, 2022

@github-actions
Copy link

github-actions bot commented Feb 7, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@dragosmg
Copy link
Contributor Author

dragosmg commented Feb 7, 2022

lubridate::tz() supports a bunch of types of inputs, for example character:

suppressMessages(library(lubridate))
suppressMessages(library(dplyr))
tibble(
  a = "2022-02-07"
) %>% 
  mutate(timezone = tz(a))
#> # A tibble: 1 × 2
#>   a          timezone
#>   <chr>      <chr>   
#> 1 2022-02-07 UTC

Created on 2022-02-07 by the reprex package (v2.0.1)

Not sure we want to support all of them. Albeit, it defaults to "UTC" when extracting the timezone from a character.

@dragosmg dragosmg marked this pull request as ready for review February 8, 2022 10:33
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. This is looking good (and nice and simple, which is fantastic!) I have a few questions for you so far

@dragosmg
Copy link
Contributor Author

dragosmg commented Feb 10, 2022

To add to the conversation above. I think the right way to go is to extract timezone from the timestamp metadata and not look at each individual value. However, that could result in a situation where a value is NA, but the column is a timestamp which will result in something like:

suppressMessages(library(arrow))
suppressMessages(library(dplyr))
df <- tibble(
  x = as.POSIXct(c("2022-02-07", NA), tz = "Pacific/Marquesas")
)

df %>% 
  record_batch() %>% 
  mutate(timezone_x = tz(x)) %>% 
  collect()
#> A tibble: 2 × 2
#>   x                   timezone_x       
#>   <dttm>              <chr>            
#> 1 2022-02-07 00:00:00 Pacific/Marquesas
#> 2 NA                  Pacific/Marquesas

Created on 2022-02-10 by the reprex package (v2.0.1)

@jonkeane
Copy link
Member

I'm going to answer these two points below together, since they are similar. Both have tradeoffs, but what I do in circumstances like this is to look through our other bindings and see if I can find examples that do similar things.

To add to the conversation above. I think the right way to go is to extract timezone from the timestamp metadata and not look at each individual value. However, that could in a situation where a value is NA

I agree that we should use the column metadata here (which is what it looks like you're doing), but then wrap it in an if_else detection if any elements are NA (one example is how we implemented if_else()(😂)

if_else_binding <- function(condition, true, false, missing = NULL) {
if (!is.null(missing)) {
return(if_else_binding(
call_binding("is.na", (condition)),
missing,
if_else_binding(condition, true, false)
))
}
build_expr("if_else", condition, true, false)
}
we also do similar with is.finite() and is.infinite()
register_binding("is.finite", function(x) {
is_fin <- Expression$create("is_finite", x)
# for compatibility with base::is.finite(), return FALSE for NA_real_
is_fin & !call_binding("is.na", is_fin)
})
register_binding("is.infinite", function(x) {
is_inf <- Expression$create("is_inf", x)
# for compatibility with base::is.infinite(), return FALSE for NA_real_
is_inf & !call_binding("is.na", is_inf)
})

I can't figure out how to surface an error from tz() as it gets caught by the dplyr query and then the dplyr-equivalent expression still gets evaluated in R.

One example of having an error in a binding is

if (!call_binding("is.logical", query[[i]])) {
abort("Left side of each formula in case_when() must be a logical expression")
}

@dragosmg
Copy link
Contributor Author

dragosmg commented Feb 10, 2022

@jonkeane Thanks for your comments. On surfacing errors: for example, I used abort() when working out how to implement format() (https://github.com/dragosmg/arrow/blob/9c99c03d4bdd5b80dea8acf88a3949808662bd32/r/R/dplyr-funcs-type.R#L259), but that error never surfaces as it is intercepted by the mutate() "context" and data is pulled into R.

@dragosmg
Copy link
Contributor Author

Worth noting that the current implementation diverges from lubridate::tz() as indicated in the conversation, namely it does not return "UTC" for strings, numerics or NAs.

suppressMessages(library(lubridate))
tz(NA)
#> [1] "UTC"
tz(2)
#> Warning: tz(): Don't know how to compute timezone for object of class numeric;
#> returning "UTC". This warning will become an error in the next major version of
#> lubridate.
#> [1] "UTC"
tz(2.2)
#> Warning: tz(): Don't know how to compute timezone for object of class numeric;
#> returning "UTC". This warning will become an error in the next major version of
#> lubridate.
#> [1] "UTC"
tz("string")
#> [1] "UTC"

Created on 2022-02-10 by the reprex package (v2.0.1)

@dragosmg
Copy link
Contributor Author

NA will be "contagious" in arrow's tz() binding (as opposed to lubridate::tz() where they aren't):

suppressMessages(library(arrow))
suppressMessages(library(lubridate))
suppressMessages(library(dplyr))

df <- tibble(
  x = as.POSIXct(c("2022-02-07", NA), tz = "Pacific/Marquesas")
)

df %>% 
  record_batch() %>% 
  mutate(timezone_x = tz(x)) %>%
  collect()
#> # A tibble: 2 × 2
#>   x                   timezone_x
#>   <dttm>              <chr>     
#> 1 2022-02-07 00:00:00 UTC       
#> 2 NA                  NA
 
df %>% 
  mutate(timezone_x = tz(x))
#> # A tibble: 2 × 2
#>   x                   timezone_x       
#>   <dttm>              <chr>            
#> 1 2022-02-07 00:00:00 Pacific/Marquesas
#> 2 NA                  Pacific/Marquesas


Created on 2022-02-10 by the reprex package (v2.0.1)

@jonkeane
Copy link
Member

Worth noting that the current implementation diverges from lubridate::tz() as indicated in the conversation, namely it does not return "UTC" for strings, numerics or NAs.

I think this is totally fine. For at least the numerics, lubridate will soon be doing what we do. For the others (strings, and bare NAs) it doesn't look like they are going to deprecate them, but I don't think we need to also follow that. We should note that (at least in a comment in the code, or in our documentation for the bindings if there's a place to put it).

NA will be "contagious" in arrow's tz() binding (as opposed to lubridate::tz() where they aren't):

Yeah, we have a choice here. I'm actually a bit surprised that lubridate::tz() returns a timezone that's not UTC there given the docs(!). But actually that behavior seems right. A bare NA either shouldn't produce a timezone (or produce a default UTC for backwards compatibility). But if the NA has a timezone attribute, that should be used.

Given that, we probably should take out the is.na() bit in our binding since that will match what lubridate::tz() here. I only suggested adding it if we were aiming to mimic the lubridate::tz(NA) behavior here (which I thought would be UTC, but turns out, it's not!). That has the added benefit that we don't do by-element comparison, which is nice!

@dragosmg dragosmg requested a review from jonkeane February 15, 2022 10:19
@dragosmg dragosmg changed the title ARROW-14817 [R] Implement bindings for lubridate::tz ARROW-14817 [R] Implement bindings for lubridate::tz() Feb 15, 2022
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting there! A few more comments and suggestions.

collect(),
df
),
error = TRUE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're looking for here (and below too) is:

Suggested change
error = TRUE
warning = TRUE

That will ensure that the warning we're getting is the "not supported" kind, and then compare the results after that.

#' @param warning The expected warning from the RecordBatch and Table comparison
#' paths, passed to `expect_warning()`. Special values:
#' * `NA` (the default) for ensuring no warning message
#' * `TRUE` is a special case to mean to check for the
#' "not supported in Arrow; pulling data into R" message.

error = TRUE is being passed via ... to expect_equal() which also has ... which ends up being silently ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to say:

  • error is an argument to expect_snapshot() and warning is to compare_dplyr_equal()
  • I like the expect_shapshot(..., error = TRUE) a bit better since with the change we would lose some of the message cover:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH, sorry I was off one level of the nesting. I might be misunderstanding the output that you're seeing there, but I interpret that as the snapshot is recording that there's no output (i.e. the expectation has passed). Which is what we expect with warning = TRUE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use warning = TRUE and compare_dplyr_binding() (and no longer use expect_snapshot()), but we will no longer be testing the error messages (which we worked to improve only because they were visible).

Copy link
Member

@jonkeane jonkeane Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that if we want to check the error messages, we should do that directly with expect_error() or expect_snapshot() wrapped around something like:

call_binding("tz", "this is a string!")

Or we might need to turn that R string into an Arrow Scalar or Array...)

Copy link
Contributor Author

@dragosmg dragosmg Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was testing was the warning in compare_dplyr_warning() being promoted to error level. It is the message for this error we improved which otherwise would not have been visible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think something like call_binding("tz", ...) works because whatever we pass via the ... needs to be an Expression. Which I think is the reason why arrow::tz() only works in a dplyr context, i.e. it doesn't work on standalone Arrays. That was why I was testing errors with dplyr pipelines.

@jonkeane jonkeane force-pushed the lubridate_tz_binding branch from 792f4ae to 3bff651 Compare February 19, 2022 14:32
@jonkeane
Copy link
Member

This is very very very close! I rebased off of the default branch since #12447 merged to get that fix.

I also updated the tests to mostly be the more simple R object in the call_binding but kept one that uses Expression$create(...) since Expressions are the type that will show up in the dplyr query.

I did notice when testing locally (and I assume this'll pop up in the CI too), I think there might be a slight problem with #12447. It looks like type({Expression object}) is returning the function for getting the type and not the Type object itself (hence everyone's favorite error message: object of type closure is not subsetable)

I think that

type.Expression <- function(x) x$type

should actually be type.Expression <- function(x) x$type()

It might be nice to add a test that asserts not just that type(x) and x$type are the same, but that type(x) produces the right Type object. So like this line:

expect_equal(x$type, type(x))

could also be tested with something like expect_equal(type(x), int32()) which will make sure that what's coming out of type(x) there is definitely a type.

It's a bit muddying this PR, but I'm happy for you to make those small adjustments here rather than a whole new Jira + commit chain + another wait to rebase after that's fixed.

Other than that, this PR looks ready to go!

@dragosmg dragosmg marked this pull request as ready for review February 21, 2022 16:22
@dragosmg dragosmg requested a review from jonkeane February 21, 2022 17:45
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the update to type.Expression, looks like this is all done now.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

@jonkeane jonkeane closed this in 0eaafe8 Feb 22, 2022
@ursabot
Copy link

ursabot commented Feb 22, 2022

Benchmark runs are scheduled for baseline = 5216c2b and contender = 0eaafe8. 0eaafe8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️7.49% ⬆️5.58%] test-mac-arm
[Failed ⬇️1.81% ⬆️3.25%] ursa-i9-9960x
[Finished ⬇️0.52% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@dragosmg dragosmg deleted the lubridate_tz_binding branch February 24, 2022 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants