Skip to content

Conversation

@dragosmg
Copy link
Contributor

@dragosmg dragosmg commented May 5, 2022

Update: this PR will only improve the message users get when trying to pass multiple formats to tryFormats. The new message will include an advice to use the dedicated parsing functions. This conclusion / recommendation can be found here (first comment): https://issues.apache.org/jira/browse/ARROW-15805

=================

Old description: This PR will allow users to try parsing with several formats via the tryFormats argument to as.Date() and as_date()

tryFormats - character vector of format strings to try if format is not specified.

as_date() does not have the tryFormats argument, but will assume an ISO 8601 format if nothing is passed via the format argument.

https://github.com/tidyverse/lubridate/blob/c4714046ec412e669906ab8d1cd9a8e4fb125e13/R/coercion.r#L725

I implemented tryFormats with the parse_date_time() binding to coalesce through the format strings. One implication is that as.Date() is no longer erroring when not being able to parse, but rather return an NA.

A thought: should we only support tryFormats only when RE2 is available? As it currently is, this PR will remove the as.Date functionality if format is NULL for Windows R 3.6 users.

@github-actions
Copy link

github-actions bot commented May 5, 2022

@dragosmg dragosmg force-pushed the asDate_with_tryFormats branch from 710034e to 74b0108 Compare May 10, 2022 08:14
@dragosmg dragosmg marked this pull request as ready for review May 10, 2022 10:17
@dragosmg dragosmg force-pushed the asDate_with_tryFormats branch from 3eb3489 to 381242a Compare May 10, 2022 17:08
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look in too much detail at the operational code, though I did comment about a few additional tests that I suspect will unearth some issues.

A thought: should we only support tryFormats only when RE2 is available? As it currently is, this PR will remove the as.Date functionality if format is NULL for Windows R 3.6 users.

This is ok if it's the only way, though we should generally avoid it. Could we provide a fallback? Or is there a way to avoid the RE2 dependency for this?

I implemented tryFormats with the parse_date_time() binding to coalesce through the format strings. One implication is that as.Date() is no longer erroring when not being able to parse, but rather return an NA.

Like I mention with the test addition, I suspect coalesce isn't quite the right behavior here. As for returning NA: is that the behavior of the function we are binding? If so, that's great + good. If not, we should think if we can fix that

@dragosmg
Copy link
Contributor Author

dragosmg commented May 11, 2022

Like I mention with the test addition, I suspect coalesce isn't quite the right behavior here. As for returning NA: is that the behavior of the function we are binding? If so, that's great + good. If not, we should think if we can fix that.

I agree with you that coalesce() doesn't seem to be the right behaviour (i.e. not giving us perfect overlap with base::as.Date()). Do you have any ideas on this? I"m struggling a bit to replicate the base behaviour (subsetting the column ...).

On the returning NA. It depends. We don't align with the behaviour of as.Date() only when all formats fail to parse and optional = FALSE (which is the default) , in which case we align with as.Date(optional = TRUE). It's a decent, but again, not perfect overlap.

x <- c(NA, "2022-01-01", "2022/01/01")

as.Date(x, tryFormats = c("%y-%m-%d", "%y/%m/%d"), optional = FALSE) # default
#> Error in charToDate(x): character string is not in a standard unambiguous format
as.Date(x, tryFormats = c("%y-%m-%d", "%y/%m/%d"), optional = TRUE)
#> [1] NA NA NA

Created on 2022-05-11 by the reprex package (v2.0.1)

@jonkeane
Copy link
Member

Do you have any ideas on this? I"m struggling a bit to replicate the base behaviour (subsetting the column ...).

What have you tried so far? This will be tricky + it's totally possible that it is not feasible to emulate this behavior with the way we build these.

On the returning NA. It depends. We don't align with the behaviour of as.Date() only when all formats fail to parse and optional = FALSE (which is the default) , in which case we align with as.Date(optional = TRUE). It's a decent, but again, not perfect overlap.

Since we have the option to either emit an NA or error on parsing fail, could we capture this behavior?

Tangentially, the example you use works, though it's a bit confusing IMO %y fails because it's year-without-century as opposed to %Y. And the NA at the beginning also made me think that was what you were relying on for failing the guessing, but that's doesn't actually make the parsing fail (which I got from reading the source of as.Date.character...). Here's a slightly more minimal | possibly less confusing example of that same phenomenon:

> x <- c("2022-01-01", "2022/01/01")
> as.Date(x, tryFormats = c("%Y--", "%m--"), optional = FALSE)
Error in charToDate(x) : 
  character string is not in a standard unambiguous format
> as.Date(x, tryFormats = c("%Y--", "%m--"), optional = TRUE)
[1] NA NA

@dragosmg
Copy link
Contributor Author

What have you tried so far? This will be tricky + it's totally possible that it is not feasible to emulate this behavior with the way we build these.

Initially I started by trying to apply an identical logic to as.Date.character(): find the first non-null element and attempt to parse it with the first format, then move on to the next format, etc. Computationally it should be relatively efficient since we would try the formats on a single element, but I couldn't figure out how to do this with expressions (find the first non-null, apply the format, force the computation, if NA, try the next one, etc).

@dragosmg
Copy link
Contributor Author

Since we have the option to either emit an NA or error on parsing fail, could we capture this behavior?

We could, but if we use coalesce() (not saying we should, but it's the best I was able to come up with), then we'd need to build the iteration something like:

coalesce(
build_expr("strptime", ..., error_is_null = TRUE)
...
build_expr("strptime", ..., error_is_null = FALSE)
)

Because we'd still try the various formats and we want to error or return NA only if all of them fail to parse.

@dragosmg
Copy link
Contributor Author

dragosmg commented May 12, 2022

@jonkeane in terms of what have I tried, I went down this route to identify the fist non-NULL element of x.

library(arrow, warn.conflicts = FALSE)
b <- Array$create(c(NA, "2022-01-01", "2022/01/01", NA))

for (i in 1:b$length()) {
  if (b$IsValid(i - 1)) {
    xx <- b$Slice(i - 1, 1)
    break()
  }
}
xx
#> Array
#> <string>
#> [
#>   "2022-01-01"
#> ]

which doesn't work when x is an Expression, but it does when it is an Array. If only there were a way to anticipatively get the length of an Expression output (and other info) once evaluated.

@jonkeane
Copy link
Member

Initially I started by trying to apply an identical logic to as.Date.character(): find the first non-null element and attempt to parse it with the first format, then move on to the next format, etc. Computationally it should be relatively efficient since we would try the formats on a single element, but I couldn't figure out how to do this with expressions (find the first non-null, apply the format, force the computation, if NA, try the next one, etc).

Your description of how base R accomplishes this is good — and like you mention: iterating through data (even to get a single element) might be complicate or problematic.

The next step I would take is to explore what functionality is available (you seem to have found IsValid() which is good. You'll need to look at both Arrays and Expressions, since you could get either in a dplyr query like this (are there any other object types you might encounter there?). And then I would try and construct a way to get the format then apply it. In this exploration it's ok to have unreasonable or code that we might not want to ship to see if it's possible. This type of data introspection is not common in our bindings, but you might look around to see if there are any others that do similar (I don't know of any off the top of my head, but it's possible that one exists).

Another approach might be to try other methods of constructing expressions that don't use coalesce but produce multiple columns and you select one at the end of the expression.

It's entirely possible that this type of introspection + format selection will not be feasible in R-only code too. But as always, it's helpful to explore and see + have something that ~works even if it's not what we want to create tests + implement it in the C++ layer.

@dragosmg dragosmg force-pushed the asDate_with_tryFormats branch from 53d302d to 7344bf5 Compare May 16, 2022 14:28
@dragosmg
Copy link
Contributor Author

dragosmg commented May 16, 2022

You'll need to look at both Arrays and Expressions, since you could get either in a dplyr query like this (are there any other object types you might encounter there?

That was one of my issues. I figured how to do it for Arrays (i.e. the logic of the selection), but not for Expressions - which are the types we have there. I had looked at functionalities for both.

@dragosmg dragosmg marked this pull request as draft May 18, 2022 09:03
@dragosmg dragosmg requested a review from jonkeane May 25, 2022 15:30
@dragosmg
Copy link
Contributor Author

dragosmg commented May 26, 2022

I think there is something I'd like to try here. An approach similar to the case_when() binding. It might be computationally costly, but it might also work (and won't require relying on coalesce()).
Short sketch:

  • build strptime Expressions for each tryFormat -> this should give us a list of Expressions
  • force evaluation of each Expression with arrow_eval in a mask probably 2-3 caller environments higher -> this should might give us a list of values (not sure if this step is actually going to give me what I need)
  • reduce the list of values to the position of the first NA / first non-NA elements.
  • compare first NA / non-NA positions and choose the format

@dragosmg dragosmg force-pushed the asDate_with_tryFormats branch from e008fdf to d929b8b Compare June 30, 2022 15:23
@dragosmg dragosmg marked this pull request as ready for review July 4, 2022 09:59
@rok
Copy link
Member

rok commented Jul 4, 2022

Given the runtime kernel limitations I think throwing an error makes most sense here.

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the change in the outcome, it might also be worth updating the ticket name so we don't get confused about the nature of this change in the release notes.

@rok
Copy link
Member

rok commented Jul 4, 2022

Given the change in the outcome, it might also be worth updating the ticket name so we don't get confused about the nature of this change in the release notes.

+1 on the jira name.

@dragosmg
Copy link
Contributor Author

dragosmg commented Jul 4, 2022

I updated the Jira ticket name. Great point! Thanks.

@dragosmg dragosmg changed the title ARROW-15804: [R] Update as.Date() to support several tryFormats ARROW-15804: [R] Improve as.Date() error message when supplying several tryFormats Jul 4, 2022
@dragosmg dragosmg requested a review from thisisnic July 4, 2022 14:46
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided the CI passes, LGTM, thanks for updating this!

@thisisnic thisisnic closed this in 3f205d4 Jul 5, 2022
@ursabot
Copy link

ursabot commented Jul 5, 2022

Benchmark runs are scheduled for baseline = e44f79a and contender = 3f205d4. 3f205d4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.25% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.07% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 3f205d49 ec2-t3-xlarge-us-east-2
[Finished] 3f205d49 test-mac-arm
[Finished] 3f205d49 ursa-i9-9960x
[Finished] 3f205d49 ursa-thinkcentre-m75q
[Finished] e44f79af ec2-t3-xlarge-us-east-2
[Finished] e44f79af test-mac-arm
[Finished] e44f79af ursa-i9-9960x
[Finished] e44f79af ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Jul 5, 2022

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

drin pushed a commit to drin/arrow that referenced this pull request Jul 5, 2022
…al tryFormats

**Update**: this PR will only improve the message users get when trying to pass multiple formats to `tryFormats`. The new message will include an advice to use the dedicated parsing functions. This conclusion / recommendation can be found here (first comment): https://issues.apache.org/jira/browse/ARROW-15805

=================

**Old description**: This PR will allow users to try parsing with several formats via the `tryFormats` argument to `as.Date()` and `as_date()`

> _tryFormats_ - character vector of format strings to try if format is not specified.

`as_date()` does not have the `tryFormats` argument, but will assume an ISO 8601 format if nothing is passed via the `format` argument.

https://github.com/tidyverse/lubridate/blob/c4714046ec412e669906ab8d1cd9a8e4fb125e13/R/coercion.r#L725

I implemented `tryFormats` with the `parse_date_time()` binding to coalesce through the format strings. One implication is that `as.Date()` is no longer erroring when not being able to parse, but rather return an `NA`.

**A thought**: should we only support `tryFormats` only when `RE2` is available? As it currently is, this PR will remove the `as.Date` functionality if `format` is `NULL` for Windows R 3.6 users.

Closes apache#13070 from dragosmg/asDate_with_tryFormats

Authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants