-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15804: [R] Improve as.Date() error message when supplying several tryFormats #13070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
710034e to
74b0108
Compare
3eb3489 to
381242a
Compare
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look in too much detail at the operational code, though I did comment about a few additional tests that I suspect will unearth some issues.
A thought: should we only support tryFormats only when RE2 is available? As it currently is, this PR will remove the as.Date functionality if format is NULL for Windows R 3.6 users.
This is ok if it's the only way, though we should generally avoid it. Could we provide a fallback? Or is there a way to avoid the RE2 dependency for this?
I implemented tryFormats with the parse_date_time() binding to coalesce through the format strings. One implication is that as.Date() is no longer erroring when not being able to parse, but rather return an NA.
Like I mention with the test addition, I suspect coalesce isn't quite the right behavior here. As for returning NA: is that the behavior of the function we are binding? If so, that's great + good. If not, we should think if we can fix that
I agree with you that On the returning x <- c(NA, "2022-01-01", "2022/01/01")
as.Date(x, tryFormats = c("%y-%m-%d", "%y/%m/%d"), optional = FALSE) # default
#> Error in charToDate(x): character string is not in a standard unambiguous format
as.Date(x, tryFormats = c("%y-%m-%d", "%y/%m/%d"), optional = TRUE)
#> [1] NA NA NACreated on 2022-05-11 by the reprex package (v2.0.1) |
What have you tried so far? This will be tricky + it's totally possible that it is not feasible to emulate this behavior with the way we build these.
Since we have the option to either emit an Tangentially, the example you use works, though it's a bit confusing IMO |
Initially I started by trying to apply an identical logic to |
We could, but if we use coalesce(
build_expr("strptime", ..., error_is_null = TRUE)
...
build_expr("strptime", ..., error_is_null = FALSE)
)Because we'd still try the various formats and we want to error or return NA only if all of them fail to parse. |
|
@jonkeane in terms of what have I tried, I went down this route to identify the fist non-NULL element of library(arrow, warn.conflicts = FALSE)
b <- Array$create(c(NA, "2022-01-01", "2022/01/01", NA))
for (i in 1:b$length()) {
if (b$IsValid(i - 1)) {
xx <- b$Slice(i - 1, 1)
break()
}
}
xx
#> Array
#> <string>
#> [
#> "2022-01-01"
#> ]which doesn't work when |
Your description of how base R accomplishes this is good — and like you mention: iterating through data (even to get a single element) might be complicate or problematic. The next step I would take is to explore what functionality is available (you seem to have found Another approach might be to try other methods of constructing expressions that don't use coalesce but produce multiple columns and you select one at the end of the expression. It's entirely possible that this type of introspection + format selection will not be feasible in R-only code too. But as always, it's helpful to explore and see + have something that ~works even if it's not what we want to create tests + implement it in the C++ layer. |
53d302d to
7344bf5
Compare
That was one of my issues. I figured how to do it for Arrays (i.e. the logic of the selection), but not for Expressions - which are the types we have there. I had looked at functionalities for both. |
|
I think there is something I'd like to try here. An approach similar to the
|
…dated unit tests
…E2 is not available
e008fdf to
d929b8b
Compare
|
Given the runtime kernel limitations I think throwing an error makes most sense here. |
thisisnic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the change in the outcome, it might also be worth updating the ticket name so we don't get confused about the nature of this change in the release notes.
+1 on the jira name. |
|
I updated the Jira ticket name. Great point! Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provided the CI passes, LGTM, thanks for updating this!
|
Benchmark runs are scheduled for baseline = e44f79a and contender = 3f205d4. 3f205d4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
…al tryFormats **Update**: this PR will only improve the message users get when trying to pass multiple formats to `tryFormats`. The new message will include an advice to use the dedicated parsing functions. This conclusion / recommendation can be found here (first comment): https://issues.apache.org/jira/browse/ARROW-15805 ================= **Old description**: This PR will allow users to try parsing with several formats via the `tryFormats` argument to `as.Date()` and `as_date()` > _tryFormats_ - character vector of format strings to try if format is not specified. `as_date()` does not have the `tryFormats` argument, but will assume an ISO 8601 format if nothing is passed via the `format` argument. https://github.com/tidyverse/lubridate/blob/c4714046ec412e669906ab8d1cd9a8e4fb125e13/R/coercion.r#L725 I implemented `tryFormats` with the `parse_date_time()` binding to coalesce through the format strings. One implication is that `as.Date()` is no longer erroring when not being able to parse, but rather return an `NA`. **A thought**: should we only support `tryFormats` only when `RE2` is available? As it currently is, this PR will remove the `as.Date` functionality if `format` is `NULL` for Windows R 3.6 users. Closes apache#13070 from dragosmg/asDate_with_tryFormats Authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Update: this PR will only improve the message users get when trying to pass multiple formats to
tryFormats. The new message will include an advice to use the dedicated parsing functions. This conclusion / recommendation can be found here (first comment): https://issues.apache.org/jira/browse/ARROW-15805=================
Old description: This PR will allow users to try parsing with several formats via the
tryFormatsargument toas.Date()andas_date()as_date()does not have thetryFormatsargument, but will assume an ISO 8601 format if nothing is passed via theformatargument.https://github.com/tidyverse/lubridate/blob/c4714046ec412e669906ab8d1cd9a8e4fb125e13/R/coercion.r#L725
I implemented
tryFormatswith theparse_date_time()binding to coalesce through the format strings. One implication is thatas.Date()is no longer erroring when not being able to parse, but rather return anNA.A thought: should we only support
tryFormatsonly whenRE2is available? As it currently is, this PR will remove theas.Datefunctionality ifformatisNULLfor Windows R 3.6 users.