-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13448: [R] Bindings for strftime #11105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
5568c27 to
8ad7ce3
Compare
|
Thanks for the review @thisisnic! |
|
Another thing to note here is that behavior of the
This is already tested in c++ and we test it in Python like so. I'm not sure there's a point in testing it in R but it could look like this: arr <- Array$create(c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA), type = timestamp("s"))
expect_dplyr_equal(
input %>%
mutate(x = strftime(x, format = "%S")) %>%
collect(),
arr
)Except this wouldn't work for type reasons: Error (test-dplyr-string-functions.R:756:3): strftime
Error: no applicable method for 'mutate' applied to an object of class "c('Array', 'ArrowDatum', 'ArrowObject', 'R6')" |
thisisnic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you also run the styler, so that the linter stops complaining?
So, the R "%S" returns an integer, so you're right in highlighting the difference between Arrow and R. We have other time-related functions which return slightly different values, which we allow for, e.g. arrow/r/tests/testthat/test-dplyr-lubridate.R Lines 155 to 164 in 5260fd5
The function I suppose the difficulty lies in the fact that the output is a string not a number, but there will be workarounds. |
768a940 to
ba9fe66
Compare
Co-authored-by: Nic <thisisnic@gmail.com>
| expect_dplyr_equal( | ||
| input %>% | ||
| mutate(x = strftime(x, format = formats_minus_c, tz = "Pacific/Marquesas")) %>% | ||
| collect(), | ||
| times | ||
| ) | ||
|
|
||
| withr::with_locale(new = c("LC_TIME" = "C"), | ||
| expect_dplyr_equal( | ||
| input %>% | ||
| mutate(x = strftime(x, format = "%c", tz = "Pacific/Marquesas")) %>% | ||
| collect(), | ||
| times | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't quite explain why, but it happens on my laptop. Could it be that strftime is using locale different from the one we use here (Sys.getlocale("LC_TIME"))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears R's strftime interprets %c differently as date.h we use in arrow in certain locales and timezones?
Sometimes it's "%a %b %e %H:%M:%S %Y" and sometimes "%a %b %e %H:%M:%S %Y %Z".
Could also just be my laptop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here - can you elaborate a bit more on your comments above please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it was late :).
These tests are failing locally for me on the %c flag in Pacific/Marquesas, but not for say UTC. I am not sure why - it could be a difference between the definition of %c in date.h we're using and in R's strftime. Or it could just be some setting on my laptop. Either way it's very strange to see an error on only this flag with this locale and timezone.
I'm more and more suspicious of my laptop really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility is that the timezone database your system uses that may/may not be quite the same in different environments (cf https://stat.ethz.ch/R-manual/R-devel/library/base/html/timezones.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be. FWIW I'm on Ubuntu 21.04 and have LC_TIME=nl_NL.UTF-8 and Sys.timezone()-> "Europe/Amsterdam".
I'll remove the check and see what happens in the CI.
Should we add checks in for different locales into the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're worried that there might be differences / bugs not respecting locales, having something non-default using withr::with_locale (like you do above) would be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I finally understand what the issue is here. Arrow's strftime behaves differently than R's.
Arrow strftime(x, format = "%c %Z"):
Sat 01 Jan 2022 00:00:00 AM CET EST
R strftime(x, format = "%c"):
Sat 01 Jan 2022 12:00:00 AM EST EST
It's probably a bug in the C++ library we use so I opened an issue and added a reference to it in the comment HowardHinnant/date#704
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've disabled %c in R for now and I propose we merge this and revisit later.
@pitrou shall we block %c in c++ as well? See HowardHinnant/date#704
|
@thisisnic @jonkeane sorry this took a while. Could you please do another pass? |
thisisnic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but please can you give it a final check @jonkeane in case I've missed anything?
One other thing to note @rok - there was a comment on a duplicate ticket of this one which mentions implementing the lubridate package's format_ISO8601 function (see https://lubridate.tidyverse.org/reference/format_ISO8601.html). Please could you either:
- implement that here, or
- create a new ticket to implement it (and as you've done most of the hard work here, it could be a good candidate to label as
good-first-issueorgood-second-issue)
|
@thisisnic good catch on Yet another thing to note: the c++ |
|
Ok, added |
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
thisisnic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor things, otherwise looking good!
Co-authored-by: Nic <thisisnic@gmail.com>
thisisnic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing format_ISO8601 as well!
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for this
|
Thanks @thisisnic @jonkeane @edponce ! |
This is to resolve [ARROW-13448](https://issues.apache.org/jira/browse/ARROW-13448). Closes apache#11105 from rok/ARROW-13448 Lead-authored-by: Rok <rok@mihevc.org> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Jonathan Keane <jkeane@gmail.com>
This is to resolve ARROW-13448.