Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Sep 7, 2021

This is to resolve ARROW-13448.

@github-actions
Copy link

github-actions bot commented Sep 7, 2021

@github-actions
Copy link

github-actions bot commented Sep 7, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@rok rok force-pushed the ARROW-13448 branch 2 times, most recently from 5568c27 to 8ad7ce3 Compare September 13, 2021 16:14
@rok rok marked this pull request as ready for review September 13, 2021 18:48
@rok
Copy link
Member Author

rok commented Sep 16, 2021

Thanks for the review @thisisnic!

@rok
Copy link
Member Author

rok commented Sep 16, 2021

Another thing to note here is that behavior of the %S flag is different in arrow:

Output precision of %S (seconds) flag depends on the input timestamp precision. Timestamps with second precision are represented as integers while milliseconds, microsecond and nanoseconds are represented as fixed floating point numbers with 3, 6 and 9 decimal places respectively. To obtain integer seconds, cast to timestamp with second resolution. The character for the decimal point is localized according to the locale. See detailed formatting documentation for descriptions of other flags.

This is already tested in c++ and we test it in Python like so. I'm not sure there's a point in testing it in R but it could look like this:

  arr <- Array$create(c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA), type = timestamp("s"))
  expect_dplyr_equal(
    input %>%
      mutate(x = strftime(x, format = "%S")) %>%
      collect(),
    arr
  )

Except this wouldn't work for type reasons:

Error (test-dplyr-string-functions.R:756:3): strftime
Error: no applicable method for 'mutate' applied to an object of class "c('Array', 'ArrowDatum', 'ArrowObject', 'R6')"

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you also run the styler, so that the linter stops complaining?

@thisisnic
Copy link
Member

thisisnic commented Sep 16, 2021

behavior of the %S flag is different in arrow

So, the R "%S" returns an integer, so you're right in highlighting the difference between Arrow and R. We have other time-related functions which return slightly different values, which we allow for, e.g.

test_that("extract second from timestamp", {
expect_dplyr_equal(
input %>%
mutate(x = second(datetime)) %>%
collect(),
test_df,
# arrow supports nanosecond resolution but lubridate does not
tolerance = 1e-6
)
})

The function expect_dplyr_equal(), as you've pointed out, won't work on Array objects. However, you could write a test like the one shown above, which just adds a tolerance (given that the user is being returned a more precise answer than they would have, the tolerance is justified).

I suppose the difficulty lies in the fact that the output is a string not a number, but there will be workarounds.

@rok rok force-pushed the ARROW-13448 branch 5 times, most recently from 768a940 to ba9fe66 Compare September 18, 2021 22:15
Co-authored-by: Nic <thisisnic@gmail.com>
Comment on lines 748 to 762
expect_dplyr_equal(
input %>%
mutate(x = strftime(x, format = formats_minus_c, tz = "Pacific/Marquesas")) %>%
collect(),
times
)

withr::with_locale(new = c("LC_TIME" = "C"),
expect_dplyr_equal(
input %>%
mutate(x = strftime(x, format = "%c", tz = "Pacific/Marquesas")) %>%
collect(),
times
)
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't quite explain why, but it happens on my laptop. Could it be that strftime is using locale different from the one we use here (Sys.getlocale("LC_TIME"))?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears R's strftime interprets %c differently as date.h we use in arrow in certain locales and timezones?
Sometimes it's "%a %b %e %H:%M:%S %Y" and sometimes "%a %b %e %H:%M:%S %Y %Z".
Could also just be my laptop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here - can you elaborate a bit more on your comments above please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it was late :).
These tests are failing locally for me on the %c flag in Pacific/Marquesas, but not for say UTC. I am not sure why - it could be a difference between the definition of %c in date.h we're using and in R's strftime. Or it could just be some setting on my laptop. Either way it's very strange to see an error on only this flag with this locale and timezone.

I'm more and more suspicious of my laptop really.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility is that the timezone database your system uses that may/may not be quite the same in different environments (cf https://stat.ethz.ch/R-manual/R-devel/library/base/html/timezones.html)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be. FWIW I'm on Ubuntu 21.04 and have LC_TIME=nl_NL.UTF-8 and Sys.timezone()-> "Europe/Amsterdam".
I'll remove the check and see what happens in the CI.
Should we add checks in for different locales into the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're worried that there might be differences / bugs not respecting locales, having something non-default using withr::with_locale (like you do above) would be a good idea.

We do hardcode this for at least some of our CI builds

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I finally understand what the issue is here. Arrow's strftime behaves differently than R's.
Arrow strftime(x, format = "%c %Z"):
Sat 01 Jan 2022 00:00:00 AM CET EST

R strftime(x, format = "%c"):
Sat 01 Jan 2022 12:00:00 AM EST EST

It's probably a bug in the C++ library we use so I opened an issue and added a reference to it in the comment HowardHinnant/date#704

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've disabled %c in R for now and I propose we merge this and revisit later.
@pitrou shall we block %c in c++ as well? See HowardHinnant/date#704

@rok rok requested a review from thisisnic September 18, 2021 23:15
@thisisnic thisisnic requested a review from jonkeane September 20, 2021 12:28
@rok
Copy link
Member Author

rok commented Sep 21, 2021

@thisisnic @jonkeane sorry this took a while. Could you please do another pass?

@thisisnic thisisnic requested a review from jonkeane September 22, 2021 10:21
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but please can you give it a final check @jonkeane in case I've missed anything?

One other thing to note @rok - there was a comment on a duplicate ticket of this one which mentions implementing the lubridate package's format_ISO8601 function (see https://lubridate.tidyverse.org/reference/format_ISO8601.html). Please could you either:

  1. implement that here, or
  2. create a new ticket to implement it (and as you've done most of the hard work here, it could be a good candidate to label as good-first-issue or good-second-issue)

@rok
Copy link
Member Author

rok commented Sep 22, 2021

@thisisnic good catch on format_ISO8601. If you don't have a strong preference I'll go ahead and add it.

Yet another thing to note: the c++ strfkernel only works for timestamps at the moment. When dates and times are added (ARROW-13916) we should add tests in R as well. I'll make a note on Jira.

@rok
Copy link
Member Author

rok commented Sep 22, 2021

Ok, added format_ISO8601 and moved %c check into c++. I suppose another review is in order @thisisnic :)

Co-authored-by: Jonathan Keane <jkeane@gmail.com>
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor things, otherwise looking good!

Co-authored-by: Nic <thisisnic@gmail.com>
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing format_ISO8601 as well!

@rok rok requested a review from jonkeane September 23, 2021 16:03
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for this

@jonkeane jonkeane closed this in 953ac8a Sep 23, 2021
@rok
Copy link
Member Author

rok commented Sep 23, 2021

Thanks @thisisnic @jonkeane @edponce !

ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This is to resolve [ARROW-13448](https://issues.apache.org/jira/browse/ARROW-13448).

Closes apache#11105 from rok/ARROW-13448

Lead-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Jonathan Keane <jkeane@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants