Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Jun 4, 2021

@github-actions
Copy link

github-actions bot commented Jun 4, 2021

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@rok rok changed the title [ARROW-12980 ARROW-12980: [C++] Kernels to extract datetime components should be timezone aware Jun 4, 2021
@github-actions
Copy link

github-actions bot commented Jun 4, 2021

@rok
Copy link
Member Author

rok commented Jun 6, 2021

Errors appear to be due to windows builds not finding tz database. See.

@rok rok force-pushed the ARROW-12980 branch 2 times, most recently from 7975c67 to c0b0bc7 Compare June 9, 2021 18:09
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 10, 2021

I think it would be good to write some tests in python as well, as currently the C++ tests are very hard to verify since we don't yet have the ability to parse strings localized in the timezone (I mean: the strings in the tests are interpreted as UTC and not the "Australia/Broken_Hill" timezone. And thus as a result, the expected values can't be read/verified from the strings).
(while in python we could create the localized input timestamps with pandas)

Given that, it might also make sense to first add a "localize" kernel for converting timestamps from naive to a certain timezone.

@rok
Copy link
Member Author

rok commented Jun 10, 2021

I think it would be good to write some tests in python as well, as currently the C++ tests are very hard to verify since we don't yet have the ability to parse strings localized in the timezone (I mean: the strings in the tests are interpreted as UTC and not the "Australia/Broken_Hill" timezone. And thus as a result, the expected values can't be read/verified from the strings).
(while in python we could create the localized input timestamps with pandas)

Indeed! I've used that approach to generate data in ARROW-11759 by pandas already and I'll just adopt that code here.
It did lead me to an odd issue? in pandas that I cant quite explain. We'll either need to address it or avoid it in tests ..

Given that, it might also make sense to first add a "localize" kernel for converting timestamps from naive to a certain timezone.

Localize kernel would be great. I suppose we'd need a scalar and a vector one depending if timezone is shared between rows or not? Vector version would apply here.

Also strptime kernel ignores timezones at the moment.

@jorisvandenbossche
Copy link
Member

Localize kernel would be great. I suppose we'd need a scalar and a vector one depending if timezone is shared between rows or not?

I think a scalar kernel is the most important (with signature like localize(array[timestamp], string tz) -> array[timestamp, tz], where array can also be chunked array or scalar of course). Mixed timezones is a rather corner case (and in principle also could be done as a scalar kernel if the second tz argument is also an array). I think ARROW-5912 is more a bug in the python->arrow conversion code, and not necessarily related / fixable by a localize kernel.

I opened https://issues.apache.org/jira/browse/ARROW-13033 for this.

@jorisvandenbossche
Copy link
Member

Another question: is the locate_zone configurable in some way to give some hints where to find the tz database?

The database is not always available on the system (as you noticed for windows). But depending on the application / binding language, it might already be available elsewhere. For example, in recent Python versions, there is now support for this in the standard library, automatically using the tzdata package if there is no system database. I am wondering if pyarrow should be able to instruct the kernel where to look for the database (https://www.python.org/dev/peps/pep-0615/#search-path-configuration in python).
Or does the vendored date.h also include functionalities to automatically download the data if not available on the system?

@rok
Copy link
Member Author

rok commented Jun 10, 2021

I opened https://issues.apache.org/jira/browse/ARROW-13033 for this.

Nice, I was also writing it just now :)
The strptime timzone ignoring issue is here.

Or does the vendored date.h also include functionalities to automatically download the data if not available on the system?

As per HowardHinnant's answer:

The library can be configured to not go to the internet for tzdata. tzdata can either be downloaded manually, and this lib can find it, or (on non-Windows systems), the OS-supplied tzdata can be used.

I suppose we could have the kernel try getting data online. If that fails try OS and as a final fallback use arrow bundled tz db (which we would have to add in this PR).

@rok
Copy link
Member Author

rok commented Jun 10, 2021

@jorisvandenbossche added the python tests. Will add some more for timezone naive and if needed make another PR :).

@rok rok force-pushed the ARROW-12980 branch 2 times, most recently from d460b7f to 62a7915 Compare June 10, 2021 18:02
@rok
Copy link
Member Author

rok commented Jun 10, 2021

Added the timezone-naive test for component extractions.
I'll wait for the consensus to build on the timezone handling discussions before closing the PR and moving the python tests to a new PR.

@jorisvandenbossche
Copy link
Member

I'll wait for the consensus to build on the timezone handling discussions before closing the PR and moving the python tests to a new PR.

I think there was no opposition for this one (i.e. that the field extraction should yield local hour/minute/etc), so I don't think there is a need to close the PR.

@rok rok force-pushed the ARROW-12980 branch 5 times, most recently from 7bcd926 to 60c0d4d Compare June 14, 2021 21:21
@rok
Copy link
Member Author

rok commented Jun 14, 2021

I think there was no opposition for this one (i.e. that the field extraction should yield local hour/minute/etc), so I don't think there is a need to close the PR.

Huh, I forget exactly what I was thinking there and you appear to be right.

I think the main remaining task is the tz db issue. I'll take a look at it tomorrow.

@rok rok force-pushed the ARROW-12980 branch 3 times, most recently from bb23f0a to 869c62a Compare August 13, 2021 02:40
@rok
Copy link
Member Author

rok commented Aug 13, 2021

@pitrou Thanks for the review! And sorry for late reply.
Could you please review again and suggest if templating could be further improved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here (or embed it in the name of the test) that this is a test for a invalid / non-existing timezone?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the test.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more comments. I'll do the changes myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't take a const-reference to a temporary dynamically created value.

Copy link
Member

@pitrou pitrou Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably avoid the overhead of std::function by templating with the callable type itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point it should be possible to reconcile this with TemporalComponentExtract above...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. This will be useful for the Strftime and localize kernels as well.

@rok
Copy link
Member Author

rok commented Aug 16, 2021

@pitrou thanks for the review and the changes - those look much better!

@pitrou
Copy link
Member

pitrou commented Aug 16, 2021

Rebased again to pick up AppVeyor fix :-/

@pitrou pitrou closed this in 6f62649 Aug 16, 2021
@rok
Copy link
Member Author

rok commented Aug 16, 2021

Thanks @pitrou @jorisvandenbossche and @kszucs !

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question on the R changes in this PR

withr::local_timezone("UTC")

test_date <- as.POSIXct("2017-01-01 00:00:12.3456789", tz = "")
if (tolower(Sys.info()[["sysname"]]) == "windows") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is windows special-cased here? This needs an explanatory comment and possibly a link to a jira that will let us delete this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to the timezone db issue on windows.
A comment here could state:

# TODO: We should test on windows once ARROW-13168 is resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this doesn't work on windows, why did we have to add -lole32 to r/configure.win in this patch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without -lole32 flag all zoned time logic was causing errors at compile time (if I remember correctly) so it would have to be disabled for windows. Since we will eventually have zones on windows I went ahead and rather added the flag.

Here's a similar issue from date.h and a similar discussion from another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants