-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12980: [C++] Kernels to extract datetime components should be timezone aware #10457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
f6b2ddf to
7dcdb9b
Compare
|
Errors appear to be due to windows builds not finding tz database. See. |
7975c67 to
c0b0bc7
Compare
|
I think it would be good to write some tests in python as well, as currently the C++ tests are very hard to verify since we don't yet have the ability to parse strings localized in the timezone (I mean: the strings in the tests are interpreted as UTC and not the "Australia/Broken_Hill" timezone. And thus as a result, the expected values can't be read/verified from the strings). Given that, it might also make sense to first add a "localize" kernel for converting timestamps from naive to a certain timezone. |
Indeed! I've used that approach to generate data in ARROW-11759 by pandas already and I'll just adopt that code here.
Localize kernel would be great. I suppose we'd need a scalar and a vector one depending if timezone is shared between rows or not? Vector version would apply here. Also strptime kernel ignores timezones at the moment. |
I think a scalar kernel is the most important (with signature like I opened https://issues.apache.org/jira/browse/ARROW-13033 for this. |
|
Another question: is the The database is not always available on the system (as you noticed for windows). But depending on the application / binding language, it might already be available elsewhere. For example, in recent Python versions, there is now support for this in the standard library, automatically using the |
Nice, I was also writing it just now :)
As per HowardHinnant's answer:
I suppose we could have the kernel try getting data online. If that fails try OS and as a final fallback use arrow bundled tz db (which we would have to add in this PR). |
|
@jorisvandenbossche added the python tests. Will add some more for timezone naive and if needed make another PR :). |
d460b7f to
62a7915
Compare
|
Added the timezone-naive test for component extractions. |
I think there was no opposition for this one (i.e. that the field extraction should yield local hour/minute/etc), so I don't think there is a need to close the PR. |
7bcd926 to
60c0d4d
Compare
Huh, I forget exactly what I was thinking there and you appear to be right. I think the main remaining task is the tz db issue. I'll take a look at it tomorrow. |
bb23f0a to
869c62a
Compare
|
@pitrou Thanks for the review! And sorry for late reply. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here (or embed it in the name of the test) that this is a test for a invalid / non-existing timezone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the test.
1e87ac3 to
29d12f8
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more comments. I'll do the changes myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't take a const-reference to a temporary dynamically created value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably avoid the overhead of std::function by templating with the callable type itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point it should be possible to reconcile this with TemporalComponentExtract above...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. This will be useful for the Strftime and localize kernels as well.
|
@pitrou thanks for the review and the changes - those look much better! |
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
|
Rebased again to pick up AppVeyor fix :-/ |
|
Thanks @pitrou @jorisvandenbossche and @kszucs ! |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question on the R changes in this PR
| withr::local_timezone("UTC") | ||
|
|
||
| test_date <- as.POSIXct("2017-01-01 00:00:12.3456789", tz = "") | ||
| if (tolower(Sys.info()[["sysname"]]) == "windows") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is windows special-cased here? This needs an explanatory comment and possibly a link to a jira that will let us delete this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to the timezone db issue on windows.
A comment here could state:
# TODO: We should test on windows once ARROW-13168 is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this doesn't work on windows, why did we have to add -lole32 to r/configure.win in this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without -lole32 flag all zoned time logic was causing errors at compile time (if I remember correctly) so it would have to be disabled for windows. Since we will eventually have zones on windows I went ahead and rather added the flag.
Here's a similar issue from date.h and a similar discussion from another PR.
https://issues.apache.org/jira/browse/ARROW-12980