-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13549: [C++] Add casts from timestamp to date/time #10933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looks like this will conflict with #10457/ARROW-12980 so we may want to hold off on this one, as that one looks close. |
Indeed. Looks great! The only thing I would recommend would be to use |
|
ARROW-12980 looks like it should be close so I'll rebase on top of that now. |
|
@lidavidm ARROW-12980 was merged. |
|
Thanks for the heads up! I'll get this rebased soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After ARROW-12980, this now works with timezones? (or can work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, I forgot to update the docstring. Yes, they all work with timezones and there are tests. I've updated all the docstrings.
|
Why aren't these implemented as cast kernels instead? |
|
Ah, that makes sense. I'll update this to be a cast instead, though I think we are going to need some refactoring of the utilities. |
|
There's already a cast from timestamp to date32/date64, however, placing this implementation there would change semantics a little bit: Also Python doesn't expose a way to set only allow_time_truncate (though maybe we should just allow the cast if we refactor things here). But that does raise the question of whether a cast is appropriate for this operation (since it seems like casting is generally interpreted more like reinterpret_cast, while this is an actual conversion). Also for instance a cast of a timestamp-with-timezone right now is quite different than what this does. |
|
Date extraction could also be thought of as a rounding to a day interval. |
|
Personally, I think having a separate (non-cast) kernel to extract those components make sense from a user perspective (but can of course share implementation), and complementing the other timestamp component extraction kernels we already have.
I would say that the current casting semantics are wrong and should be fixed? (in any case, the "extracted" date is clearly wrong, whether that's seen as a consequence of the "unsafe" cast or not is to be discussed I suppose) There are actually two "unsafe" steps in this conversion it seems (which explains the wrong part): arr = pa.array(["1970-01-01 00:00:59.123456789","2000-02-29 23:23:23.999999999","1899-01-01 00:59:20.001001001"]).cast(pa.timestamp("ns"))
>>> arr.cast(pa.date64())
...
ArrowInvalid: Casting from timestamp[ns] to date64[ms] would lose data: 59123456789
../src/arrow/compute/kernels/scalar_cast_temporal.cc:178 (ShiftTime<int64_t, int64_t>(ctx, conversion.first, conversion.second, input, output))
# that error is actually coming from a conversion to milliseconds
# (and you don't really care about the part being lost for conversion to date ..)
>>> arr.cast(pa.timestamp("ms"))
...
ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 59123456789
# when ignoring this lost part in conversion to ms, then casting to date gives another error:
>>> arr.cast(pa.timestamp("ms"), safe=False).cast(pa.date64())
...
ArrowInvalid: Timestamp value had non-zero intraday millisecondsAnd I suppose that when those intraday milliseconds are ignored by doing arrow/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc Lines 198 to 202 in d3af6d4
By subtracting the remainder we basically round towards zero (it seems C++ module operator behaves differently as Python when involving negative integers |
I'm not sure I understand what you mean with the reinterpret_cast comment. Our casts are definitely conversions (see the Decimal -> Decimal casts for example). |
My problem is that I don't even understand the difference they're supposed to make to the "normal" casts. Are those (supposedly) different semantics really desired? I would favour fixing/improving the currently implemented casts, if necessary. |
I guess I was wondering if the current behavior (just 'reinterpreting' the timestamp) is still useful, it sounds like not. I think the path here is to make the kernels in this PR into safe casts, so that users don't have to specify an unsafe cast. (In theory you could get away with just allow_time_truncate but I think there's no way to pass that in Python.) |
2b9b2a2 to
bda1eae
Compare
|
This is now implemented as a cast and is rebased. For casting timestamp->time, we do check the truncation/overflow flags in some scenarios (e.g. if you want to cast a nanosecond timestamp to a time32). |
|
BTW, I stumbled on https://issues.apache.org/jira/browse/ARROW-10213, so it seems @lidavidm you already opened an issue about this buggy (round instead of extract, see above #10933 (comment)) behaviour a while ago .. :-) |
|
Whoops! Yeah, let me link/close-as-duplicate the issues and update the description. Thanks for finding this. |
I think there can be some value in the current meaning of "safe" cast of timestamp to date. For example, it would allow you to convert timestamps-which-are-actually-dates safely to dates, without loosing any time information. While if we make the safe cast to ignore the time values by default, the only way to do this is by first checking if all hour/minute/second/subsecond components are zero. |
bda1eae to
2ed0770
Compare
|
I've rebased this again. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one suggestion
| } | ||
|
|
||
| template <typename T> | ||
| enable_if_timestamp<T, const std::string> GetInputTimezone(const DataType& type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this conflict with the non-template GetInputTimezone(const DataType&) above? At least it seems there's a potential for confusion. Perhaps we can simply reconcile both implementations? For example:
static inline const std::string& GetInputTimezone(const DataType& type) {
static const std::string no_timezone = "";
switch (type.id()) {
case Type::TIMESTAMP:
return checked_cast<const TimestampType&>(type).timezone();
default:
return no_timezone;
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, fixed. I suppose it worked before since it was called as GetInputTimezone<T>(...).
|
@jorisvandenbossche Any further comments on this? |
Closes apache#10933 from lidavidm/arrow-13549 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
No description provided.