Skip to content

Conversation

@lidavidm
Copy link
Member

No description provided.

@github-actions
Copy link

@lidavidm
Copy link
Member Author

Looks like this will conflict with #10457/ARROW-12980 so we may want to hold off on this one, as that one looks close.

@rok
Copy link
Member

rok commented Aug 15, 2021

Looks like this will conflict with #10457/ARROW-12980 so we may want to hold off on this one, as that one looks close.

Indeed. Looks great! The only thing I would recommend would be to use MakeTemporal from ARROW-12980 instead of MakeTimeTemporal.

@lidavidm
Copy link
Member Author

ARROW-12980 looks like it should be close so I'll rebase on top of that now.

@rok
Copy link
Member

rok commented Aug 17, 2021

@lidavidm ARROW-12980 was merged.

@lidavidm
Copy link
Member Author

Thanks for the heads up! I'll get this rebased soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After ARROW-12980, this now works with timezones? (or can work)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh, I forgot to update the docstring. Yes, they all work with timezones and there are tests. I've updated all the docstrings.

@pitrou
Copy link
Member

pitrou commented Aug 18, 2021

Why aren't these implemented as cast kernels instead?

@lidavidm
Copy link
Member Author

Ah, that makes sense. I'll update this to be a cast instead, though I think we are going to need some refactoring of the utilities.

@lidavidm lidavidm marked this pull request as draft August 18, 2021 16:14
@lidavidm
Copy link
Member Author

There's already a cast from timestamp to date32/date64, however, placing this implementation there would change semantics a little bit:

>>> timestamps
<pyarrow.lib.TimestampArray object at 0x7f676cd86fa0>
[
  1970-01-01 00:00:59.123456789,
  2000-02-29 23:23:23.999999999,
  1899-01-01 00:59:20.001001001
]
>>> timestamps.cast(pa.date64(), safe=False)
<pyarrow.lib.Date64Array object at 0x7f676cf277c0>
[
  1970-01-01,
  2000-02-29,
  1899-01-02
]
>>> pc.date64(timestamps)
<pyarrow.lib.Date64Array object at 0x7f676cf27d00>
[
  1970-01-01,
  2000-02-29,
  1899-01-01
]

Also Python doesn't expose a way to set only allow_time_truncate (though maybe we should just allow the cast if we refactor things here). But that does raise the question of whether a cast is appropriate for this operation (since it seems like casting is generally interpreted more like reinterpret_cast, while this is an actual conversion). Also for instance a cast of a timestamp-with-timezone right now is quite different than what this does.

@rok
Copy link
Member

rok commented Aug 18, 2021

Date extraction could also be thought of as a rounding to a day interval.

@jorisvandenbossche
Copy link
Member

Personally, I think having a separate (non-cast) kernel to extract those components make sense from a user perspective (but can of course share implementation), and complementing the other timestamp component extraction kernels we already have.
As @lidavidm also mentions, this is in general "unsafe" cast, but when you explicitly want to extract a time/date component, it is obvious you want this and having to allow an unsafe cast feels unnecessarily.

There's already a cast from timestamp to date32/date64, however, placing this implementation there would change semantics a little bit:

I would say that the current casting semantics are wrong and should be fixed? (in any case, the "extracted" date is clearly wrong, whether that's seen as a consequence of the "unsafe" cast or not is to be discussed I suppose)

There are actually two "unsafe" steps in this conversion it seems (which explains the wrong part):

arr = pa.array(["1970-01-01 00:00:59.123456789","2000-02-29 23:23:23.999999999","1899-01-01 00:59:20.001001001"]).cast(pa.timestamp("ns"))

>>> arr.cast(pa.date64())
...
ArrowInvalid: Casting from timestamp[ns] to date64[ms] would lose data: 59123456789
../src/arrow/compute/kernels/scalar_cast_temporal.cc:178  (ShiftTime<int64_t, int64_t>(ctx, conversion.first, conversion.second, input, output))

# that error is actually coming from a conversion to milliseconds
# (and you don't really care about the part being lost for conversion to date ..)
>>> arr.cast(pa.timestamp("ms"))
...
ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 59123456789

# when ignoring this lost part in conversion to ms, then casting to date gives another error:
>>> arr.cast(pa.timestamp("ms"), safe=False).cast(pa.date64())
...
ArrowInvalid: Timestamp value had non-zero intraday milliseconds

And I suppose that when those intraday milliseconds are ignored by doing safe=False, we do a simple round to get rid of those:

const int64_t remainder = out_data[i] % kMillisecondsInDay;
if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && remainder > 0)) {
return Status::Invalid("Timestamp value had non-zero intraday milliseconds");
}
out_data[i] -= remainder;

By subtracting the remainder we basically round towards zero (it seems C++ module operator behaves differently as Python when involving negative integers -12 % 10 = -2 in C++ and -12 % 10 = 8 in Python), which means that for negative values we are rounding up, and we should round down instead? (that could be seen as a bug fix?)

@pitrou
Copy link
Member

pitrou commented Aug 19, 2021

But that does raise the question of whether a cast is appropriate for this operation (since it seems like casting is generally interpreted more like reinterpret_cast, while this is an actual conversion).

I'm not sure I understand what you mean with the reinterpret_cast comment. Our casts are definitely conversions (see the Decimal -> Decimal casts for example).

@pitrou
Copy link
Member

pitrou commented Aug 19, 2021

I think having a separate (non-cast) kernel to extract those components make sense from a user perspective (but can of course share implementation), and complementing the other timestamp component extraction kernels we already have

My problem is that I don't even understand the difference they're supposed to make to the "normal" casts. Are those (supposedly) different semantics really desired?

I would favour fixing/improving the currently implemented casts, if necessary.

@lidavidm
Copy link
Member Author

But that does raise the question of whether a cast is appropriate for this operation (since it seems like casting is generally interpreted more like reinterpret_cast, while this is an actual conversion).

I'm not sure I understand what you mean with the reinterpret_cast comment. Our casts are definitely conversions (see the Decimal -> Decimal casts for example).

I guess I was wondering if the current behavior (just 'reinterpreting' the timestamp) is still useful, it sounds like not.

I think the path here is to make the kernels in this PR into safe casts, so that users don't have to specify an unsafe cast. (In theory you could get away with just allow_time_truncate but I think there's no way to pass that in Python.)

@lidavidm lidavidm marked this pull request as ready for review August 23, 2021 13:29
@lidavidm lidavidm marked this pull request as draft September 9, 2021 12:33
@lidavidm lidavidm force-pushed the arrow-13549 branch 2 times, most recently from 2b9b2a2 to bda1eae Compare September 9, 2021 13:29
@lidavidm lidavidm marked this pull request as ready for review September 9, 2021 13:31
@lidavidm
Copy link
Member Author

lidavidm commented Sep 9, 2021

This is now implemented as a cast and is rebased. For casting timestamp->time, we do check the truncation/overflow flags in some scenarios (e.g. if you want to cast a nanosecond timestamp to a time32).

@jorisvandenbossche
Copy link
Member

BTW, I stumbled on https://issues.apache.org/jira/browse/ARROW-10213, so it seems @lidavidm you already opened an issue about this buggy (round instead of extract, see above #10933 (comment)) behaviour a while ago .. :-)
So I think this PR is now closing that issue as well?

@lidavidm
Copy link
Member Author

lidavidm commented Sep 9, 2021

Whoops! Yeah, let me link/close-as-duplicate the issues and update the description. Thanks for finding this.

@jorisvandenbossche
Copy link
Member

I guess I was wondering if the current behavior (just 'reinterpreting' the timestamp) is still useful, it sounds like not.

I think there can be some value in the current meaning of "safe" cast of timestamp to date. For example, it would allow you to convert timestamps-which-are-actually-dates safely to dates, without loosing any time information. While if we make the safe cast to ignore the time values by default, the only way to do this is by first checking if all hour/minute/second/subsecond components are zero.
In the end, this is very similar to our casting rule of floats to ints: by default a safe cast only allows it for "round" floats without decimals, and forcing an unsafe cast actually rounds / discards decimals).

@lidavidm lidavidm changed the title ARROW-13549: [C++] Add date/time extraction functions ARROW-13549: [C++] Add casts from timestamp to date/time Sep 16, 2021
@lidavidm
Copy link
Member Author

I've rebased this again.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one suggestion

}

template <typename T>
enable_if_timestamp<T, const std::string> GetInputTimezone(const DataType& type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this conflict with the non-template GetInputTimezone(const DataType&) above? At least it seems there's a potential for confusion. Perhaps we can simply reconcile both implementations? For example:

static inline const std::string& GetInputTimezone(const DataType& type) {
  static const std::string no_timezone = "";
  switch (type.id()) {
    case Type::TIMESTAMP:
      return checked_cast<const TimestampType&>(type).timezone();
    default:
      return no_timezone;
  }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed. I suppose it worked before since it was called as GetInputTimezone<T>(...).

@pitrou
Copy link
Member

pitrou commented Sep 27, 2021

@jorisvandenbossche Any further comments on this?

@pitrou pitrou closed this in 2bf4421 Sep 27, 2021
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
Closes apache#10933 from lidavidm/arrow-13549

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants