Skip to content

Conversation

@lidavidm
Copy link
Member

@lidavidm lidavidm commented Aug 18, 2021

Implements kernels to get the number of {years, months, days and millis, days, hours, minutes, seconds, millis, micros, nanos} between two timestamps. Everything returns either Int64 or DayTimeInterval.

@lidavidm
Copy link
Member Author

I think Python doesn't have bindings for the interval types - that should be fixed in a separate issue I would think.

@github-actions
Copy link

@ianmcook
Copy link
Member

Would it be straightforward to add a week variant of this that takes DayOfWeekOptions?

@lidavidm
Copy link
Member Author

Would it be straightforward to add a week variant of this that takes DayOfWeekOptions?

It should be doable. I'll update the PR.

@lidavidm lidavidm force-pushed the arrow-13548 branch 2 times, most recently from 5bcb371 to f4ce70e Compare September 9, 2021 15:34
@nealrichardson
Copy link
Member

@rok can you take a look here?

@lidavidm
Copy link
Member Author

lidavidm commented Sep 15, 2021

I'll *rebase this in a little while

@lidavidm lidavidm force-pushed the arrow-13548 branch 2 times, most recently from 55dcad6 to 342e6d8 Compare September 15, 2021 15:50
Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lidavidm !
Couple of comments :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a better suggestion, but boundaries sounds very implementation-like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

const FunctionDoc years_between_doc{
    "Compute the number of years between two timestamps",
    ("Returns the number of year boundaries crossed from the first timestamp to the\n"
     "second. That is, the difference is calculated as if the timestamps were\n"
     "truncated to the year.\n"
     "Null values emit null."),
    {"start", "end"}};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better. How about using 'complete years passed/elapsed/..' instead of 'year boundaries crossed'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the reason why I'm using 'boundaries' is because the difference between 1998-12-01 and 1999-01-1 is '1 year' here even though a complete year hasn't passed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought the entire year needs to pass. Right, boundaries are better yeah.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could find the first week starting weekday after the starting date and then calculate the difference // 7?

@lidavidm
Copy link
Member Author

Changes:

  • Simplified the week kernel (thanks for the suggestion)
  • Added docs
  • Added kernels to get differences as month interval and as month-day-nano interval
    Note the interval types are not bound in Python, see ARROW-14018.

@rok
Copy link
Member

rok commented Sep 16, 2021

I'll take another look tomorrow. But this already looks pretty good.

Comment on lines 1386 to 1396
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes you use _interval_ and sometimes you don't. I realize it's due to the type returned but it still bothers my eyes :). Probably best keep it as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't sure how to disambiguate these. Note that month_interval_between and months_between are basically redundant, except MonthIntervalType is int32_t, not int64_t.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, is it useful to have a int64_t return?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly, except for consistency; that said for Python, we don't have the interval types bound yet, so there'll be a period where this kernel is unavailable if we remove this version. I think overall it's not worth keeping it though so I'll remove it.

@rok
Copy link
Member

rok commented Sep 17, 2021

LGTM @lidavidm!
Perhaps it would be good to ask for another review - I might be missing something.

@lidavidm lidavidm requested a review from pitrou September 20, 2021 13:27
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a recommended format for months? Should we use "M" instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the existing formatting of the array printers, but M probably makes more sense/I can align these with strftime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I suppose that doesn't make any sense. M seems preferable to m though so I'll do that.

Comment on lines 1386 to 1396
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, is it useful to have a int64_t return?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the "notes" column is empty, then can omit it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"has a defined timezone" is less weird than "is zoned".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "because the local year is the same even though the UTC years would be different".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is MakeArrayFromScalar necessary or is the kernel able to broadcast a scalar input?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we may instead define a Validate() method on FunctionOptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but "Extract" refers to component extraction, which this isn't doing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: Return the number of year boundaries crossed from `start` to `end`.

@lidavidm lidavidm force-pushed the arrow-13548 branch 2 times, most recently from 20742d5 to 5a5421d Compare September 27, 2021 15:44
Co-authored-by: Rok Mihevc <rok@mihevc.org>
@lidavidm
Copy link
Member Author

I've rebased this.

Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @lidavidm !

Do we want to support this for time/date/32/64 at some point?

Comment on lines 867 to 870
static int64_t GetQuarters(const year_month_day& ymd) {
return static_cast<int64_t>(static_cast<int32_t>(ymd.year())) * 4 +
(static_cast<uint32_t>(ymd.month()) - 1) / 3;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use this function for the quarter extractor and delete this logic there.

Copy link
Member

@rok rok Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this:

template <typename T, typename Arg0>
T Call(KernelContext*, Arg0 arg, Status*) const {
const auto ymd =
year_month_day(floor<days>(localizer_.template ConvertTimePoint<Duration>(arg)));
return static_cast<T>((static_cast<const uint32_t>(ymd.month()) - 1) / 3 + 1);
}

@lidavidm
Copy link
Member Author

Looks great @lidavidm !

Do we want to support this for time/date/32/64 at some point?

Thanks for the review. Yeah, let me add support for the other temporal types - I suppose all that's merged by now.

@lidavidm
Copy link
Member Author

I've added the support for date/time types (years, months, etc. don't accept time types).

Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

&default_day_of_week_options, DayOfWeekState::Init);
DCHECK_OK(registry->AddFunction(std::move(weeks_between)));

auto day_time_between =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: day_time_interval_between would be more consistent. Same for month_day_nano_interval_between. I don't feel strongly about it, just noticing :).

@lidavidm
Copy link
Member Author

Thanks, fixed those issues.

@rok
Copy link
Member

rok commented Sep 28, 2021

I think this is ready to go.

@pitrou pitrou closed this in 661c7d7 Sep 29, 2021
@lidavidm lidavidm deleted the arrow-13548 branch September 29, 2021 11:57
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
Implements kernels to get the number of {years, months, days and millis, days, hours, minutes, seconds, millis, micros, nanos} between two timestamps. Everything returns either Int64 or DayTimeInterval.

Closes apache#10960 from lidavidm/arrow-13548

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants