-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14821: [R] Implement bindings for lubridate's floor_date, ceiling_date, and round_date #12154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
This looks great @djnavarro thanks for doing it! Regarding your TODOs:
Perhaps you can put in an R shim to get lubridate matching logic and open a Jira to fix this in c++ later on.
I found dates around 1900s quite problematic when testing against Pandas as they don't apply some historic offset changes. I'd suggest aiming at more modern times first.
This would probably be a good c++ Jira candidate too. Provide some examples if you have them handy :). |
|
Another thing regarding timezones - only timestamps in arrow have timezones. Dates do not. |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few notes! I'm new to creating bindings, too, so take everything with a grain of salt. I noted a few style guide things...I think the linter check will give you notes on this when it runs as well.
r/R/util.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if statements elsewhere in arrow R code are always followed by a space (if (...) { ... })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can test this more directly with:
expect_error(
call_binding("round_date", Expression$scalar(Sys.time()), "61 seconds"),
"Rounding with second > 60 is not supported"
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I hadn't realised call_binding() existed 🤦🏻♀️ 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be easier to find at the top of the file next to test_df
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The right way to go, I think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I agree here. It might be nice to add a small comment in those test cases about this (so future-us remembers why we have as.Date() at the end there
r/R/dplyr-funcs-datetime.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | |
| }) |
|
@jonkeane Can you approve the workflows to run? |
I approved them |
r/R/util.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on these! What do you think about rephrasing a little to say what the user should do, and use the word "must" in there somewhere, kinda like this: https://style.tidyverse.org/error-messages.html ?
(note: feel free to say "no"; we don't do this in every single case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thought. That said, I'm tempted to go with the suggestion @rok makes and allow these cases. It's not inconceivable that people might have a good reason to round to 90 minutes. I can't think of a reason why existing code would depend on this lubridate error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see where you are coming from here. I guess, generally, the aim when implementing the functions for non-standard evaluation code blocks is to (as closely as possible unless there's good reason not to!) replicate the original. However, there are some rare cases when I've seen us not do that; for example, our mapping to lubridate::second() supports nanoseconds as we didn't see the harm in having more precise results, and so the unit test for that binding looks like this:
arrow/r/tests/testthat/test-dplyr-funcs-datetime.R
Lines 474 to 483 in 0c1fd88
| test_that("extract second from timestamp", { | |
| compare_dplyr_binding( | |
| .input %>% | |
| mutate(x = second(datetime)) %>% | |
| collect(), | |
| test_df, | |
| # arrow supports nanosecond resolution but lubridate does not | |
| tolerance = 1e-6 | |
| ) | |
| }) |
That said, I also can't think of a reason someone else's code would depend on raising an error but I'm not confident that that would mean it won't - people have all sorts of unusual workflows.
There are multiple ways of calling Arrow C++ compute functions from R via the dplyr/non-standard evaluation code - either using the bindings to the lubridate etc equivalent, or calling the Arrow C++ compute function directly by prefixing the function name with arrow_ (so in this case it'd be arrow_round_temporal; if you've not come across this before, check it out here: https://arrow.apache.org/cookbook/r/manipulating-data---tables.html#use-arrow-functions-in-dplyr-verbs-in-arrow).
So if people use that directly they'd still have access to the full functionality. I'm not sure what to suggest here - would be good to get input from @jonkeane or @nealrichardson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good point. Maybe erring on the side of caution and mirroring lubridate as closely as possible would be best? It's going to come up again as an issue once I try to sort out the issue associated with the change_on_boundary argument, actually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd err on the side of caution but I can see the argument for it not mattering - let's see what the more experienced R devs say!
Just to check; what's the issue with change_on_boundary? If it's the fact that it's not supported in the C++ yet, it's fine to open a C++ ticket for that to be implemented, raise an error in the R code saying that parameter is not yet supported, and open an R ticket for implementing it later once the C++ functionality is in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thisisnic: So yeah... I haven't finished my dive into the lubridate code on the change_on_boundary issue, but (I think!) the TL;DR is that it's an edge case for how ceiling_date() handles time points that sit at the start of a day. Should the ceiling of 2022-01-17 00:00:00 be 2022-01-18 00:00 or 2022-01-17 00:00? Conceptually, "January 17" is an interval of time that starts at 00:00 and ends at 23:59, so a strict interpretation suggests that all time points within that interval should ceiling to midnight of January 18. That's what lubridate currently does, but...
Prior to lubridate 1.6.0 ceiling_date() took a different approach, and AFAICT the C++ round_temporal() function does the same thing. If we were to ignore the strict interpretation of ISO 8601, we can think of "midnight moments" as analogous to integer values spaced along the real line, and round accordingly. Under this approach, which I think is more intuitive but does violate ISO 8601, we don't ceiling an integer up to the next integer, and so 2022-01-17 00:00 stays 2022-01-17 00:00 because it's already an "integer valued date".
The downside to violating ISO 8601 -- in addition to the obvious problem that it violates ISO 8601 -- is that it means we're no longer in agreement with lubridate. The downside to adhering to it is that you get counterintuitive result that taking a ceiling of a Date object always returns the next day:
ceiling_date(ymd("2022-01-17"), "day") # returns 2022-01-18Long story short, the change_on_boundary argument exists so that lubridate can revert to its older behaviour. Given that the C++ round_temporal() only matches the older lubridate behaviour (again, assuming I've understood the code properly), I'll need to write a shim that supports the change_on_boundary anyway. If I don't we're not mirroring lubridate.
I think the right approach is probably the same one @rok suggested for week_start. Short term the solution is to write a shim to support it on the R side, and open a JIRA ticket for the C++ library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What y'all have already said is great, and totally agree with this being an edge case for if extending just a bit is fine or if we should error in the same way.
How easy/hard would it be to construct the arrow_round_temporal() call if someone did want to do that?
r/R/util.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we allow such cases (minute > 60)? C++ implementation supports them and I can imagine I'd want to round to e.g. 90 minutes or 36 hours sometimes.
Users would probably not mind as we would not reduce capabilities.
|
|
|
I think I've taken this as far as I can right now? I had been hoping to write shims to support the week_start_shim <- function(function_name, x, week_start, opts) {
offset <- Expression$create(Scalar$create(as.difftime(week_start - 4, units = "days")))
x <- Expression$create("add", x, offset)
x <- Expression$create(function_name, x, options = opts)
return(Expression$create("subtract", x, offset))
} I can't get anything like this to work because (as far as I can tell) we don't currently have the ability to add or subtract durations from date/time data in Arrow. I imagine I'm missing something obvious here! |
Would it make sense for the shims to be in R? |
If the functionality isn't there, I'd be tempted to put this PR aside for the moment - we're aiming to have a larger number of the |
|
My thinking at this point is that (if this is consistent with the general etiquette here!) I'd like to (1) work out why the build is failing two checks and fix those, (2) merge the passing version, (3) open tickets for the missing functionality, one for the C++ side and one for the R side (and assign myself to the R one 😁 ). My reasons for suggesting this are:
Possible reasons not to:
I'm happy to follow advice here. I'm totally fine with whatever approach more experienced folks think is best! 🙂 |
|
In case it helps, you could try searching for "ARROW-13168" in the test files as there are some tests we skip on Windows because of the issue mentioned in that ticket - not sure if this is related or not though. |
|
@djnavarro I've opened a jira for the |
|
@djnavarro Does this unblock this PR or do we need to solve Let me know if I can help somehow. |
|
@rok sorry for not getting to this earlier! For the this_thursday <- ymd("2022-03-10")
floor_date(this_thursday, unit = "week", week_start = 1) # returns "2022-03-07"
floor_date(this_thursday, unit = "week", week_start = 2) # returns "2022-03-08"It's still possible I can handle it at the level of the R binding (I've learned a bit more in the last couple of month!) but I'm wondering if the R level is the best place to do this or if it's better to implement it at the C++ level? For the |
|
No worries @djnavarro :) About the |
|
Well, I can't think of a good reason why anyone would set |
|
I'd raise on I'm not sure what you mean by edge cases. Could you sketch it with some pseudocode? |
|
Maybe it's easiest to illustrate with actual R code. The behaviour I need to emulate with regards to the # a datetime object at the midnight boundary, and a date object with no time
midnight_time <- as.POSIXct(strptime("2022-03-10 00:00:00", format = "%F %T"))
timeless_date <- as.Date("2022-03-10")
# ceiling date applied to the datetime
ceiling_date(midnight_time, unit = "day", change_on_boundary = NULL) # returns the 10th
ceiling_date(midnight_time, unit = "day", change_on_boundary = TRUE) # returns the 11th
ceiling_date(midnight_time, unit = "day", change_on_boundary = FALSE) # returns the 10th
# ceiling date applied to the date
ceiling_date(timeless_date, unit = "day", change_on_boundary = NULL) # returns the 11th
ceiling_date(timeless_date, unit = "day", change_on_boundary = TRUE) # returns the 11th
ceiling_date(timeless_date, unit = "day", change_on_boundary = FALSE) # returns the 10thThis pattern only affects times and dates deemed to be on the boundary with respect to the rounding unit. Any time past midnight and Currently, the call_function("add_checked", Scalar$create(midnight_time), Scalar$create(as.difftime("24:00:00")))
EDIT: Never mind, I worked it out. 🤦🏻♀️ I can handle the |
|
That's great news @djnavarro! Does solving this with subtraction work correctly around DST changes? e.g. (stealing an example from here): We should also consider performance implications here. Adding an option to the c++ kernels might be better than calling two different c++ kernels. However that should be a separate ticket. |
|
@rok Agreed! My thinking at the moment is to see how far I can push this from the R side so that we have a clear idea of exactly what behaviour we're trying to emulate, but we'll certainly want to think about how to improve performance because I can guarantee that the code I'm writing on the R side isn't very efficient. Maybe the right thing to do is to arrange a voice chat sometime soon? |
Perfect. Understanding the behaviour is enough at this stage. And current situation PR is pretty efficient IMO. |
…nternal consistency test of timezone-local rounding
…test subsecond rounding \facepalm
|
@djnavarro I've rebased to fix the C++ conflicts and removed the ARROW-16412 workarounds. |
|
Okay, I think this is actually good? The failing check doesn't appear to be related to this PR as far as I can tell, but notably one of them is timezone related so there might be something going on that I am not quite understanding |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for persisting!
|
Thanks for doing this @djnavarro ! 😄 |
|
Woohoo! 🎉 |
|
Benchmark runs are scheduled for baseline = 1f03f12 and contender = b0734e6. b0734e6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…Hub issue numbers (#34260) Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature. Issue numbers have been rewritten based on the following correspondence. Also, the pkgdown settings have been changed and updated to link to GitHub. I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly. --- ARROW-6338 #5198 ARROW-6364 #5201 ARROW-6323 #5169 ARROW-6278 #5141 ARROW-6360 #5329 ARROW-6533 #5450 ARROW-6348 #5223 ARROW-6337 #5399 ARROW-10850 #9128 ARROW-10624 #9092 ARROW-10386 #8549 ARROW-6994 #23308 ARROW-12774 #10320 ARROW-12670 #10287 ARROW-16828 #13484 ARROW-14989 #13482 ARROW-16977 #13514 ARROW-13404 #10999 ARROW-16887 #13601 ARROW-15906 #13206 ARROW-15280 #13171 ARROW-16144 #13183 ARROW-16511 #13105 ARROW-16085 #13088 ARROW-16715 #13555 ARROW-16268 #13550 ARROW-16700 #13518 ARROW-16807 #13583 ARROW-16871 #13517 ARROW-16415 #13190 ARROW-14821 #12154 ARROW-16439 #13174 ARROW-16394 #13118 ARROW-16516 #13163 ARROW-16395 #13627 ARROW-14848 #12589 ARROW-16407 #13196 ARROW-16653 #13506 ARROW-14575 #13160 ARROW-15271 #13170 ARROW-16703 #13650 ARROW-16444 #13397 ARROW-15016 #13541 ARROW-16776 #13563 ARROW-15622 #13090 ARROW-18131 #14484 ARROW-18305 #14581 ARROW-18285 #14615 * Closes: #33631 Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
This patch provides dplyr bindings to for lubridate functions
floor_date(),ceiling_date(), andround_date(). This is my first attempt at writing a patch, so my apologies if I've made any errors 🙂Supported functionality:
unit = .001 secondsas an alias forunit = 1 millisecondsec,second,secondsall matchsecond### Major problems not yet addressed:- Does not yet support theweek_startargument, and implicitly fixesweek_start = 4- Does not yet mirror lubridate handling of timezonesI'd prefer to fix these two issues before merging, but I'm uncertain how best to handle them. Any advice would be appreciated!### Minor things not yet addressed- During rounding lubridate sometimes coerces Date objects to POSIXct. This is not mirrored in the arrow bindings: date32 classes remain date32 classes. This introduces minor differences in rounding in some cases- Does not yet support thechange_on_boundaryargument toceiling_date(). It's a small discrepancy, but it means that the default behaviour of the arrow dplyr binding mirrors lubridate prior to v1.6.0EDIT: issues now addressed!