ARROW-14810 [R] Implement bindings for lubridate's `date_decimal()` and `decimal_date()` #12707

dragosmg · 2022-03-24T15:56:38Z

This would allow the following operations:

library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)

test_df <- tibble(
  a = c(2007.38998954347, 1970.77732069883, 2020.96061799722,
        2009.43465948477, 1975.71251467871, NA),
  b = as.POSIXct(
    c("2007-05-23 08:18:30", "1970-10-11 17:19:45", "2020-12-17 14:04:06",
      "2009-06-08 15:37:01", "1975-09-18 01:37:42", NA)
  )
)

test_df %>%
  mutate(
    decimal_date_from_date = decimal_date(b),
    date_from_decimal = date_decimal(a)
  )
#> # A tibble: 6 × 4
#>       a b                   decimal_date_from_date date_from_decimal  
#>   <dbl> <dttm>                               <dbl> <dttm>             
#> 1 2007. 2007-05-23 08:18:30                  2007. 2007-05-23 08:18:30
#> 2 1971. 1970-10-11 17:19:45                  1971. 1970-10-11 17:19:45
#> 3 2021. 2020-12-17 14:04:06                  2021. 2020-12-17 14:04:06
#> 4 2009. 2009-06-08 15:37:01                  2009. 2009-06-08 15:37:01
#> 5 1976. 1975-09-18 01:37:42                  1976. 1975-09-18 01:37:42
#> 6   NA  NA                                     NA  NA

test_df %>%
  arrow_table() %>% 
  mutate(
    decimal_date_from_date = decimal_date(b),
    date_from_decimal = date_decimal(a)
  ) %>%
  collect()
#> # A tibble: 6 × 4
#>       a b                   decimal_date_from_date date_from_decimal  
#>   <dbl> <dttm>                               <dbl> <dttm>             
#> 1 2007. 2007-05-23 08:18:30                  2007. 2007-05-23 08:18:30
#> 2 1971. 1970-10-11 17:19:45                  1971. 1970-10-11 17:19:45
#> 3 2021. 2020-12-17 14:04:06                  2021. 2020-12-17 14:04:06
#> 4 2009. 2009-06-08 15:37:01                  2009. 2009-06-08 15:37:01
#> 5 1976. 1975-09-18 01:37:42                  1976. 1975-09-18 01:37:42
#> 6   NA  NA                                     NA  NA

^{Created on 2022-03-28 by the reprex package (v2.0.1)}

jonkeane

Thanks! A few comments for you

r/R/dplyr-funcs-datetime.R

r/tests/testthat/test-dplyr-funcs-datetime.R

r/R/dplyr-funcs-datetime.R

github-actions · 2022-03-28T20:40:22Z

https://issues.apache.org/jira/browse/ARROW-14810

github-actions · 2022-03-28T20:40:24Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

jonkeane

A few questions, but this is looking good!

jonkeane · 2022-03-29T16:36:41Z

r/R/dplyr-funcs-datetime.R

Could these be Expression$scalar(31622400L)?

It would be nice to do a bit of benchmarking (even micro bench with {bench}) on these implementation details (honestly, even the old implementation of doing strptime more...) I would be curious if any of them are better or worse for performance (my gut says this is quicker than strptime a bunch, but I might be totally wrong about that!)

You are 100% right. The newer implementation is quicker. By how much - it depends, but always quicker. In general it takes half the time.

What kind of scale are those numbers on? Could you post or link to the code you used to run that, I would be interested to see how you got there!

https://gist.github.com/dragosmg/673a5b23a434adbe0d4e972d7b8a4620

summary(test1, relative = TRUE) #> A tibble: 2 × 13 #> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc #> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> #> 1 new_implementation 1 1 1.40 1 1.03 94 6 1.32s <tibble> <Rprofmem> <bench_tm> <tibble> #> 2 old_implementation 1.46 1.46 1 1.27 1 92 8 1.81s <tibble> <Rprofmem> <bench_tm> <tibble> test1 #> A tibble: 2 × 13 #> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc #> <bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> #> 1 new_implementation 12.7ms 13.3ms 71.1 76.1KB 4.54 94 6 1.32s <tibble> <Rprofmem> <bench_tm> <tibble> #> 2 old_implementation 18.5ms 19.4ms 50.8 96.9KB 4.42 92 8 1.81s <tibble> <Rprofmem> <bench_tm> <tibble>

I don't have any reason to believe this would be different at larger scales, but for benchmarks like these we probably should use larger sets of data than a handful of rows. It depends on what we're looking at, but 10k-100k rows is more reasonable for something like this

r/R/dplyr-funcs-datetime.R

paleolimbot

Just two tiny things! It looks like all the comments from Jon were addressed...it looks great!

r/R/dplyr-funcs-datetime.R

r/NEWS.md

ursabot · 2022-04-11T22:41:45Z

Benchmark runs are scheduled for baseline = 633687c and contender = 80bba5c. 80bba5c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.59% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.26% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/485| 80bba5cb ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/470| 80bba5cb test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/471| 80bba5cb ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/480| 80bba5cb ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/484| 633687c1 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/469| 633687c1 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/470| 633687c1 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/479| 633687c1 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

eitsupi · 2022-08-16T06:42:22Z

r/R/dplyr-funcs-datetime.R

+    # if time1 or time2 are timestamps they cannot be expressed in "s" /seconds
+    # otherwise they cannot be added subtracted with durations
+    # TODO delete the casting to "us" once
+    # https://issues.apache.org/jira/browse/ARROW-16060 is solved
+    if (inherits(time1, "Expression") &&
+        time1$type_id() %in% Type[c("TIMESTAMP")] && time1$type()$unit() != 2L) {
+      time1 <- build_expr("cast", time1, options = cast_options(to_type = timestamp("us")))
+    }
+
+    if (inherits(time2, "Expression") &&
+        time2$type_id() %in% Type[c("TIMESTAMP")] && time2$type()$unit() != 2L) {
+      time2 <- build_expr("cast", time2, options = cast_options(to_type = timestamp("us")))
+    }


Can we update this section since it looks like ARROW-16060 has already been solved?

Absolutely, feel free to open a Jira + PR to do so if we don't already have one

dragosmg force-pushed the decimal_dates branch from e5a888f to f828389 Compare March 25, 2022 10:39

github-actions bot added the Component: R label Mar 25, 2022

dragosmg marked this pull request as ready for review March 28, 2022 16:46

jonkeane requested changes Mar 28, 2022

View reviewed changes

dragosmg requested a review from jonkeane March 29, 2022 12:45

jonkeane reviewed Mar 29, 2022

View reviewed changes

dragosmg requested a review from jonkeane March 29, 2022 19:45

dragosmg added 12 commits March 30, 2022 10:29

first pass at decimal dates

bac338f

NEWS + testing df

21dd80d

workaround for subtraction between date32 and timestamp

2ee535d

unit tests

babb0fe

unit tests

c146ef3

🤦

da3a070

improved comment

99f030b

removed extra empty row

b58aebf

cleanup + more test cases

83fbde6

floor instead of safe = FALSE casting + simplified implementation

8bbe52f

added reference to follow-up ticket

6f0ed8d

create an Expression$scalar() instead of casting one

ae3114e

dragosmg force-pushed the decimal_dates branch from 8d446d4 to ae3114e Compare March 30, 2022 09:29

paleolimbot approved these changes Apr 4, 2022

View reviewed changes

r/R/dplyr-funcs-datetime.R Outdated Show resolved Hide resolved

r/NEWS.md Outdated Show resolved Hide resolved

news update + style

63445b7

jonkeane closed this in 80bba5c Apr 11, 2022

dragosmg deleted the decimal_dates branch April 12, 2022 12:46

eitsupi reviewed Aug 16, 2022

View reviewed changes

asfimport mentioned this pull request Aug 16, 2022

[R] Implement bindings for lubridate's date_decimal and decimal_date #30344

Closed

ARROW-14810 [R] Implement bindings for lubridate's date_decimal() and decimal_date() #12707

ARROW-14810 [R] Implement bindings for lubridate's date_decimal() and decimal_date() #12707

Uh oh!

Conversation

dragosmg commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 28, 2022

Uh oh!

github-actions bot commented Mar 28, 2022

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ursabot commented Apr 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ARROW-14810 [R] Implement bindings for lubridate's `date_decimal()` and `decimal_date()` #12707

ARROW-14810 [R] Implement bindings for lubridate's `date_decimal()` and `decimal_date()` #12707

dragosmg commented Mar 24, 2022 •

edited

Loading