Skip to content

Conversation

@dragosmg
Copy link
Contributor

@dragosmg dragosmg commented Mar 24, 2022

This would allow the following operations:

library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)

test_df <- tibble(
  a = c(2007.38998954347, 1970.77732069883, 2020.96061799722,
        2009.43465948477, 1975.71251467871, NA),
  b = as.POSIXct(
    c("2007-05-23 08:18:30", "1970-10-11 17:19:45", "2020-12-17 14:04:06",
      "2009-06-08 15:37:01", "1975-09-18 01:37:42", NA)
  )
)

test_df %>%
  mutate(
    decimal_date_from_date = decimal_date(b),
    date_from_decimal = date_decimal(a)
  )
#> # A tibble: 6 × 4
#>       a b                   decimal_date_from_date date_from_decimal  
#>   <dbl> <dttm>                               <dbl> <dttm>             
#> 1 2007. 2007-05-23 08:18:30                  2007. 2007-05-23 08:18:30
#> 2 1971. 1970-10-11 17:19:45                  1971. 1970-10-11 17:19:45
#> 3 2021. 2020-12-17 14:04:06                  2021. 2020-12-17 14:04:06
#> 4 2009. 2009-06-08 15:37:01                  2009. 2009-06-08 15:37:01
#> 5 1976. 1975-09-18 01:37:42                  1976. 1975-09-18 01:37:42
#> 6   NA  NA                                     NA  NA

test_df %>%
  arrow_table() %>% 
  mutate(
    decimal_date_from_date = decimal_date(b),
    date_from_decimal = date_decimal(a)
  ) %>%
  collect()
#> # A tibble: 6 × 4
#>       a b                   decimal_date_from_date date_from_decimal  
#>   <dbl> <dttm>                               <dbl> <dttm>             
#> 1 2007. 2007-05-23 08:18:30                  2007. 2007-05-23 08:18:30
#> 2 1971. 1970-10-11 17:19:45                  1971. 1970-10-11 17:19:45
#> 3 2021. 2020-12-17 14:04:06                  2021. 2020-12-17 14:04:06
#> 4 2009. 2009-06-08 15:37:01                  2009. 2009-06-08 15:37:01
#> 5 1976. 1975-09-18 01:37:42                  1976. 1975-09-18 01:37:42
#> 6   NA  NA                                     NA  NA

Created on 2022-03-28 by the reprex package (v2.0.1)

@dragosmg dragosmg marked this pull request as ready for review March 28, 2022 16:46
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few comments for you

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@dragosmg dragosmg requested a review from jonkeane March 29, 2022 12:45
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions, but this is looking good!

Comment on lines 324 to 325
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these be Expression$scalar(31622400L)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to do a bit of benchmarking (even micro bench with {bench}) on these implementation details (honestly, even the old implementation of doing strptime more...) I would be curious if any of them are better or worse for performance (my gut says this is quicker than strptime a bunch, but I might be totally wrong about that!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are 100% right. The newer implementation is quicker. By how much - it depends, but always quicker. In general it takes half the time.
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of scale are those numbers on? Could you post or link to the code you used to run that, I would be interested to see how you got there!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary(test1, relative = TRUE)
#>  A tibble: 2 × 13
#>   expression           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   memory     time       gc      
#>   <bch:expr>         <dbl>  <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl>   <bch:tm> <list>   <list>     <list>     <list>  
#> 1 new_implementation  1      1         1.40      1        1.03    94     6      1.32s <tibble> <Rprofmem> <bench_tm> <tibble>
#> 2 old_implementation  1.46   1.46      1         1.27     1       92     8      1.81s <tibble> <Rprofmem> <bench_tm> <tibble>

test1
#>  A tibble: 2 × 13
#>   expression              min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   memory     time       gc      
#>   <bch:expr>         <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>   <list>     <list>     <list>  
#> 1 new_implementation   12.7ms  13.3ms      71.1    76.1KB     4.54    94     6      1.32s <tibble> <Rprofmem> <bench_tm> <tibble>
#> 2 old_implementation   18.5ms  19.4ms      50.8    96.9KB     4.42    92     8      1.81s <tibble> <Rprofmem> <bench_tm> <tibble>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any reason to believe this would be different at larger scales, but for benchmarks like these we probably should use larger sets of data than a handful of rows. It depends on what we're looking at, but 10k-100k rows is more reasonable for something like this

@dragosmg dragosmg requested a review from jonkeane March 29, 2022 19:45
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two tiny things! It looks like all the comments from Jon were addressed...it looks great!

@jonkeane jonkeane closed this in 80bba5c Apr 11, 2022
@ursabot
Copy link

ursabot commented Apr 11, 2022

Benchmark runs are scheduled for baseline = 633687c and contender = 80bba5c. 80bba5c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.59% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.26% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/485| 80bba5cb ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/470| 80bba5cb test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/471| 80bba5cb ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/480| 80bba5cb ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/484| 633687c1 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/469| 633687c1 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/470| 633687c1 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/479| 633687c1 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@dragosmg dragosmg deleted the decimal_dates branch April 12, 2022 12:46
Comment on lines +273 to +285
# if time1 or time2 are timestamps they cannot be expressed in "s" /seconds
# otherwise they cannot be added subtracted with durations
# TODO delete the casting to "us" once
# https://issues.apache.org/jira/browse/ARROW-16060 is solved
if (inherits(time1, "Expression") &&
time1$type_id() %in% Type[c("TIMESTAMP")] && time1$type()$unit() != 2L) {
time1 <- build_expr("cast", time1, options = cast_options(to_type = timestamp("us")))
}

if (inherits(time2, "Expression") &&
time2$type_id() %in% Type[c("TIMESTAMP")] && time2$type()$unit() != 2L) {
time2 <- build_expr("cast", time2, options = cast_options(to_type = timestamp("us")))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update this section since it looks like ARROW-16060 has already been solved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, feel free to open a Jira + PR to do so if we don't already have one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants