Skip to content

Conversation

@dragosmg
Copy link
Contributor

@dragosmg dragosmg commented Feb 7, 2022

Once this PR is merged arrow queries would be able to parse dates, date times, and periods with hour, minute, and second components correctly. For example,

df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03")) 
df %>% 
  record_batch() %>% 
  mutate(y = ymd(x)) %>% 
  collect()

would be equivalent of

df %>%
   mutate(y = lubridate::ymd(x)

@github-actions
Copy link

github-actions bot commented Feb 7, 2022

@jonkeane jonkeane changed the title ARROW-14471 [R] Implement lubridate's date/time parsing functions ARROW-14471: [R] Implement lubridate's date/time parsing functions Feb 8, 2022
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! One suggestion for a slightly different approach

Comment on lines 154 to 159
ymd_hyphen1 = "%Y-%m-%d",
ymd_hyphen2 = "%y-%m-%d",
ymd_hyphen3 = "%Y-%B-%d",
ymd_hyphen4 = "%y-%B-%d",
ymd_hyphen5 = "%Y-%b-%d",
ymd_hyphen6 = "%y-%b-%d",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use one set of 6 like this, and then pre-process the strings with a regex like was suggested in https://issues.apache.org/jira/browse/ARROW-14471?focusedCommentId=17446011&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17446011 ?

Something kind of like:

format_map <- list(
        ymd_hyphen1 = "%Y-%m-%d",
        ymd_hyphen2 = "%y-%m-%d",
        ymd_hyphen3 = "%Y-%B-%d",
        ymd_hyphen4 = "%y-%B-%d",
        ymd_hyphen5 = "%Y-%b-%d",
        ymd_hyphen6 = "%y-%b-%d"
)

x <- gsub("[^A-Za-z0-9.]", "-", x)
call_binding(
      "coalesce",
      call_binding("strptime", x, format = format_map[[1]], unit = "s"),
      call_binding("strptime", x, format = format_map[[2]], unit = "s"),
      call_binding("strptime", x, format = format_map[[3]], unit = "s"),
      call_binding("strptime", x, format = format_map[[4]], unit = "s"),
      call_binding("strptime", x, format = format_map[[5]], unit = "s"),
      call_binding("strptime", x, format = format_map[[6]], unit = "s")
    )

You might need to use call_binding("gsub", ...) instead of being able to use it directly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would at least cut down on the number of formats we need to try and mostly get the right answers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the formats works (I might have to implement it in several steps to handle " " and multiple instances of the same separator). There are a couple of outstanding issues:

  • coalesce() needs the first call to call_binding("strptime") to return NA in order to move on to the second one and so on. Instead the first call errors and the second one never get evaluated. tryCatch() doesn't seem to work in this context.
  • short ("%y") vs long ("%Y") years. If the string contains a short year (e.g. "19") libarrrow will not error for the first format, but instead return 0019.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will move the focus on parse_date_time() of which most other date/time parsing functions are a particular case of. ymd(x) becomes parse_date_time(x, orders = "ymd"), but the problems highlighted here still remain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coalesce() needs the first call to call_binding("strptime") to return NA in order to move on to the second one and so on. Instead the first call errors and the second one never get evaluated. tryCatch() doesn't seem to work in this context.

Good point! Is there a jira to make that a possibility for the strptime binding (or in the C++ itself, which is the ultimate place it'll really need to be...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/ARROW-15659, but this is for the R side only.

mutate(x = ymd(string_ymd)) %>%
collect(),
test_dates
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have a check that if the format fails, we get NAs out — so long as that's possible)

It looks like right now, it throws an error for the whole conversion: https://github.com/apache/arrow/runs/5094263843?check_suite_focus=true#step:6:22119

@dragosmg dragosmg closed this Jul 1, 2022
@dragosmg
Copy link
Contributor Author

dragosmg commented Jul 1, 2022

I closed this as, in the meantime, this has become an umbrella issue only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants