-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14471: [R] Implement lubridate's date/time parsing functions #12353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14471: [R] Implement lubridate's date/time parsing functions #12353
Conversation
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! One suggestion for a slightly different approach
r/R/dplyr-funcs-datetime.R
Outdated
| ymd_hyphen1 = "%Y-%m-%d", | ||
| ymd_hyphen2 = "%y-%m-%d", | ||
| ymd_hyphen3 = "%Y-%B-%d", | ||
| ymd_hyphen4 = "%y-%B-%d", | ||
| ymd_hyphen5 = "%Y-%b-%d", | ||
| ymd_hyphen6 = "%y-%b-%d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use one set of 6 like this, and then pre-process the strings with a regex like was suggested in https://issues.apache.org/jira/browse/ARROW-14471?focusedCommentId=17446011&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17446011 ?
Something kind of like:
format_map <- list(
ymd_hyphen1 = "%Y-%m-%d",
ymd_hyphen2 = "%y-%m-%d",
ymd_hyphen3 = "%Y-%B-%d",
ymd_hyphen4 = "%y-%B-%d",
ymd_hyphen5 = "%Y-%b-%d",
ymd_hyphen6 = "%y-%b-%d"
)
x <- gsub("[^A-Za-z0-9.]", "-", x)
call_binding(
"coalesce",
call_binding("strptime", x, format = format_map[[1]], unit = "s"),
call_binding("strptime", x, format = format_map[[2]], unit = "s"),
call_binding("strptime", x, format = format_map[[3]], unit = "s"),
call_binding("strptime", x, format = format_map[[4]], unit = "s"),
call_binding("strptime", x, format = format_map[[5]], unit = "s"),
call_binding("strptime", x, format = format_map[[6]], unit = "s")
)
You might need to use call_binding("gsub", ...) instead of being able to use it directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would at least cut down on the number of formats we need to try and mostly get the right answers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducing the formats works (I might have to implement it in several steps to handle " " and multiple instances of the same separator). There are a couple of outstanding issues:
coalesce()needs the first call tocall_binding("strptime")to returnNAin order to move on to the second one and so on. Instead the first call errors and the second one never get evaluated.tryCatch()doesn't seem to work in this context.- short (
"%y") vs long ("%Y") years. If the string contains a short year (e.g. "19") libarrrow will not error for the first format, but instead return0019.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will move the focus on parse_date_time() of which most other date/time parsing functions are a particular case of. ymd(x) becomes parse_date_time(x, orders = "ymd"), but the problems highlighted here still remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coalesce() needs the first call to call_binding("strptime") to return NA in order to move on to the second one and so on. Instead the first call errors and the second one never get evaluated. tryCatch() doesn't seem to work in this context.
Good point! Is there a jira to make that a possibility for the strptime binding (or in the C++ itself, which is the ultimate place it'll really need to be...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/ARROW-15659, but this is for the R side only.
| mutate(x = ymd(string_ymd)) %>% | ||
| collect(), | ||
| test_dates | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also have a check that if the format fails, we get NAs out — so long as that's possible)
It looks like right now, it throws an error for the whole conversion: https://github.com/apache/arrow/runs/5094263843?check_suite_focus=true#step:6:22119
|
I closed this as, in the meantime, this has become an umbrella issue only. |
Once this PR is merged arrow queries would be able to parse dates, date times, and periods with hour, minute, and second components correctly. For example,
would be equivalent of