Skip to content

Conversation

@thisisnic
Copy link
Member

@thisisnic thisisnic commented Apr 28, 2021

This PR adds bindings for both strsplit() and stringr::str_split() for dplyr

@github-actions
Copy link

@thisisnic thisisnic force-pushed the ARROW-11515-strsplit branch from 3118928 to 6e6addc Compare April 30, 2021 11:14
@thisisnic thisisnic force-pushed the ARROW-11515-strsplit branch from b7953fa to 5615486 Compare April 30, 2021 13:38
@thisisnic thisisnic marked this pull request as draft April 30, 2021 17:02
@thisisnic thisisnic marked this pull request as ready for review May 4, 2021 14:35
@thisisnic
Copy link
Member Author

This PR now also implements stringr::str_split and adds some tests for both implemented functions

Co-authored-by: Ian Cook <ianmcook@gmail.com>
@ianmcook
Copy link
Member

ianmcook commented May 6, 2021

I'm going to do a bit more testing and polishing and then merge this after the CI is green if I don't hear any objections
cc @nealrichardson

@ianmcook
Copy link
Member

ianmcook commented May 6, 2021

I'm going to do a bit more testing and polishing and then merge this after the CI is green if I don't hear any objections
cc @nealrichardson

Update: I found a few issues when reviewing which I'm going to try to fix now; I will describe in comments soon

@ianmcook
Copy link
Member

ianmcook commented May 7, 2021

I think this is ready to merge now! The R 3.5 CI failure was happening in other recent PRs so I assume it's unrelated to any changes here.

The commits I added yesterday were mostly just minor things. @thisisnic if you have questions or uncertainty about any of the commits I added, please let me know today

@thisisnic
Copy link
Member Author

At some point (in a separate PR) it might be worth going through and checking the other functions here in compute.cpp to see if there are any others with optional arguments that are actually mandatory because we didn't set defaults for them like this. @thisisnic could you take a quick look to see if you notice any? If so, please open a Jira. Thanks!

Will do!

@ianmcook ianmcook closed this in 66cd373 May 7, 2021

if (func_name == "split_pattern") {
using Options = arrow::compute::SplitPatternOptions;
int64_t max_splits = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than setting defaults manually, can you do auto out = std::make_shared<Options>(Options::Defaults()); and then just set them if they're provided? (We do this above for other types.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this didn't work when I tried it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that there is no Defaults() method in arrow::compute::SplitPatternOptions. Apparently some of the options classes just don't have one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They probably should be added in the C++ library, can you do that and/or make another JIRA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thisisnic could you make a Jira for this please? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member Author

@thisisnic thisisnic May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, @nealrichardson and @ianmcook , it may make sense for arrow::compute::SplitPatternOptions not to have defaults, given that the pattern is one of the options - it might not make sense to pre-specify a pattern to split the string on? Neither of the R functions strsplit nor stringr::str_split have defaults for this argument.

Could one perhaps make a case for pattern moving from being an option to instead being a second input, similarly to how arrow::compute::filter and arrow::compute::take operate? At that point, having defaults for SplitPatternOptions feels more natural.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all makes good sense to me but I'm not up to speed on what design conventions we follow in the C++ compute functions. Maybe it'd be best to start a Zulip thread to ask folks about this (Joris, Antonie, Maarten Breddels) to get a better idea of the best change to suggest (if any).

@ianmcook
Copy link
Member

ianmcook commented May 7, 2021

I opened #10271 to resolve the outstanding comments here. @nealrichardson PTAL

ianmcook added a commit that referenced this pull request May 8, 2021
This resolves a few outstanding comments raised in #10190

Closes #10271 from ianmcook/ARROW-12692

Authored-by: Ian Cook <ianmcook@gmail.com>
Signed-off-by: Ian Cook <ianmcook@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants