Skip to content

Conversation

@edponce
Copy link
Contributor

@edponce edponce commented Sep 25, 2021

This PR adds locale validation to string functions matching stringr: str_to_title, str_to_lower, and str_to_upper.
In Arrow C++, these kernels do not support a locale argument so only the default R locale ("en") is currently supported.

@github-actions
Copy link

@edponce edponce marked this pull request as draft September 26, 2021 07:59
@edponce edponce marked this pull request as ready for review September 28, 2021 21:04
@edponce edponce force-pushed the ARROW-13853-String-title-case-kernel branch 2 times, most recently from 180d12f to f6c29d0 Compare September 29, 2021 03:28
@jonkeane
Copy link
Member

@github-actions crossbow submit test-r-linux-as-cran

@github-actions
Copy link

Revision: f6c29d0379efb31fe4e41c959356bc484c4ca43c

Submitted crossbow builds: ursacomputing/crossbow @ actions-859

Task Status
test-r-linux-as-cran Github Actions

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good — a few comments and running our as-cran crossbow build to confirm that this doesn't error on that.

@edponce edponce force-pushed the ARROW-13853-String-title-case-kernel branch from a53004b to 42f6860 Compare September 29, 2021 18:50
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is up with the tests, I've added a suggestion that should get around that. If it persists or you want me to dig into this deeper let me know and I'll pull the branch locally and figure out what's going wrong here.

@edponce
Copy link
Contributor Author

edponce commented Sep 29, 2021

Thanks all for the great reviews!

@edponce
Copy link
Contributor Author

edponce commented Sep 29, 2021

Interesting...the failed tests are because stringr behavior for str_to_title() is different from Python's title() when string begins with an integer. The C++ implementation follows Python's behavior.

# R stringr
> str_to_title("1Foo1")  # "1foo1"
# Python
>>> "1Foo1".title()  # "1Foo1"

I conclude that this also applies for "is_title()" function because it ignores beginning integers...well...stringr does not has an is_title() function.

What to do in such a case? cc @jonkeane @nealrichardson

@jonkeane
Copy link
Member

Hmm, practically speaking I suspect this won't matter to all that many people. We have a few options:

  1. We could introduce an argument to the kernel that allows one to change the behavior to one or the other
  2. We could accept Python's behavior and note that this will produce slightly different results than {stringr}
  3. We could change the behavior to {stringr}'s (and for that matter tools::toTitleCase()) and note that the Python implementation will be slightly different.

I suspect that 2 is probably the easiest right now.

@jonkeane
Copy link
Member

And if you're looking for other ways this could be even more hairy: https://stat.ethz.ch/R-manual/R-devel/library/tools/html/toTitleCase.html

this implementation knows about conventional mixed-case words such as ‘LaTeX’ and ‘OpenBUGS’ and a few technical terms which are not usually capitalized such as ‘jar’ and ‘xls’. However, unknown technical terms will be capitalized unless they are single words enclosed in single quotes: names of packages and libraries should be quoted in titles.

@edponce
Copy link
Contributor Author

edponce commented Sep 30, 2021

Yeah, when I was implementing the title() kernels in C++, I encountered all sorts of corner cases and hurdles, but we decided to keep it simple and match Python's approach.

For this PR, the R binding calls the C++ kernels directly, so it will produce different results for extreme cases when compared to stringr. If users find issues or request more complex functionality, then this can be revisited in a future PR.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failure is unrelated (failure to install an optional dependency)

ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This PR adds locale validation to string functions matching stringr: str_to_title, str_to_lower, and str_to_upper.
In Arrow C++, these kernels do not support a locale argument so only the default R locale ("en") is currently supported.

Closes apache#11232 from edponce/ARROW-13853-String-title-case-kernel

Authored-by: Eduardo Ponce <edponce00@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants