Skip to content

Conversation

@edponce
Copy link
Contributor

@edponce edponce commented Aug 4, 2021

This PR adds scalar string compute functions for titlecasing a string, namely "ascii_title" and "utf8_title". Simple titlecasing is performed, only every cased character following an uncased character is uppercased.
Additional changes included with this PR are:

  • restructure StringTransformCodepointXXX classes to support vector string kernels using codepoint transforms
  • update capitalize kernels

@github-actions
Copy link

github-actions bot commented Aug 4, 2021

@edponce edponce force-pushed the ARROW-12714-String-title-case-kernel branch from 213d532 to 2a5d59f Compare August 5, 2021 11:27
@edponce edponce force-pushed the ARROW-12714-String-title-case-kernel branch from d83bc17 to 5a02758 Compare August 18, 2021 20:42
@pitrou
Copy link
Member

pitrou commented Aug 23, 2021

@edponce Do you know when this will be ready for review? Or do you need help on this?

@edponce
Copy link
Contributor Author

edponce commented Aug 23, 2021

@pitrou I am working on completing this PR today and would greatly appreciate your review.

@edponce edponce force-pushed the ARROW-12714-String-title-case-kernel branch from 5a02758 to 6d23d4b Compare August 27, 2021 10:37
@edponce edponce marked this pull request as ready for review August 27, 2021 10:43
@edponce
Copy link
Contributor Author

edponce commented Aug 27, 2021

The capitalize and title kernels are the first vector string kernels that perform code point transforms. The code point transforms (case changes) can grow in bytes and thus required the use of MaxCodeUnits with the 3/2 growth factor. Also, these kernels depend on EnsureLookupTablesFilled to be called in PreExec method. These two requirements existed in different classes (CaseMappingTransform and StringTransformCodepoint) and their use applied TransformCodepoint (scalar method) instead of Transform (custom). For these reasons, I created StringTransformCodepointBase class with the growth factor and lookup tables methods. Now the available classes can be chosen depending on kind of kernel and for those that apply codepoint transforms.

cc @pitrou

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

@edponce edponce force-pushed the ARROW-12714-String-title-case-kernel branch from 78911fd to 02957f2 Compare September 1, 2021 02:03
@edponce
Copy link
Contributor Author

edponce commented Sep 1, 2021

@ianmcook Could you revise the R binding for the titlecase kernel? stringr has a str_to_title function.

@edponce edponce force-pushed the ARROW-12714-String-title-case-kernel branch from 0fff885 to 78bd427 Compare September 1, 2021 03:18
@edponce edponce requested a review from pitrou September 1, 2021 03:36
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple more questions / comments

@edponce edponce force-pushed the ARROW-12714-String-title-case-kernel branch from 78bd427 to 5955c4e Compare September 1, 2021 17:46
@edponce edponce requested a review from pitrou September 1, 2021 18:09
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, you may or may not want to act on it.
However, can you fix the lint failure? archery lint --clang-format should do it.

@pitrou pitrou closed this in e9eeff1 Sep 1, 2021
@pitrou
Copy link
Member

pitrou commented Sep 1, 2021

Thank you very much @edponce !

ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This PR adds scalar string compute functions for titlecasing a string, namely "ascii_title" and "utf8_title". Simple titlecasing is performed, only every cased character following an uncased character is uppercased.
Additional changes included with this PR are:
* restructure StringTransformCodepointXXX classes to support vector string kernels using codepoint transforms
* update capitalize kernels

Closes apache#10869 from edponce/ARROW-12714-String-title-case-kernel

Authored-by: Eduardo Ponce <edponce00@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants