Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented May 26, 2020

There's some new code generation machinery here (that will be worth ongoing iteration) but the relevant implementation / "developer UX" is what's in string_scalar_ascii.cc, take a look.

Note: the implementation of ascii_upper is far from optimal. std::toupper does more than convert ASCII to uppercase and so it would likely be faster to replace it with a bespoke implementation that only deals with the ASCII alphabetic character space

In [1]: import pyarrow as pa; import pyarrow.compute as pc                                                                                                                                     

In [2]: arr = pa.array(['aaa', 'bbbbbb', None, ''])                                                                                                                                            

In [3]: pc.ascii_upper(arr)                                                                                                                                                                    
Out[3]: 
<pyarrow.lib.StringArray object at 0x7f7044003e50>
[
  "AAA",
  "BBBBBB",
  null,
  ""
]

In [4]: pc.ascii_length(arr)                                                                                                                                                                   
Out[4]: 
<pyarrow.lib.Int32Array object at 0x7f7044003910>
[
  3,
  6,
  null,
  0
]

int64 offsets are respected with LargeString

In [5]: arr = pa.array(['aaa', 'bbbbbb', None, ''], type='large_utf8')                                                                                                                         

In [6]: pc.ascii_length(arr)                                                                                                                                                                   
Out[6]: 
<pyarrow.lib.Int64Array object at 0x7f703c74cbb0>
[
  3,
  6,
  null,
  0
]

@wesm
Copy link
Member Author

wesm commented May 26, 2020

cc @maartenbreddels, have a look at arrow/compute/kernels/scalar_string_ascii.cc for the example function implementations (valid both for 32-bit and 64-bit offset string types). As more functions are added, common structures will emerge to enable implementing and testing them to be easier and easier.

@wesm wesm changed the title ARROW-8922: [C++] Add illustrative "ascii_upper" and "ascii_length" scalar functions valid for Array and Scalar inputs ARROW-8922: [C++] Add illustrative "ascii_upper" and "ascii_length" scalar string functions valid for Array and Scalar inputs May 26, 2020
@github-actions
Copy link

@wesm
Copy link
Member Author

wesm commented May 31, 2020

+1. I'm doing some cleanup of includes (and handling moving ArrayData to a separate header) and it will be helpful to me to have this merged. Please leave comments and I will address them in follow up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant