Strategy for mixing large_string and string with chunked arrays

I'm trying to see if I can move vaex to rely fully on arrow, and I have an issue with mixing string and large_string.

In vaex I can lazily concatenate dataframes without memory copy. If I want to implement this using a `pa.ChunkedArray`, users cannot concatenate dataframes that have a string column with `pa.string` type to a dataframe that has a column with `pa.large_string`.

In short, there is no arrow data structure to handle this 'mixed chunked array', but I was wondering if this could change. The only way out seems to cast them manually to a common type (although blocked by https://issues.apache.org/jira/browse/ARROW-6071).
Internally I could solve this in vaex, but feedback from building a DataFrame library with arrow might be useful. Also, it means I cannot expose the concatenated DataFrame as an arrow table.

I also wonder if having two types (large_string and string) is a good idea in the end since it makes type checking cumbersome (having to check two types each time).

PS: I guess GitHub is ok for discussions/questions, Jira has no 'Issue type' for that  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strategy for mixing large_string and string with chunked arrays #5874

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strategy for mixing large_string and string with chunked arrays #5874

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions