-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
I'm trying to see if I can move vaex to rely fully on arrow, and I have an issue with mixing string and large_string.
In vaex I can lazily concatenate dataframes without memory copy. If I want to implement this using a pa.ChunkedArray, users cannot concatenate dataframes that have a string column with pa.string type to a dataframe that has a column with pa.large_string.
In short, there is no arrow data structure to handle this 'mixed chunked array', but I was wondering if this could change. The only way out seems to cast them manually to a common type (although blocked by https://issues.apache.org/jira/browse/ARROW-6071).
Internally I could solve this in vaex, but feedback from building a DataFrame library with arrow might be useful. Also, it means I cannot expose the concatenated DataFrame as an arrow table.
I also wonder if having two types (large_string and string) is a good idea in the end since it makes type checking cumbersome (having to check two types each time).
PS: I guess GitHub is ok for discussions/questions, Jira has no 'Issue type' for that