Skip to content

Strategy for mixing large_string and string with chunked arrays #5874

@maartenbreddels

Description

@maartenbreddels

I'm trying to see if I can move vaex to rely fully on arrow, and I have an issue with mixing string and large_string.

In vaex I can lazily concatenate dataframes without memory copy. If I want to implement this using a pa.ChunkedArray, users cannot concatenate dataframes that have a string column with pa.string type to a dataframe that has a column with pa.large_string.

In short, there is no arrow data structure to handle this 'mixed chunked array', but I was wondering if this could change. The only way out seems to cast them manually to a common type (although blocked by https://issues.apache.org/jira/browse/ARROW-6071).
Internally I could solve this in vaex, but feedback from building a DataFrame library with arrow might be useful. Also, it means I cannot expose the concatenated DataFrame as an arrow table.

I also wonder if having two types (large_string and string) is a good idea in the end since it makes type checking cumbersome (having to check two types each time).

PS: I guess GitHub is ok for discussions/questions, Jira has no 'Issue type' for that

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions