Skip to content

Logically repartition files by row splits #14607

@AdamGS

Description

@AdamGS

Is your feature request related to a problem or challenge?

We’re implementing a file format Vortex, which has no “row groups” or similar concept, meaning byte range might fall completely within one column, and aligning columns is a non trivial task. I would like to be able express repartitioning logic to only split files logically (by rows and not by bytes).
The existing repartitioning logic in Datafusion (specifically FileGroupPartitioner and FileScanConfig::repartitioned) assume that files can be split logically by byte ranges (FileRange), and even the rustdoc on it seems very Parquet-specific (even though other formats do support it). This assumes some mapping/alignment between the physical layout and the logical one.

Describe the solution you'd like

Seems like the best way would be to configure FileGroupPartitioner through FileSource. The other option would be to make FileRange an enum, but that would still mean we (and any other format with a similar structure) will have to maintain our own repartitioning logic.

Describe alternatives you've considered

We can keep the current state, which is maintaining our own repartitioning logic and eventually just reusing FileRange to describe row splits.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions