-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
We’re implementing a file format Vortex, which has no “row groups” or similar concept, meaning byte range might fall completely within one column, and aligning columns is a non trivial task. I would like to be able express repartitioning logic to only split files logically (by rows and not by bytes).
The existing repartitioning logic in Datafusion (specifically FileGroupPartitioner and FileScanConfig::repartitioned) assume that files can be split logically by byte ranges (FileRange), and even the rustdoc on it seems very Parquet-specific (even though other formats do support it). This assumes some mapping/alignment between the physical layout and the logical one.
Describe the solution you'd like
Seems like the best way would be to configure FileGroupPartitioner through FileSource. The other option would be to make FileRange an enum, but that would still mean we (and any other format with a similar structure) will have to maintain our own repartitioning logic.
Describe alternatives you've considered
We can keep the current state, which is maintaining our own repartitioning logic and eventually just reusing FileRange to describe row splits.
Additional context
No response