-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
ScanOptions currently has a number of constraints between members, which violates the contract of a public struct:
-
filtermust be bound todataset_schema -
projectionmust be bound todataset_schema -
projected_schemamust beschema<...fields>, where the type of projection isstruct<...fields>These are currently required to support
FilterAndProjectScanTask, but after ARROW-13328 this complexity can be removed and ScanOptions can be a pure struct argument toMakeScanNode. Specifically, it should be possible to: -
remove the
projected_schemafield (ScanNode doesn't need to know the schemas of any subsequent nodes) -
remove the
projectionfield (ScanNode doesn't need to know how or if scanned batches will be projected) -
provide a simple vector of
FieldRefto indicate which fields should be materialized (MakeScanNode can validate that this includes every field referenced byfilter) -
allow
filterto be unbound (MakeScanNode can bind it to the dataset schema)dataset_schemaseems slightly redundant too since MakeScanNode also takes a Dataset as an argument but it is currently used by CsvFileFormat to derive column types
Reporter: Ben Kietzman / @bkietz
Related issues:
Note: This issue was originally created as ARROW-13340. Please see the migration documentation for further details.