Skip to content

[C++][Dataset] Simplify ScanOptions after complexity has moved to ScanNode #29015

@asfimport

Description

@asfimport

ScanOptions currently has a number of constraints between members, which violates the contract of a public struct:

  • filter must be bound to dataset_schema

  • projection must be bound to dataset_schema

  • projected_schema must be schema<...fields>, where the type of projection is struct<...fields>

    These are currently required to support FilterAndProjectScanTask, but after ARROW-13328 this complexity can be removed and ScanOptions can be a pure struct argument to MakeScanNode. Specifically, it should be possible to:

  • remove the projected_schema field (ScanNode doesn't need to know the schemas of any subsequent nodes)

  • remove the projection field (ScanNode doesn't need to know how or if scanned batches will be projected)

  • provide a simple vector of FieldRef to indicate which fields should be materialized (MakeScanNode can validate that this includes every field referenced by filter)

  • allow filter to be unbound (MakeScanNode can bind it to the dataset schema)

    dataset_schema seems slightly redundant too since MakeScanNode also takes a Dataset as an argument but it is currently used by CsvFileFormat to derive column types

Reporter: Ben Kietzman / @bkietz

Related issues:

Note: This issue was originally created as ARROW-13340. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions