Skip to content

New Feature Request: Predicate Pushdown support for Parquet #636

@ddutt

Description

@ddutt

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 19.04
  • Modin installed from (source or binary): binary, pip install modin
  • Modin version: 0.5.0
  • Python version: 3.7.3
  • Exact command to reproduce: Use filters in read_parquet

Describe the problem

Probably one of the most important features of parquet is the support for predicate pushdown which helps cut down on the I/O quite significantly. pyarrow supports it, but not in the read_pandas() code. If you replace the existing call to read_pandas() in ray/pandas_on_ray/io.py with the following segment, predicate pushdown automatically works:

    df = pq.ParquetDataset(path, **kwargs) \
           .read(columns=columns) \
           .to_pandas()
    # df = pq.read_pandas(path, columns=columns, **kwargs).to_pandas()
    # Append the length of the index here to build it externally

I've included the original read_pandas() code commented out to provide an anchor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Performance 🚀Performance related issues and pull requests.new feature/request 💬Requests and pull requests for new features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions