Skip to content

read_parquet is not supported for partitioned parquet #633

@dazza-codes

Description

@dazza-codes

Split from #626

The read_parquet is not supported for a partitioned data set.

System information

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
$ conda --version
conda 4.6.14
$ python --version
Python 3.7.3
$ pip --version
pip 19.1 from /home/dlweber/miniconda3/envs/gis-dataprocessing/lib/python3.7/site-packages/pip (python 3.7)

$ pip freeze | grep modin
modin==0.5.0
$ pip freeze | grep pandas
pandas==0.24.2
$ pip freeze | grep numpy
numpy==1.16.3

miniconda3 was used to install most of the sci-py stack, with a pip clause to add modin, e.g.

# environment.yaml
channels:
  - conda-forge
  - defaults

dependencies:
  - python>=3.7
  - affine
  - configobj
  - dask
  - numpy
  - pandas
  - pyarrow
  - rasterio
  - s3fs
  - scikit-learn
  - scipy
  - shapely
  - xarray
  - pip
  - pip:
    - modin

Describe the problem

https://modin.readthedocs.io/en/latest/pandas_supported.html says read_parquet is supported, but maybe not for partitioned data.

error

  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/backends/pandas/query_compiler.py", line 871, in _full_reduce
    mapped_parts = self.data.map_across_blocks(map_func)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/engines/base/frame/partition_manager.py", line 209, in map_across_blocks
    preprocessed_map_func = self.preprocess_func(map_func)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/engines/base/frame/partition_manager.py", line 100, in preprocess_func
    return self._partition_class.preprocess_func(map_func)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py", line 108, in preprocess_func
    return ray.put(func)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/worker.py", line 2216, in put
    worker.put_object(object_id, value)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/worker.py", line 375, in put_object
    self.store_and_register(object_id, value)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/worker.py", line 309, in store_and_register
    self.task_driver_id))
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/utils.py", line 475, in _wrapper
    return orig_attr(*args, **kwargs)
  File "pyarrow/_plasma.pyx", line 496, in pyarrow._plasma.PlasmaClient.put
  File "pyarrow/serialization.pxi", line 355, in pyarrow.lib.serialize
  File "pyarrow/serialization.pxi", line 150, in pyarrow.lib.SerializationContext._serialize_callback
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle.py", line 952, in dumps
    cp.dump(obj)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle.py", line 271, in dump
    raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions