Skip to content

to_parquet is not supported #626

@dazza-codes

Description

@dazza-codes

System information

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
$ conda --version
conda 4.6.14
$ python --version
Python 3.7.3
$ pip --version
pip 19.1 from /home/dlweber/miniconda3/envs/gis-dataprocessing/lib/python3.7/site-packages/pip (python 3.7)

$ pip freeze | grep modin
modin==0.5.0
$ pip freeze | grep pandas
pandas==0.24.2
$ pip freeze | grep numpy
numpy==1.16.3

miniconda3 was used to install most of the sci-py stack, with a pip clause to add modin, e.g.

# environment.yaml
channels:
  - conda-forge
  - defaults

dependencies:
  - python>=3.7
  - affine
  - configobj
  - dask
  - numpy
  - pandas
  - pyarrow
  - rasterio
  - s3fs
  - scikit-learn
  - scipy
  - shapely
  - xarray
  - pip
  - pip:
    - modin

Describe the problem

https://modin.readthedocs.io/en/latest/pandas_supported.html says to_parquet is supported, but maybe not:

import numpy as np
import modin.pandas as pd
size = (1, 10 * 10)
column_ij = ["%04d_%04d" % (i, j) for i in range(10) for j in range(10)]
data = np.random.randint(0, 10000, size=size, dtype="uint16")
df = pd.DataFrame(data, columns=column_ij)
df.to_parquet('/tmp/tmp.parquet')
UserWarning: `DataFrame.to_parquet` defaulting to pandas implementation.

More details:

2019-05-21 16:03:46,207	WARNING worker.py:1337 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
2019-05-21 16:03:46,207	INFO node.py:469 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-21_16-03-46_18437/logs.
2019-05-21 16:03:46,310	INFO services.py:407 -- Waiting for redis server at 127.0.0.1:55558 to respond...
2019-05-21 16:03:46,418	INFO services.py:407 -- Waiting for redis server at 127.0.0.1:41726 to respond...
2019-05-21 16:03:46,420	INFO services.py:804 -- Starting Redis shard with 2.1 GB max memory.
2019-05-21 16:03:46,426	INFO node.py:483 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-21_16-03-46_18437/logs.
2019-05-21 16:03:46,427	WARNING services.py:1304 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 5238738944 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2019-05-21 16:03:46,427	INFO services.py:1427 -- Starting the Plasma object store with 6.0 GB memory using /tmp.
UserWarning: Distributing <class 'list'> object. This may take some time.
UserWarning: `DataFrame.to_parquet` defaulting to pandas implementation.
To request implementation, send an email to feature_requests@modin.org.

Maybe modin could be added to conda-forge so that conda can help with resolving version dependencies?

Metadata

Metadata

Assignees

Labels

P0Highest priority tasks requiring immediate fixdocumentation 📜Updates and issues with the documentationquestion ❓Questions about Modin

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions