Skip to content

[Data] Information loss when writing parquet for numpy array with all NaNs #59087

@cqiao-cmt

Description

@cqiao-cmt

What happened + What you expected to happen

When saving a ray.data.Dataset converted from pandas Dataframe where a column is numpy arrays all filled with NaN, the cells become None when reading from the parquet. This happens only when the cells in that column are (1) numpy array, (2) filled with np.nan, and (3) having the same length. In other cases, e.g., cells are python list, or some numpy arrays have not NaN value, or some numpy arrays have different length, then the data is written in the correct way.

Versions / Dependencies

Python: 3.11.11
ray: 2.51.1
pandas: 2.3.3
numpy: 1.26.4
pyarrow: 22.0.0 (also tested 18.1.0)

Reproduction script

import ray
import pandas as pd
import numpy as np

df = pd.DataFrame({"foo": [np.array([np.nan, np.nan]), np.array([np.nan, np.nan])]})
fpath = "foo.parquet"
ds = ray.data.from_pandas(df)
ds.write_parquet(fpath, mode="overwrite")
ray.data.read_parquet(fpath).to_pandas()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issuestriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions