-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Data] Information loss when writing parquet for numpy array with all NaNs #59087
Copy link
Copy link
Closed
Closed
Copy link
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issuestriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)
Description
What happened + What you expected to happen
When saving a ray.data.Dataset converted from pandas Dataframe where a column is numpy arrays all filled with NaN, the cells become None when reading from the parquet. This happens only when the cells in that column are (1) numpy array, (2) filled with np.nan, and (3) having the same length. In other cases, e.g., cells are python list, or some numpy arrays have not NaN value, or some numpy arrays have different length, then the data is written in the correct way.
Versions / Dependencies
Python: 3.11.11
ray: 2.51.1
pandas: 2.3.3
numpy: 1.26.4
pyarrow: 22.0.0 (also tested 18.1.0)
Reproduction script
import ray
import pandas as pd
import numpy as np
df = pd.DataFrame({"foo": [np.array([np.nan, np.nan]), np.array([np.nan, np.nan])]})
fpath = "foo.parquet"
ds = ray.data.from_pandas(df)
ds.write_parquet(fpath, mode="overwrite")
ray.data.read_parquet(fpath).to_pandas()Issue Severity
Medium: It is a significant difficulty but I can work around it.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issuestriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)