Fix P2PShuffle serialization for categorical data#7410
Fix P2PShuffle serialization for categorical data#7410jrbourbeau merged 26 commits intodask:mainfrom
P2PShuffle serialization for categorical data#7410Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 22 files + 10 22 suites +10 9h 23m 22s ⏱️ + 5h 3m 45s For more details on these failures, see this check. Results for commit 5fb858d. ± Comparison against base commit 3ac8631. ♻️ This comment has been updated with latest results. |
|
TODO: I still need to improve testing, e.g. by extending |
| # f"col{next(counter)}": pd.array( | ||
| # [np.nan, np.nan, 1.0, np.nan, np.nan] * 20, | ||
| # dtype="Sparse[float64]", | ||
| # ), |
There was a problem hiding this comment.
raises TypeError: Sparse pandas data (column col27) not supported.
There was a problem hiding this comment.
| f"col{next(counter)}": pd.array(range(100), dtype="float16"), | ||
| f"col{next(counter)}": pd.array(range(100), dtype="float32"), | ||
| f"col{next(counter)}": pd.array(range(100), dtype="float64"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="csingle"), |
There was a problem hiding this comment.
raises pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 14
There was a problem hiding this comment.
Looks like arrow doesn't support complex dtypes https://issues.apache.org/jira/browse/ARROW-638
| f"col{next(counter)}": pd.array(range(100), dtype="float32"), | ||
| f"col{next(counter)}": pd.array(range(100), dtype="float64"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="csingle"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="cdouble"), |
There was a problem hiding this comment.
raises pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 15
| f"col{next(counter)}": pd.array(range(100), dtype="float64"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="csingle"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="cdouble"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="clongdouble"), |
There was a problem hiding this comment.
raises pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 16
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @hendrikmakait -- I agree that complex, sparse, and object dtypes aren't supported by pyarrow, so raising an informative error message in that case makes sense.
It's be good to also test that pyarrow-backed dtypes also work (e.g. int64[pyarrow], string[pyarrow], etc.)
| f"col{next(counter)}": pd.array(range(100), dtype="float16"), | ||
| f"col{next(counter)}": pd.array(range(100), dtype="float32"), | ||
| f"col{next(counter)}": pd.array(range(100), dtype="float64"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="csingle"), |
There was a problem hiding this comment.
Looks like arrow doesn't support complex dtypes https://issues.apache.org/jira/browse/ARROW-638
| # f"col{next(counter)}": pd.array( | ||
| # [np.nan, np.nan, 1.0, np.nan, np.nan] * 20, | ||
| # dtype="Sparse[float64]", | ||
| # ), |
There was a problem hiding this comment.
| # FIXME: distributed#7420 | ||
| # f"col{next(counter)}": pd.array( | ||
| # ["lorem ipsum"] * 100, | ||
| # dtype="string[pyarrow]", | ||
| # ), | ||
| # f"col{next(counter)}": pd.array( | ||
| # ["lorem ipsum"] * 100, | ||
| # dtype=pd.StringDtype("pyarrow"), | ||
| # ), |
There was a problem hiding this comment.
My guess is we're running into pandas-dev/pandas#50074 here
There was a problem hiding this comment.
Do we have access to, or could easily keep track of, the original input dtypes at the point when we convert the pa.Table to a pd.DataFrame?
There was a problem hiding this comment.
Left for follow-up work.
I leave this to a follow-up PR since we currently cast all objects to |
I added a bunch of arrow-based types, hopefully didn't miss any. |
distributed/shuffle/_arrow.py
Outdated
| if str(e) == "Tried reading schema message, was null or length 0": | ||
| return pa.concat_tables(shards) |
There was a problem hiding this comment.
Parsing this error message seems brittle. Is there some other check we can do on file to make sure we're read it all?
There was a problem hiding this comment.
I forgot that we can seek the end of the open file object, I'm using that instead now.
| assert sum(map(len, out.values())) == len(df) | ||
| assert all(v.to_pandas().dtypes.equals(df.dtypes) for v in out.values()) |
There was a problem hiding this comment.
(Not meant as a blocking comment) We're checking lengths and dtypes here. Maybe we could just use pd.testing.assert_frame_equal instead?
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
|
I have incorporated all feedback and CI looks good, this is ready for another review. |
P2PShuffle serialization for categorical data
Closes #7400
pre-commit run --all-files