Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 24 files ± 0 24 suites ±0 10h 33m 13s ⏱️ - 25m 23s For more details on these failures, see this check. Results for commit 50238de. ± Comparison against base commit fef78d4. ♻️ This comment has been updated with latest results. |
| if pd.api.types.is_complex_dtype(column): | ||
| raise TypeError( | ||
| f"p2p does not support data of type '{column.dtype}' found in column '{name}'." | ||
| ) | ||
| # FIXME: Serializing custom objects to PyArrow is not supported in P2P shuffling | ||
| if pd.api.types.is_object_dtype(column): | ||
| raise TypeError( | ||
| f"p2p does not support custom objects found in column '{name}'." | ||
| ) | ||
| # FIXME: PyArrow does not support sparse data: https://issues.apache.org/jira/browse/ARROW-8679 | ||
| if pd.api.types.is_sparse(column): | ||
| raise TypeError("p2p does not support sparse data found in column '{name}'") |
There was a problem hiding this comment.
Part of me likes the explicitness of these checks, but I wonder if they've got full coverage. For example, what happens if someone if using their own custom extension array dtype? My guess is this check wouldn't raise an error and shuffling would fail (though I've not confirmed this).
Maybe a there's a simple try/except we can do here instead to check that shuffling doesn't work? Would not being able to roundtrip through the shuffle-specific serialization be a sufficient check?
There was a problem hiding this comment.
You bring up a good point. Let me check what I need to do to trigger reliably trigger an error that we could utilize.
There was a problem hiding this comment.
Also, if it's not straightforward to come up with such a check, we can totally include what you have here and follow-up as needed. The current set of changes are already an improvement over the current main branch
There was a problem hiding this comment.
can't we just try to roundtrip meta? That would not be such a nice message but it would fail early
There was a problem hiding this comment.
can't we just try to roundtrip
meta? That would not be such a nice message but it would fail early
I've tried this before going out on holidays and it detected some issues but not all. IIRC, the main problem was object requiring actual data. We could combine explicit checks with the roundtrip though. That would give us nice error messages for some stuff and a fallback for anything we haven't explicitly checked.
There was a problem hiding this comment.
There is also df._meta_nonempty for cases like this. Might not catch super funky stuff but that's ok from my POV. After all, the shuffle will fail either way it's just about when it fails, isn't it?
There was a problem hiding this comment.
I'll take a look at df._meta_nonempty. It's both about failing early and providing useful feedback to the user.
fjetter
left a comment
There was a problem hiding this comment.
please ping again once we should merge. I would like to see a roundtrip test but I'm also fine with the current state
P2PShuffleserialization for categorical data #7410pre-commit run --all-files