Raise and avoid data loss of meta provided to P2P shuffle is wrong#8520
Raise and avoid data loss of meta provided to P2P shuffle is wrong#8520
Conversation
| ddf = dd.from_delayed( | ||
| [data_gen()] * 2, meta=[("a", int), ("b", int)], verify_meta=False | ||
| ) | ||
|
|
||
| with raises_with_cause( | ||
| RuntimeError, | ||
| r"shuffling \w* failed", | ||
| ValueError, | ||
| "meta", | ||
| ): | ||
| await c.gather(c.compute(ddf.shuffle(on="a"))) |
There was a problem hiding this comment.
On main this indeed causes data loss and is returning an empty frame. While the from_delayed path above might be slightly artificial I don't think it's that difficult to produce a slight meta drift in general.
There was a problem hiding this comment.
read_csv is an example where this could happen in real world use cases
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ± 0 27 suites ±0 9h 54m 19s ⏱️ - 3m 46s For more details on these failures, see this check. Results for commit 67eb545. ± Comparison against base commit a5a6e99. |
Closes #8519
The
KeyErrorwas used for control flow but I believe one should use dedicated exception types for this use case and keep the std exceptions clear for error reporting. This logic swallowed a very obvious error.I'll see if I can reproduce #8519 even with artificial input somehow. Even if I don't have a reproducer I believe this change is sensible.