Added capability to shuffle when splitting a dataframe.#5980
Added capability to shuffle when splitting a dataframe.#5980TomAugspurger merged 6 commits intodask:masterfrom
Conversation
TomAugspurger
left a comment
There was a problem hiding this comment.
Thanks for this, it seems reasonable. Can you add tests?
Thanks for the feedback @TomAugspurger. I extended the test that is closest to the changes made (the one testing |
- Set `shuffle` param's default to `False` for backward compatibility. - Changed the shuffling technique.
| np.testing.assert_array_equal(a.index, sorted(a.index)) | ||
|
|
||
| a, b = d.random_split([0.5, 0.5], 42, False) | ||
| np.testing.assert_array_equal(a.index, sorted(a.index)) |
There was a problem hiding this comment.
Move this assert up to line 1605? a and b look the same as up there.
|
|
||
| a, b = d.random_split([0.5, 0.5], 42, False) | ||
| np.testing.assert_array_equal(a.index, sorted(a.index)) | ||
|
|
There was a problem hiding this comment.
Can you also add a test to ensure the random state is passed through correctly?
a1, b1 = d.random_split([0.5, 0.5], random_state=42, shuffle=True)
a2, b2 = d.random_split([0.5, 0.5], random_state=42, shuffle=True)
assert_eq(a1, a2)
assert_eq(b1, b2)before passing it to pandas.
before passing it to pandas.
| if not isinstance(random_state, np.random.RandomState): | ||
| random_state = np.random.RandomState(random_state) |
There was a problem hiding this comment.
Can this be removed?
| if not isinstance(random_state, np.random.RandomState): | |
| random_state = np.random.RandomState(random_state) |
It seems like random_state is always supplied. If so, I'd recommend making it a required argument.
There was a problem hiding this comment.
If those lines are removed, random_state is a numpy array and pandas.core.common.random_state raises as it's expecting integer, np.random.RandomState, or None
There was a problem hiding this comment.
Gotcha, thanks. I think pandas is being too strict there. I opened pandas-dev/pandas#32503
This is the first part of our internal implementation extending dask-ml's
train_test_splitto shuffle dataframes within partitions (second commit will be in dask-ml). Thought we give it a shot at a contribution.black dask/flake8 dask