Added capability to shuffle when splitting a dataframe. by petiop · Pull Request #5980 · dask/dask

petiop · 2020-03-05T21:26:33Z

This is the first part of our internal implementation extending dask-ml's train_test_split to shuffle dataframes within partitions (second commit will be in dask-ml). Thought we give it a shot at a contribution.

Tests added / passed
Passes black dask / flake8 dask

TomAugspurger

Thanks for this, it seems reasonable. Can you add tests?

dask/dataframe/core.py

petiop · 2020-03-06T15:14:51Z

Thanks for this, it seems reasonable. Can you add tests?

Thanks for the feedback @TomAugspurger.

I extended the test that is closest to the changes made (the one testing df.random_split). I did not find any unit tests directly testing pd_split. Are those the one you'd like me to add?

- Set `shuffle` param's default to `False` for backward compatibility. - Changed the shuffling technique.

TomAugspurger · 2020-03-06T15:24:09Z

dask/dataframe/tests/test_dataframe.py

+        np.testing.assert_array_equal(a.index, sorted(a.index))
+
+    a, b = d.random_split([0.5, 0.5], 42, False)
+    np.testing.assert_array_equal(a.index, sorted(a.index))


Move this assert up to line 1605? a and b look the same as up there.

TomAugspurger · 2020-03-06T15:25:19Z

dask/dataframe/tests/test_dataframe.py

+
+    a, b = d.random_split([0.5, 0.5], 42, False)
+    np.testing.assert_array_equal(a.index, sorted(a.index))
+


Can you also add a test to ensure the random state is passed through correctly?

a1, b1 = d.random_split([0.5, 0.5], random_state=42, shuffle=True) a2, b2 = d.random_split([0.5, 0.5], random_state=42, shuffle=True) assert_eq(a1, a2) assert_eq(b1, b2)

before passing it to pandas.

TomAugspurger · 2020-03-06T19:42:42Z

dask/dataframe/core.py

+        if not isinstance(random_state, np.random.RandomState):
+            random_state = np.random.RandomState(random_state)


Can this be removed?

Suggested change

if not isinstance(random_state, np.random.RandomState):

random_state = np.random.RandomState(random_state)

It seems like random_state is always supplied. If so, I'd recommend making it a required argument.

If those lines are removed, random_state is a numpy array and pandas.core.common.random_state raises as it's expecting integer, np.random.RandomState, or None

Gotcha, thanks. I think pandas is being too strict there. I opened pandas-dev/pandas#32503

TomAugspurger

Thanks @petiop!

Added capability to shuffle when splitting a dataframe.

fedda30

TomAugspurger reviewed Mar 5, 2020

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Outdated Show resolved Hide resolved

Addressed comments:

8c8a4ea

- Set `shuffle` param's default to `False` for backward compatibility. - Changed the shuffling technique.

TomAugspurger reviewed Mar 6, 2020

View reviewed changes

petiop added 4 commits March 6, 2020 11:00

Fixed a but introduced by not processing the random state properly

0473ac9

before passing it to pandas.

Fixed a but introduced by not processing the random state properly

fd8773c

before passing it to pandas.

Expanded test_random_partitions test case

c3fe7d5

Cleaned up the test case

62a56d8

TomAugspurger reviewed Mar 6, 2020

View reviewed changes

TomAugspurger approved these changes Mar 6, 2020

View reviewed changes

TomAugspurger merged commit fb3203a into dask:master Mar 6, 2020

petiop mentioned this pull request Mar 11, 2020

train_test_split shuffle DataFrame partitions dask/dask-ml#625

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added capability to shuffle when splitting a dataframe.#5980

Added capability to shuffle when splitting a dataframe.#5980
TomAugspurger merged 6 commits intodask:masterfrom
petiop:add-shuffle-to-split

petiop commented Mar 5, 2020 •

edited

Loading

Uh oh!

TomAugspurger left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petiop commented Mar 6, 2020

Uh oh!

TomAugspurger Mar 6, 2020

Uh oh!

TomAugspurger Mar 6, 2020

Uh oh!

TomAugspurger Mar 6, 2020

Uh oh!

petiop Mar 6, 2020

Uh oh!

TomAugspurger Mar 6, 2020

Uh oh!

TomAugspurger left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		a, b = d.random_split([0.5, 0.5], 42, False)
		np.testing.assert_array_equal(a.index, sorted(a.index))

		if not isinstance(random_state, np.random.RandomState):
		random_state = np.random.RandomState(random_state)

Uh oh!

Conversation

petiop commented Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petiop commented Mar 6, 2020

Uh oh!

TomAugspurger Mar 6, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Mar 6, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Mar 6, 2020

Choose a reason for hiding this comment

Uh oh!

petiop Mar 6, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Mar 6, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

petiop commented Mar 5, 2020 •

edited

Loading