Skip to content

Added capability to shuffle when splitting a dataframe.#5980

Merged
TomAugspurger merged 6 commits intodask:masterfrom
petiop:add-shuffle-to-split
Mar 6, 2020
Merged

Added capability to shuffle when splitting a dataframe.#5980
TomAugspurger merged 6 commits intodask:masterfrom
petiop:add-shuffle-to-split

Conversation

@petiop
Copy link
Copy Markdown
Contributor

@petiop petiop commented Mar 5, 2020

This is the first part of our internal implementation extending dask-ml's train_test_split to shuffle dataframes within partitions (second commit will be in dask-ml). Thought we give it a shot at a contribution.

  • Tests added / passed
  • Passes black dask / flake8 dask

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, it seems reasonable. Can you add tests?

@petiop
Copy link
Copy Markdown
Contributor Author

petiop commented Mar 6, 2020

Thanks for this, it seems reasonable. Can you add tests?

Thanks for the feedback @TomAugspurger.

I extended the test that is closest to the changes made (the one testing df.random_split). I did not find any unit tests directly testing pd_split. Are those the one you'd like me to add?

- Set `shuffle` param's default to `False` for backward compatibility.
- Changed the shuffling technique.
np.testing.assert_array_equal(a.index, sorted(a.index))

a, b = d.random_split([0.5, 0.5], 42, False)
np.testing.assert_array_equal(a.index, sorted(a.index))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this assert up to line 1605? a and b look the same as up there.


a, b = d.random_split([0.5, 0.5], 42, False)
np.testing.assert_array_equal(a.index, sorted(a.index))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test to ensure the random state is passed through correctly?

a1, b1 = d.random_split([0.5, 0.5], random_state=42, shuffle=True)
a2, b2 = d.random_split([0.5, 0.5], random_state=42, shuffle=True)

assert_eq(a1, a2)
assert_eq(b1, b2)

Comment on lines +5456 to +5457
if not isinstance(random_state, np.random.RandomState):
random_state = np.random.RandomState(random_state)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed?

Suggested change
if not isinstance(random_state, np.random.RandomState):
random_state = np.random.RandomState(random_state)

It seems like random_state is always supplied. If so, I'd recommend making it a required argument.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those lines are removed, random_state is a numpy array and pandas.core.common.random_state raises as it's expecting integer, np.random.RandomState, or None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks. I think pandas is being too strict there. I opened pandas-dev/pandas#32503

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @petiop!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants