Skip to content

train_test_split shuffle DataFrame partitions#625

Merged
TomAugspurger merged 6 commits intodask:masterfrom
petiop:train-test-split-dataframe-shuffle
Mar 20, 2020
Merged

train_test_split shuffle DataFrame partitions#625
TomAugspurger merged 6 commits intodask:masterfrom
petiop:train-test-split-dataframe-shuffle

Conversation

@petiop
Copy link
Copy Markdown
Contributor

@petiop petiop commented Mar 11, 2020

  • Passed train_test_split's shuffle parameter onto the dask.DataFrame.random_split function to take advantage of the behaviour introduced in dask/dask#5980.
  • Added error when blockwise is set to False for a DataFrame.
  • Added test case for the new behaviour of train_test_split on DataFrames.

  `dask.DataFrame.random_split` function to take advantage of the
  behaviour introduced in #5980.
- Added error when `blockwise` is set to `False` for a `DataFrame`.
- Added test case for the new behaviour of `train_test_split` on
  `DataFrame`s.
Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I think you'll need to only pass shuffle=shuffle when a new enough Dask version is detected (and perhaps raise if a non-default value is specified).

@petiop
Copy link
Copy Markdown
Contributor Author

petiop commented Mar 11, 2020

Should I wait for a new version of Dask to be release before moving forward with this?

@TomAugspurger
Copy link
Copy Markdown
Member

TomAugspurger commented Mar 11, 2020 via email

@petiop
Copy link
Copy Markdown
Contributor Author

petiop commented Mar 11, 2020

Dask 2.12 was released on Friday, which I think had the necessary changes. You'll need to check the dask version, and only pass through shuffle if it's >=2.12. You can check dask_ml._compat for similar code.

On Wed, Mar 11, 2020 at 3:37 PM petiop @.***> wrote: Should I wait for a new version of Dask to be release before moving forward with this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#625 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIRUTHYGQSNWLRY6PP3RG7ZBLANCNFSM4LF4GSQA .

Seems like it just missed the 2.12 release (changelog). I'll set things up according to the _compat code and push the change on the next one.

@petiop
Copy link
Copy Markdown
Contributor Author

petiop commented Mar 11, 2020

Thanks.

I think you'll need to only pass shuffle=shuffle when a new enough Dask version is detected (and perhaps raise if a non-default value is specified).

Should I also raise a warning re the behaviour change since the result with default params will be different between the versions?

@TomAugspurger
Copy link
Copy Markdown
Member

Gotcha. We do test against Dask master, so you can have your checks be >= 2.13.0.

Should I also raise a warning re the behaviour change since the result with default params will be different between the versions?

We should change the default shuffle for train_test_split to be None.

For arrays, we have no change. If it's none just set it to True and continue as usual.
For DataFrames, we'll emit a warning saying that it'll change to true in the future, and to specify shuffle=False if you want to retain the previous behavior.

petiop added 3 commits March 16, 2020 11:55
- Adds version-dependent logic for DataFrame shuffling.
- Adds future warning that `shuffle` will default to `True`.
Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update.

import itertools
import logging
import numbers
from distutils.version import LooseVersion
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use packaging.version.parse.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should all be centralized in dask_ml._compat. If something isn't there that you need (like DASK_213) go ahead and define it.

if DASK_GT_2130:
if shuffle is None:
shuffle = False
if not sys.warnoptions:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? Typically we would just warn, and the users can filter it using the usual filters.

if shuffle is None:
shuffle = False
if not sys.warnoptions:
import warnings
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import this at the top of the file.

petiop added 2 commits March 17, 2020 10:24
- Exapands the `FutureWarning` message.
- Cleans up the DataFrame `random_split` process.
@petiop
Copy link
Copy Markdown
Contributor Author

petiop commented Mar 18, 2020

Not too sure how to interpret the failed checks. Any pointers will be greatly appreciated

@TomAugspurger
Copy link
Copy Markdown
Member

I don't know why circleci was run. We don't use it anymore.

The other failure looks like a spurious failure. Don't worry about it for now.

@TomAugspurger
Copy link
Copy Markdown
Member

I've (re)disabled circleci on this project, and the flaky test has resolved itself. Thanks @petiop

@TomAugspurger TomAugspurger merged commit f512bf1 into dask:master Mar 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants