train_test_split shuffle DataFrame partitions#625
Conversation
`dask.DataFrame.random_split` function to take advantage of the behaviour introduced in #5980. - Added error when `blockwise` is set to `False` for a `DataFrame`. - Added test case for the new behaviour of `train_test_split` on `DataFrame`s.
|
Should I wait for a new version of Dask to be release before moving forward with this? |
|
Dask 2.12 was released on Friday, which I think had the necessary changes.
You'll need to check the dask version, and only pass through shuffle if
it's >=2.12.
You can check dask_ml._compat for similar code.
…On Wed, Mar 11, 2020 at 3:37 PM petiop ***@***.***> wrote:
Should I wait for a new version of Dask to be release before moving
forward with this?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#625 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIRUTHYGQSNWLRY6PP3RG7ZBLANCNFSM4LF4GSQA>
.
|
Seems like it just missed the 2.12 release (changelog). I'll set things up according to the _compat code and push the change on the next one. |
Should I also raise a warning re the behaviour change since the result with default params will be different between the versions? |
|
Gotcha. We do test against Dask master, so you can have your checks be >= 2.13.0.
We should change the default shuffle for train_test_split to be None. For arrays, we have no change. If it's none just set it to True and continue as usual. |
- Adds version-dependent logic for DataFrame shuffling. - Adds future warning that `shuffle` will default to `True`.
dask_ml/model_selection/_split.py
Outdated
| import itertools | ||
| import logging | ||
| import numbers | ||
| from distutils.version import LooseVersion |
There was a problem hiding this comment.
We use packaging.version.parse.
There was a problem hiding this comment.
Also, this should all be centralized in dask_ml._compat. If something isn't there that you need (like DASK_213) go ahead and define it.
dask_ml/model_selection/_split.py
Outdated
| if DASK_GT_2130: | ||
| if shuffle is None: | ||
| shuffle = False | ||
| if not sys.warnoptions: |
There was a problem hiding this comment.
What is this? Typically we would just warn, and the users can filter it using the usual filters.
dask_ml/model_selection/_split.py
Outdated
| if shuffle is None: | ||
| shuffle = False | ||
| if not sys.warnoptions: | ||
| import warnings |
There was a problem hiding this comment.
Import this at the top of the file.
- Exapands the `FutureWarning` message. - Cleans up the DataFrame `random_split` process.
|
Not too sure how to interpret the failed checks. Any pointers will be greatly appreciated |
|
I don't know why circleci was run. We don't use it anymore. The other failure looks like a spurious failure. Don't worry about it for now. |
|
I've (re)disabled circleci on this project, and the flaky test has resolved itself. Thanks @petiop |
train_test_split's shuffle parameter onto thedask.DataFrame.random_splitfunction to take advantage of the behaviour introduced in dask/dask#5980.blockwiseis set toFalsefor aDataFrame.train_test_splitonDataFrames.