Support Dask Dataframes in Hyperband by mrocklin · Pull Request #701 · dask/dask-ml

mrocklin · 2020-07-18T22:30:27Z

It doesn't look like we actually use exact chunk sizes anywhere, so this should be ok?

I may be wrong though. Regardless there is probably still some cleanup to do here. I thought I'd push up something early for feedback though.

cc @stsievert

Currently depends on dask/dask-ml#701 This could be improved by using an estimator that benefitted from large amounts of data.

stsievert

I don't have any issues or qualms with this PR, especially if the tests pass. The failing test looks relevant, but only because we were performing an unnecessary check.

stsievert · 2020-07-18T23:42:16Z

dask_ml/model_selection/_incremental.py

-        X_train, X_test, y_train, y_test = self._get_train_test_split(X, y)
+
+        X_train, X_test, y_train, y_test = self._get_train_test_split(
+            X, y, shuffle=True


Is this shuffle=True needed? There's made a modification above where shuffle=True is hard-coded.

This method warns when given a dataframe and shuffle= is not specified. We need to choose some default. Would False be a better choice?

Ah, I see. I prefer shuffle=True because it's the right choice from an optimization perspective. For example, what if the Dask Dataframe are CSVs from each state? Calling partial_fit on any one CSV is not the right answer if each CSV is very different.

I think that in this function shuffle is only per-partitions, not between partitions. I think that we would leave inter-partition shuffling to the user ahead of sending to Hyperband

I think that in this function shuffle is only per-partitions, not between partitions.

That's not true; it looks like for default value of blockwise depends on whether it's a Dask Array or DataFrame:

dask-ml/dask_ml/model_selection/_split.py

Lines 392 to 395 in fc6a04a

The default behavior depends on the types in arrays. For Dask Arrays,

the default is True (data are not shuffled between blocks). For Dask

DataFrames, the default and only allowed value is False (data are

shuffled between blocks).

But that's unrelated to this PR. I think shuffle=True is a good choice in case there's order. Maybe we should make _get_train_test_split public?

Maybe we should make _get_train_test_split public?

What's the motivation for that?

In case someone wants a specific train/test split. Maybe they have time series and don't want shuffle=True or blockwise=True? Or maybe they want the test set to be a specific chunk from the Dask Array?

shuffle=True, blockwise=True was the default behavior on master, and that's still what we have?

shuffle=True is definitely the behavior on master; shuffle=False isn't supported (_split.py#L493, _split.py#L477), and blockwise is False for DataFrames, and True for Arrays. I think that's the right behavior.

I need to investigate blockwise some; I'm not sure if there's a mixup in the documentation or implementation.

I think blockwise is backwards for DataFrame in the documentation & implementation. It only supports the equivalent of blockwise=True (within-block shuffling).

I think blockwise is backwards for DataFrame in the documentation & implementation. It only supports the equivalent of blockwise=True (within-block shuffling).

I'm a little confused. Should the implementation support shuffling between chunks or within chunks? I'd like to see train_test_split support shuffling between chunks. Currently, train_test_split only supports shuffling within chunks blockwise=True:

import pandas as pd import numpy as np import pandas as pd import dask.dataframe as dd from dask_ml.model_selection import train_test_split N = 1000 df = pd.DataFrame({"x": np.arange(N), "y": np.arange(N)}) ddf = dd.from_pandas(df, npartitions=2) kwargs = dict(random_state=0, train_size=0.5) train, test = train_test_split(ddf, blockwise=True, **kwargs) assert train.compute().shape == (530, 2) train, test = train_test_split(ddf, blockwise=False, **kwargs) # raises NotImplementedError

Should the implementation support shuffling between chunks or within chunks?

Ideally either. But that's distinct from this PR.

mrocklin · 2020-07-19T02:26:48Z

I don't have any issues or qualms with this PR, especially if the tests pass. The failing test looks relevant, but only because we were performing an unnecessary check.

Ah indeed. Fixed I think.

Interestingly. Tests also pass if I remove the to_dask_array lines. I'm curious do we know what support is like for passing Pandas dataframes around to models is? Is this something that we would want to leave for the user?

TomAugspurger · 2020-07-20T13:26:08Z

The sklearn dev failure can be ignore. I need to look into it today or tomorrow.

Planning to merge later today, assuming the discussion in https://github.com/dask/dask-ml/pull/701/files/485ff530bca48eb6051591c75dbe6a5965d42333#diff-0a54ded74c8013caf0588e9a5c879e99 has been resolved.

dask_ml/model_selection/_incremental.py

Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>

TomAugspurger · 2020-07-21T19:02:04Z

Thanks! Apologies for the incorrect suggestion.

mrocklin added 3 commits July 18, 2020 15:28

Support Dask Dataframes in Hyperband

e94c875

co-persist training and testing data

8800d61

black

485ff53

mrocklin added a commit to coiled/coiled-examples that referenced this pull request Jul 18, 2020

Add hyper-pararmeter-optimization notebook with Hyperband

1f97cc2

Currently depends on dask/dask-ml#701 This could be improved by using an estimator that benefitted from large amounts of data.

mrocklin mentioned this pull request Jul 18, 2020

Add hyper-pararmeter-optimization notebook with Hyperband coiled/coiled-examples#1

Merged

stsievert reviewed Jul 18, 2020

View reviewed changes

Don't check arrays without need

506be91

isort

da4a9b9

TomAugspurger reviewed Jul 21, 2020

View reviewed changes

dask_ml/model_selection/_incremental.py Outdated Show resolved Hide resolved

mrocklin and others added 2 commits July 21, 2020 07:35

Update dask_ml/model_selection/_incremental.py

9465a2c

Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>

add back in X, y

bcc9753

TomAugspurger merged commit 382bbb6 into dask:master Jul 21, 2020

mrocklin deleted the hyperband-dataframe branch July 21, 2020 21:42

TomAugspurger mentioned this pull request Aug 5, 2020

Support DataFrame in IncrementalSearchCV #628

Closed

	The default behavior depends on the types in arrays. For Dask Arrays,
	the default is True (data are not shuffled between blocks). For Dask
	DataFrames, the default and only allowed value is False (data are
	shuffled between blocks).

Uh oh!

Conversation

mrocklin commented Jul 18, 2020

Uh oh!

stsievert left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stsievert Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jul 19, 2020

Uh oh!

TomAugspurger commented Jul 20, 2020

Uh oh!

Uh oh!

TomAugspurger commented Jul 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stsievert Jul 21, 2020 •

edited

Loading

TomAugspurger Jul 21, 2020 •

edited

Loading