Support Dask Dataframes in Hyperband#701
Conversation
Currently depends on dask/dask-ml#701 This could be improved by using an estimator that benefitted from large amounts of data.
stsievert
left a comment
There was a problem hiding this comment.
I don't have any issues or qualms with this PR, especially if the tests pass. The failing test looks relevant, but only because we were performing an unnecessary check.
| X_train, X_test, y_train, y_test = self._get_train_test_split(X, y) | ||
|
|
||
| X_train, X_test, y_train, y_test = self._get_train_test_split( | ||
| X, y, shuffle=True |
There was a problem hiding this comment.
Is this shuffle=True needed? There's made a modification above where shuffle=True is hard-coded.
There was a problem hiding this comment.
This method warns when given a dataframe and shuffle= is not specified. We need to choose some default. Would False be a better choice?
There was a problem hiding this comment.
Ah, I see. I prefer shuffle=True because it's the right choice from an optimization perspective. For example, what if the Dask Dataframe are CSVs from each state? Calling partial_fit on any one CSV is not the right answer if each CSV is very different.
There was a problem hiding this comment.
I think that in this function shuffle is only per-partitions, not between partitions. I think that we would leave inter-partition shuffling to the user ahead of sending to Hyperband
There was a problem hiding this comment.
I think that in this function shuffle is only per-partitions, not between partitions.
That's not true; it looks like for default value of blockwise depends on whether it's a Dask Array or DataFrame:
dask-ml/dask_ml/model_selection/_split.py
Lines 392 to 395 in fc6a04a
But that's unrelated to this PR. I think shuffle=True is a good choice in case there's order. Maybe we should make _get_train_test_split public?
There was a problem hiding this comment.
Maybe we should make _get_train_test_split public?
What's the motivation for that?
In case someone wants a specific train/test split. Maybe they have time series and don't want shuffle=True or blockwise=True? Or maybe they want the test set to be a specific chunk from the Dask Array?
There was a problem hiding this comment.
shuffle=True, blockwise=True was the default behavior on master, and that's still what we have?
shuffle=True is definitely the behavior on master; shuffle=False isn't supported (_split.py#L493, _split.py#L477), and blockwise is False for DataFrames, and True for Arrays. I think that's the right behavior.
I need to investigate blockwise some; I'm not sure if there's a mixup in the documentation or implementation.
There was a problem hiding this comment.
I think blockwise is backwards for DataFrame in the documentation & implementation. It only supports the equivalent of blockwise=True (within-block shuffling).
There was a problem hiding this comment.
I think blockwise is backwards for DataFrame in the documentation & implementation. It only supports the equivalent of blockwise=True (within-block shuffling).
I'm a little confused. Should the implementation support shuffling between chunks or within chunks? I'd like to see train_test_split support shuffling between chunks. Currently, train_test_split only supports shuffling within chunks blockwise=True:
import pandas as pd
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split
N = 1000
df = pd.DataFrame({"x": np.arange(N), "y": np.arange(N)})
ddf = dd.from_pandas(df, npartitions=2)
kwargs = dict(random_state=0, train_size=0.5)
train, test = train_test_split(ddf, blockwise=True, **kwargs)
assert train.compute().shape == (530, 2)
train, test = train_test_split(ddf, blockwise=False, **kwargs) # raises NotImplementedErrorThere was a problem hiding this comment.
Should the implementation support shuffling between chunks or within chunks?
Ideally either. But that's distinct from this PR.
Ah indeed. Fixed I think. Interestingly. Tests also pass if I remove the to_dask_array lines. I'm curious do we know what support is like for passing Pandas dataframes around to models is? Is this something that we would want to leave for the user? |
|
The sklearn dev failure can be ignore. I need to look into it today or tomorrow. Planning to merge later today, assuming the discussion in https://github.com/dask/dask-ml/pull/701/files/485ff530bca48eb6051591c75dbe6a5965d42333#diff-0a54ded74c8013caf0588e9a5c879e99 has been resolved. |
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
|
Thanks! Apologies for the incorrect suggestion. |
It doesn't look like we actually use exact chunk sizes anywhere, so this should be ok?
I may be wrong though. Regardless there is probably still some cleanup to do here. I thought I'd push up something early for feedback though.
cc @stsievert