[WIP] Wrap sklearn.model_selection routines with dask#270
[WIP] Wrap sklearn.model_selection routines with dask#270
sklearn.model_selection routines with dask#270Conversation
|
Always nice when the tests fail :) It's currently failing because of this scikit-learn/scikit-learn@106bb9e We're testing against sklearn-dev and @GaelVaroquaux renamed Since the failure comes from |
|
It's currently failing because of this ***@***.***
We're testing against sklearn-dev and @GaelVaroquaux renamed
sklearn.externals.joblib to sklearn.externals._joblib.
Indeed, and there is a half-done PR to address it in distributed:
dask/distributed#2019
That said, we are discussing in scikit-learn whether we should revert
this API change, as it is breaking too many things.
|
|
@GaelVaroquaux sorry, didn't see that one! I've closed PR 2091! |
TomAugspurger
left a comment
There was a problem hiding this comment.
This looks really good, thanks.
Could you also add these to helpers to docs/source/modules/api.rst, and add a basic test using it. Let me know if you need help writing a test.
We'll ignore the sklearn-dev failures for now, until there's a clear direction forward.
dask_ml/model_selection/_validate.py
Outdated
| scatter = [X, y] | ||
|
|
||
| if is_dask_collection(X) or is_dask_collection(y): | ||
| raise NotImplementedError('This method is not implemented ' |
There was a problem hiding this comment.
This should note that it isn't implemented for dask collections yet.
Perhaps you could indicate that users can pre-compute X and y if they fit in memory.
If your data fit in RAM on each worker, you can pre-compute the data with
X, y = dask.compute(X, y)
before calling {f.__name__}
|
I'm going to push back a bit (though please don't put to much weight here, I don't have much user experience here). We might want to avoid explicitly wrapping sklearn operations if we can avoid it by just asking users to use sklearn with joblib. This might reduce maintenance in the future, and keep their code sklearn friendly when they don't want dask involved. Or, perhaps a different way of saying this: What is the advantage of doing this vs just having the user wrap code with the joblib dask backend? |
|
That's a fair concern. These utilities would blur the line between "scikit-learn + distributed joblib for CPU bound problems" vs. Dask-ML for RAM-bound problems. @gglanzani do you have any thoughts on that? |
|
@mrocklin I see your point. This machinery might indeed make more sense once these methods are implemented also on dask collections but as it now stands it has another advantage, namely if my code now reads: from sklearn.model_selection import cross_validate
def my_fun(*args, **kwargs):
do_something()
cross_validate(...)
def my_fun2(*args, **kwargs):
do_something_else()
cross_validate(...)I don't need to change all the instances of @TomAugspurger Can you point me to a test which I could learn/copy from for this specific case? |
The commit adds ``cross_validate, cross_val_score``, and ``cross_val_predict`` methods that wrap the sklearn equivalent methods by using ``joblib`` with dask as a backend. It should close dask#251
|
In your case with the new joblib you can call the following at the top of your script And the rest of your code will work as before, keeping the sklearn import |
|
@mrocklin in that case I have to pass the ball to Tom who originally opened the issue 😄 |
|
I'll think on it for a bit. Apologies for the (potentially) unused work @gglanzani. |
|
@mrocklin @TomAugspurger one extra advantage of the above approach is that it's still easy to use joblib with dask backend only on parts of the functions in the module. But that's a minor thing. Since we're talking about this, what would the best venue to talk about how to implement the functions for dask collections as well? |
The commit adds
cross_validate, cross_val_score, andcross_val_predictmethods that wrap the sklearn equivalent methodsby using
joblibwith dask as a backend.It should close #251
@TomAugspurger I've marked it as WIP to see if this is the direction you would like the PR to go in order to close the issue. Let me know otherwise.