Skip to content

[WIP] Wrap sklearn.model_selection routines with dask#270

Open
gglanzani wants to merge 1 commit intodask:mainfrom
gglanzani:patch-251
Open

[WIP] Wrap sklearn.model_selection routines with dask#270
gglanzani wants to merge 1 commit intodask:mainfrom
gglanzani:patch-251

Conversation

@gglanzani
Copy link
Copy Markdown
Contributor

The commit adds cross_validate, cross_val_score, and
cross_val_predict methods that wrap the sklearn equivalent methods
by using joblib with dask as a backend.

It should close #251

@TomAugspurger I've marked it as WIP to see if this is the direction you would like the PR to go in order to close the issue. Let me know otherwise.

@gglanzani
Copy link
Copy Markdown
Contributor Author

Always nice when the tests fail :)

It's currently failing because of this scikit-learn/scikit-learn@106bb9e

We're testing against sklearn-dev and @GaelVaroquaux renamed sklearn.externals.joblib to sklearn.externals._joblib.

Since the failure comes from distributed, I've opened a PR there.

@GaelVaroquaux
Copy link
Copy Markdown

GaelVaroquaux commented Jul 3, 2018 via email

@gglanzani
Copy link
Copy Markdown
Contributor Author

@GaelVaroquaux sorry, didn't see that one! I've closed PR 2091!

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, thanks.

Could you also add these to helpers to docs/source/modules/api.rst, and add a basic test using it. Let me know if you need help writing a test.

We'll ignore the sklearn-dev failures for now, until there's a clear direction forward.

scatter = [X, y]

if is_dask_collection(X) or is_dask_collection(y):
raise NotImplementedError('This method is not implemented '
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should note that it isn't implemented for dask collections yet.

Perhaps you could indicate that users can pre-compute X and y if they fit in memory.

If your data fit in RAM on each worker, you can pre-compute the data with

X, y = dask.compute(X, y)

before calling {f.__name__}

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Jul 3, 2018

I'm going to push back a bit (though please don't put to much weight here, I don't have much user experience here). We might want to avoid explicitly wrapping sklearn operations if we can avoid it by just asking users to use sklearn with joblib. This might reduce maintenance in the future, and keep their code sklearn friendly when they don't want dask involved.

Or, perhaps a different way of saying this:

What is the advantage of doing this vs just having the user wrap code with the joblib dask backend?

@TomAugspurger
Copy link
Copy Markdown
Member

That's a fair concern. These utilities would blur the line between "scikit-learn + distributed joblib for CPU bound problems" vs. Dask-ML for RAM-bound problems.

@gglanzani do you have any thoughts on that?

@gglanzani
Copy link
Copy Markdown
Contributor Author

@mrocklin I see your point. This machinery might indeed make more sense once these methods are implemented also on dask collections but as it now stands it has another advantage, namely if my code now reads:

from sklearn.model_selection import cross_validate

def my_fun(*args, **kwargs):
    do_something()
    cross_validate(...)


def my_fun2(*args, **kwargs):
    do_something_else()
    cross_validate(...)

I don't need to change all the instances of cross_validate, but I need only to change the import to dask. So personally this has value for the end user. If this value is justified from a development perspective… I'm not qualified for this :)

@TomAugspurger Can you point me to a test which I could learn/copy from for this specific case?
I will also update the error message and update the docs.

The commit adds ``cross_validate, cross_val_score``, and
``cross_val_predict`` methods that wrap the sklearn equivalent methods
by using ``joblib`` with dask as a backend.

It should close dask#251
@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Jul 3, 2018

In your case with the new joblib you can call the following at the top of your script

joblib.parallel_backend('dask')

And the rest of your code will work as before, keeping the sklearn import

@gglanzani
Copy link
Copy Markdown
Contributor Author

@mrocklin in that case I have to pass the ball to Tom who originally opened the issue 😄

@TomAugspurger
Copy link
Copy Markdown
Member

I'll think on it for a bit. Apologies for the (potentially) unused work @gglanzani.

@gglanzani
Copy link
Copy Markdown
Contributor Author

@mrocklin @TomAugspurger one extra advantage of the above approach is that it's still easy to use joblib with dask backend only on parts of the functions in the module.

But that's a minor thing.

Since we're talking about this, what would the best venue to talk about how to implement the functions for dask collections as well?

Base automatically changed from master to main February 2, 2021 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add cross_validate helper

4 participants