Conversation
|
My benchmarks show this helps, but only a little bit. Have yet to find a scikit-learn task where the distributed backend is noticeably more performant than the existing threading/multiprocessing backends (except for grid-search, which is still beat by One issue is that internal scikit-learn code may transform the input data, so the objects that make it to joblib aren't the same as the input ones (e.g. the |
|
Are the |
|
I wouldn't be surprised if joblib was near full speed on a single machine. This might have more of an impact on a cluster where communication costs are larger. |
Different objects. Usually it's a small thing like a reshape or dtype change. |
|
What is the status here? Is this code helpful? |
|
It's useful, but still fails to beat single machine performance for anything except grid_search (which is still beat by |
Do you mean to say that if we have a cluster we should still prefer to use a single machine or that on a single machine one should use the threaded/multiprocessing joblib backends? If the former then did you get a sense for what the limiting factor was? |
|
The latter. The former depends on the work:serialization-cost ratio, and how many tasks there are. Grid-Search is a good use here, as fitting individual estimators is usually pretty expensive, and we're using the same data multiple times (which can be cached). |
|
Are there joblib-accelerated algorithms within sklearn that have a high work-to-serialization-cost ratio other than grid search? cc @ogrisel |
|
Ok, I ran a benchmark of fitting using a modified version of this scikit-learn benchmark (see here for the code). This fits a The threading backend was run on a single As you can see here, train time for I tried running without the As such, I think this is a net win. I have a patch for scikit-learn that I haven't pushed yet to allow overriding the backend for fitting on these classes, which would avoid the need for the monkeypatch in the benchmark above. (cc @amueller, who recommended this benchmark a few months ago :)) |
|
To my naive understanding this seems pretty great? Hijacking joblib like this seems like the least intrusive way to accelerate scikit-learn with dask.distributed. It would be interesting to see both how well this scales and also what other common computations in scikit-learn might benefit. also cc'ing @GaelVaroquaux For the perspective of this PR I think it's clear that it provides an improvement. Working towards merging sounds good to me. |
|
That looks pretty good indeed. Maybe get some error bars? It looks like the speedup for RF is slightly more than 3x which is a bit strange ;) |
Yeah, not sure why previous numbers were less great. Might have grabbed the wrong commit when running them. |
distributed/joblib.py
Outdated
| MIN_IDEAL_BATCH_DURATION = 0.2 | ||
| MAX_IDEAL_BATCH_DURATION = 1.0 | ||
| MIN_IDEAL_BATCH_DURATION = 0.5 | ||
| MAX_IDEAL_BATCH_DURATION = 5.0 |
There was a problem hiding this comment.
How did you come by these values?
There was a problem hiding this comment.
Oop, leftover from debugging.
| self.client = Client(scheduler_host, loop=loop) | ||
| if scatter is not None: | ||
| # Keep a reference to the scattered data to keep the ids the same | ||
| self._scatter = list(scatter) |
There was a problem hiding this comment.
What happens if someone gives us a single numpy array?
There was a problem hiding this comment.
Then it's bad. We should only accept lists here because technically we can pre-scatter any object so it'd be hard to know if someone wanted us to scatter the object or elements inside it. Will fix to error if not list/tuple.
distributed/joblib.py
Outdated
| return Batch(tasks), args2 | ||
|
|
||
| def apply_async(self, func, callback=None): | ||
| key = '%s-%s' % (joblib_funcname(func), uuid4().hex) |
There was a problem hiding this comment.
Should this add "%s-batch-%s" or something to signify that this is more than one function call?
Was just monkey-patching in the benchmark. Scikit-learn hardcodes the backend into the There might be a better way though. |
Allows to prescatter data to the cluster. For large arguments (big arrays, etc...) that are used in more than one task, this can be more efficient as it avoids serializing the data for every task. Also fix to ensure distributed backend is an instance of the `ParallelBackendBase` for both scikit-learn and joblib modules.
|
I think this should be good to go. |
| sols = [func(*args, **kwargs) for func, args, kwargs in tasks] | ||
| results = Parallel()(tasks) | ||
|
|
||
| ba.terminate() |
There was a problem hiding this comment.
Why is this call necessary? Should this be called as part of __exit__?
There was a problem hiding this comment.
Backends aren't contextmanagers, there is no __exit__ to the backend. The call was already there for the other tests (e.g. https://github.com/dask/distributed/blob/master/distributed/tests/test_joblib.py#L42), I moved it to use terminate which is more standard to the other joblib backends.
| return funcname(x) | ||
|
|
||
|
|
||
| class Batch(object): |
There was a problem hiding this comment.
Will users ever see this object? If so should it get a one-line docstring?
There was a problem hiding this comment.
No, this is completely internal.
|
+1 from me |
|
Backends aren't contextmanagers, there is no __exit__ to the backend.
I'm not opposed to change this.
|
|
@jcrist can you raise a PR with the patch to sklearn? |
|
I could, but wanted to wait until a decision on how best to override scikit-learn estimators was reached (see relevant issue: scikit-learn/scikit-learn#8804). |
Allows to prescatter data to the cluster. For large arguments (big arrays, etc...) that are used in more than one task, this can be more efficient as it avoids serializing the data for every task.
Also fix to ensure distributed backend is an instance of the
ParallelBackendBasefor both scikit-learn and joblib modules.