Skip to content

[MRG+1] Backend hints and shared memory constraints#602

Merged
ogrisel merged 5 commits intojoblib:masterfrom
ogrisel:backend-hints
Feb 7, 2018
Merged

[MRG+1] Backend hints and shared memory constraints#602
ogrisel merged 5 commits intojoblib:masterfrom
ogrisel:backend-hints

Conversation

@ogrisel
Copy link
Copy Markdown
Contributor

@ogrisel ogrisel commented Jan 24, 2018

Sorry, I deleted my remote branch by accident and it closed #595. Here is the same PR again.

This is an alternative implementation of #537. I reimplemented it from scratch because joblib had diverged a bit and I did not agree with the semantics of constraints violations in #537.

TODO:

  • test by running RandomForestClassifier on dask-distributed cluster and check that fit run trees in parallel (done with this branch of sklearn: scikit-learn/scikit-learn@master...ogrisel:joblib-backend-hints);
  • decide whether we should use prefer='threads' and require='sharedmem' or keep the currently implemented boolean flags for hinting and hard constraints;
  • update Parallel docstring to document the new options.

@ogrisel ogrisel changed the title Backend hints and shared memory constraints [MRG] Backend hints and shared memory constraints Jan 24, 2018
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 24, 2018

Codecov Report

Merging #602 into master will increase coverage by 0.09%.
The diff coverage is 96.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #602      +/-   ##
==========================================
+ Coverage   94.83%   94.93%   +0.09%     
==========================================
  Files          39       39              
  Lines        5287     5389     +102     
==========================================
+ Hits         5014     5116     +102     
  Misses        273      273
Impacted Files Coverage Δ
joblib/parallel.py 98.7% <100%> (+0.11%) ⬆️
joblib/_parallel_backends.py 94.82% <100%> (+0.96%) ⬆️
joblib/test/test_parallel.py 95.9% <94.52%> (-0.21%) ⬇️
joblib/backports.py 93.75% <0%> (-2.09%) ⬇️
joblib/_store_backends.py 90.47% <0%> (-0.53%) ⬇️
joblib/test/test_memory.py 98.14% <0%> (+0.37%) ⬆️
joblib/memory.py 95.54% <0%> (+0.59%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update feb1188...808b27a. Read the comment docs.

@GaelVaroquaux GaelVaroquaux changed the title [MRG] Backend hints and shared memory constraints [MRG+1] Backend hints and shared memory constraints Jan 25, 2018
@GaelVaroquaux
Copy link
Copy Markdown
Member

This looks great. I am in favor of merging it.

I'd be interested in having the point of view of @jcrist, given that this is work that he started. @jcrist , if you have time to review it?

@TomAugspurger
Copy link
Copy Markdown

TomAugspurger commented Jan 25, 2018

Quick results from training a RandomForestClassifier and an ExtraTreesClassifier using (@jcrist's benchmark)

  • threading (4 cores I think)
  • loky (4 cores I think)
  • dask.distributed cluster with 8 workers, 4 cores each

Classification performance:

Classifier backend train-time test-time error-rate
RandomForest dask.distributed 11.9515s 0.6102s 0.0296
RandomForest threading 25.0732s 0.6096s 0.0296
RandomForest loky 30.9979s 0.6114s 0.0296
ExtraTreesClassifier dask.distributed 16.1022s 0.8105s 0.0325
ExtraTreesClassifier threading 27.1128s 0.7095s 0.0325
ExtraTreesClassifier loky 32.8696s 0.8143s 0.0325

https://github.com/tomaugspurger/joblib-distributed-benchmark. I need to re-run the multiply-nested CV example with the correct patches applied.

@ogrisel
Copy link
Copy Markdown
Contributor Author

ogrisel commented Jan 25, 2018

Thanks. This confirms that threading is optimal for RF dominated workloads when run on a single machine. The individual tasks in this benchmark are probably too short to be efficiently run on a distributed cluster (compared to the amount of computing resource used (~2x speedup for 8x more machines). The data is probably transferred unnecessarily repeatedly many times with the current state of the joblib distributed connector. Running the same benchmark with the dask-ml implementation of cross-validation and random search should make it possible to confirm this. A joblib.shelve (#593) primitive used automatically under the hood by Parallel on redundant input arguments might help fix this while preserving the eager numpy oriented API of scikit-learn. This should be explored in a separate PR though.

@TomAugspurger
Copy link
Copy Markdown

The data is probably transferred unnecessarily repeatedly many times with the current state of the joblib distributed connector.

Indeed, I forgot to scatter the data in the outer parallel_backend call. I'll repeat the benchmark later with this added.

Copy link
Copy Markdown
Contributor

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay in review. Overall this looks fine to me. I left a few points that intersect with design issues I mentioned here: #537 (comment).

One rough edge on the backend selection code as currently implemented (and in this PR) is that n_jobs is coupled to the backend, and can be specified in a number of different ways:

  • Keyword to parallel_backend
  • Keyword to Parallel
  • Global default value

The defaults for these differ - parallel_backend defaults to -1 while Parallel defaults to 1. This makes it tricky in either case to determine if n_jobs was explicitly set or using the defaults. Both of these default values should maybe be changed to None to make detecting of explicit setting easier. The default semantics (either -1 or 1) should also probably be unified.

I believe the behavior should be:

  • If n_jobs is explicitly set in Parallel, it should be respected even if set in parallel_backend
  • If n_jobs is not explicitly set in Parallel, then n_jobs from parallel_backend should be used. However, if the backend from parallel_backend is ignored due to requirements, then the n_jobs from parallel_backend should also be ignored.

More discussion of this was given in #52 (the first take at allowing overrides).

"as the latter does not provide shared memory semantics."
% (sharedmem_backend.__class__.__name__,
backend.__class__.__name__))
return sharedmem_backend, n_jobs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When falling back on the backend as it doesn't satisfy the constraints, I think you should also fall back to the n_jobs default instead of the one from the contextmanager. An all-or-nothing approach seems the easiest to reason about to me.

# fallback to the default thead-based backend.
sharedmem_backend = BACKENDS[DEFAULT_THREAD_BACKEND]()
if verbose >= 10:
print("Using %s as joblib.Parallel backend instead of %s "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a project idiom, but I'd prefer a warning in this case rather than print if verbose >= 10. Warnings can be silenced as needed, but if I wrote code to use a certain backend and that backend is being ignored I'd like to know even if I have verbose set to 0.

Copy link
Copy Markdown
Contributor Author

@ogrisel ogrisel Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't want to issue a warning in this case because the user cannot do anything to "fix" the cause of the problem: take the example fo random forests in scikit-learn: the current implementation relies on shared memory semantics in the prediction loop but not in the main fit loop. If the users is calling RF in a nested cross-val loop (which can itself benefit from parallelism). If the user uses a context manager to use the dask-distributed backend to parallelize as much as possible on all the levels, the prediction part should stay sequential (the default behavior) while all other parallel calls will benefit from dask (which would be more than enough to saturate all the cores in typical uses cases). There is no point in issuing a noisy warning that the user will not understand without understanding the details of the inner code of scikit-learn.

I have changed the verbosity of the parallel call itself to make it explicit which backend is used for each individual call however.

"""

supports_timeout = True
use_threads = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference for uses_threads instead. use_threads feels a bit off grammatically, especially given other notations like supports_timeout.

@ogrisel
Copy link
Copy Markdown
Contributor Author

ogrisel commented Jan 30, 2018 via email

@ogrisel ogrisel merged commit cf66463 into joblib:master Feb 7, 2018
@ogrisel ogrisel deleted the backend-hints branch February 7, 2018 14:06
TomAugspurger added a commit to TomAugspurger/scikit-learn that referenced this pull request Jun 22, 2018
Uses the new prefer / require keywords from joblib/joblib#602.

This allows users to control how jobs are parallelized in more situations.
For example, training a RandomForest on a cluster of machines with the dask backend.

Closes scikit-learn#8804
tomMoral pushed a commit to ogrisel/scikit-learn that referenced this pull request Jun 22, 2018
Uses the new prefer / require keywords from joblib/joblib#602.

This allows users to control how jobs are parallelized in more situations.
For example, training a RandomForest on a cluster of machines with the dask backend.

Closes scikit-learn#8804
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants