Skip to content

[WIP] Implement backend requirements and preferences#537

Closed
jcrist wants to merge 2 commits intojoblib:masterfrom
jcrist:backend-requirements
Closed

[WIP] Implement backend requirements and preferences#537
jcrist wants to merge 2 commits intojoblib:masterfrom
jcrist:backend-requirements

Conversation

@jcrist
Copy link
Copy Markdown
Contributor

@jcrist jcrist commented Jul 19, 2017

Adds an initial set of backend preferences and requirements that a call to Parallel can specify. If the active backend meets the requirements and preferences, it will be used over the specified backend in the parallel call.

Currently only two requirements/preferences are implemented:

  • prefer_processes

    Whether to choose a backend that uses multiple processes over one that uses a single process. Note that this is only a preference and not a strict requirement.

  • require_shared_memory

    Whether shared memory (a single process) is required. If True, backends that use multiple processes are forbidden.

Supersedes #524.

Adds an initial set of backend preferences and requirements that a
call to Parallel can specify. If the active backend meets the
requirements and preferences, it will be used over the specified backend
in the parallel call.

Currently only two requirements/preferences are implemented:

- `prefer_processes`

  Whether to choose a backend that uses multiple processes over one that
  uses a single process. Note that this is only a preference and not a
  strict requirement.

- `require_shared_memory`

  Whether shared memory (a single process) is required. If True,
  backends that use multiple processes are forbidden.
@jcrist
Copy link
Copy Markdown
Contributor Author

jcrist commented Jul 19, 2017

I'm not set on the keyword names or interface, but I think the defaults make sense (and are compatible with the current implementation).

A few assorted thoughts on the design/implementation:

  • One other options I contemplated but decided against was to specify requirements as a set, e.g. backend_requirements={'shared_memory', ...}. This would help if many other requirements were added, preventing an explosion of keyword names. However, it would prevent non-boolean requirements. As is I think having individual keywords is more readable and provides greater options in the future.

  • I went with prefer_processes={True,False} to specify GIL interaction. If a function holds the GIL, we'd prefer to use processes but can use threads. If it doesn't hold the GIL we don't prefer to use processes, but still can. The default is True, matching the current joblib defaults.

  • The backend and n_jobs go together. Overriding the backend and n_jobs on a Parallel call is an all-or-nothing choice; if the active_backend is chosen, then the active_n_jobs are also chosen. This seemed like the most easy to reason about design.

  • One mildly rough edge is that you can't specify prefer_processes=True and requires_shared_memory=True together. This makes sense, as one invalidates the other. However, it does mean that to require shared memory you need to also specify prefer_processes=False. One option would be to merge the keywords together into a single keyword. Something like use_processes={'preferred', 'forbidden'}, with the default of 'preferred'.

  • We end up calling __init__ on the default backend even if it's overridden. This is cheap for all built in backends, and made the implementation more consistent, but may not be cheap for some custom backend. Not sure if that's an issue.

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 19, 2017

Codecov Report

Merging #537 into master will increase coverage by <.01%.
The diff coverage is 98.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #537      +/-   ##
==========================================
+ Coverage   93.72%   93.72%   +<.01%     
==========================================
  Files          37       37              
  Lines        4779     4829      +50     
==========================================
+ Hits         4479     4526      +47     
- Misses        300      303       +3
Impacted Files Coverage Δ
joblib/_parallel_backends.py 91.39% <100%> (+0.17%) ⬆️
joblib/test/test_parallel.py 95.56% <100%> (+0.34%) ⬆️
joblib/parallel.py 96.87% <97.61%> (-0.17%) ⬇️
joblib/test/test_memory.py 98.09% <0%> (-0.43%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b320fa6...a4aae0d. Read the comment docs.

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 19, 2017

Codecov Report

Merging #537 into master will increase coverage by 0.04%.
The diff coverage is 98.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #537      +/-   ##
==========================================
+ Coverage   93.72%   93.76%   +0.04%     
==========================================
  Files          37       37              
  Lines        4779     4829      +50     
==========================================
+ Hits         4479     4528      +49     
- Misses        300      301       +1
Impacted Files Coverage Δ
joblib/test/test_parallel.py 95.56% <100%> (+0.34%) ⬆️
joblib/_parallel_backends.py 91.39% <100%> (+0.17%) ⬆️
joblib/parallel.py 96.87% <97.61%> (-0.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b320fa6...a4aae0d. Read the comment docs.

@jcrist
Copy link
Copy Markdown
Contributor Author

jcrist commented Jul 20, 2017

Ping @GaelVaroquaux for review.

@GaelVaroquaux
Copy link
Copy Markdown
Member

I have started looking at the code, but before I comment on the specific details, here is a though that we've been having with @ogrisel:

  • It might be useful at some point to change the default backend of joblib from processes to threads, the reason being that more and more code are releasing the GIL. A more nuanced approach might be to enable setting the default backend (threads or process).

  • Hence, we were wondering whether rather than "prefer_threads" we should not have "prefer={None, 'threads', 'processes'}

What do people think? @ogrisel ?

@@ -300,6 +371,14 @@ class Parallel(Logger):
- finally, you can register backends by calling
register_parallel_backend. This will allow you to implement
a backend of your liking.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to revisit the docstring and the arguments to deprecate "backend" and replace it by the mechanism of backend specification that you implemented and the default backend.

What do people think?

@mrocklin
Copy link
Copy Markdown
Contributor

Hence, we were wondering whether rather than "prefer_threads" we should not have prefer={None, 'threads', 'processes'}

Not that I'm an expert in Joblib usability, but this change makes sense to me. I think it's useful to stress that this is "prefers" and not "requires" to algorithm authors.

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like the prefer={None, 'threads', 'processes'} API as well. It makes it more future-proof (will not change if we change the default backend of joblib).

process backends (e.g. threading). This is useful if the function
being applied doesn't release the global interpretter lock (GIL).
require_shared_memory : bool, optional
If true, a backend supporting shared memory (e.g. a single process) is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. a single process with thread-based workers

Copy link
Copy Markdown

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything else design or code-wise that needs to be done to move this forward? (aside from a rebase to fix the merge conflict)

Also, cc @stephen-hoover you might find this interesting. This short summary is that this PR would allow user-level code to (safely) override which parallel backend is used inside scikit-learn. I don't think this would directly affect how civis is using joblib, but I may be wrong.

for backend, n_jobs in backend_and_n_jobs:
if getattr(backend, 'uses_processes', None) is False:
return backend, n_jobs
raise ValueError("Unable to satisfy shared memory requirement "
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this case isn't hit by the test suite.

This is a silly example, but does hit the code:

choose_backend_and_n_jobs('multiprocessing', 5,
                          require_shared_memory=True, prefer_processes=False)

@stephen-hoover
Copy link
Copy Markdown

Thanks for tagging me, @TomAugspurger ! I had to read through all of the code changes to figure out exactly what was going on -- it looks to me like this change will require that custom backends add a uses_processes attribute or else risk being ignored. A custom backend which doesn't set uses_processes will always be skipped by a Parallel which uses the default prefer_processes=True. This is fine -- I'm aware that custom backends are experimental and subject to breaking interface changes, but this does look like a breaking change for them.

Any chance you could add a logging.getLogger(__name__).debug emit when overriding an active backend? I can see this causing maddening bugs if developers assume that their active backend will always be respected.

@TomAugspurger
Copy link
Copy Markdown

it looks to me like this change will require that custom backends add a uses_processes attribute or else risk being ignored.

Hmm, is https://github.com/joblib/joblib/pull/537/files#diff-12c1bd53fbe4c0df54e8edee36f2d2e9R335 the line in question? If so, changing that to if getattr(backend, 'uses_processes', True) is True: would I think fix it by assuming backends do use processes unless they say otherwise, but I haven't thought through if that's a good idea.

@stephen-hoover
Copy link
Copy Markdown

Yes, that's the line I was looking at. I really don't know what most custom backends look like, so it's hard for me to say what the assumption should be. Defaulting to True would look more like behavior in the existing code, IIUC.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Jan 22, 2018

I gave another shot at this (#595) with different semantics in case of constraints violation. See the assertions in the tests for details.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Jan 25, 2018

Closing in favor of #602.

@jcrist jcrist closed this Jan 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants