[Callbacks] Make clone keep callbacks#33522
[Callbacks] Make clone keep callbacks#33522jeremiedbb merged 12 commits intoscikit-learn:callbacksfrom
Conversation
|
Emotional roller-coaster 😅 |
adrinjalali
left a comment
There was a problem hiding this comment.
So it seems the things we talked about yesterday actually didn't come up in the test suite here:
- making sure there are no threads left open after a
fitis done - the need for a singleton main root node kept at the callback module level instead of the parent node creating it
Otherwise I'm happy with the PR if so far our tests pass.
sklearn/base.py
Outdated
| # callbacks are passed by reference because a same instance of a callback can | ||
| # be used by multiple clones of the same estimator. | ||
| new_object._skl_callbacks = estimator._skl_callbacks |
There was a problem hiding this comment.
I'm happy to keep it as is here, but I also wouldn't be surprised if we figure we actually want to copy the callbacks here.
There was a problem hiding this comment.
It doesn't really matter since clone can be called in sub-processes so we effectively get a "copy" in that case.
There was a problem hiding this comment.
I think it's more efficient to pass by reference whenever we can. I suspect that callback objects grow big if they accumulate data about running operations, so better not copy if we can avoid it.
There was a problem hiding this comment.
Yes, as long as we're not pickling, better to share the same reference.
In sub-processes we get new copies of it, with proxies still pointing to the same managed data structure so that they still share a part of their state
There was a problem hiding this comment.
But in BaseForest and in LinearModelCV we use cloned sub-estimators in Parallel(n_jobs=self.n_jobs, prefer="threads"). Wouldn't it be a problem then?
There was a problem hiding this comment.
No, when we can share the full state between threads it's even better !
For instance progressbars can use the same manager for all shared queues.
There was a problem hiding this comment.
Look at test_progressbar_no_callback_support that I just enhanced. It covers the case you're describing
We can add a test for that. Right now the existing tests make sure that all teardowns are called (and threads are closed in those teardowns), so it's already something.
That (was discussed between Adrin and I during a call) was not a necessity after all and that's good because it would have added a lot of complexity to make make work accross processes. |
|
Even though I opened this PR, my preference is still that |
|
Actually everything's not as simple as I stated in the PR description. Reusing the cross-val example but running the cross-val in parallel this time we only see 1 bar progressing at once and then new bars appears and are already finished. This is because we spawn threads in the sub-processes and their stdout is not directly forwarded to the parent process. So the rendering in parallel setting is quite bad. It also requires some small changes in #33494, but nothing too bad (still a bit annoying like the clean setup/teardown around fit and setting the context as attribute of the estimator) |
|
Thank you for being open for revision, @jeremiedbb and @FrancoisPgm.
For me, that was not the expectation. I only now see that your main intention behind not cloning callbacks by default was to prevent undefined callback behaviour, whereas fixing the pickling error in the clone (making cloning work) and other issues were only side quests. There was some information loss as to what for we want to define the cloning behaviour, at least for me.
|
|
fc833ab shows the necessary changes to ProgressBar to make it work. The main change is that now the callback must be able to handle:
It complexifies a bit implementing propagated callbacks properly. |
I have tried to do that before, to raise a warning when a sub-estimator that supports callbacks is used in a meta-estimator that doesn't. My approach was inspecting the traceback during runtime, identifying if it contains a call of a fit from a non |
Just to be sure, do you have some code just so we get an idea how bad the behaviour is. From what you are saying, this doesn't sound horribly bad, but maybe with other callbacks it could be worse? I tried something like: from sklearn.base import clone
from joblib import delayed, Parallel
from sklearn.callback.tests._utils import MaxIterEstimator
from sklearn.callback import ProgressBar
n_jobs = 4
backend = "loky"
if __name__ == "__main__":
def clone_and_fit(estimator):
clone(estimator).fit()
def func(estimator, n_fits, n_jobs, backend):
Parallel(n_jobs=n_jobs, backend=backend)(
delayed(clone_and_fit)(estimator) for _ in range(n_fits)
)
n_fits, max_iter = 5, 7
callback = ProgressBar()
estimator = MaxIterEstimator(max_iter=max_iter).set_callbacks(callback)
func(estimator, n_fits, n_jobs, backend)and I don't see any major issue personally. |
Your snippet illustrates what I mean. It's just that you need to set a higher number of iterations and fits to actually see something (otherwise it's too fast). Say n_fits=12 and max_iter=100. The final rendering is fine and shows all bars finished, but that's not what we're interested in with progress bars, we want to see progress. EDIT: Actually it looks like all bars from subprocesses are stacked during progress and un-stacked when finished. |
|
As far as I can tell, this looks good to me now |
…-again-clone-keeps-callback
|
LGTM thanks :) |
ogrisel
left a comment
There was a problem hiding this comment.
Some small improvement suggestion below, but LGTM overall. I wouldn't mind merging without addressing all of my comments below.
| def func(estimator): | ||
| Parallel(n_jobs=2)(delayed(clone_and_fit)(estimator) for _ in range(4)) | ||
|
|
||
| func(MaxIterEstimator().set_callbacks(ProgressBar())) |
There was a problem hiding this comment.
Would it be possible to inspect that we do not leak rich related resources after the end of the call to func?
I don't want to delay the merge of this PR for this, but maybe we could add a # TODO comment for this.
There was a problem hiding this comment.
I added a check for leftover threads and queues.
I case of multiprocessing we can just assume that threads created from the worker processes are killed anyway when the worker is terminated. Actually checking that the thread is properly finished from within the worker would add a lot of complexity.
In case of multithreading I added checks to make sure all threads are finished and queues empty. It also serves as a check for the case above applied to each worker.
sklearn/base.py
Outdated
| # callbacks are passed by reference because a same instance of a callback can | ||
| # be used by multiple clones of the same estimator. | ||
| new_object._skl_callbacks = estimator._skl_callbacks |
There was a problem hiding this comment.
I think it's more efficient to pass by reference whenever we can. I suspect that callback objects grow big if they accumulate data about running operations, so better not copy if we can avoid it.
|
I addressed the review comments. I'm merging this one with 2 approvals. Thanks ! |
This PR proposes to come back to the state before #33340 (and mostly reverts it), i.e.
clonedoesn't discard callbacks.The motivation for this turnaround comes from:
So it seems that a common expectation is that callbacks should work even when set on sub-estimators of meta-estimators or functions that don't implement callback support.
Why did we chose to make
clonediscard callbacks ?:my main reason was that callbacks expect the whole computation tree to work optimally. For instance, a progressbar on a bunch of cloned estimators inside a cross-val won't be able to show a progressbar for the cross-val itself, and will spawn that many threads degrading performances.
Can callback work (without changing everything) even in meta-estimators that don't support callback ?:
It seems so.
setupis implemented correctly, i.e. setup from different clones should not try to override each other. It was irrelevant before because there was only one setup: before the fit of the outermost estimator.What to do with the many threads spawned by Progressbar ?:
In the end I think it's a matter of choice. In this PR I removed the warning raised in clone but we could keep it and change the message to inform users that they're not using callbacks optimally and guide them to the documentation for adding callback support. I removed it because I felt that part of the concerns was the unexpected coupling between
cloneand the callback infrastructure, so better make things as simple as possible.Here's an example of what happens with progressbars in such an non-optimal setting:
We don't have the global progress of cross-val and we don't know which bar corresponds to which fold but that still something.
@FrancoisPgm @ogrisel @StefanieSenger @lesteve @adrinjalali let's discuss this option during tomorrow's meeting.