-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Description
see #28760 (comment)
related to #27676
Since callbacks are expected to be process aware and in particular aggregate information accross processes, they more or less all need to hold a data structure (list, dict, queue, ...) that is managed by a multiprocessing.Manager(). Each manager creates a manager process, so we don't want to accumulate them.
There are callbacks that can create this data structure in on_fit_begin but others that need to create it at initialization. Taking into account the fact that it should work in custom user code, the most robust solution would be to have a sklearn global manager. It would be created lazily, i.e. the first time a callbacks requests a manager, and could then be accessed by all callbacks.
It implies that it would create a process that lives throughout the whole program session. Maybe that's okay.
To prevent that it accumulates data without control, we should make sure that when a shared data structure is no longer needed (i.e. when a callback gets garbage collected) it's automatically destroyed. If we find that the garbage collector is not able to do it properly on its own, we could rely on weakref finalizers.
ping @FrancoisPgm