ENH: implement __sizeof__ for estimators#15526
ENH: implement __sizeof__ for estimators#15526stsievert wants to merge 22 commits intoscikit-learn:mainfrom
Conversation
|
This PR is not exact, and doesn't not count any overhead. In the tests, it's fairly straightforward to get within ±5%, at least for decently large estimators. This is mentioned more by @mrocklin in #8642 (comment) and #8642 (comment). |
rth
left a comment
There was a problem hiding this comment.
Thanks @stsivert. It would be useful to add this to common tests #8642 (comment) which can be added e.g. to sklearn.utils.estimator_checks::check_estimator_pickle.
It may be fine if for some estimaors the size of the pickle is not equal with some tolerance to __sizeof__ result, but at least it would be good to know which ones.
Modifying *Mixin classes is necessary because many classes
override BaseEstimator's implementation with (for example)
class Birch(ClusterMixin, TransformerMixin, BaseEstimator):
This means that Birch will use ClusterMixin's implementation of
__sizeof__ and override BaseEstimator's
jnothman
left a comment
There was a problem hiding this comment.
I thought we decided we only need to implement it for C/++ data structures since the caller should be dealing with estimation for plain old objects.
|
Hi @stsievert, I'm checking the status of PRs and issues in the 0.23 milestone. |
rth
left a comment
There was a problem hiding this comment.
I thought we decided we only need to implement it for C/++ data structures since the caller should be dealing with estimation for plain old objects.
I would be OK with it in BaseEstimator, and any Cython/C objects that need it. Having a simple way to check the size of an estimator won't hurt even outside of Dask.
Please ping once CI is passing.
|
(Also it might be good to sync with master) |
Co-Authored-By: Roman Yurchak <rth.yurchak@gmail.com>
|
So it's okay that this can double count objects with shared references? |
|
As in I'd hoped to only handle cython/C implementations so that we didn't have to maintain a function that we might be tempted to make more nuanced over time |
Yeah, I think it'd be best to forget about looping over class BaseEstimator:
@property
def nbytes(self):
return sum(x.nbytes if hasattr(x, "nbytes") else sys.getsizeof(x) for x in self.__dict__.values())
...That'd resolve @jnothman's concerns and provide an accurate estimate of an estimator's memory usage. |
Reference Issues/PRs
Closes #8642
What does this implement/fix? Explain your changes.
It implements
__sizeof__for Estimators as mentioned in #8642 (comment).TODO:
__sizeof____sizeof__for trees