Skip to content

ENH: implement __sizeof__ for estimators#15526

Open
stsievert wants to merge 22 commits intoscikit-learn:mainfrom
stsievert:sizeof
Open

ENH: implement __sizeof__ for estimators#15526
stsievert wants to merge 22 commits intoscikit-learn:mainfrom
stsievert:sizeof

Conversation

@stsievert
Copy link
Contributor

Reference Issues/PRs

Closes #8642

What does this implement/fix? Explain your changes.

It implements __sizeof__ for Estimators as mentioned in #8642 (comment).

TODO:

  • implement and test __sizeof__
  • implement __sizeof__ for trees

@stsievert
Copy link
Contributor Author

This PR is not exact, and doesn't not count any overhead. In the tests, it's fairly straightforward to get within ±5%, at least for decently large estimators. This is mentioned more by @mrocklin in #8642 (comment) and #8642 (comment).

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stsivert. It would be useful to add this to common tests #8642 (comment) which can be added e.g. to sklearn.utils.estimator_checks::check_estimator_pickle.

It may be fine if for some estimaors the size of the pickle is not equal with some tolerance to __sizeof__ result, but at least it would be good to know which ones.

Modifying *Mixin classes is necessary because many classes
override BaseEstimator's implementation with (for example)

    class Birch(ClusterMixin, TransformerMixin, BaseEstimator):

This means that Birch will use ClusterMixin's implementation of
__sizeof__ and override BaseEstimator's
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we decided we only need to implement it for C/++ data structures since the caller should be dealing with estimation for plain old objects.

@cmarmo
Copy link
Contributor

cmarmo commented Mar 31, 2020

Hi @stsievert, I'm checking the status of PRs and issues in the 0.23 milestone.
Do you think you could find some time to work on this one? Thanks a lot for your time!

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we decided we only need to implement it for C/++ data structures since the caller should be dealing with estimation for plain old objects.

I would be OK with it in BaseEstimator, and any Cython/C objects that need it. Having a simple way to check the size of an estimator won't hurt even outside of Dask.

Please ping once CI is passing.

@rth
Copy link
Member

rth commented Apr 2, 2020

(Also it might be good to sync with master)

@jnothman
Copy link
Member

jnothman commented Apr 2, 2020

So it's okay that this can double count objects with shared references?

@jnothman
Copy link
Member

jnothman commented Apr 2, 2020

As in I'd hoped to only handle cython/C implementations so that we didn't have to maintain a function that we might be tempted to make more nuanced over time

@stsievert
Copy link
Contributor Author

So it's okay that this can double count objects with shared references?

Yeah, I think it'd be best to forget about looping over __dict__ in __sizeof__. Instead, how 'bout this?

class BaseEstimator:

    @property
    def nbytes(self):
        return sum(x.nbytes if hasattr(x, "nbytes") else sys.getsizeof(x) for x in self.__dict__.values())

    ...

That'd resolve @jnothman's concerns and provide an accurate estimate of an estimator's memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement nbytes or __sizeof__ (or equivalent) for estimators

5 participants