Skip to content

Faster info via meta class#8998

Merged
taldcroft merged 9 commits intoastropy:masterfrom
mhvk:faster-info-via-meta-class
Sep 14, 2019
Merged

Faster info via meta class#8998
taldcroft merged 9 commits intoastropy:masterfrom
mhvk:faster-info-via-meta-class

Conversation

@mhvk
Copy link
Copy Markdown
Contributor

@mhvk mhvk commented Jul 15, 2019

@taldcroft - I decided to try the metaclass approach to DataInfo, where for every attribute one automatically generates a descriptor that either gets/sets in _attrs or from the parent. It is a bit of magic in the meta-class, but not really that much more than there was before in __getattr__ and __setattr__. Note that I had to move to __slots__ in order to continue to disallow setting arbitrary attributes; not completely clear whether that is worth it. I have some other comments in-line.

Comparing this PR to current master:

from astropy.table import MaskedColumn
m = MaskedColumn([1, 2])
%timeit m.info
# 1000000 loops, best of 5: 290 ns per loop
# on current master: 1.24 µs
%timeit m.info.name
# 1000000 loops, best of 5: 659 ns per loop
# on current master: 2.12 µs

from astropy import units as u
q = [1, 2] * u.m
%timeit q.info
# 1000000 loops, best of 5: 319 ns per loop
# on current master: 1.82 µs
%timeit q.info.name
# 1000000 loops, best of 5: 463 ns per loop 
# on current master: 2.46 µs

The last item is the most relevant for possibly moving everything to info - 463 ns is still about 6 times slower than just %timeit m.name. But it is not obvious to me that with the general speed-up here, it would still be a limiting factor.

Copy link
Copy Markdown
Contributor Author

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comments

with pytest.raises(AttributeError) as err:
c[:].info.name
assert 'failed access "info" attribute' in str(err.value)
# No problem any more?
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't know why this test started failing...

@mhvk mhvk requested a review from taldcroft July 15, 2019 01:43
@taldcroft
Copy link
Copy Markdown
Member

@mhvk - this looks great from the description, but clearly needs a decent chunk of time to dig into and evaluate. So it's on my radar, but probably not in the next few days...

@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Jul 15, 2019

No worries - but if you do have a small amount of time, it would be lovely to get #8854 out of the way...

@taldcroft
Copy link
Copy Markdown
Member

@mvhk - have you had a look at https://www.python.org/dev/peps/pep-0487/ ? This is available as of Python 3.6. Not clear if the __slots__ stuff is possible, but it would be nice to avoid metaclasses if possible.

@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Jul 22, 2019

(Taking this out of the inline comments as I think it goes to the heart of the matter.)

@taldcroft - The approach with just defining descriptors on the class is indeed a lot clearer!

The main reason I went for the meta-class approach was to keep the API similar, with its attr_names and attrs_from_parent - which is why there are relatively few changes to Info subclasses.

The question then is whether we are happy to change it, which in turn depends on whether anybody outside of astropy relies on it. In principle, I guess the metaclass could help here, by doing the auto-generation while emitting a deprecation warning. But of course easier if we can do without it...

@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Jul 22, 2019

p.s. Have been meaning to look at the effects of PEP 487... Definitely useful for simple registration like Time fornats.

@taldcroft
Copy link
Copy Markdown
Member

taldcroft commented Jul 23, 2019

.. was to keep the API similar, with its attr_names and attrs_from_parent

Ah yes, I didn't immediately think of the API. That's possibly thorny... So what about some radical thoughts just for the sake of conversation:

  • What if we left the existing DataInfo hierarchy in place for external usage (but deprecate) and then made a brand new infrastructure that keeps most of the public API but is not constrained to implement in the same way. For the sake of discussion call it XDataInfo or XMixinInfo.
  • attr_names and attrs_from_parents can be done (for reading) with __set_name__ in the descriptors.
  • Basically saying, Info is a honking good idea, let's make it better and faster with a reboot. In practice it probably won't be radically different, but just a chance to take stock of what is there and make it nicer.
  • I think that in the rest of astropy there are not too many places we would need to change to reflect this additional XDataInfo class hierarchy (but I haven't tried looking).
  • Our Astropy classes would have e.g. class QuantityInfoBase(XParentDtypeInfo), class TimeInfo(XMixinInfo), etc.

# arbitrary attributes could be set (as a by-product, save
# some memory).
dct.setdefault('__slots__', [])
# Set a default __doc__, mostly to avoid InheritDocstrings
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what InheritDocstrings does, but will #9024 help?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It copied __doc__ to attributes - now that we do not use that anymore internally (#8881), this should indeed not be needed any more!

... testing ...

Indeed! So, removed. Nice catch!

self._represent_as_dict_attrs = attrs

out = super()._represent_as_dict()
out = {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use data_info._get_obj_attrs_map. (Which maybe needs to be made public?).

Or maybe _represent_as_dict could take an optional attrs to allow overriding self._represent_as_dict_attrs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better. Here, represent_as_dict itself is overridden, but, yes, a super().represent_as_dict(attrs) would work. I'm not quite sure it is worth changing, especially if we're considering a larger overhaul.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, darn, I now see why you suggested it: _get_obj_attrs_map is not on DataInfo but in the data_info module...

@taldcroft
Copy link
Copy Markdown
Member

Having given a decent first pass through this code and seen the timing, it looks great! So it might be a reasonable strategy to get this in shape to merge and separately think about a bigger redesign to make the code more elegant / readable and maybe even faster.

@mhvk mhvk force-pushed the faster-info-via-meta-class branch from 354b7bf to 6b28365 Compare July 23, 2019 21:20
@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Jul 23, 2019

OK, I've implemented the suggested changes. I like the idea of merging this, and worrying about a simpler design separately. The one outstanding question, I think, is whether it is worth the hassle with __slots__ to prevent people from sticking arbitrary attributes on info.

@taldcroft
Copy link
Copy Markdown
Member

Here are some timing results on my laptop with your branch and master. I get qualitatively similar results. This also explores the speedup of a different strategy (attrs are stored as object attributes, not dict elements).

from weakref import ref as weakref
from astropy.utils.data_info import DataInfo
from astropy.table import MaskedColumn
%astro
astropy=4.0.dev25224
DataInfo._attr_defaults  # Shows this is the faster-info-via-meta-class branch
{'dtype': dtype('O')}
class A:
    info = DataInfo()
a = A()
a.info.description = 'asdf'
timeit a.info
463 ns ± 3.27 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

MASTER: 2.5 µs ± 43.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

timeit a.info.description
845 ns ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

MASTER: 3.39 µs ± 64.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

a.__dict__
{'info': dtype = object
 shape = --
 description = asdf
 class = A
 n_bad = 0}
class SimpleDataInfo:
    """Simplest possible data info container"""
    __slots__ = ['name', 'description', 'dtype', 'meta', 'unit', '_parent_ref']
class B:
    """Use SimpleDataInfo via a class property"""
    @property
    def info(self):
        try:
            info = self._info
        except AttributeError:
            info = self._info = SimpleDataInfo()
        finally:
            # info._parent_ref = self # weakref(self)
            return info            
b = B()
b.info.name = 'asdf'
timeit b.info.name
207 ns ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
class FDataInfo:
    """Descriptor that roughly matches the property above"""
    def __get__(self, instance, owner_cls):
        try:
            info = instance._info
        except AttributeError:
            if instance is None:
                # This is an unbound descriptor on the class
                self._parent_cls = owner_cls
                return self
            else:
                info = instance._info = SimpleDataInfo()
        finally:
            # info._parent_ref = weakref.ref(instance)
            return info
class C:
    info = FDataInfo()
c = C()
c.info.description = 'hello'
timeit c.info.description
231 ns ± 4.75 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
m = MaskedColumn([1, 2])
m.info.description = 'asdf'
timeit m.description
50 ns ± 0.801 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
timeit m.info.description
1.01 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
junk = SimpleDataInfo()
m.junk = junk
timeit m.junk
58.1 ns ± 0.271 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
m.junk.name = 'name'
timeit m.junk.name
86.1 ns ± 0.644 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

@taldcroft
Copy link
Copy Markdown
Member

@mvhk - planning to get back to this now that you are returned from vacation. My recollection is that there were some things to explore, but that this PR itself was in reasonably good shape and definitely helps performance.

@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Aug 28, 2019

Yes, I think the only real question here is whether we want to explicitly prevent people from setting arbitrary attributes on info (as is the case currently, as which requires __slots__) or whether we actually do not care (in which case we can get rid of __slots__).

@taldcroft
Copy link
Copy Markdown
Member

👍 on explicitly preventing people from setting arbitrary attributes on info. That matches the current release behavior and I think it prevents mistakes far more often than being a limitation. Along those lines, I have often done things like q.name = name on a Quantity when I meant q.info.name.

The current implementation as slots is thus fine to me. If we even move to a different implementation that does the same thing then that will also be fine.

Copy link
Copy Markdown
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk - I gave this another go through today and don't see any issues apart from the one trivial (unrelated) fix and the issue with the lost parent. With those (and a rebase to fix the CHANGES conflict), this should be good to merge!

out['mask'] = col.mask
self._represent_as_dict_attrs += ('mask',)

elif method is 'null_value':
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this got there, but it should be method == 'null_value'.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure, but coincidentally being fixed in #9225, so I'll leave it here.


else:
raise AttributeError("""\
except TypeError:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is amok in this logic because it is not correctly catching the case below (which is exactly the same as the test_lost_parent_error unit test that got deleted). This should print the informative message instead of an obscure AttributeError.

In [17]: c = Column([1,2,3], name='a')

In [18]: c[:].info.name
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-6440802ca069> in <module>()
----> 1 c[:].info.name

~/git/astropy/astropy/utils/data_info.py in __get__(self, instance, owner_cls)
    212             return self
    213 
--> 214         return getattr(instance._parent, self.attr)
    215 
    216     def __set__(self, instance, value):

AttributeError: 'NoneType' object has no attribute 'name'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very weird - there must have been an intermediate state where it didn't happen, since the test was failing. In any case, I'll fix and reinstate the test!

Also cleans up some other code, in particular by now making use
of the fact that dicts follow insertion order.
Makes code a bit clearer, though not much of a speed-up.
And put back the test that that gives an informative error message.
@mhvk mhvk force-pushed the faster-info-via-meta-class branch from 6b28365 to 30a07db Compare September 13, 2019 23:45
@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Sep 14, 2019

OK, I reinstated the test and fixed the code so that the appropriate message is given again. The difference was that now there always is a weakref - it can just be dead, in which case _parent_ref() returns None.

Copy link
Copy Markdown
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @mhvk!

@taldcroft taldcroft merged commit 7480e0d into astropy:master Sep 14, 2019
@mhvk mhvk deleted the faster-info-via-meta-class branch September 14, 2019 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants