Phase 1 of improvements to I/O infrastructure by astrofrog · Pull Request #3090 · astropy/astropy

astrofrog · 2014-11-07T23:54:47Z

Background

This PR implements delayed loading of read/write/identifier functions by allowing strings to be registered instead of functions, and when the functions are needed they are loaded. This should speed up the import of astropy.table by not importing all the rest of astropy when Table is needed.

(I had already started working on improving the I/O infrastructure in #2063 but decided to tackle it in smaller pieces)

Issues

One of the issues I came across is that astropy.io.ascii auto-registers many ASCII formats via a meta-class. This is a problem because it means that in order to register the readers/writers, io.ascii needs to be imported which we want to avoid. But just to see if it would work, in the first commit (dcaf1a6) I implemented lazy loading for all readers/writers except io.ascii, but this doesn't work because we get circular imports when doing:

python -c 'import astropy.io.ascii'

because then it calls the meta-class in io.ascii, registers readers/writers, but during the registration of the registration, _update__doc__ gets called, and then _register_builtins gets called for the first time and tries to import io.ascii, raising an error because io.ascii isn't loaded yet.

The second commit (a9b1d9e) and the PR in its current state therefore hard-codes the ASCII readers/writers by hand in registry.py (but using some partial magic to avoid having to make any changes in io.ascii related to how the readers/writers are handled).

Benefits

One benefit of the new system is that all the readers/writers are defined in one place and it provides a unique place for developers to go when adding/removing built-in formats.

Because of the new deferred importing infrastructure, importing astropy.table is now faster. Before:

$ time python -c 'from astropy.table import Table'

real    0m0.921s
user    0m0.843s
sys 0m0.056s

After:

$ time python -c 'from astropy.table import Table'

real    0m0.511s
user    0m0.468s
sys 0m0.039s

and part of this comes from importing astropy itself:

$ time python -c 'import astropy'

real    0m0.237s
user    0m0.202s
sys 0m0.033s

so importing astropy.table is 2-3 times faster.

Next steps for this PR

If we go ahead with this PR I can add tests that try and ensure that the list of readers/writers in registry.py is up to date with the readers/writers in io.ascii.

Phase 2 (separate PR)

Note that in Phase 2, I plan to work on the idea that was suggesting before of making use of I/O classes instead of functions, but wanted to do things one step at a time. Note also that the above issue with io.ascii demonstrates that self-registering classes may not actually be a good idea if we also want delayed registering/importing of modules - that is, in the present PR, the HDF5 readers/writers are actually only accessed if absolutely needed. However, if they are in a self-registering class, that class can't self-register unless the code is executed. If we use delayed imports of that class as in this PR, there is nothing gained from making the class self-registering. We can still use non-self-registering classes to encapsulate the reading/writing/identifying functionality as well as any documentation for those.

@taldcroft - it would be great if you could share your thoughts on this PR.

cc @embray @eteq @mdboom

embray · 2014-11-10T19:41:19Z

I think I like the general idea of this (I believe I may have even suggested it), but I'm not happy with the implementation details. I'm sorry I don't have anything more specific to say at this time, but I have to give it more thought.

embray · 2014-11-10T19:42:17Z

I think based on your explanation I'd want to see the end result of "phase 1" + "phase 2"

embray · 2014-11-10T19:43:39Z

This could just be changed to

if _builtin_registered: return

and then the rest can drop a level of indentation.

astrofrog · 2014-11-10T20:39:26Z

@embray - I'll just describe the plan I had rather than implement it all, so we can discuss it. The idea was that one format (for a specific class, such as Table) would correspond to one class, so for example FITSTableIO. This would then define for example (but not always all of) a reader, a writer, and an identifier method:

class FITSTableIO(...):

    def identify(...):
        ...

    def read(...):
        ...

    def write(...):
        ...

in the registry, we'd then replace the separate calls to register_* with:

register_io_class('fits', Table, 'astropy.io.fits.connect.FITSTableIO')

Now we could also consider changing this a little so that every sub-package that defines at least one IO reader/writer class also defines a 'hook' function that does all the calls to register_io_class. Then in registry we would only need a single call for a sub-package, even if it contains many IO classes, though we'd still have to list the format names explicitly, e.g.:

register_hook(['ascii', 'ascii.ipac', 'ascii.csv', ...], 'astropy.io.ascii.connect.hook')

One other thing that was planned is that when calling Table.read? one would be presented with a list of formats (I think this already works), and @taldcroft's idea was that to get help on a specific format one would do Table.read(format='fits', help=True). In a few words, the motivation for this is that one could not concatenate the docstrings from all formats into Table.read? because (a) it would be really long, and (b) one would need to import all the IO classes to get the content of the docstring. So then having Table.read(format='fits', help=True) dynamically loads the help for that format on-the-fly.

embray · 2014-11-10T21:05:51Z

Yeah, that mostly sounds good. I think also specifying Table via its class may not be necessary either but I haven't investigated. Don't get me wrong, this is all in a good direction and I think my previous comment was unnecessarily negative; I was mostly just bothered by some of the details. I also don't have a lot of time/energy to spend looking this over right now, unfortunately.

astrofrog · 2014-11-12T09:05:45Z

@embray @taldcroft - I've been working on this more and one alternative (based on @embray's comments) is to basically simply list the formats in the registry and for each of them to specify the sub-package to import, e.g.:

BUILTIN_FORMATS = {
    'fits': 'astropy.io.fits',
    'votable': 'astropy.io.votable'
}

In addition we move the calls to register_reader etc back inside the format-specific sub-packages. If one of the formats is needed, the specified sub-package is imported and its readers/writers/etc will be registered. One issue however with this is that we can't know a-priori which formats apply to which classes and also which formats have readers/writers as opposed to say just readers. So if someone calls NDData.write(..., format='votable') we don't want to be importing astropy.io.votable because that format only applies to Table. So we can then do e.g.:

BUILTIN_FORMATS = {
    ('fits', NDData): 'astropy.io.fits',
    ('fits', Table): 'astropy.io.fits',
    ('votable', Table): 'astropy.io.votable'
}

We still don't know which formats are just readers and which are readers/writers though, so we could consider adding this information too. At this point we've duplicated a lot of information in the above structure and in the sub-packages (in the register* calls). So this may argue against the idea of having of leaving the sub-packages to register readers/writers because all that information has to be duplicated in the registry anyway.

So one solution is instead to go with something like we have in this PR, but tidied up a bit, so that the list of formats/functions is listed at the top in a module-level constant. But the readers/writers are still registered directly in registry.py so that we can avoid duplication of information. Does this sound reasonable?

astrofrog · 2014-11-26T22:12:46Z

@embray @taldcroft - do you have any thoughts on my comments above?

taldcroft · 2014-11-30T18:03:12Z

@astrofrog @embray - what about a hybrid approach where we keep all the self-registration within the sub-packages, but include some additional infrastructure that can auto-generate the appropriate code for astropy.io.registry? In this way there will only ever be one master source, but if you add/change things that would affect the registry to you need to a setup like python setup.py make_registry. This would import all the sub-packages and create the registry code, potentially complete with keyword docs and the whole kaboodle.

It wouldn't necessarily need to create code, since all the info could be encoded in a JSON file for instance. What would be nice is a data structure for each read or write format that has the necessary method names etc, but also has all available format-specific keyword options along with docs.

astrofrog · 2014-12-01T10:00:03Z

@taldcroft - this is an interesting idea. So would this be generated at build/install time? It is related to my suggestion in #3159 that we may want to consider the ability to have generic pre-processing steps for packages that would get run first before even any extensions are built. I can start up a discussion about this on the mailing list.

embray · 2014-12-01T22:05:30Z

I don't know that such a thing has to be "generated"--can't it just be a hard-coded file that lives in the source tree?

astrofrog · 2014-12-01T22:18:47Z

@embray - what @taldcroft is suggesting is that all the readers, parameters, documentation would live in their own sub-packages, but then we'd have that information (at least which readers/writers are available, and parameters for readers/writers) available in the registry. However, for it to be useful it would have to duplicate information that exists in the sub-packages, so we are suggesting that it would be generated to not have to maintain two copies of the same information (if we don't do that, it will be a nightmare to sync).

taldcroft · 2014-12-01T22:21:54Z

@embray - certainly for io.ascii it is more convenient to generate than to hard code. By "generating" you can have the relevant information like format name and description defined locally in the actual class that defines the format. Moreover you can programmatically generate the list of available keyword options which rely on class inheritance to get fully defined. Finally, by maintaining the idea of registration via the base metaclass, it allows for user-defined formats that fully (and automatically) integrate with the I/O connection system.

astrofrog · 2014-12-07T21:44:54Z

@taldcroft - with these changes all the tests pass locally (let's see what Travis says). This includes a backward compatibility layer for the old register_reader/writer/identifier functions. But this should not be merged yet. In particular:

Need to add deprecation warnings for the old functions
I'm not sure if what I'm doing in io.ascii is optimal - in particular there may be a way of actually making the real format classes inherit from BaseIO rather than have wrapper classes that call the ui functions that then call the ASCII IO classes (seems inefficient). Maybe you could take a look at this @taldcroft?
For now I removed the deprecated ASCII formats but can add them back if needed
How do we want to deal with the make_registry.py script? Do we want to include it in astropy/io/registry/ and run it manually as needed for now?
Add registry tests for new API (currently only being tested via backward-compatibility layer)

embray · 2014-12-08T19:30:03Z

I'm sorry I haven't really had a chance to look this over yet, but just at a glance it looks like it's in a good direction. I'll get back to it soon as I can.

astrofrog · 2014-12-15T22:33:56Z

@embray @taldcroft - I think we are going to have to postpone this until 1.1 because I think there are other priorities to focus on by the end of the week. As this PR shows, we can maintain backward compatibility relatively easily, so it's not as crucial as I thought.

hamogu · 2014-12-16T02:33:54Z

How would this deal with formats where the same format can be read as a Table, but also as NDData and maybe a Spectrum, e.g. a fits file? I'm sure it's in there, I just cannot find it.

taldcroft · 2014-12-16T04:34:18Z

I think that postponing this is reasonable.

taldcroft · 2014-12-16T04:50:30Z

@hamogu - as a user you differentiate with the class, i.e. Table.read('f.fits') or NDData.read('f.fits'). In the connector classes there is the _supported_class attribute. It would be nice to see this in action for a class other than Table. I guess this will come after NDData becomes a reality.

astrofrog · 2015-04-25T20:15:11Z

@taldcroft - oops I wanted to talk to you about this PR at the workshop... It would be great if you could try and review what is here currently at some point in the next few weeks and if it seems reaonable to you, I can continue to move ahead with it. The rebase is going to be bit a bit of a nightmare, so I will wait until you've had a chance to look over this (and will rebase if you are ok with it)

astrofrog · 2015-06-06T13:07:41Z

@taldcroft - do you have any initial thoughts on this PR? I would like to try and make sure it gets sufficient testing prior to 1.1 if we are going ahead with it.

taldcroft · 2015-06-08T12:50:30Z

Sorry, I'll have a look this week. It's been a little while...

taldcroft · 2015-06-15T17:04:37Z

It looks like this keyword is being ignored. I would have expected this at least to be used in front of line 132 at the end to decide whether to resolve IO classes.

taldcroft · 2015-06-15T17:22:33Z

@astrofrog - I got myself back up to speed on this and it looks great so far. I haven't dug into the nitty gritty but as far as the big concept I don't see any reason to not proceed.

The next thing that would be really nice is to get some code demonstrating the real user-facing benefit of this whole refactor, namely better docs. The first thing I would love to see is for Table.read and Table.write to have a dynamically generated list of available formats and extensions. The next is the help keyword to output the docstring for the reader/writer for the selected format.

taldcroft · 2015-06-15T17:28:58Z

Kind of a trivial point, but I wonder if these should be considered public attributes. For the BaseIO class and subclasses, these attributes feel like part of the key public functionality. (Though I don't understand what, if anything, the _flexible attribute does since it doesn't seem to be used anywhere).

pllim · 2016-12-02T18:28:52Z

Digging through old FITS PRs. Is this one still relevant? It has merge conflict.

astrofrog · 2016-12-02T18:31:14Z

@pllim - I still want to do this, but I'll likely need to start from scratch. I don't mind if we close it since I'll likely never finish this PR as-is.

astrofrog · 2016-12-02T18:33:34Z

We have an issue for this: #1046 - so this can be closed if others agree

pllim · 2016-12-02T18:54:59Z

I agree. 😄

astrofrog · 2016-12-02T18:58:03Z

Sometimes, you have to know when to give up on a pull request! 😆

astrofrog force-pushed the improvements-to-table-io branch from 5281316 to a9b1d9e Compare November 8, 2014 08:41

astrofrog added io.registry io.ascii io.fits io.misc table io.votable labels Nov 8, 2014

astrofrog added this to the v1.0.0 milestone Nov 8, 2014

astrofrog changed the title ~~WIP: Phase 1 of improvements to I/O infrastructure~~ Phase 1 of improvements to I/O infrastructure Nov 8, 2014

embray reviewed Nov 10, 2014
View reviewed changes

embray mentioned this pull request Nov 24, 2014

Circular import subtlety in astropy.table.__init__ #1402

Closed

astrofrog added 7 commits December 4, 2014 12:25

Only load required I/O functions at the last minute

f9da5be

Hard-code ASCII types in deferred types

9953a86

Added back deprecated ASCII formats for now

1b5e4aa

Fix tests

aaf3a29

Fix use of importlib on Python 2.6

5d1f52c

Switch to using classes for I/O formats

98eb435

Add a JSON registry of formats

5ecbfa6

Fixed test in jsviewer

2701b85

astrofrog added 2 commits December 7, 2014 23:19

Fixed doctest

5c9226c

PEP8

60d7d16

hamogu mentioned this pull request Dec 16, 2014

io.registry: '.csv' is not longer sufficient to select reader. #3189

Merged

astrofrog modified the milestones: v1.1, v1.0.0 Jan 13, 2015

This was referenced Jan 15, 2015

WIP: Use classes for readers/writers and use a metaclass for auto-registering #2063

Closed

Refactor/rethink aspects of the Table I/O #1046

Closed

taldcroft reviewed Jun 15, 2015
View reviewed changes

astrofrog removed this from the v1.1.0 milestone Sep 29, 2015

embray force-pushed the master branch from 63540e0 to 34f4c29 Compare October 5, 2015 21:18

taldcroft mentioned this pull request Sep 8, 2016

make docstring-update optional for io.registry and some additional performance changes #5275

Merged

pllim closed this Dec 2, 2016

bsipocz mentioned this pull request Feb 17, 2017

Table.read and Table.write should provide useful docstrings #1011

Closed

Uh oh!

Uh oh!

Conversation

astrofrog commented Nov 7, 2014

Background

Issues

Benefits

Next steps for this PR

Phase 2 (separate PR)

Uh oh!

embray commented Nov 10, 2014

Uh oh!

embray commented Nov 10, 2014

Uh oh!

embray Nov 10, 2014

Choose a reason for hiding this comment

Uh oh!

astrofrog commented Nov 10, 2014

Uh oh!

embray commented Nov 10, 2014

Uh oh!

astrofrog commented Nov 12, 2014

Uh oh!

astrofrog commented Nov 26, 2014

Uh oh!

taldcroft commented Nov 30, 2014

Uh oh!

astrofrog commented Dec 1, 2014

Uh oh!

embray commented Dec 1, 2014

Uh oh!

astrofrog commented Dec 1, 2014

Uh oh!

taldcroft commented Dec 1, 2014

Uh oh!

astrofrog commented Dec 7, 2014

Uh oh!

embray commented Dec 8, 2014

Uh oh!

astrofrog commented Dec 15, 2014

Uh oh!

hamogu commented Dec 16, 2014

Uh oh!

taldcroft commented Dec 16, 2014

Uh oh!

taldcroft commented Dec 16, 2014

Uh oh!

astrofrog commented Apr 25, 2015

Uh oh!

astrofrog commented Jun 6, 2015

Uh oh!

taldcroft commented Jun 8, 2015

Uh oh!

taldcroft Jun 15, 2015

Choose a reason for hiding this comment

Uh oh!

taldcroft commented Jun 15, 2015

Uh oh!

taldcroft Jun 15, 2015

Choose a reason for hiding this comment

Uh oh!

pllim commented Dec 2, 2016

Uh oh!

astrofrog commented Dec 2, 2016

Uh oh!

astrofrog commented Dec 2, 2016

Uh oh!

pllim commented Dec 2, 2016

Uh oh!

astrofrog commented Dec 2, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants