Skip to content

Phase 1 of improvements to I/O infrastructure#3090

Closed
astrofrog wants to merge 19 commits into
astropy:masterfrom
astrofrog:improvements-to-table-io
Closed

Phase 1 of improvements to I/O infrastructure#3090
astrofrog wants to merge 19 commits into
astropy:masterfrom
astrofrog:improvements-to-table-io

Conversation

@astrofrog

Copy link
Copy Markdown
Member

Background

This PR implements delayed loading of read/write/identifier functions by allowing strings to be registered instead of functions, and when the functions are needed they are loaded. This should speed up the import of astropy.table by not importing all the rest of astropy when Table is needed.

(I had already started working on improving the I/O infrastructure in #2063 but decided to tackle it in smaller pieces)

Issues

One of the issues I came across is that astropy.io.ascii auto-registers many ASCII formats via a meta-class. This is a problem because it means that in order to register the readers/writers, io.ascii needs to be imported which we want to avoid. But just to see if it would work, in the first commit (dcaf1a6) I implemented lazy loading for all readers/writers except io.ascii, but this doesn't work because we get circular imports when doing:

python -c 'import astropy.io.ascii'

because then it calls the meta-class in io.ascii, registers readers/writers, but during the registration of the registration, _update__doc__ gets called, and then _register_builtins gets called for the first time and tries to import io.ascii, raising an error because io.ascii isn't loaded yet.

The second commit (a9b1d9e) and the PR in its current state therefore hard-codes the ASCII readers/writers by hand in registry.py (but using some partial magic to avoid having to make any changes in io.ascii related to how the readers/writers are handled).

Benefits

One benefit of the new system is that all the readers/writers are defined in one place and it provides a unique place for developers to go when adding/removing built-in formats.

Because of the new deferred importing infrastructure, importing astropy.table is now faster. Before:

$ time python -c 'from astropy.table import Table'

real    0m0.921s
user    0m0.843s
sys 0m0.056s

After:

$ time python -c 'from astropy.table import Table'

real    0m0.511s
user    0m0.468s
sys 0m0.039s

and part of this comes from importing astropy itself:

$ time python -c 'import astropy'

real    0m0.237s
user    0m0.202s
sys 0m0.033s

so importing astropy.table is 2-3 times faster.

Next steps for this PR

If we go ahead with this PR I can add tests that try and ensure that the list of readers/writers in registry.py is up to date with the readers/writers in io.ascii.

Phase 2 (separate PR)

Note that in Phase 2, I plan to work on the idea that was suggesting before of making use of I/O classes instead of functions, but wanted to do things one step at a time. Note also that the above issue with io.ascii demonstrates that self-registering classes may not actually be a good idea if we also want delayed registering/importing of modules - that is, in the present PR, the HDF5 readers/writers are actually only accessed if absolutely needed. However, if they are in a self-registering class, that class can't self-register unless the code is executed. If we use delayed imports of that class as in this PR, there is nothing gained from making the class self-registering. We can still use non-self-registering classes to encapsulate the reading/writing/identifying functionality as well as any documentation for those.

@taldcroft - it would be great if you could share your thoughts on this PR.

cc @embray @eteq @mdboom

@astrofrog astrofrog force-pushed the improvements-to-table-io branch from 5281316 to a9b1d9e Compare November 8, 2014 08:41
@astrofrog astrofrog added this to the v1.0.0 milestone Nov 8, 2014
@astrofrog astrofrog changed the title WIP: Phase 1 of improvements to I/O infrastructure Phase 1 of improvements to I/O infrastructure Nov 8, 2014
@embray

embray commented Nov 10, 2014

Copy link
Copy Markdown
Member

I think I like the general idea of this (I believe I may have even suggested it), but I'm not happy with the implementation details. I'm sorry I don't have anything more specific to say at this time, but I have to give it more thought.

@embray

embray commented Nov 10, 2014

Copy link
Copy Markdown
Member

I think based on your explanation I'd want to see the end result of "phase 1" + "phase 2"

Comment thread astropy/io/registry.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be changed to

if _builtin_registered:
    return

and then the rest can drop a level of indentation.

@astrofrog

Copy link
Copy Markdown
Member Author

@embray - I'll just describe the plan I had rather than implement it all, so we can discuss it. The idea was that one format (for a specific class, such as Table) would correspond to one class, so for example FITSTableIO. This would then define for example (but not always all of) a reader, a writer, and an identifier method:

class FITSTableIO(...):

    def identify(...):
        ...

    def read(...):
        ...

    def write(...):
        ...

in the registry, we'd then replace the separate calls to register_* with:

register_io_class('fits', Table, 'astropy.io.fits.connect.FITSTableIO')

Now we could also consider changing this a little so that every sub-package that defines at least one IO reader/writer class also defines a 'hook' function that does all the calls to register_io_class. Then in registry we would only need a single call for a sub-package, even if it contains many IO classes, though we'd still have to list the format names explicitly, e.g.:

register_hook(['ascii', 'ascii.ipac', 'ascii.csv', ...], 'astropy.io.ascii.connect.hook')

One other thing that was planned is that when calling Table.read? one would be presented with a list of formats (I think this already works), and @taldcroft's idea was that to get help on a specific format one would do Table.read(format='fits', help=True). In a few words, the motivation for this is that one could not concatenate the docstrings from all formats into Table.read? because (a) it would be really long, and (b) one would need to import all the IO classes to get the content of the docstring. So then having Table.read(format='fits', help=True) dynamically loads the help for that format on-the-fly.

@embray

embray commented Nov 10, 2014

Copy link
Copy Markdown
Member

Yeah, that mostly sounds good. I think also specifying Table via its class may not be necessary either but I haven't investigated. Don't get me wrong, this is all in a good direction and I think my previous comment was unnecessarily negative; I was mostly just bothered by some of the details. I also don't have a lot of time/energy to spend looking this over right now, unfortunately.

@astrofrog

Copy link
Copy Markdown
Member Author

@embray @taldcroft - I've been working on this more and one alternative (based on @embray's comments) is to basically simply list the formats in the registry and for each of them to specify the sub-package to import, e.g.:

BUILTIN_FORMATS = {
    'fits': 'astropy.io.fits',
    'votable': 'astropy.io.votable'
}

In addition we move the calls to register_reader etc back inside the format-specific sub-packages. If one of the formats is needed, the specified sub-package is imported and its readers/writers/etc will be registered. One issue however with this is that we can't know a-priori which formats apply to which classes and also which formats have readers/writers as opposed to say just readers. So if someone calls NDData.write(..., format='votable') we don't want to be importing astropy.io.votable because that format only applies to Table. So we can then do e.g.:

BUILTIN_FORMATS = {
    ('fits', NDData): 'astropy.io.fits',
    ('fits', Table): 'astropy.io.fits',
    ('votable', Table): 'astropy.io.votable'
}

We still don't know which formats are just readers and which are readers/writers though, so we could consider adding this information too. At this point we've duplicated a lot of information in the above structure and in the sub-packages (in the register* calls). So this may argue against the idea of having of leaving the sub-packages to register readers/writers because all that information has to be duplicated in the registry anyway.

So one solution is instead to go with something like we have in this PR, but tidied up a bit, so that the list of formats/functions is listed at the top in a module-level constant. But the readers/writers are still registered directly in registry.py so that we can avoid duplication of information. Does this sound reasonable?

@astrofrog

Copy link
Copy Markdown
Member Author

@embray @taldcroft - do you have any thoughts on my comments above?

@taldcroft

Copy link
Copy Markdown
Member

@astrofrog @embray - what about a hybrid approach where we keep all the self-registration within the sub-packages, but include some additional infrastructure that can auto-generate the appropriate code for astropy.io.registry? In this way there will only ever be one master source, but if you add/change things that would affect the registry to you need to a setup like python setup.py make_registry. This would import all the sub-packages and create the registry code, potentially complete with keyword docs and the whole kaboodle.

It wouldn't necessarily need to create code, since all the info could be encoded in a JSON file for instance. What would be nice is a data structure for each read or write format that has the necessary method names etc, but also has all available format-specific keyword options along with docs.

@astrofrog

Copy link
Copy Markdown
Member Author

@taldcroft - this is an interesting idea. So would this be generated at build/install time? It is related to my suggestion in #3159 that we may want to consider the ability to have generic pre-processing steps for packages that would get run first before even any extensions are built. I can start up a discussion about this on the mailing list.

@embray

embray commented Dec 1, 2014

Copy link
Copy Markdown
Member

I don't know that such a thing has to be "generated"--can't it just be a hard-coded file that lives in the source tree?

@astrofrog

Copy link
Copy Markdown
Member Author

@embray - what @taldcroft is suggesting is that all the readers, parameters, documentation would live in their own sub-packages, but then we'd have that information (at least which readers/writers are available, and parameters for readers/writers) available in the registry. However, for it to be useful it would have to duplicate information that exists in the sub-packages, so we are suggesting that it would be generated to not have to maintain two copies of the same information (if we don't do that, it will be a nightmare to sync).

@taldcroft

Copy link
Copy Markdown
Member

@embray - certainly for io.ascii it is more convenient to generate than to hard code. By "generating" you can have the relevant information like format name and description defined locally in the actual class that defines the format. Moreover you can programmatically generate the list of available keyword options which rely on class inheritance to get fully defined. Finally, by maintaining the idea of registration via the base metaclass, it allows for user-defined formats that fully (and automatically) integrate with the I/O connection system.

@astrofrog

Copy link
Copy Markdown
Member Author

@taldcroft - with these changes all the tests pass locally (let's see what Travis says). This includes a backward compatibility layer for the old register_reader/writer/identifier functions. But this should not be merged yet. In particular:

  • Need to add deprecation warnings for the old functions
  • I'm not sure if what I'm doing in io.ascii is optimal - in particular there may be a way of actually making the real format classes inherit from BaseIO rather than have wrapper classes that call the ui functions that then call the ASCII IO classes (seems inefficient). Maybe you could take a look at this @taldcroft?
  • For now I removed the deprecated ASCII formats but can add them back if needed
  • How do we want to deal with the make_registry.py script? Do we want to include it in astropy/io/registry/ and run it manually as needed for now?
  • Add registry tests for new API (currently only being tested via backward-compatibility layer)

@embray

embray commented Dec 8, 2014

Copy link
Copy Markdown
Member

I'm sorry I haven't really had a chance to look this over yet, but just at a glance it looks like it's in a good direction. I'll get back to it soon as I can.

@astrofrog

Copy link
Copy Markdown
Member Author

@embray @taldcroft - I think we are going to have to postpone this until 1.1 because I think there are other priorities to focus on by the end of the week. As this PR shows, we can maintain backward compatibility relatively easily, so it's not as crucial as I thought.

@hamogu

hamogu commented Dec 16, 2014

Copy link
Copy Markdown
Member

How would this deal with formats where the same format can be read as a Table, but also as NDData and maybe a Spectrum, e.g. a fits file? I'm sure it's in there, I just cannot find it.

@taldcroft

Copy link
Copy Markdown
Member

I think that postponing this is reasonable.

@taldcroft

Copy link
Copy Markdown
Member

@hamogu - as a user you differentiate with the class, i.e. Table.read('f.fits') or NDData.read('f.fits'). In the connector classes there is the _supported_class attribute. It would be nice to see this in action for a class other than Table. I guess this will come after NDData becomes a reality.

@astrofrog

Copy link
Copy Markdown
Member Author

@taldcroft - oops I wanted to talk to you about this PR at the workshop... It would be great if you could try and review what is here currently at some point in the next few weeks and if it seems reaonable to you, I can continue to move ahead with it. The rebase is going to be bit a bit of a nightmare, so I will wait until you've had a chance to look over this (and will rebase if you are ok with it)

@astrofrog

Copy link
Copy Markdown
Member Author

@taldcroft - do you have any initial thoughts on this PR? I would like to try and make sure it gets sufficient testing prior to 1.1 if we are going ahead with it.

@taldcroft

Copy link
Copy Markdown
Member

Sorry, I'll have a look this week. It's been a little while...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this keyword is being ignored. I would have expected this at least to be used in front of line 132 at the end to decide whether to resolve IO classes.

@taldcroft

Copy link
Copy Markdown
Member

@astrofrog - I got myself back up to speed on this and it looks great so far. I haven't dug into the nitty gritty but as far as the big concept I don't see any reason to not proceed.

The next thing that would be really nice is to get some code demonstrating the real user-facing benefit of this whole refactor, namely better docs. The first thing I would love to see is for Table.read and Table.write to have a dynamically generated list of available formats and extensions. The next is the help keyword to output the docstring for the reader/writer for the selected format.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a trivial point, but I wonder if these should be considered public attributes. For the BaseIO class and subclasses, these attributes feel like part of the key public functionality. (Though I don't understand what, if anything, the _flexible attribute does since it doesn't seem to be used anywhere).

@pllim

pllim commented Dec 2, 2016

Copy link
Copy Markdown
Member

Digging through old FITS PRs. Is this one still relevant? It has merge conflict.

@astrofrog

Copy link
Copy Markdown
Member Author

@pllim - I still want to do this, but I'll likely need to start from scratch. I don't mind if we close it since I'll likely never finish this PR as-is.

@astrofrog

Copy link
Copy Markdown
Member Author

We have an issue for this: #1046 - so this can be closed if others agree

@pllim

pllim commented Dec 2, 2016

Copy link
Copy Markdown
Member

I agree. 😄

@pllim pllim closed this Dec 2, 2016
@astrofrog

Copy link
Copy Markdown
Member Author

Sometimes, you have to know when to give up on a pull request! 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants