Phase 1 of improvements to I/O infrastructure#3090
Conversation
5281316 to
a9b1d9e
Compare
|
I think I like the general idea of this (I believe I may have even suggested it), but I'm not happy with the implementation details. I'm sorry I don't have anything more specific to say at this time, but I have to give it more thought. |
|
I think based on your explanation I'd want to see the end result of "phase 1" + "phase 2" |
There was a problem hiding this comment.
This could just be changed to
if _builtin_registered:
return
and then the rest can drop a level of indentation.
|
@embray - I'll just describe the plan I had rather than implement it all, so we can discuss it. The idea was that one format (for a specific class, such as Table) would correspond to one class, so for example in the registry, we'd then replace the separate calls to Now we could also consider changing this a little so that every sub-package that defines at least one IO reader/writer class also defines a 'hook' function that does all the calls to One other thing that was planned is that when calling |
|
Yeah, that mostly sounds good. I think also specifying |
|
@embray @taldcroft - I've been working on this more and one alternative (based on @embray's comments) is to basically simply list the formats in the registry and for each of them to specify the sub-package to import, e.g.: In addition we move the calls to We still don't know which formats are just readers and which are readers/writers though, so we could consider adding this information too. At this point we've duplicated a lot of information in the above structure and in the sub-packages (in the register* calls). So this may argue against the idea of having of leaving the sub-packages to register readers/writers because all that information has to be duplicated in the registry anyway. So one solution is instead to go with something like we have in this PR, but tidied up a bit, so that the list of formats/functions is listed at the top in a module-level constant. But the readers/writers are still registered directly in |
|
@embray @taldcroft - do you have any thoughts on my comments above? |
|
@astrofrog @embray - what about a hybrid approach where we keep all the self-registration within the sub-packages, but include some additional infrastructure that can auto-generate the appropriate code for It wouldn't necessarily need to create code, since all the info could be encoded in a JSON file for instance. What would be nice is a data structure for each read or write format that has the necessary method names etc, but also has all available format-specific keyword options along with docs. |
|
@taldcroft - this is an interesting idea. So would this be generated at build/install time? It is related to my suggestion in #3159 that we may want to consider the ability to have generic pre-processing steps for packages that would get run first before even any extensions are built. I can start up a discussion about this on the mailing list. |
|
I don't know that such a thing has to be "generated"--can't it just be a hard-coded file that lives in the source tree? |
|
@embray - what @taldcroft is suggesting is that all the readers, parameters, documentation would live in their own sub-packages, but then we'd have that information (at least which readers/writers are available, and parameters for readers/writers) available in the registry. However, for it to be useful it would have to duplicate information that exists in the sub-packages, so we are suggesting that it would be generated to not have to maintain two copies of the same information (if we don't do that, it will be a nightmare to sync). |
|
@embray - certainly for |
|
@taldcroft - with these changes all the tests pass locally (let's see what Travis says). This includes a backward compatibility layer for the old
|
|
I'm sorry I haven't really had a chance to look this over yet, but just at a glance it looks like it's in a good direction. I'll get back to it soon as I can. |
|
@embray @taldcroft - I think we are going to have to postpone this until 1.1 because I think there are other priorities to focus on by the end of the week. As this PR shows, we can maintain backward compatibility relatively easily, so it's not as crucial as I thought. |
|
How would this deal with formats where the same format can be read as a |
|
I think that postponing this is reasonable. |
|
@hamogu - as a user you differentiate with the class, i.e. |
|
@taldcroft - oops I wanted to talk to you about this PR at the workshop... It would be great if you could try and review what is here currently at some point in the next few weeks and if it seems reaonable to you, I can continue to move ahead with it. The rebase is going to be bit a bit of a nightmare, so I will wait until you've had a chance to look over this (and will rebase if you are ok with it) |
|
@taldcroft - do you have any initial thoughts on this PR? I would like to try and make sure it gets sufficient testing prior to 1.1 if we are going ahead with it. |
|
Sorry, I'll have a look this week. It's been a little while... |
There was a problem hiding this comment.
It looks like this keyword is being ignored. I would have expected this at least to be used in front of line 132 at the end to decide whether to resolve IO classes.
|
@astrofrog - I got myself back up to speed on this and it looks great so far. I haven't dug into the nitty gritty but as far as the big concept I don't see any reason to not proceed. The next thing that would be really nice is to get some code demonstrating the real user-facing benefit of this whole refactor, namely better docs. The first thing I would love to see is for |
There was a problem hiding this comment.
Kind of a trivial point, but I wonder if these should be considered public attributes. For the BaseIO class and subclasses, these attributes feel like part of the key public functionality. (Though I don't understand what, if anything, the _flexible attribute does since it doesn't seem to be used anywhere).
|
Digging through old FITS PRs. Is this one still relevant? It has merge conflict. |
|
@pllim - I still want to do this, but I'll likely need to start from scratch. I don't mind if we close it since I'll likely never finish this PR as-is. |
|
We have an issue for this: #1046 - so this can be closed if others agree |
|
I agree. 😄 |
|
Sometimes, you have to know when to give up on a pull request! 😆 |
Background
This PR implements delayed loading of read/write/identifier functions by allowing strings to be registered instead of functions, and when the functions are needed they are loaded. This should speed up the import of astropy.table by not importing all the rest of astropy when Table is needed.
(I had already started working on improving the I/O infrastructure in #2063 but decided to tackle it in smaller pieces)
Issues
One of the issues I came across is that
astropy.io.asciiauto-registers many ASCII formats via a meta-class. This is a problem because it means that in order to register the readers/writers, io.ascii needs to be imported which we want to avoid. But just to see if it would work, in the first commit (dcaf1a6) I implemented lazy loading for all readers/writers except io.ascii, but this doesn't work because we get circular imports when doing:because then it calls the meta-class in io.ascii, registers readers/writers, but during the registration of the registration,
_update__doc__gets called, and then_register_builtinsgets called for the first time and tries to importio.ascii, raising an error becauseio.asciiisn't loaded yet.The second commit (a9b1d9e) and the PR in its current state therefore hard-codes the ASCII readers/writers by hand in
registry.py(but using somepartialmagic to avoid having to make any changes inio.asciirelated to how the readers/writers are handled).Benefits
One benefit of the new system is that all the readers/writers are defined in one place and it provides a unique place for developers to go when adding/removing built-in formats.
Because of the new deferred importing infrastructure, importing
astropy.tableis now faster. Before:After:
and part of this comes from importing astropy itself:
so importing astropy.table is 2-3 times faster.
Next steps for this PR
If we go ahead with this PR I can add tests that try and ensure that the list of readers/writers in
registry.pyis up to date with the readers/writers inio.ascii.Phase 2 (separate PR)
Note that in Phase 2, I plan to work on the idea that was suggesting before of making use of I/O classes instead of functions, but wanted to do things one step at a time. Note also that the above issue with
io.asciidemonstrates that self-registering classes may not actually be a good idea if we also want delayed registering/importing of modules - that is, in the present PR, the HDF5 readers/writers are actually only accessed if absolutely needed. However, if they are in a self-registering class, that class can't self-register unless the code is executed. If we use delayed imports of that class as in this PR, there is nothing gained from making the class self-registering. We can still use non-self-registering classes to encapsulate the reading/writing/identifying functionality as well as any documentation for those.@taldcroft - it would be great if you could share your thoughts on this PR.
cc @embray @eteq @mdboom