Skip to content

Lazy loading of HDUs on fits.open#5065

Merged
eteq merged 11 commits intoastropy:masterfrom
embray:fits/lazy-load-hdus
Dec 9, 2016
Merged

Lazy loading of HDUs on fits.open#5065
eteq merged 11 commits intoastropy:masterfrom
embray:fits/lazy-load-hdus

Conversation

@embray
Copy link
Member

@embray embray commented Jun 10, 2016

This is something I've wanted to do a long time. And while this isn't exactly how I had in mind to do it, it lays a lot of groundwork. This is my alternative to #4634

The idea here is that fits.open() currently immediately seeks through an entire FITS file, and finds all HDUs, as well as creates HDU objects and more importantly Header objects for them. In a typical FITS file this overhead is negligible. But for files containing hundreds of HDUs this can have a serious impact. Further this can have a serious impact even when working with files of just a few HDUs, but when there are thousands of them that need to be looped over (and in particular when only the PRIMARY header is needed which is frequently the case).

This is an experimental attempt to add lazy-loading of HDUs, so HDUs are only read on demand. For example if you open a file with hdul = fits.open(filename) and immediately do something like hdul[3] it read all HDUs up through the fourth one and then stop any further reading until necessary.

Of course some operations, most notably len(hdul) and hdul.info() necessitate reading the entire file. But these are not the most common cases in scripts which is what this change really concerns itself with (as opposed to command-line interaction).

The biggest shortcoming to this patch as it stands is the issue I mentioned in #4634 regarding hdulist.close(). Because fits.open() used to load all HDUs, and in particular their headers, I have no doubt that there is code in the wild that follows this sort of pattern:

>>> hdulist = fits.open(filename)
>>> # do some stuff
>>> hdulist.close()
>>> print(hdulist[3].header)

Because of the lazy-loading, hdulist[3] may not have been loaded yet before hdulist.close() was called, and so this will fail. I'm of the opinion that it should fail--you shouldn't close() the file object until you're actually done doing everything you're going to do with it. But unfortunately PyFITS has played loose on this forever.

So currently, to maintain backwards compatibility, this patch ensures that all HDUs are loaded before the file is closed. This is of course kills most benefit from lazy-loading. But not in all cases! In special cases like convenience functions (getheader, getdata, etc.) in which the user never has direct access to the HDUList object we can still take full advantage of lazy-loading. Furthermore, attempts to access a previously not loaded HDU after the file has been closed will result in a DeprecationWarning, with the aim of disallowing this usage in the long term, so that lazy-loading can be fully taken advantage of.

@embray
Copy link
Member Author

embray commented Jun 10, 2016

It would great to get some feedback on how this holds up in real-world, non-trivial usage.

@eteq eteq added this to the v1.3.0 milestone Jun 10, 2016
@astrofrog
Copy link
Member

This sounds great to me - I will try and test it out shortly

@embray
Copy link
Member Author

embray commented Aug 16, 2016

A future enhancement to this that I've had in mind for a long time would be an LRU cache of headers from recently opened files. This would allow, for example, skipping directly to specific HDUs of known, recently used files (that have not been modified of course).

@juliantaylor
Copy link

str and repr just return return [primaryhdu] which can be confusing.
It should probably either just load everything or print itself as a LazyHDUList.

@embray
Copy link
Member Author

embray commented Sep 23, 2016

@juliantaylor Thanks for having a look at this! And I'm glad it addressed the issue for you (modulo this minor issue with printing). I hope to get this updated soon.

@juliantaylor
Copy link

note the feature freeze is dec 2. Will this get ready by then? I do not want the fix for my issue to miss another release.

@embray
Copy link
Member Author

embray commented Nov 9, 2016

Is there anything this needs other than to be rebased, and to fix the repr issue?

@embray embray force-pushed the fits/lazy-load-hdus branch from 6f7dfa5 to fe681de Compare November 9, 2016 11:42
@pllim
Copy link
Member

pllim commented Nov 9, 2016

Add a change log? Also maybe a rebase will fix the Appveyor build.

@embray
Copy link
Member Author

embray commented Nov 21, 2016

I added a fix for the __repr__ issue. For now __repr__ just loads all the HDUs, so no dramatic changes (the HDUList repr is fairly useless anyways; it would be great if it displayed something useful--I think I already have an issue open about that somewhere).

Anyways, IIRC I designed this change to be as minimally disruptive as possible to the existing usage, but to be available as an optimization for the corner cases where it does apply.

Now merely opening a FITS file does not cause it to seek through the entire
FITS file for HDUs.  Instead it reads just the primary HDU to ensure that it is
a valid FITS file, and in some cases it also reads the first extension HDU
(just to see if there are any extensions).

Otherwise it does not parse the file for any further HDUs until an operation
requesting a specific HDU is performed (in which case HDUs are loaded until the
requested HDU is found or the end of the file or invalid data is reached).

Operations that require knowing about all the HDUs in the file (including,
unfortunately, len()) result in parsing all the HDUs in the file.  This can
unfortunately be slow with files containing a large number of HDUs.

Further speedup may be achievable by implementing a fast header reader that
parses out only the information from FITS headers necessary to to locate valid
extensions (without parsing the entire header).
…return the correct HDU. For example, ensures that hdul[-1] returns the *actual* last HDU. This of course means all HDUs have to be loaded before we can find the last. Therefore negative indices can have poor performance implications if used carelessly.
… a needless search for duplicates. This behavior was previously not tested for anyways, and is somewhat user-hostile anyways.
… this case no operation should attempt to read from the file because it is only open for writing.
… after a file has been closed.

Now you can open a file, close it, and still read the headers (but not the data---that had been disabled for a long time) after it was closed.  This of course means that before the file is closed it reads in any HDUs that haven't already been read, killing much of the benefit gained from lazy-loading in the first place.

However this is *not* performed in certain special cases where a user would never manually (either by calling close() or using a with statement) close the file--namely the convenience functions.

However, use of the old behavior does raise a deprecation warning.  This should give users the chance to fix poorly written code that 'closes' and HDUList before it's done being used.  Then we can take fuller advantage of the lazy loading.

It might also be worth adding an option to disable the backwards-compat mode so that one can take full advantage of lazy-loading immediately.
Miraculously it didn't break any tests, and that worries me.  But I'm almost certain it wasn't intentional so putting this back the way they were for now just in case...
…oad all the HDUs for now.

(The info in HDUList.__repr__ is fairly useless though; that should be improved in a separate issue.)
@embray embray force-pushed the fits/lazy-load-hdus branch from bf6d0f2 to 9f00347 Compare November 21, 2016 11:32
@embray
Copy link
Member Author

embray commented Nov 21, 2016

Rebased again

@embray
Copy link
Member Author

embray commented Nov 21, 2016

Strange about the AppVeyor failure. I don't understand why that is. I'll have to load an AppVeyor VM and see if I can reproduce it, I guess.

@juliantaylor
Copy link

so freeze is today, will this still go in?
if its to complicated to get reviewed, maybe put in the simpler alternative instead?
I don't want this to miss another release.

@eteq eteq merged commit 6ab68e2 into astropy:master Dec 9, 2016
eteq added a commit that referenced this pull request Dec 9, 2016
Lazy loading of HDUs on fits.open
@astrobot
Copy link

astrobot commented Dec 9, 2016

@eteq - thanks for merging this! However, I noticed the following issue with this pull request:

  • Changelog entry not present (or pull request number missing) and neither the Affects-dev nor the no-changelog-entry-needed label are set

Would it be possible to fix this? Thanks!

If you believe the above to be incorrect (which I - @astrobot - very much doubt) you can ping @astrofrog

eteq added a commit that referenced this pull request Dec 9, 2016
@eteq
Copy link
Member

eteq commented Dec 9, 2016

Hah. Typical. changelog entry added in a72fdfb . Need to remember that when backporting, though! (mostly that's a note-to-self)

@nden
Copy link
Contributor

nden commented Dec 10, 2016

I didn't follow this PR but as a side note - it broke code here which looks into the file to get specific extensions, for example, getting the "SCI" extensions in an HST file.

f = fits.open('j94f05bgq_flt.fits')
[f[("SCI", img.header['EXTVER'])] for img in f[1:] if img.header['EXTNAME'].lower() == 'sci']
Out[13]: []
In [14]: f.info()
In [15]: [f[("SCI", img.header['EXTVER'])] for img in f[1:] if img.header['EXTNAME'].lower() == 'sci']
Out[15]: 
[<astropy.io.fits.hdu.image.ImageHDU at 0x31f0f90>,
 <astropy.io.fits.hdu.image.ImageHDU at 0x3407890>]

If this was the intended functionality, we'll fix it on our side but it can be a surprise to others as well.

@mwcraig
Copy link
Member

mwcraig commented Dec 11, 2016

@nden can you please open an issue for this? The intent was not to change functionality,ai understand it, it was just to lazily load.

@embray
Copy link
Member Author

embray commented Dec 12, 2016

Ideally the HDUList class would never have been a subclass of list in the first place, and would have been designed somewhat differently, but that would be difficult to exorcise.

The problem this PR solves is the same problem you were trying to solve, but without limiting the solution (seemingly arbitrarily, though understandably from a technical standpoint) to the "convenience functions" the use of which I try to discourage for code that makes significant in-place manipulation of FITS files, or really any code outside of simple interactive scripting. In other words, it makes performance improvements even for code that's using the "lower level" API.

I admit it is complicated though.

warnings.warn(textwrap.dedent(
"""\
Accessing an HDU after an HDUList is closed, where
that HDU was no read while the HDUList was open
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "no" -> "not"

@juliantaylor
Copy link

the only thing that is actually improved compared to the alternative is:

fits.open(file)[0].header

That's it. This is the perfect case for the convenience functions.
As soon as you touch the data, the performance of header parsing does not matter, as soon as you modify the file it doesn't matter as you have to rewrite the thing anyway.
It is better for only looking at a subset of headers of a very large fits fits file, but is that a practical use case?
Of course being able to use open for only looking at the primary is nice, I'd still rather had this issue solved sufficiently 10 month ago and improved upon latter.

eteq added a commit that referenced this pull request Dec 13, 2016
Lazy loading of HDUs on fits.open
eteq added a commit that referenced this pull request Dec 13, 2016
@eteq
Copy link
Member

eteq commented Dec 13, 2016

backported to v1.3.x in f096868 (and changelog in 1d4cb4d)

@embray
Copy link
Member Author

embray commented Dec 13, 2016

When you say "That's it" I don't think you appreciate how extremely common that is (and it usually does not appear in isolation--it may be followed up with other operations on the same open file, so it's not directly replaceable by the convenience functions).

@embray
Copy link
Member Author

embray commented Dec 13, 2016

Of course being able to use open for only looking at the primary is nice, I'd still rather had this issue solved sufficiently 10 month ago and improved upon latter.

I accept cash.

@embray embray deleted the fits/lazy-load-hdus branch December 13, 2016 15:31
@juliantaylor
Copy link

When you say "That's it" I don't think you appreciate how extremely common that is (and it usually does not appear in isolation--it may be followed up with other operations on the same open file, so it's not directly replaceable by the convenience functions).

That is my point, as soon as it does not appear in isolation anymore lazy loading becomes much less relevant.
Parsing the required headers of the full file is not expensive compared to most stuff you do on the file.
It is only expensive when the only thing you are doing is looking at the primary header and then doing very cheap operations based on that like moving the file (not copying) or updating some database keywords.
This is a very marginal performance improvement for edge cases, which is why I have argued for a simple solution to the problem.

@embray
Copy link
Member Author

embray commented Dec 15, 2016

I think you're still missing my point. First of all, the only gains are not just from reading the primary header, though that's certainly where the biggest gain comes from. But even so, even if that were the only advantage, there's tons of existing code out there that is already using the full API and not the "convenience functions" (which have always been advertised as wrappers around the lower-level API mainly for interactive convenience), and I think there's a strange cognitive overhead involved in trying to explain "Well, if you use the 'convenience functions' you'll get performance improvements, but not if you use the lower-level API on top of which the convenience functions are built".

The way this patch does things--loading HDUs as needed--is the obvious way, I think, to write a FITS reader in the first place. IIRC it's more or less how CFITSIO works too--CFITSIO was designed better in the first place, I think, insofar as separating the API better from the details of how the file is being read, which are all squirreled away in the mostly-opaque Fitsfile struct. The only reason this design seems complicated is because it's trying to rectify past sins without breaking backwards compatibility. In this case the past sin is the HDUList object which is the primary user interface to the FITS file, but is too close to the underlying file reader implementation (in fact one in the same). That's where the trouble comes in. A cleaner implementation might separate it entirely, but you'd still be left with an object that allows a list-like interface to the FITS file, unless that were deprecated and replaced at some point. But I don't see much reason to do that since:

a) I think it's actually a reasonably popular interface, it makes sense to users (and in terms of how one thinks of a FITS file as an ordered collection of HDUs) and
b) The current design works, and is not that hard to understand IMO

saimn added a commit to saimn/astropy that referenced this pull request May 16, 2017
Closes astropy#5582. astropy#5065 implemented lazy loading of HDUs in fits files. But
it did it in a way that maintains backwards compatibility at the expense
of performance, loading the remaining HDUs to allow read the headers
after the file was closed. Basically this reverts
5c66c2d
saimn added a commit to saimn/astropy that referenced this pull request May 17, 2017
Closes astropy#5582. astropy#5065 implemented lazy loading of HDUs in fits files. But
it did it in a way that maintains backwards compatibility at the expense
of performance, loading the remaining HDUs to allow read the headers
after the file was closed. Basically this reverts
5c66c2d
dhomeier pushed a commit to dhomeier/astropy that referenced this pull request May 24, 2017
saimn added a commit to saimn/astropy that referenced this pull request May 30, 2017
Closes astropy#5582. astropy#5065 implemented lazy loading of HDUs in fits files. But
it did it in a way that maintains backwards compatibility at the expense
of performance, loading the remaining HDUs to allow read the headers
after the file was closed. Basically this reverts
5c66c2d
saimn added a commit to saimn/astropy that referenced this pull request Jun 6, 2017
Closes astropy#5582. astropy#5065 implemented lazy loading of HDUs in fits files. But
it did it in a way that maintains backwards compatibility at the expense
of performance, loading the remaining HDUs to allow read the headers
after the file was closed. Basically this reverts
5c66c2d
saimn added a commit to saimn/astropy that referenced this pull request Jun 9, 2017
Closes astropy#5582. astropy#5065 implemented lazy loading of HDUs in fits files. But
it did it in a way that maintains backwards compatibility at the expense
of performance, loading the remaining HDUs to allow read the headers
after the file was closed. Basically this reverts
5c66c2d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants