Lazy loading of HDUs on fits.open#5065
Conversation
|
It would great to get some feedback on how this holds up in real-world, non-trivial usage. |
|
This sounds great to me - I will try and test it out shortly |
|
A future enhancement to this that I've had in mind for a long time would be an LRU cache of headers from recently opened files. This would allow, for example, skipping directly to specific HDUs of known, recently used files (that have not been modified of course). |
|
str and repr just return return [primaryhdu] which can be confusing. |
|
@juliantaylor Thanks for having a look at this! And I'm glad it addressed the issue for you (modulo this minor issue with printing). I hope to get this updated soon. |
|
note the feature freeze is dec 2. Will this get ready by then? I do not want the fix for my issue to miss another release. |
|
Is there anything this needs other than to be rebased, and to fix the repr issue? |
6f7dfa5 to
fe681de
Compare
|
Add a change log? Also maybe a rebase will fix the Appveyor build. |
|
I added a fix for the Anyways, IIRC I designed this change to be as minimally disruptive as possible to the existing usage, but to be available as an optimization for the corner cases where it does apply. |
Now merely opening a FITS file does not cause it to seek through the entire FITS file for HDUs. Instead it reads just the primary HDU to ensure that it is a valid FITS file, and in some cases it also reads the first extension HDU (just to see if there are any extensions). Otherwise it does not parse the file for any further HDUs until an operation requesting a specific HDU is performed (in which case HDUs are loaded until the requested HDU is found or the end of the file or invalid data is reached). Operations that require knowing about all the HDUs in the file (including, unfortunately, len()) result in parsing all the HDUs in the file. This can unfortunately be slow with files containing a large number of HDUs. Further speedup may be achievable by implementing a fast header reader that parses out only the information from FITS headers necessary to to locate valid extensions (without parsing the entire header).
…return the correct HDU. For example, ensures that hdul[-1] returns the *actual* last HDU. This of course means all HDUs have to be loaded before we can find the last. Therefore negative indices can have poor performance implications if used carelessly.
… a needless search for duplicates. This behavior was previously not tested for anyways, and is somewhat user-hostile anyways.
… this case no operation should attempt to read from the file because it is only open for writing.
… after a file has been closed. Now you can open a file, close it, and still read the headers (but not the data---that had been disabled for a long time) after it was closed. This of course means that before the file is closed it reads in any HDUs that haven't already been read, killing much of the benefit gained from lazy-loading in the first place. However this is *not* performed in certain special cases where a user would never manually (either by calling close() or using a with statement) close the file--namely the convenience functions. However, use of the old behavior does raise a deprecation warning. This should give users the chance to fix poorly written code that 'closes' and HDUList before it's done being used. Then we can take fuller advantage of the lazy loading. It might also be worth adding an option to disable the backwards-compat mode so that one can take full advantage of lazy-loading immediately.
Miraculously it didn't break any tests, and that worries me. But I'm almost certain it wasn't intentional so putting this back the way they were for now just in case...
…oad all the HDUs for now. (The info in HDUList.__repr__ is fairly useless though; that should be improved in a separate issue.)
bf6d0f2 to
9f00347
Compare
|
Rebased again |
|
Strange about the AppVeyor failure. I don't understand why that is. I'll have to load an AppVeyor VM and see if I can reproduce it, I guess. |
|
so freeze is today, will this still go in? |
|
@eteq - thanks for merging this! However, I noticed the following issue with this pull request:
Would it be possible to fix this? Thanks! If you believe the above to be incorrect (which I - @astrobot - very much doubt) you can ping @astrofrog |
|
Hah. Typical. changelog entry added in a72fdfb . Need to remember that when backporting, though! (mostly that's a note-to-self) |
|
I didn't follow this PR but as a side note - it broke code here which looks into the file to get specific extensions, for example, getting the "SCI" extensions in an HST file. If this was the intended functionality, we'll fix it on our side but it can be a surprise to others as well. |
|
@nden can you please open an issue for this? The intent was not to change functionality,ai understand it, it was just to lazily load. |
|
Ideally the The problem this PR solves is the same problem you were trying to solve, but without limiting the solution (seemingly arbitrarily, though understandably from a technical standpoint) to the "convenience functions" the use of which I try to discourage for code that makes significant in-place manipulation of FITS files, or really any code outside of simple interactive scripting. In other words, it makes performance improvements even for code that's using the "lower level" API. I admit it is complicated though. |
| warnings.warn(textwrap.dedent( | ||
| """\ | ||
| Accessing an HDU after an HDUList is closed, where | ||
| that HDU was no read while the HDUList was open |
|
the only thing that is actually improved compared to the alternative is: That's it. This is the perfect case for the convenience functions. |
Lazy loading of HDUs on fits.open
|
When you say "That's it" I don't think you appreciate how extremely common that is (and it usually does not appear in isolation--it may be followed up with other operations on the same open file, so it's not directly replaceable by the convenience functions). |
I accept cash. |
That is my point, as soon as it does not appear in isolation anymore lazy loading becomes much less relevant. |
|
I think you're still missing my point. First of all, the only gains are not just from reading the primary header, though that's certainly where the biggest gain comes from. But even so, even if that were the only advantage, there's tons of existing code out there that is already using the full API and not the "convenience functions" (which have always been advertised as wrappers around the lower-level API mainly for interactive convenience), and I think there's a strange cognitive overhead involved in trying to explain "Well, if you use the 'convenience functions' you'll get performance improvements, but not if you use the lower-level API on top of which the convenience functions are built". The way this patch does things--loading HDUs as needed--is the obvious way, I think, to write a FITS reader in the first place. IIRC it's more or less how CFITSIO works too--CFITSIO was designed better in the first place, I think, insofar as separating the API better from the details of how the file is being read, which are all squirreled away in the mostly-opaque Fitsfile struct. The only reason this design seems complicated is because it's trying to rectify past sins without breaking backwards compatibility. In this case the past sin is the a) I think it's actually a reasonably popular interface, it makes sense to users (and in terms of how one thinks of a FITS file as an ordered collection of HDUs) and |
Closes astropy#5582. astropy#5065 implemented lazy loading of HDUs in fits files. But it did it in a way that maintains backwards compatibility at the expense of performance, loading the remaining HDUs to allow read the headers after the file was closed. Basically this reverts 5c66c2d
Closes astropy#5582. astropy#5065 implemented lazy loading of HDUs in fits files. But it did it in a way that maintains backwards compatibility at the expense of performance, loading the remaining HDUs to allow read the headers after the file was closed. Basically this reverts 5c66c2d
Closes astropy#5582. astropy#5065 implemented lazy loading of HDUs in fits files. But it did it in a way that maintains backwards compatibility at the expense of performance, loading the remaining HDUs to allow read the headers after the file was closed. Basically this reverts 5c66c2d
Closes astropy#5582. astropy#5065 implemented lazy loading of HDUs in fits files. But it did it in a way that maintains backwards compatibility at the expense of performance, loading the remaining HDUs to allow read the headers after the file was closed. Basically this reverts 5c66c2d
Closes astropy#5582. astropy#5065 implemented lazy loading of HDUs in fits files. But it did it in a way that maintains backwards compatibility at the expense of performance, loading the remaining HDUs to allow read the headers after the file was closed. Basically this reverts 5c66c2d
This is something I've wanted to do a long time. And while this isn't exactly how I had in mind to do it, it lays a lot of groundwork. This is my alternative to #4634
The idea here is that
fits.open()currently immediately seeks through an entire FITS file, and finds all HDUs, as well as creates HDU objects and more importantly Header objects for them. In a typical FITS file this overhead is negligible. But for files containing hundreds of HDUs this can have a serious impact. Further this can have a serious impact even when working with files of just a few HDUs, but when there are thousands of them that need to be looped over (and in particular when only the PRIMARY header is needed which is frequently the case).This is an experimental attempt to add lazy-loading of HDUs, so HDUs are only read on demand. For example if you open a file with
hdul = fits.open(filename)and immediately do something likehdul[3]it read all HDUs up through the fourth one and then stop any further reading until necessary.Of course some operations, most notably
len(hdul)andhdul.info()necessitate reading the entire file. But these are not the most common cases in scripts which is what this change really concerns itself with (as opposed to command-line interaction).The biggest shortcoming to this patch as it stands is the issue I mentioned in #4634 regarding
hdulist.close(). Becausefits.open()used to load all HDUs, and in particular their headers, I have no doubt that there is code in the wild that follows this sort of pattern:Because of the lazy-loading,
hdulist[3]may not have been loaded yet beforehdulist.close()was called, and so this will fail. I'm of the opinion that it should fail--you shouldn'tclose()the file object until you're actually done doing everything you're going to do with it. But unfortunately PyFITS has played loose on this forever.So currently, to maintain backwards compatibility, this patch ensures that all HDUs are loaded before the file is closed. This is of course kills most benefit from lazy-loading. But not in all cases! In special cases like convenience functions (
getheader,getdata, etc.) in which the user never has direct access to theHDUListobject we can still take full advantage of lazy-loading. Furthermore, attempts to access a previously not loaded HDU after the file has been closed will result in aDeprecationWarning, with the aim of disallowing this usage in the long term, so that lazy-loading can be fully taken advantage of.