io.fits: allow limiting the number of read extensions#4634
io.fits: allow limiting the number of read extensions#4634juliantaylor wants to merge 1 commit intoastropy:masterfrom
Conversation
In the convenience functions only read as many extensions as necessary to satisfy the users request. This can greatly increase performance when one is only interested in an early subset of extensions. It adds the `ext` keyword to the fitsopen function so it will also be accepted by the top level fits.open function, but this is an implementational detail and not documented. Closes astropygh-4589
|
should the private new |
|
ping |
| data = hdu.data | ||
| if data is None and extidx == 0: | ||
| try: | ||
| hdulist, extidx = _getext(filename, mode, ext=1, *args, **kwargs) |
There was a problem hiding this comment.
I am against hardcoding ext=1 here. There are all sorts of data formats out there, where the data is not always in Extension 1. This is bound to break someone else's codes.
There was a problem hiding this comment.
the extension 1 is already hardcoded, before it just loaded all and threw away all except the first one, see the line below
There was a problem hiding this comment.
it is slightly less efficient though if you only have one extension as you are loading the first one twice, I didn't consider changing that worthwhile but it could be done
|
c/c @embray (in case you are interested) |
| saved_compression_enabled = compressed.COMPRESSION_ENABLED | ||
|
|
||
| # (extname, extver) tuple | ||
| ext = kwargs.pop("ext", None) |
There was a problem hiding this comment.
I would suggest renaming "ext" to something that would better show the meaning of this parameter stop_reading_data_at_ext: stop_ext, last_read_ext, etc.
There was a problem hiding this comment.
I'd actually like a different approach altogether, but I'll update more about that in a separate comment.
|
All the tests seem to pass, so I guess at least this didn't break anything that we are testing for. I suspect you can get rid of the Sphinx warning by rebasing to the latest master. Since your motivation in this PR is performance, do you have any profiling data? |
| return self._file.name | ||
| return None | ||
|
|
||
| @classmethod |
There was a problem hiding this comment.
This doesn't use self (and for classmethods the "self" is typically spelled "cls"). But in any case this can be a staticmethod instead. I'd also prefer a different name...maybe something like "_ext_match".
|
This has been on my todo list for a long time, so thanks for tackling it. Rather than just supporting this in the A better approach (which you've already made some progress on by modifying In other words, opening a FITS file should not read any HDUs at all (though the Primary at least will almost always be read). For example if I do: The If an HDU is given by name like Some operations, like But doing this generally at the HDU level will also get this functionality for free in all the "convenience" functions, but also all other use cases. |
|
fully lazy hdu list might be interesting but more complicated to do. I heard today is the feature freeze for the next astropy version, can this be considered for merge if the remaining minor comments are fixed? |
|
oh nevermind the feature freeze was last week ... well I'll stick to my branch then for now. |
|
It wouldn't be significantly more complicated than what you've already done, and would work more transparently. It would mostly be a matter of taking the work you've already done, and go through the rest of the |
| # 'PRIMARY' should always work as a reference to the first HDU | ||
| if ((name == _key or (_key == 'PRIMARY' and idx == 0)) and | ||
| (_ver is None or _ver == hdu.ver)): | ||
| if self._ext_equal((_key, _ver), hdu, idx): |
There was a problem hiding this comment.
I'm not sure this is the best way to refactor this either, since now _ext_equal ends up duplicating a lot of work already done in index_of. Need to rethink this a bit...
There was a problem hiding this comment.
yes I tried to avoid as much of it as I could think of, but its either have two functions to 90% the same or some duplicate work.
The extra work is negligible in terms of cpu time so I opted for less code
|
I'll see if I can come up with an alternative. |
|
In fairness, though I shouldn't be surprised, it's a little tricker to get the lazy loading just right than I thought. The trickiest aspect is that there's code out there in the wild that does things (I think incorrectly, but it used to work) like: Just for example. That still needs to work, which menas Need to think more about what to do about that. It could still be optimized around for the specific case of "getdata" or "getheader". |
|
laziness is always more complicated than you first think, especially for old software where every imaginable quirk is used as a feature in some program. I also don't think it is that useful, it helps when you use open and just look at a subset, but when that is to slow it is quite obvious why. And it is not obvious at all that getheader will read all headers even when you only want the first. And in a bad (and existing in the real world! well at least my world) case the behavior is pathologically bad. This is what this PR fixes. |
|
In relation to #4634 (comment) example I think that @juliantaylor proposal does not affect that behavior at all. @juliantaylor fix is for "convenience" functions such as "get primary header". The old code should not be affected by @juliantaylor changes. For "set" functions the situation may be different. Also, it seems that @juliantaylor fix should work fine with |
|
My implementation is almost ready and is backwards compatible. Just have a couple Python 3 bugs to iron out. |
Yes, but |
|
there is a very good reason, open gives you a list of all hdu's, while getheader only returns a single header. They can have the same performance, but there is no reason one would expect them to given their current behaviour (which includes its repr()). It might not be a problem here because of the GIL and the append nature of fits but lazy stuff also complicates parallel read access, as suddenly you read-only paths start writing. |
|
That's not a bad point about the I don't think There's lots of code out there that does not use the "convenience" functions, and in fact I regularly discourage their use outside of limited cases (such as you really do only need to read one header and not update anything). @juliantaylor I've heard of cases like yours before with files containing a large number of HDUs, and the bottleneck that causes. But even outside that case--files with a "reasonable" number of HDUs like 5, I've heard of cases where thousands of FITS files are being looped over, and the major bottleneck is in the I don't think the performance should depend on which high-level interface the user is going through. It's very non-obvious. Not sure what you're talking about parallel read. It can be designed not to read past the original end of the file even if the file has been appended to while reading. |
|
Perhaps as a compromise the lazy-loading could be optional. I would tend to leave it on by default, since it would benefit the majority of users. But certainly allow all HDUs to be loaded upon |
|
I'm not against lazy loading, I only think the benefit its not worth the effort and potential unforeseen issues involved. I have with files in the thousands of extensions almost the worst case for pyfits there can be and I don't need lazy loading. I either just read a few sections (getheader/getdata) or I read all of it. |
|
Perhaps one day I'll get around to writing that proposal for a FITS standard extension for actually listing the extension HDUs in the primary header. Then this wouldn't even be an issue (for files that use that convention at least :/) |
|
its been a while, maybe its time to merge the perfectly adequate solution instead of waiting for the perfect solution that may never come. |
|
oh or maybe I should read the references before I post there is actually some code now... |
|
the lazy hdu PR solves my problem too, so I'll close this one. |
In the convenience functions only read as many extensions as necessary
to satisfy the users request. This can greatly increase performance when
one is only interested in an early subset of extensions.
It adds the
extkeyword to the fitsopen function so it will also beaccepted by the top level fits.open function, but this is an
implementational detail and not documented.
Closes gh-4589