Skip to content

io.fits: allow limiting the number of read extensions#4634

Closed
juliantaylor wants to merge 1 commit intoastropy:masterfrom
juliantaylor:fits-load-as-needed
Closed

io.fits: allow limiting the number of read extensions#4634
juliantaylor wants to merge 1 commit intoastropy:masterfrom
juliantaylor:fits-load-as-needed

Conversation

@juliantaylor
Copy link

In the convenience functions only read as many extensions as necessary
to satisfy the users request. This can greatly increase performance when
one is only interested in an early subset of extensions.
It adds the ext keyword to the fitsopen function so it will also be
accepted by the top level fits.open function, but this is an
implementational detail and not documented.

Closes gh-4589

In the convenience functions only read as many extensions as necessary
to satisfy the users request. This can greatly increase performance when
one is only interested in an early subset of extensions.
It adds the `ext` keyword to the fitsopen function so it will also be
accepted by the top level fits.open function, but this is an
implementational detail and not documented.

Closes astropygh-4589
@juliantaylor
Copy link
Author

should the private new ext keyword hidden in kwargs be named _ext?

@juliantaylor
Copy link
Author

ping

data = hdu.data
if data is None and extidx == 0:
try:
hdulist, extidx = _getext(filename, mode, ext=1, *args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am against hardcoding ext=1 here. There are all sorts of data formats out there, where the data is not always in Extension 1. This is bound to break someone else's codes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the extension 1 is already hardcoded, before it just loaded all and threw away all except the first one, see the line below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is slightly less efficient though if you only have one extension as you are loading the first one twice, I didn't consider changing that worthwhile but it could be done

@pllim
Copy link
Member

pllim commented Apr 14, 2016

c/c @embray (in case you are interested)

saved_compression_enabled = compressed.COMPRESSION_ENABLED

# (extname, extver) tuple
ext = kwargs.pop("ext", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest renaming "ext" to something that would better show the meaning of this parameter stop_reading_data_at_ext: stop_ext, last_read_ext, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually like a different approach altogether, but I'll update more about that in a separate comment.

@pllim
Copy link
Member

pllim commented Apr 15, 2016

All the tests seem to pass, so I guess at least this didn't break anything that we are testing for. I suspect you can get rid of the Sphinx warning by rebasing to the latest master.

Since your motivation in this PR is performance, do you have any profiling data?

return self._file.name
return None

@classmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't use self (and for classmethods the "self" is typically spelled "cls"). But in any case this can be a staticmethod instead. I'd also prefer a different name...maybe something like "_ext_match".

@embray
Copy link
Member

embray commented Apr 18, 2016

This has been on my todo list for a long time, so thanks for tackling it.

Rather than just supporting this in the getdata "convenience" function, this sort of enhancement should benefit all modes of accessing a FITS file.

A better approach (which you've already made some progress on by modifying HDUList._readfrom) would be to make HDUList completely lazy when it comes to accessing HDUs.

In other words, opening a FITS file should not read any HDUs at all (though the Primary at least will almost always be read). For example if I do:

with fits.open(filename) as hdul:
    print hdul[2].header

The hdul[2] subscripting will first check to see how many HDUs are already loaded. If it's less than 3 it knows it needs to seek to the end of the last known HDU and then try to read the next one. If no further HDUs exist then it would raise an IndexError like usual.

If an HDU is given by name like hdul['DQ'] then it would have to load HDUs until it finds one named 'DQ' and then stop.

Some operations, like len(hdulist) or hdulist.info() necessarily require loading all HDUs and that's fine.

But doing this generally at the HDU level will also get this functionality for free in all the "convenience" functions, but also all other use cases.

@juliantaylor
Copy link
Author

fully lazy hdu list might be interesting but more complicated to do.
I probably won't have time to attempt that anytime soon.

I heard today is the feature freeze for the next astropy version, can this be considered for merge if the remaining minor comments are fixed?
It is purely internal changes and do not prevent future even better improvements for the other apis.
The this performance issue makes astropy useless for some types header only reading data tasks which is a shame. E.g. the files I work with can have several thousand extensions, but often I am only interested in the first header. This change speeds this task up by a magnitude or two.

@juliantaylor
Copy link
Author

oh nevermind the feature freeze was last week ... well I'll stick to my branch then for now.

@embray
Copy link
Member

embray commented May 20, 2016

It wouldn't be significantly more complicated than what you've already done, and would work more transparently. It would mostly be a matter of taking the work you've already done, and go through the rest of the HDUList methods (including __len__ and __getitem__) and apply the logic to only read up to a given HDU (where in the case of __len__ and info this would be -1 indicating to read to the last HDU).

# 'PRIMARY' should always work as a reference to the first HDU
if ((name == _key or (_key == 'PRIMARY' and idx == 0)) and
(_ver is None or _ver == hdu.ver)):
if self._ext_equal((_key, _ver), hdu, idx):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the best way to refactor this either, since now _ext_equal ends up duplicating a lot of work already done in index_of. Need to rethink this a bit...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I tried to avoid as much of it as I could think of, but its either have two functions to 90% the same or some duplicate work.
The extra work is negligible in terms of cpu time so I opted for less code

@embray
Copy link
Member

embray commented May 20, 2016

I'll see if I can come up with an alternative.

@embray
Copy link
Member

embray commented May 23, 2016

In fairness, though I shouldn't be surprised, it's a little tricker to get the lazy loading just right than I thought. The trickiest aspect is that there's code out there in the wild that does things (I think incorrectly, but it used to work) like:

hdul = fits.open(filename)
hdul.close()
hdul[1].header

Just for example. That still needs to work, which menas HDUList.close() still needs to check for additional headers in the file before closing, which is annoying. Maybe it should be change so that the above example should not work.

Need to think more about what to do about that. It could still be optimized around for the specific case of "getdata" or "getheader".

@juliantaylor
Copy link
Author

juliantaylor commented May 23, 2016

laziness is always more complicated than you first think, especially for old software where every imaginable quirk is used as a feature in some program. I also don't think it is that useful, it helps when you use open and just look at a subset, but when that is to slow it is quite obvious why.
And a simple operation like len() will already break the laziness.

And it is not obvious at all that getheader will read all headers even when you only want the first. And in a bad (and existing in the real world! well at least my world) case the behavior is pathologically bad. This is what this PR fixes.

@mcara
Copy link
Contributor

mcara commented May 23, 2016

In relation to #4634 (comment) example I think that @juliantaylor proposal does not affect that behavior at all. @juliantaylor fix is for "convenience" functions such as "get primary header". The old code should not be affected by @juliantaylor changes. For "set" functions the situation may be different. Also, it seems that @juliantaylor fix should work fine with HDUList.close() as long as the file was opened in the read-only mode.

@embray
Copy link
Member

embray commented May 24, 2016

My implementation is almost ready and is backwards compatible. Just have a couple Python 3 bugs to iron out.

@embray
Copy link
Member

embray commented May 24, 2016

And it is not obvious at all that getheader will read all headers even when you only want the first.

Yes, but getheader(filename, ext) is just a special case of hdul = fits.open(filename); hdul[ext].header. There's no reason the former should be more efficient than the latter.

@juliantaylor
Copy link
Author

there is a very good reason, open gives you a list of all hdu's, while getheader only returns a single header. They can have the same performance, but there is no reason one would expect them to given their current behaviour (which includes its repr()).

It might not be a problem here because of the GIL and the append nature of fits but lazy stuff also complicates parallel read access, as suddenly you read-only paths start writing.

@embray
Copy link
Member

embray commented May 25, 2016

That's not a bad point about the repr, though I've been meaning to overhaul that too. The current __repr__ for HDUList (the default) is pretty terrible in useless (in part because the same is true for the HDUs themselves). I'm glad you brought it up though--for my lazy read patch I should override __repr__ to not result in all HDUs being loaded.

I don't think open should give a list of all HDUs. I think ideally open should do next to nothing until you try to actually read something from a file (currently I have it so that at least the first HDU is read to confirm that it's a valid FITS file).

There's lots of code out there that does not use the "convenience" functions, and in fact I regularly discourage their use outside of limited cases (such as you really do only need to read one header and not update anything).

@juliantaylor I've heard of cases like yours before with files containing a large number of HDUs, and the bottleneck that causes. But even outside that case--files with a "reasonable" number of HDUs like 5, I've heard of cases where thousands of FITS files are being looped over, and the major bottleneck is in the open() operation, even if only, say, the primary HDUs are being looked at (maybe one header keyword is being updated).

I don't think the performance should depend on which high-level interface the user is going through. It's very non-obvious.

Not sure what you're talking about parallel read. It can be designed not to read past the original end of the file even if the file has been appended to while reading.

@embray
Copy link
Member

embray commented May 26, 2016

Perhaps as a compromise the lazy-loading could be optional. I would tend to leave it on by default, since it would benefit the majority of users. But certainly allow all HDUs to be loaded upon open (this could even be extended to loading all the data into memory too, overriding the mmap settings).

@juliantaylor
Copy link
Author

I'm not against lazy loading, I only think the benefit its not worth the effort and potential unforeseen issues involved. I have with files in the thousands of extensions almost the worst case for pyfits there can be and I don't need lazy loading. I either just read a few sections (getheader/getdata) or I read all of it.
But if you want to implement it and it works, thats great. I'll definitely make use of it.

@embray
Copy link
Member

embray commented May 27, 2016

Perhaps one day I'll get around to writing that proposal for a FITS standard extension for actually listing the extension HDUs in the primary header. Then this wouldn't even be an issue (for files that use that convention at least :/)

@juliantaylor
Copy link
Author

its been a while, maybe its time to merge the perfectly adequate solution instead of waiting for the perfect solution that may never come.

@juliantaylor
Copy link
Author

oh or maybe I should read the references before I post there is actually some code now...
(github doesn't send mails for references a ping would have been nice)

@juliantaylor
Copy link
Author

juliantaylor commented Sep 15, 2016

the lazy hdu PR solves my problem too, so I'll close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants