Add LZW file support to io.fits, io.ascii and utils.data#17960
Add LZW file support to io.fits, io.ascii and utils.data#17960saimn merged 4 commits intoastropy:mainfrom
io.fits, io.ascii and utils.data#17960Conversation
|
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
|
@kYwzor - before anything else, really nice to have written that new package! Your code changes here look very reasonable, although I'll let @saimn look in more detail. Obviously, we'd want some tests - maybe we can use some of the ESO files you mentioned in the issue? (A smaller one, ideally!). As for suggestions for how to deal with backwards seeks, since lzw is stream-based, would it be an option to keep some kind of check points so you don't have to start from scratch? Maybe at a given spacing? E.g., I could imagine for FITS files, with their native 2880 byte blocks, to have a dict indexed by uncompressed block giving the corresponding compressed file location and decompressor state. (Obviously, without having looked at your code, no idea whether this makes sense!). Separately, the fact that we apparently skip backward relatively frequently suggests that there are some improvements to make in astropy! |
|
p.s. Where does the source code live? I didn't see it on your github page. |
|
@mhvk thanks for the comments. Oops, I forgot to make the repository public, nice catch! It should be public now here. Regarding the test file, yes ESO files are probably fine if they are accessible from the public ESO archive. I did some digging and the file provided by the original poster on #10714 is available on the ESO archive here or with a direct download link here. The file has about 4MB which I think is reasonable (on my testing uncompresspy needed less than a second to fully decompress it). Let me know if this is what you had in mind. Agreed on it being puzzling that io.fits backtracks so much; I did not investigate what causes it exactly, I just assumed there was a legitimate reason. Maybe someone can chime in? As for the checkpoints, that actually sounds like a pretty promising idea. I don't want to "marry" uncompresspy to FITS files specifically, thus tying it to 2880 byte blocks doesn't sound great; however, all LZW files I've found in the wild use the so called "block mode", which means the decompression dictionary gets reset from time to time when a special "CLEAR" code is encountered. A reset of the dictionary means that at that point in the code stream you don't need any knowledge of the prior code stream (other than knowing exactly where the new block starts). Therefore, these would be very "natural" checkpoints to jump to. For the typical file with max code size of 16 bits, CLEAR codes are spaced a minimum of 122656 bytes apart (i.e. roughly 120KB, though they typically go a bit beyond that, it varies depending on how good the compression ratio is at that point in the file); this seems good enough to me. This would definitely allow us to go backwards without requiring a full restart of the decompression or forcing us to keep the full decompression buffer in memory. I will probably not have the time to attempt an implementation of the checkpoint idea this week, but I could give it a shot in the coming weeks if you think that's worth chasing. I still don't love this approach because we're still decoding previously decoded parts of the file (which is obviously sub-optimal) and thus it could result in significantly worse performance than just keeping the buffer in memory... but I also don't love keeping the buffer in memory, so maybe we really have no alternative and this is the best of two bad options. I guess the only way out is if we could avoid rewinding the file to begin with, but that's likely not possible. Side note, the |
|
Had another look at the code base. Not familiar with this part of the code, but I suspect we might also need to check for LZW files somewhere around here: Lines 387 to 448 in 2382a99 I'll try to address this in a later commit. I also searched for the tests being done for bz2 and it seems they are really spread across a few test files, not entirely sure what the logic here is but I'll leave what I found below: A whole bunch of code snippets
|
|
Added support inside get_readable_fileobj, does anyone know an example use-case to test it? |
By default we load HDUs lazily and use memmap, so it's not possible to predict if and when data will be loaded.
4MB is too big to store it in the astropy repo, so either we have a smaller file (<50 KB) that we can store here or another option would be to have a remote data test (we probably need to host it on astropy-data instead of using directly the ESO link).
We already a lot of parameters and this one seems a bit too specific. Also it would be probably not apply to other compression methods where we might not control this behavior.
Yes I think so. |
You can probably extend some of the tests from #3667 ? |
|
Regarding the test file, it seems ESO's programmatic access allows you to filter by file size, so I searched for files smaller than 20KB. There's obviously tons, so I just picked the oldest observation (why not!) which is a 13.4KB file (33.7KB uncompressed). This is the details link and direct download link. My thinking is we could put both the compressed and uncompressed files in the repository, so that we could then test the decompression and compare with the uncompressed file. Let me know if this is ok for you. |
|
@kYwzor - I'd store only the compressed version in the repository, and just test a few header keywords and a few data points. For better or worse, all those test data get installed with the package for everyone who downloads it, so keeping things small is important! |
io.fitsio.fits, io.ascii and utils.data
|
I changed the title of this PR to better reflect the fact that this change also impacts io.ascii and utils.data (and added towncrier files accordingly). I believe I've also added all the relevant tests. I've also changed the documentation to mention the LZW capabilities where relevant. I believe this commit is now in a good shape to be reviewed. In parallel, I've been updating uncompresspy to improve some interfaces. Before merging this, I'd still like to remove the kludgy "keep_buffer" option and instead implement a solution for seeking backwards based on checkpoints, as previously discussed here. |
|
I've updated uncompresspy and now it is able to keep checkpoints in the file, which enable much better seeking performance. Updated the dependencies and removed the "keep_buffer" option accordingly. I believe this commit is ready for final review. |
e07d4f8 to
f31b06c
Compare
|
Had to make changes to solve merge conflicts resulting from #17968. Commit history was getting really messy so I took the chance to clean it up and squash the commits. |
mhvk
left a comment
There was a problem hiding this comment.
Nice! I have only the most nitpicky comments.
|
@mhvk thanks for the comments, I agree with them and they should now be addressed in my latest commit. Regarding the tests on test_compressed.py, indeed I just copied what the others were doing. I edited my test as you proposed, but kept the others as-is (I think it would be out of scope to touch them). Regarding lzma and bz2 my understanding is that, indeed they are in the standard library, but minimal Python installations may not include them (or in other words, they are technically an optional part of the Python installation). I think in practice this never happens in Windows, but if you compile from the source in Linux you may be missing some dependencies and thus the support for lzma/bz2 is disabled. |
|
Had to do a rebase to fix merge conflicts caused by #17984, should be fine now |
Description
This pull request is to address the lack of LZW support in Astropy. Fixes #10714.
For performance discussion please see here. For disclosure, I am the sole developer of uncompresspy, which I am using in this PR.
There are a few open questions for this PR: