Skip to content

Add LZW file support to io.fits, io.ascii and utils.data#17960

Merged
saimn merged 4 commits intoastropy:mainfrom
kYwzor:lzwfiles
Apr 10, 2025
Merged

Add LZW file support to io.fits, io.ascii and utils.data#17960
saimn merged 4 commits intoastropy:mainfrom
kYwzor:lzwfiles

Conversation

@kYwzor
Copy link
Copy Markdown
Member

@kYwzor kYwzor commented Mar 31, 2025

Description

This pull request is to address the lack of LZW support in Astropy. Fixes #10714.

For performance discussion please see here. For disclosure, I am the sole developer of uncompresspy, which I am using in this PR.

There are a few open questions for this PR:

  • This implies a new optional dependency for Astropy. I believe I've handled this correctly, but there may be some tricky stuff I missed.
  • I consider uncompresspy to still be in beta and thus I'm fully open to any suggestion of changing interfaces, if you think the existing ones are unsatisfying in some way. Ultimately, I built uncompresspy mostly for the Astropy use-case, so I'm willing to adapt whatever needed to best fit it.
  • I have not added any tests, though this new feature would surely need some tests. Please advise on how to best handle this.
  • I also suspect documentation needs to be updated somewhere to address LZW support.
  • The implementation in uncompresspy allows reading of the uncompressed stream byte by byte, without keeping the decompressed buffer in memory. However, in my testing, I've noticed Astropy very frequently seeks the compressed file backwards, even if we're accessing entirely sequential data (e.g. reading header and then data of the first HDU, then reading the header and data of the second HDU, and so forward, will result in Astropy frequently going backwards on the file). This is a huge performance killer since LZW decompression is purely sequential and thus going backwards implies restarting from the very beginning of the file. To reduce this effect, I created a "keep_buffer" option for uncompresspy which makes it so that the LZWFile object keeps an internal buffer of all previously decompressed data, which it leverages to speed up backwards seeks (i.e. avoid repeated decoding). This is frankly unsatisfying, as it results in significantly larger memory consumption for large files, but I'm not sure if there's a better alternative. Very much open to suggestions on this.

@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@pllim pllim added this to the v7.1.0 milestone Mar 31, 2025
kYwzor added a commit to kYwzor/astropy that referenced this pull request Mar 31, 2025
@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Mar 31, 2025

@kYwzor - before anything else, really nice to have written that new package! Your code changes here look very reasonable, although I'll let @saimn look in more detail. Obviously, we'd want some tests - maybe we can use some of the ESO files you mentioned in the issue? (A smaller one, ideally!).

As for suggestions for how to deal with backwards seeks, since lzw is stream-based, would it be an option to keep some kind of check points so you don't have to start from scratch? Maybe at a given spacing? E.g., I could imagine for FITS files, with their native 2880 byte blocks, to have a dict indexed by uncompressed block giving the corresponding compressed file location and decompressor state. (Obviously, without having looked at your code, no idea whether this makes sense!).

Separately, the fact that we apparently skip backward relatively frequently suggests that there are some improvements to make in astropy!

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Mar 31, 2025

p.s. Where does the source code live? I didn't see it on your github page.

@kYwzor
Copy link
Copy Markdown
Member Author

kYwzor commented Apr 1, 2025

@mhvk thanks for the comments. Oops, I forgot to make the repository public, nice catch! It should be public now here.

Regarding the test file, yes ESO files are probably fine if they are accessible from the public ESO archive. I did some digging and the file provided by the original poster on #10714 is available on the ESO archive here or with a direct download link here. The file has about 4MB which I think is reasonable (on my testing uncompresspy needed less than a second to fully decompress it). Let me know if this is what you had in mind.

Agreed on it being puzzling that io.fits backtracks so much; I did not investigate what causes it exactly, I just assumed there was a legitimate reason. Maybe someone can chime in?

As for the checkpoints, that actually sounds like a pretty promising idea. I don't want to "marry" uncompresspy to FITS files specifically, thus tying it to 2880 byte blocks doesn't sound great; however, all LZW files I've found in the wild use the so called "block mode", which means the decompression dictionary gets reset from time to time when a special "CLEAR" code is encountered. A reset of the dictionary means that at that point in the code stream you don't need any knowledge of the prior code stream (other than knowing exactly where the new block starts). Therefore, these would be very "natural" checkpoints to jump to. For the typical file with max code size of 16 bits, CLEAR codes are spaced a minimum of 122656 bytes apart (i.e. roughly 120KB, though they typically go a bit beyond that, it varies depending on how good the compression ratio is at that point in the file); this seems good enough to me. This would definitely allow us to go backwards without requiring a full restart of the decompression or forcing us to keep the full decompression buffer in memory.

I will probably not have the time to attempt an implementation of the checkpoint idea this week, but I could give it a shot in the coming weeks if you think that's worth chasing. I still don't love this approach because we're still decoding previously decoded parts of the file (which is obviously sub-optimal) and thus it could result in significantly worse performance than just keeping the buffer in memory... but I also don't love keeping the buffer in memory, so maybe we really have no alternative and this is the best of two bad options. I guess the only way out is if we could avoid rewinding the file to begin with, but that's likely not possible.

Side note, the decompress_in_memory flag on io.fits makes it so the entire file is decompressed at once into memory. I wonder if it would be reasonable to have a different flag in which we only decompress what's requested, but we keep in memory whatever has already been decompressed (i.e. lazy decompressing but storing it in memory).

@kYwzor kYwzor changed the title Add LZW file support Add LZW file support to io.fits Apr 1, 2025
@kYwzor
Copy link
Copy Markdown
Member Author

kYwzor commented Apr 1, 2025

Had another look at the code base. Not familiar with this part of the code, but I suspect we might also need to check for LZW files somewhere around here:

signature = fileobj.read(4)
fileobj.seek(0)
if signature[:3] == b"\x1f\x8b\x08": # gzip
import struct
try:
import gzip
fileobj_new = gzip.GzipFile(fileobj=fileobj, mode="rb")
fileobj_new.read(1) # need to check that the file is really gzip
except (OSError, EOFError, struct.error): # invalid gzip file
fileobj.seek(0)
fileobj_new.close()
else:
fileobj_new.seek(0)
fileobj = fileobj_new
elif signature[:3] == b"BZh": # bzip2
if not HAS_BZ2:
for fd in close_fds:
fd.close()
raise ModuleNotFoundError(
"This Python installation does not provide the bz2 module."
)
import bz2
try:
# bz2.BZ2File does not support file objects, only filenames, so we
# need to write the data to a temporary file
with NamedTemporaryFile("wb", delete=False) as tmp:
tmp.write(fileobj.read())
tmp.close()
fileobj_new = bz2.BZ2File(tmp.name, mode="rb")
fileobj_new.read(1) # need to check that the file is really bzip2
except OSError: # invalid bzip2 file
fileobj.seek(0)
fileobj_new.close()
# raise
else:
fileobj_new.seek(0)
close_fds.append(fileobj_new)
fileobj = fileobj_new
elif signature[:3] == b"\xfd7z": # xz
if not HAS_LZMA:
for fd in close_fds:
fd.close()
raise ModuleNotFoundError(
"This Python installation does not provide the lzma module."
)
import lzma
try:
fileobj_new = lzma.LZMAFile(fileobj, mode="rb")
fileobj_new.read(1) # need to check that the file is really xz
except (OSError, EOFError): # invalid xz file
fileobj.seek(0)
fileobj_new.close()
# should we propagate this to the caller to signal bad content?
# raise ValueError(e)
else:
fileobj_new.seek(0)
fileobj = fileobj_new

I'll try to address this in a later commit.

I also searched for the tests being done for bz2 and it seems they are really spread across a few test files, not entirely sure what the logic here is but I'll leave what I found below:

A whole bunch of code snippets
  • Straightforward decompression test for io.fits. I could replicate this for LZW but there's an added challenge: we don't really have a way to mimic _make_bzip2_file as we can't easily create an LZW file on demand. I guess we would just have to store the file somewhere. Similarly, writing to LZW will just have to be not supported (frankly, why would you want this 😅).

    @pytest.mark.skipif(not HAS_BZ2, reason="Python built without bz2 module")
    def test_open_bzipped(self):
    bzip_file = self._make_bzip2_file()
    with fits.open(bzip_file) as fits_handle:
    assert fits_handle._file.compression == "bzip2"
    assert len(fits_handle) == 5
    with fits.open(bzip_file, decompress_in_memory=True) as fits_handle:
    assert fits_handle._file.compression == "bzip2"
    assert len(fits_handle) == 5
    with fits.open(bz2.BZ2File(bzip_file)) as fits_handle:
    assert fits_handle._file.compression == "bzip2"
    assert len(fits_handle) == 5
    @pytest.mark.skipif(not HAS_BZ2, reason="Python built without bz2 module")
    def test_open_bzipped_from_handle(self):
    with open(self._make_bzip2_file(), "rb") as handle:
    with fits.open(handle) as fits_handle:
    assert fits_handle._file.compression == "bzip2"
    assert len(fits_handle) == 5
    @pytest.mark.skipif(not HAS_BZ2, reason="Python built without bz2 module")
    def test_detect_bzipped(self):
    """Test detection of a bzip2 file when the extension is not .bz2."""
    with fits.open(self._make_bzip2_file("test0.xx")) as fits_handle:
    assert fits_handle._file.compression == "bzip2"
    assert len(fits_handle) == 5
    @pytest.mark.skipif(not HAS_BZ2, reason="Python built without bz2 module")
    def test_writeto_bzip2_fileobj(self):
    """Test writing to a bz2.BZ2File file like object"""
    fileobj = bz2.BZ2File(self.temp("test.fits.bz2"), "w")
    h = fits.PrimaryHDU()
    try:
    h.writeto(fileobj)
    finally:
    fileobj.close()
    with fits.open(self.temp("test.fits.bz2")) as hdul:
    assert hdul[0].header == h.header
    @pytest.mark.skipif(not HAS_BZ2, reason="Python built without bz2 module")
    def test_writeto_bzip2_filename(self):
    """Test writing to a bzip2 file by name"""
    filename = self.temp("testname.fits.bz2")
    h = fits.PrimaryHDU()
    h.writeto(filename)
    with fits.open(self.temp("testname.fits.bz2")) as hdul:
    assert hdul[0].header == h.header

  • Straightforward decompression test but for io.ascii. My current PR only addresses io.fits but it probably could be expanded to io.ascii without much effort (though I haven't been able to quickly find exactly where io.ascii is handling decompression... might be the get_readable_fileobj I linked above?)

    @pytest.mark.xfail(not HAS_BZ2, reason="requires bz2")
    @pytest.mark.parametrize("filename", ["data/short.rdb.bz2", "data/ipac.dat.bz2"])
    def test_bzip2(filename):
    t_comp = read(get_pkg_data_filename(filename))
    t_uncomp = read(get_pkg_data_filename(filename.replace(".bz2", "")))
    assert t_comp.dtype.names == t_uncomp.dtype.names
    assert np.all(t_comp.as_array() == t_uncomp.as_array())

  • This one doesn't seem to be testing for bz2 specifically, rather just using it as an example, it doesn't seem like we would need to replicate it to LZW.

    @pytest.mark.xfail(not HAS_BZ2, reason="requires bz2")
    def test_guessing_file_object():
    """
    Test guessing a file object. Fixes #3013 and similar issue noted in #3019.
    """
    with open("data/ipac.dat.bz2", "rb") as fd:
    t = ascii.read(fd)
    assert t.colnames == ["ra", "dec", "sai", "v2", "sptype"]

  • And finally there's a few tests on utils/tests/test_data.py that I don't fully understand the purpose of

    # Package data functions
    @pytest.mark.filterwarnings("ignore:unclosed:ResourceWarning")
    @pytest.mark.parametrize(
    "filename", ["local.dat", "local.dat.gz", "local.dat.bz2", "local.dat.xz"]
    )
    def test_local_data_obj(filename):
    if (not HAS_BZ2 and "bz2" in filename) or (not HAS_LZMA and "xz" in filename):
    with pytest.raises(ValueError, match=r" format files are not supported"):
    with get_pkg_data_fileobj(
    os.path.join("data", filename), encoding="binary"
    ) as f:
    f.readline()
    # assert f.read().rstrip() == b'CONTENT'
    else:
    with get_pkg_data_fileobj(
    os.path.join("data", filename), encoding="binary"
    ) as f:
    f.readline()
    assert f.read().rstrip() == b"CONTENT"

    @pytest.mark.filterwarnings("ignore:unclosed:ResourceWarning")
    def test_local_data_obj_invalid(bad_compressed):
    is_bz2 = bad_compressed.endswith(".bz2")
    is_xz = bad_compressed.endswith(".xz")
    # Note, since these invalid files are created on the fly in order to avoid
    # problems with detection by antivirus software
    # (see https://github.com/astropy/astropy/issues/6520), it is no longer
    # possible to use ``get_pkg_data_fileobj`` to read the files. Technically,
    # they're not local anymore: they just live in a temporary directory
    # created by pytest. However, we can still use get_readable_fileobj for the
    # test.
    if (not HAS_BZ2 and is_bz2) or (not HAS_LZMA and is_xz):
    with pytest.raises(
    ModuleNotFoundError, match=r"does not provide the [lb]z[2m]a? module\."
    ):
    with get_readable_fileobj(bad_compressed, encoding="binary") as f:
    f.read()
    else:
    with get_readable_fileobj(bad_compressed, encoding="binary") as f:
    assert f.read().rstrip().endswith(b"invalid")

    @pytest.mark.filterwarnings("ignore:unclosed:ResourceWarning")
    @pytest.mark.parametrize(
    "encoding, expected_type, expected_lines",
    [
    pytest.param("utf-8", str, ["האסטרונומי פייתון"], id="utf-8"),
    pytest.param(
    "binary",
    bytes,
    [
    b"\xd7\x94\xd7\x90\xd7\xa1\xd7\x98\xd7\xa8\xd7\x95\xd7\xa0\xd7\x95"
    b"\xd7\x9e\xd7\x99 \xd7\xa4\xd7\x99\xd7\x99\xd7\xaa\xd7\x95\xd7\x9f"
    ],
    id="binary",
    ),
    ],
    )
    @pytest.mark.parametrize(
    "filename",
    [
    "unicode.txt",
    "unicode.txt.gz",
    pytest.param(
    "unicode.txt.bz2",
    marks=pytest.mark.xfail(not HAS_BZ2, reason="no bz2 support"),
    ),
    pytest.param(
    "unicode.txt.xz",
    marks=pytest.mark.xfail(not HAS_LZMA, reason="no lzma support"),
    ),
    ],
    )
    def test_read_unicode(filename, encoding, expected_type, expected_lines):
    contents = get_pkg_data_contents(os.path.join("data", filename), encoding=encoding)
    assert type(contents) is expected_type
    # Using splitlines() instead of split("\n") here for portability as
    # newlines can be represented differently in different OSes.
    assert contents.splitlines() == expected_lines

@kYwzor
Copy link
Copy Markdown
Member Author

kYwzor commented Apr 1, 2025

Added support inside get_readable_fileobj, does anyone know an example use-case to test it?

@saimn
Copy link
Copy Markdown
Contributor

saimn commented Apr 3, 2025

Agreed on it being puzzling that io.fits backtracks so much; I did not investigate what causes it exactly, I just assumed there was a legitimate reason. Maybe someone can chime in?

By default we load HDUs lazily and use memmap, so it's not possible to predict if and when data will be loaded.
Maybe try with memmap=False (but it depends what you are testing).

Regarding the test file, yes ESO files are probably fine if they are accessible from the public ESO archive. [...] The file has about 4MB

4MB is too big to store it in the astropy repo, so either we have a smaller file (<50 KB) that we can store here or another option would be to have a remote data test (we probably need to host it on astropy-data instead of using directly the ESO link).

I wonder if it would be reasonable to have a different flag in which we only decompress what's requested, but we keep in memory whatever has already been decompressed (i.e. lazy decompressing but storing it in memory).

We already a lot of parameters and this one seems a bit too specific. Also it would be probably not apply to other compression methods where we might not control this behavior.

I haven't been able to quickly find exactly where io.ascii is handling decompression... might be the get_readable_fileobj I linked above?

Yes I think so.

@saimn
Copy link
Copy Markdown
Contributor

saimn commented Apr 3, 2025

Added support inside get_readable_fileobj, does anyone know an example use-case to test it?

You can probably extend some of the tests from #3667 ?

@kYwzor
Copy link
Copy Markdown
Member Author

kYwzor commented Apr 3, 2025

Regarding the test file, it seems ESO's programmatic access allows you to filter by file size, so I searched for files smaller than 20KB. There's obviously tons, so I just picked the oldest observation (why not!) which is a 13.4KB file (33.7KB uncompressed). This is the details link and direct download link. My thinking is we could put both the compressed and uncompressed files in the repository, so that we could then test the decompression and compare with the uncompressed file. Let me know if this is ok for you.

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Apr 3, 2025

@kYwzor - I'd store only the compressed version in the repository, and just test a few header keywords and a few data points. For better or worse, all those test data get installed with the package for everyone who downloads it, so keeping things small is important!

@kYwzor kYwzor changed the title Add LZW file support to io.fits Add LZW file support to io.fits, io.ascii and utils.data Apr 7, 2025
@kYwzor
Copy link
Copy Markdown
Member Author

kYwzor commented Apr 7, 2025

I changed the title of this PR to better reflect the fact that this change also impacts io.ascii and utils.data (and added towncrier files accordingly). I believe I've also added all the relevant tests. I've also changed the documentation to mention the LZW capabilities where relevant. I believe this commit is now in a good shape to be reviewed.

In parallel, I've been updating uncompresspy to improve some interfaces. Before merging this, I'd still like to remove the kludgy "keep_buffer" option and instead implement a solution for seeking backwards based on checkpoints, as previously discussed here.

@kYwzor
Copy link
Copy Markdown
Member Author

kYwzor commented Apr 8, 2025

I've updated uncompresspy and now it is able to keep checkpoints in the file, which enable much better seeking performance. Updated the dependencies and removed the "keep_buffer" option accordingly.

I believe this commit is ready for final review.

@kYwzor kYwzor force-pushed the lzwfiles branch 3 times, most recently from e07d4f8 to f31b06c Compare April 9, 2025 11:37
@kYwzor
Copy link
Copy Markdown
Member Author

kYwzor commented Apr 9, 2025

Had to make changes to solve merge conflicts resulting from #17968. Commit history was getting really messy so I took the chance to clean it up and squash the commits.

Copy link
Copy Markdown
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I have only the most nitpicky comments.

@kYwzor
Copy link
Copy Markdown
Member Author

kYwzor commented Apr 9, 2025

@mhvk thanks for the comments, I agree with them and they should now be addressed in my latest commit.

Regarding the tests on test_compressed.py, indeed I just copied what the others were doing. I edited my test as you proposed, but kept the others as-is (I think it would be out of scope to touch them).

Regarding lzma and bz2 my understanding is that, indeed they are in the standard library, but minimal Python installations may not include them (or in other words, they are technically an optional part of the Python installation). I think in practice this never happens in Windows, but if you compile from the source in Linux you may be missing some dependencies and thus the support for lzma/bz2 is disabled.

@kYwzor
Copy link
Copy Markdown
Member Author

kYwzor commented Apr 9, 2025

Had to do a rebase to fix merge conflicts caused by #17984, should be fine now

Copy link
Copy Markdown
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good to me, but obviously @saimn should approve too.

Copy link
Copy Markdown
Contributor

@saimn saimn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great to see this issue fixed, thanks @kYwzor !

@saimn saimn merged commit 92b38ae into astropy:main Apr 10, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fits.Z cannot be opened using automatic decompression

5 participants