ENH: added lzma support to data file handler by dhomeier · Pull Request #3667 · astropy/astropy

dhomeier · 2015-04-06T19:38:18Z

This PR adds capability to transparently detect and read files with LZMA compression (.xz) to get_readable_fileobj(), using either the builtin lzma module of Python 3.3+, or the backports.lzma package, when installed - the latter should thus only be relevant for 2.6+2.7 among the still-supported versions.

Only potential issue I could think of is that backports.lzma needs to open the file by its name, but since a new file handle is returned in place of the original one anyway, I don't really see a problem with that.

Added a line about the decompression capabilities to the unified file read/write interface doc, with a link to get_readable_fileobj(); maybe debatable if the latter is getting in too deep.

embray · 2015-04-06T20:46:09Z

Sounds okay to me.

embray · 2015-04-06T20:50:57Z

astropy/utils/tests/test_data.py

This test now seems to be leaving open a file handle somewhere--some of the build configurations test that all tests close any files they open. It's not obvious to me yet where that's happening, but you can try it yourself locally by passing the --open-files flag to ./setup.py test.

I was able to confirm this locally, so I don't think it's just a transient failure. And it is indeed happening when 'local.dat.xz' is passed to the filename fixture.

Just to give one more data point though, it turns out my Python (3.4) is not compiled with lzma support:

[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import lzma Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/internal/1/root/usr/local/lib/python3.4/lzma.py", line 26, in <module> from _lzma import * ImportError: No module named '_lzma' >>>

despite this, the test is still doing something (I'm not sure what yet) and leaving an open file. That might be what's happening on Travis too.

dhomeier · 2015-04-06T20:59:50Z

Couldn't run the test --open-files here, but I'm trying to install psutil now. I suspect it's the original file handle being abandoned by backports.lzma, probably just needs a fileobj.close().

dhomeier · 2015-04-06T22:08:50Z

Right, that was quite unexpected, but I think the underlying problem is that get_readable_fileobj raises the ValueError without closing fileobj and then returns the file handle just like for any other binary file.
Did not expect this because I followed exactly the way the bzip2 case was setup, but perhaps the case without bzip2 support was never really tested as there are no installations without it anymore?
I did notice that higher-level callers like ascii.read just read the binary data despite the ValueError, which did not really seem like the desired behaviour.
In any case, just forced the import bz2 to fail here, and now the local.dat.bz2 is also reported as left open, as well as a bunch of InconsistentTableError and the like from test_compressed and others. test_read e.g. also does not check for bzip2 support at all.

dhomeier · 2015-04-06T22:16:11Z

About the test philosophy, I did not like the original way in test_data of raising the error and then intercepting it with the

        if not HAS_BZ2 and 'bz2' in filename:
             pass

and replaced it with the with pytest.raises(ValueError) as e: in the first instance. Skipping would be an alternative, too, but I think essentially this is an xfail.

dhomeier · 2015-04-06T22:35:46Z

OK, I think the problem is in ascii.read intercepting any exception here:

    if guess:
        # If `table` is a filename or readable file object then read in the
        # file now.  This prevents problems in Python 3 with the file object
        # getting closed or left at the file end.  See #3132, #3013, #3109,
        # #2001.  If a `readme` arg was passed that implies CDS format, in
        # which case the original `table` as the data filename must be left
        # intact.
        if 'readme' not in new_kwargs:
            try:
                with get_readable_fileobj(table) as fileobj:
                    table = fileobj.read()
            except:
                pass

Which makes no distinction between binary files that could not be decoded at all and text files with a not-yet-understood structure.
Actually I don't understand the logic of that test at all - it should fail exactly when there is nothing to read from fileobj whatsoever, in which case it doesn't seem to make much sense to go ahead and try to guess its format... But I also don't quite see why data.py bypasses the errors from invalid gzip/bzip2 files. Would seem more logical to me to raise IOError there and then have ascii.read explicitly checking for (and honouring) that IOError!

dhomeier · 2015-04-07T01:17:59Z

Well that seems to work better; left it at the ValueError now.
There was one open-file error remaining from test_hdf5 which is not really a fault of the module; since test_read_h5py_objects() directly calls f = h5py.File(test_file) it should be reasonable to demand a f.close() as well. Travis apparently has no h5py, so this has probably slipped through.

dhomeier · 2015-04-07T14:33:10Z

Oh, had missed the original except ValueError in test_data.py you did not like - they are now all marked as xfails, only test_local_data_obj* are explicitly testing for the right ValueError.
Also fixed the open file failure in test_hdf5.py.

mdboom · 2015-04-07T14:36:21Z

astropy/utils/data.py

There's a number of situations where Python doesn't have bzip2 support (i.e. it's in the stdlib but not compiled). Maybe say "any of which can be compressed in gzip, bzip2 or xz if the appropriate compression libraries are available."

mdboom · 2015-04-07T14:55:17Z

Thanks for taking this on. As I'm sure you're discovering, file handling in Python is full of all kinds of fun inconsistencies and gotchas -- particularly when trying to support as many different versions of Python as we are. Don't take the amount of feedback as saying anything negative about your work...

dhomeier · 2015-04-07T16:47:26Z

You're welcome; I did get some exposure to this when doing a little work on the numpy io routines, sometimes similarly frustrating. It's not so easy to get your hands on an installation without bzip2 compiled in, or Python 3.3 without lzma.
My principal remaining concern is whether the IOError and EOFError for invalid compressed files should really be left as they are (so they will be ignored by ascii.read()). Since there is a file in those cases, and it has content, but not valid g|b|xz data, I think passing this on as a ValueError would make some sense...

hamogu · 2015-04-26T01:26:15Z

@taldcroft : I just saw this issue and thought it's worth pointing out to you, too. The way io.ascii guessing goes through files, but then just ignores any error without checking if the error really indicates a bad ascii file or an IO problem, is causing issues in other parts of astropy. Not something I thought about when I argued that we should catch all possible errors in guessing in #3346.

embray · 2015-07-07T22:27:47Z

This seems to have gotten mixed up with some unrelated commits (possibly due to a botched merge or something). @dhomeier, do you think you'll have a chance any time soon to finish work on this? If not, I can try to see what I can do to tie up the loose ends.

dhomeier · 2015-07-13T20:31:58Z

Sorry I've left this and my other pending PR open this long as I was busy changing to a new job and moving... I will have a look at the merge conflicts here; apart from this there were no intrinsic open issues with this one afaiac, just the general question how many exceptions should be caught and silenced by io.ascii. Gonna have a closer look at #3346 as well.

embray · 2015-07-14T15:46:10Z

No problem, thanks!

dhomeier · 2015-08-10T15:46:39Z

Later than I hoped; the actual conflicts are pretty much trivial and easily fixed – don't know why they don't merge in the first place. Not sure what exactly is the correct procedure now; do I merge the current HEAD of master and push to this branch?
And there is one remaining issue driving me nuts (on most switches from master to older branches actually): I am always getting a conflict like this for astropy_helpers without any actual content:

diff --git a/astropy_helpers b/astropy_helpers
index 1ecac98..161773f 160000
--- a/astropy_helpers
+++ b/astropy_helpers
@@ -1 +1 @@
-Subproject commit 1ecac98d547729393ca54bcbe4f03cd4e1bfcd9a
+Subproject commit 161773fa72d916c498e0a2a513ecc24460244ac8-dirty

How can I convince git this is resolved or just take the bloody version that's in the current HEAD when there's no file to edit? I tried to git rm astropy_helpers, but then the entire directory is gone for good (I did not git rm -r)...

dhomeier · 2015-08-10T17:18:07Z

OK, no idea what's going on now; I have managed to auto-merge master with this branch using the recursive theirs strategy and managed to keep git silent enough about astropy_helpers to push the update back. Failing check is apparently due to the

try:
    import bz2
except ImportError:
    HAS_BZ2 = False
else:
    HAS_BZ2 = True

in astropy/io/ascii/tests/test_read.py being refused to merge, but I have no idea what to change on this side to make that happen. Also don't know how to find any further information on what the details of
This branch has conflicts that must be resolved are...

embray · 2015-08-10T17:30:30Z

You're messing yourself up by trying to merge the master branch into your branch. Try this instead. First go back to the last commit you made on your branch (before the merging mess):

$ git reset --hard 6a47e95e5125324bb1c5f6040db88b46f8883d48

Then follow the instructions for performing a rebase: http://docs.astropy.org/en/stable/development/workflow/development_workflow.html#rebase-but-only-if-asked

If at any point you get a merge conflict with astropy_helpers just enter:

$ git reset astropy_helpers
$ git rebase --continue  # After resolving any other conflicts on that particular commit

Finish the rebase, resolving any issues along the way. Push the updated branch:

$ git push --force origin <branchname>

You should end up with a pull request that has your commits and only your commits applied as diffs to the current upstream/master branch.

added capability to detect and read files with LZMA compression (.xz) to get_readable_fileobj(), using either the builtin lzma module of Python 3.3+, or the backports.lzma package, when installed

changed xz file opening with backports.lzma to use NamedTemporaryFile; added xz cases to test_compressed and check for bzip2 support to test_read; ascii.read() raises ValueError on unreadable files

switched the remaining checks for bz2/lzma support to pytest.mark.xfail

dhomeier · 2015-08-10T17:48:17Z

Thanks, the git reset astropy_helpers is very helpful to know about. Was a bit worried about forcing the push.

embray · 2015-08-10T19:26:11Z

Great, thanks @dhomeier that seems to have fixed it up correctly. If you'll just add a changelog entry for this in v1.1.0 under New Features/astropy.utils we can merge this. I'm not worried about a .01% drop in coverage.

dhomeier · 2015-08-10T22:54:18Z

Which appeared in astropy/wcs/src/str_list_proxy.c, so I am not worried either...
You want it under astropy.utils, not in astropy.io.ascii?.

embray · 2015-08-11T22:01:15Z

@dhomeier Yes, in astropy.utils since the change is in get_readable_fileobj (which io.ascii happens to use). There are some dark murky corners where this isn't used yet, such as in io.fits, but should be. You could add a note that this means ASCII tables can be read from lzma compressed files, but I don't really see that as the main point (even if that was the original motivation :)

embray · 2015-08-11T22:03:02Z

@dhomeier Or, if it's important to you to call out, it's fine to add a changelog entry to both sections. Under New Features/io.ascii say that ASCII tables can be read from LZMA compressed files. Then also note under New Features/utils that get_readable_fileobj supports them as well. Reference this issue in both changelog entries.

dhomeier · 2015-08-11T22:50:51Z

Sensible, io.ascii just was the only instance actually using this that I was aware of (I did notice the independent implementation of bzip2 compression in io.fits, which of course raises the issue of making LZMA available for FITS files as well). Maybe for another PR.

embray · 2015-08-11T23:10:54Z

Yeah, I need to refactor the FITS stuff to use the code from astropy.utils. A lot of the code in io.fits predates Astropy.

embray · 2015-08-11T23:11:12Z

In some cases that has already been done, such as the use of download_file to open FITS files from URLs.

dhomeier · 2015-08-12T00:22:21Z

Done

embray · 2015-08-12T13:57:49Z

Great, thanks! Sorry for the rebase hassles.

ENH: added lzma support to data file handler

dhomeier · 2015-08-12T14:13:51Z

Thanks, just my learning curve with pull requests, I guess!

embray added utils Affects-release labels Apr 6, 2015

embray added the Enhancement label Apr 6, 2015

embray reviewed Apr 6, 2015
View reviewed changes

mdboom reviewed Apr 7, 2015
View reviewed changes

embray self-assigned this Jul 7, 2015

Derek Homeier added 2 commits August 10, 2015 19:38

ENH: added lzma support to data file handler

f28d1a5

added capability to detect and read files with LZMA compression (.xz) to get_readable_fileobj(), using either the builtin lzma module of Python 3.3+, or the backports.lzma package, when installed

fixes for open files and compression support

7e06391

changed xz file opening with backports.lzma to use NamedTemporaryFile; added xz cases to test_compressed and check for bzip2 support to test_read; ascii.read() raises ValueError on unreadable files

Derek Homeier added 3 commits August 10, 2015 19:40

TST: mark.xfail used throughout test_data; fixed open file in test_hdf5

98cc956

switched the remaining checks for bz2/lzma support to pytest.mark.xfail

BUG: cleanup of open files after ImportError in data.py, whitespace

c1a67a6

data.py doc string and file opening changes suggested by mdboom

1ab3582

dhomeier force-pushed the lzmafile branch from 84253ec to 1ab3582 Compare August 10, 2015 17:45

embray added this to the v1.1.0 milestone Aug 10, 2015

embray mentioned this pull request Aug 11, 2015

Fortran-exponents #3655

Closed

Changelog entries for LZMA support

5e5ca16

embray added a commit that referenced this pull request Aug 12, 2015

Merge pull request #3667 from dhomeier/lzmafile

90d3d72

ENH: added lzma support to data file handler

embray merged commit 90d3d72 into astropy:master Aug 12, 2015

saimn mentioned this pull request Apr 3, 2025

Add LZW file support to io.fits, io.ascii and utils.data #17960

Merged

pllim mentioned this pull request Apr 8, 2025

Fix LZMA invalid file checking in utils #17984

Merged

Uh oh!

Conversation

dhomeier commented Apr 6, 2015

Uh oh!

embray commented Apr 6, 2015

Uh oh!

embray Apr 6, 2015

Choose a reason for hiding this comment

Uh oh!

embray Apr 6, 2015

Choose a reason for hiding this comment

Uh oh!

embray Apr 6, 2015

Choose a reason for hiding this comment

Uh oh!

dhomeier commented Apr 6, 2015

Uh oh!

dhomeier commented Apr 6, 2015

Uh oh!

dhomeier commented Apr 6, 2015

Uh oh!

dhomeier commented Apr 6, 2015

Uh oh!

dhomeier commented Apr 7, 2015

Uh oh!

dhomeier commented Apr 7, 2015

Uh oh!

mdboom Apr 7, 2015

Choose a reason for hiding this comment

Uh oh!

mdboom commented Apr 7, 2015

Uh oh!

dhomeier commented Apr 7, 2015

Uh oh!

hamogu commented Apr 26, 2015

Uh oh!

embray commented Jul 7, 2015

Uh oh!

dhomeier commented Jul 13, 2015

Uh oh!

embray commented Jul 14, 2015

Uh oh!

dhomeier commented Aug 10, 2015

Uh oh!

dhomeier commented Aug 10, 2015

Uh oh!

embray commented Aug 10, 2015

Uh oh!

dhomeier commented Aug 10, 2015

Uh oh!

embray commented Aug 10, 2015

Uh oh!

dhomeier commented Aug 10, 2015

Uh oh!

embray commented Aug 11, 2015

Uh oh!

embray commented Aug 11, 2015

Uh oh!

dhomeier commented Aug 11, 2015

Uh oh!

embray commented Aug 11, 2015

Uh oh!

embray commented Aug 11, 2015

Uh oh!

dhomeier commented Aug 12, 2015

Uh oh!

embray commented Aug 12, 2015

Uh oh!

dhomeier commented Aug 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants