Skip to content

ENH: added lzma support to data file handler#3667

Merged
embray merged 6 commits intoastropy:masterfrom
dhomeier:lzmafile
Aug 12, 2015
Merged

ENH: added lzma support to data file handler#3667
embray merged 6 commits intoastropy:masterfrom
dhomeier:lzmafile

Conversation

@dhomeier
Copy link
Copy Markdown
Contributor

@dhomeier dhomeier commented Apr 6, 2015

This PR adds capability to transparently detect and read files with LZMA compression (.xz) to get_readable_fileobj(), using either the builtin lzma module of Python 3.3+, or the backports.lzma package, when installed - the latter should thus only be relevant for 2.6+2.7 among the still-supported versions.

Only potential issue I could think of is that backports.lzma needs to open the file by its name, but since a new file handle is returned in place of the original one anyway, I don't really see a problem with that.

Added a line about the decompression capabilities to the unified file read/write interface doc, with a link to get_readable_fileobj(); maybe debatable if the latter is getting in too deep.

@embray
Copy link
Copy Markdown
Member

embray commented Apr 6, 2015

Sounds okay to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now seems to be leaving open a file handle somewhere--some of the build configurations test that all tests close any files they open. It's not obvious to me yet where that's happening, but you can try it yourself locally by passing the --open-files flag to ./setup.py test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to confirm this locally, so I don't think it's just a transient failure. And it is indeed happening when 'local.dat.xz' is passed to the filename fixture.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to give one more data point though, it turns out my Python (3.4) is not compiled with lzma support:

[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import lzma
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/internal/1/root/usr/local/lib/python3.4/lzma.py", line 26, in <module>
    from _lzma import *
ImportError: No module named '_lzma'
>>>

despite this, the test is still doing something (I'm not sure what yet) and leaving an open file. That might be what's happening on Travis too.

@dhomeier
Copy link
Copy Markdown
Contributor Author

dhomeier commented Apr 6, 2015

Couldn't run the test --open-files here, but I'm trying to install psutil now. I suspect it's the original file handle being abandoned by backports.lzma, probably just needs a fileobj.close().

@dhomeier
Copy link
Copy Markdown
Contributor Author

dhomeier commented Apr 6, 2015

Right, that was quite unexpected, but I think the underlying problem is that get_readable_fileobj raises the ValueError without closing fileobj and then returns the file handle just like for any other binary file.
Did not expect this because I followed exactly the way the bzip2 case was setup, but perhaps the case without bzip2 support was never really tested as there are no installations without it anymore?
I did notice that higher-level callers like ascii.read just read the binary data despite the ValueError, which did not really seem like the desired behaviour.
In any case, just forced the import bz2 to fail here, and now the local.dat.bz2 is also reported as left open, as well as a bunch of InconsistentTableError and the like from test_compressed and others. test_read e.g. also does not check for bzip2 support at all.

@dhomeier
Copy link
Copy Markdown
Contributor Author

dhomeier commented Apr 6, 2015

About the test philosophy, I did not like the original way in test_data of raising the error and then intercepting it with the

        if not HAS_BZ2 and 'bz2' in filename:
             pass

and replaced it with the with pytest.raises(ValueError) as e: in the first instance. Skipping would be an alternative, too, but I think essentially this is an xfail.

@dhomeier
Copy link
Copy Markdown
Contributor Author

dhomeier commented Apr 6, 2015

OK, I think the problem is in ascii.read intercepting any exception here:

    if guess:
        # If `table` is a filename or readable file object then read in the
        # file now.  This prevents problems in Python 3 with the file object
        # getting closed or left at the file end.  See #3132, #3013, #3109,
        # #2001.  If a `readme` arg was passed that implies CDS format, in
        # which case the original `table` as the data filename must be left
        # intact.
        if 'readme' not in new_kwargs:
            try:
                with get_readable_fileobj(table) as fileobj:
                    table = fileobj.read()
            except:
                pass

Which makes no distinction between binary files that could not be decoded at all and text files with a not-yet-understood structure.
Actually I don't understand the logic of that test at all - it should fail exactly when there is nothing to read from fileobj whatsoever, in which case it doesn't seem to make much sense to go ahead and try to guess its format... But I also don't quite see why data.py bypasses the errors from invalid gzip/bzip2 files. Would seem more logical to me to raise IOError there and then have ascii.read explicitly checking for (and honouring) that IOError!

@dhomeier
Copy link
Copy Markdown
Contributor Author

dhomeier commented Apr 7, 2015

Well that seems to work better; left it at the ValueError now.
There was one open-file error remaining from test_hdf5 which is not really a fault of the module; since test_read_h5py_objects() directly calls f = h5py.File(test_file) it should be reasonable to demand a f.close() as well. Travis apparently has no h5py, so this has probably slipped through.

@dhomeier
Copy link
Copy Markdown
Contributor Author

dhomeier commented Apr 7, 2015

Oh, had missed the original except ValueError in test_data.py you did not like - they are now all marked as xfails, only test_local_data_obj* are explicitly testing for the right ValueError.
Also fixed the open file failure in test_hdf5.py.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a number of situations where Python doesn't have bzip2 support (i.e. it's in the stdlib but not compiled). Maybe say "any of which can be compressed in gzip, bzip2 or xz if the appropriate compression libraries are available."

@mdboom
Copy link
Copy Markdown
Contributor

mdboom commented Apr 7, 2015

Thanks for taking this on. As I'm sure you're discovering, file handling in Python is full of all kinds of fun inconsistencies and gotchas -- particularly when trying to support as many different versions of Python as we are. Don't take the amount of feedback as saying anything negative about your work...

@dhomeier
Copy link
Copy Markdown
Contributor Author

dhomeier commented Apr 7, 2015

You're welcome; I did get some exposure to this when doing a little work on the numpy io routines, sometimes similarly frustrating. It's not so easy to get your hands on an installation without bzip2 compiled in, or Python 3.3 without lzma.
My principal remaining concern is whether the IOError and EOFError for invalid compressed files should really be left as they are (so they will be ignored by ascii.read()). Since there is a file in those cases, and it has content, but not valid g|b|xz data, I think passing this on as a ValueError would make some sense...

@hamogu
Copy link
Copy Markdown
Member

hamogu commented Apr 26, 2015

@taldcroft : I just saw this issue and thought it's worth pointing out to you, too. The way io.ascii guessing goes through files, but then just ignores any error without checking if the error really indicates a bad ascii file or an IO problem, is causing issues in other parts of astropy. Not something I thought about when I argued that we should catch all possible errors in guessing in #3346.

@embray embray self-assigned this Jul 7, 2015
@embray
Copy link
Copy Markdown
Member

embray commented Jul 7, 2015

This seems to have gotten mixed up with some unrelated commits (possibly due to a botched merge or something). @dhomeier, do you think you'll have a chance any time soon to finish work on this? If not, I can try to see what I can do to tie up the loose ends.

@dhomeier
Copy link
Copy Markdown
Contributor Author

Sorry I've left this and my other pending PR open this long as I was busy changing to a new job and moving... I will have a look at the merge conflicts here; apart from this there were no intrinsic open issues with this one afaiac, just the general question how many exceptions should be caught and silenced by io.ascii. Gonna have a closer look at #3346 as well.

@embray
Copy link
Copy Markdown
Member

embray commented Jul 14, 2015

No problem, thanks!

@dhomeier
Copy link
Copy Markdown
Contributor Author

Later than I hoped; the actual conflicts are pretty much trivial and easily fixed – don't know why they don't merge in the first place. Not sure what exactly is the correct procedure now; do I merge the current HEAD of master and push to this branch?
And there is one remaining issue driving me nuts (on most switches from master to older branches actually): I am always getting a conflict like this for astropy_helpers without any actual content:

diff --git a/astropy_helpers b/astropy_helpers
index 1ecac98..161773f 160000
--- a/astropy_helpers
+++ b/astropy_helpers
@@ -1 +1 @@
-Subproject commit 1ecac98d547729393ca54bcbe4f03cd4e1bfcd9a
+Subproject commit 161773fa72d916c498e0a2a513ecc24460244ac8-dirty

How can I convince git this is resolved or just take the bloody version that's in the current HEAD when there's no file to edit? I tried to git rm astropy_helpers, but then the entire directory is gone for good (I did not git rm -r)...

@dhomeier
Copy link
Copy Markdown
Contributor Author

OK, no idea what's going on now; I have managed to auto-merge master with this branch using the recursive theirs strategy and managed to keep git silent enough about astropy_helpers to push the update back. Failing check is apparently due to the

try:
    import bz2
except ImportError:
    HAS_BZ2 = False
else:
    HAS_BZ2 = True

in astropy/io/ascii/tests/test_read.py being refused to merge, but I have no idea what to change on this side to make that happen. Also don't know how to find any further information on what the details of
This branch has conflicts that must be resolved are...

@embray
Copy link
Copy Markdown
Member

embray commented Aug 10, 2015

You're messing yourself up by trying to merge the master branch into your branch. Try this instead. First go back to the last commit you made on your branch (before the merging mess):

$ git reset --hard 6a47e95e5125324bb1c5f6040db88b46f8883d48

Then follow the instructions for performing a rebase: http://docs.astropy.org/en/stable/development/workflow/development_workflow.html#rebase-but-only-if-asked

If at any point you get a merge conflict with astropy_helpers just enter:

$ git reset astropy_helpers
$ git rebase --continue  # After resolving any other conflicts on that particular commit

Finish the rebase, resolving any issues along the way. Push the updated branch:

$ git push --force origin <branchname>

You should end up with a pull request that has your commits and only your commits applied as diffs to the current upstream/master branch.

Derek Homeier added 2 commits August 10, 2015 19:38
added capability to detect and read files with LZMA compression (.xz)
to get_readable_fileobj(), using either the builtin lzma module of
Python 3.3+, or the backports.lzma package, when installed
changed xz file opening with backports.lzma to use NamedTemporaryFile;
added xz cases to test_compressed and check for bzip2 support to
test_read; ascii.read() raises ValueError on unreadable files
@dhomeier
Copy link
Copy Markdown
Contributor Author

Thanks, the git reset astropy_helpers is very helpful to know about. Was a bit worried about forcing the push.

@embray embray added this to the v1.1.0 milestone Aug 10, 2015
@embray
Copy link
Copy Markdown
Member

embray commented Aug 10, 2015

Great, thanks @dhomeier that seems to have fixed it up correctly. If you'll just add a changelog entry for this in v1.1.0 under New Features/astropy.utils we can merge this. I'm not worried about a .01% drop in coverage.

@dhomeier
Copy link
Copy Markdown
Contributor Author

Which appeared in astropy/wcs/src/str_list_proxy.c, so I am not worried either...
You want it under astropy.utils, not in astropy.io.ascii?.

@embray embray mentioned this pull request Aug 11, 2015
@embray
Copy link
Copy Markdown
Member

embray commented Aug 11, 2015

@dhomeier Yes, in astropy.utils since the change is in get_readable_fileobj (which io.ascii happens to use). There are some dark murky corners where this isn't used yet, such as in io.fits, but should be. You could add a note that this means ASCII tables can be read from lzma compressed files, but I don't really see that as the main point (even if that was the original motivation :)

@embray
Copy link
Copy Markdown
Member

embray commented Aug 11, 2015

@dhomeier Or, if it's important to you to call out, it's fine to add a changelog entry to both sections. Under New Features/io.ascii say that ASCII tables can be read from LZMA compressed files. Then also note under New Features/utils that get_readable_fileobj supports them as well. Reference this issue in both changelog entries.

@dhomeier
Copy link
Copy Markdown
Contributor Author

Sensible, io.ascii just was the only instance actually using this that I was aware of (I did notice the independent implementation of bzip2 compression in io.fits, which of course raises the issue of making LZMA available for FITS files as well). Maybe for another PR.

@embray
Copy link
Copy Markdown
Member

embray commented Aug 11, 2015

Yeah, I need to refactor the FITS stuff to use the code from astropy.utils. A lot of the code in io.fits predates Astropy.

@embray
Copy link
Copy Markdown
Member

embray commented Aug 11, 2015

In some cases that has already been done, such as the use of download_file to open FITS files from URLs.

@dhomeier
Copy link
Copy Markdown
Contributor Author

Done

@embray
Copy link
Copy Markdown
Member

embray commented Aug 12, 2015

Great, thanks! Sorry for the rebase hassles.

embray added a commit that referenced this pull request Aug 12, 2015
ENH: added lzma support to data file handler
@embray embray merged commit 90d3d72 into astropy:master Aug 12, 2015
@dhomeier
Copy link
Copy Markdown
Contributor Author

Thanks, just my learning curve with pull requests, I guess!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants