Skip to content

[MRG] simplify loading of compressed files#400

Merged
ogrisel merged 5 commits intojoblib:masterfrom
aabadie:compressor_detection
Oct 12, 2016
Merged

[MRG] simplify loading of compressed files#400
ogrisel merged 5 commits intojoblib:masterfrom
aabadie:compressor_detection

Conversation

@aabadie
Copy link
Copy Markdown
Contributor

@aabadie aabadie commented Sep 28, 2016

This is a small optimization. Maybe we could check if the passed fileobj is seekable before detecting the compressor and raise an error if it's not the case.

Related to #398.

@aabadie aabadie changed the title skip compressor detection if the fileobj is in the kwnon compressor [MRG] skip compressor detection if the fileobj is in the kwnon compressor Sep 28, 2016
@aabadie aabadie changed the title [MRG] skip compressor detection if the fileobj is in the kwnon compressor [MRG] skip compressor detection if the fileobj is a knwon compressor Sep 29, 2016
@aabadie aabadie changed the title [MRG] skip compressor detection if the fileobj is a knwon compressor [MRG] skip compressor detection if the fileobj is a known compressor Sep 29, 2016
@aabadie aabadie force-pushed the compressor_detection branch 2 times, most recently from 92548f0 to 5b566f9 Compare October 1, 2016 16:54
if PY3_OR_LATER and not fileobj.seekable():
# With Python 3 file object have to be able to seek explicitely.
raise ValueError("Input fileobject must be seekable.")
compressor = _detect_compressor(fileobj)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 for moving the _detect_compressor here, but I am not sure what the added value is of requiring that .seekable() return True.

What kind of error do we get with a file that is not seekable?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I don't understand why we check this here and how it can possibly work in Python 2.

Copy link
Copy Markdown
Member

@lesteve lesteve Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I don't understand why we check this here and how it can possibly work in Python 2.

seekable may not exist or return False (as it was the case for BinaryZlibFile before the PR just merged) and yet .seek be completely functional.

An example for Python 2.7:

f = open('/tmp/bla', 'w')
f.seekable()  # AttributeError: 'file' object has no attribute 'seekable'
f.seek(3)  # works fine

# the very beginning of the file.
if PY3_OR_LATER and not fileobj.seekable():
# With Python 3 file object have to be able to seek explicitely.
raise ValueError("Input fileobject must be seekable.")
Copy link
Copy Markdown
Contributor

@ogrisel ogrisel Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include the string representation of the fileobject in the message to make it more informative:

raise ValueError("Cannot load from non-seekable fileobject %r" % fileobj)

np.testing.assert_array_equal(array_reloaded, test_array)


def test_pickling_in_non_seekable_fileobject_raise_error():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More accurate test name: test_loading_from_non_seekable_fileobject_raise_error

if PY3_OR_LATER and not fileobj.seekable():
# With Python 3 file object have to be able to seek explicitely.
raise ValueError("Input fileobject must be seekable.")
compressor = _detect_compressor(fileobj)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I don't understand why we check this here and how it can possibly work in Python 2.

@aabadie aabadie changed the title [MRG] skip compressor detection if the fileobj is a known compressor [MRG] simplify loading of compressed files Oct 12, 2016
@aabadie aabadie force-pushed the compressor_detection branch from 5b566f9 to 0a261fb Compare October 12, 2016 14:43
@aabadie
Copy link
Copy Markdown
Contributor Author

aabadie commented Oct 12, 2016

After some discussion IRL, I came out to another refactoring of the compression detection when loading pickles:

  • when applicable, use peek instead of seek to determine the magic number of the files
  • better detect and handle cases where mmap_mode is not applicable (compressor detected in stream, non raw file passed to read from, use of memory buffers, eg io.BytesIO)

It doesn't change the actual behavior, just make the code cleaner.

# objects (GzipFile, BZ2File, LzmaFile), we simply return it.
if isinstance(fileobj, tuple(_COMPRESSOR_CLASSES)):
yield fileobj
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we should remove this case now. This would allow for nested compression no?

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. The logic is much more straightforward to read and understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants