[MRG] Uses gzip when caching in fetch_openml#11830
[MRG] Uses gzip when caching in fetch_openml#11830jnothman merged 5 commits intoscikit-learn:masterfrom
Conversation
c20f9ef to
97dd729
Compare
|
Thanks for your PR! Could you please add "openml" to the title somewhere? Currently it's not really clear from the title or description what this PR is about. (That would help attracting reviewers). Also tests (and in particular Travis CI) are failing.. |
|
Ah I see, the test mocks have been updated to account for the gzip feature. |
jnothman
left a comment
There was a problem hiding this comment.
Can you confirm that this results in a faster fetch of MNIST when cache=False?
sklearn/datasets/openml.py
Outdated
| fsrc = urlopen(_OPENML_PREFIX + openml_path) | ||
| with open(local_path, 'wb') as fdst: | ||
| req.add_header('Accept-encoding', 'gzip') | ||
| fsrc = urlopen(req) |
There was a problem hiding this comment.
This should probably double check that the http response says it is actually gzipped.
There was a problem hiding this comment.
When the responses' s Content-Encoding is not gzip, I see two options to handle this:
- Do another request for without
gzip. - Raise an exception.
Which do you prefer?
|
With gzip enabled the response body size is 19.8 MB, without gzip enabled, it is 127.9 MB. With the MNIST datasets, there is little to none speed difference when running |
|
Okay.
Raise an exception please
|
Can't you allways set |
|
@rth I like the idea. I will update this PR according. |
|
@jnothman Previously, when |
rth
left a comment
There was a problem hiding this comment.
This is looking nice, but tests are failing with,
AttributeError: MockHTTPResponse instance has no attribute 'tell'
on python 2.7
| return MockHTTPResponse(fp, True) | ||
| else: | ||
| fp = read_fn(path, 'rb') | ||
| return MockHTTPResponse(fp, False) |
There was a problem hiding this comment.
This branch is not tested, can we add a test for it?
There was a problem hiding this comment.
This PR has been updated to test both branches.
| path_suffix = '.gz' | ||
| read_fn = gzip.open | ||
|
|
||
| class MockHTTPResponse(): |
There was a problem hiding this comment.
MockHTTPResponse(object) to use new-style classes on Py2, not that it matters much..
|
I update this PR to address the API differences of |
|
@jnothman Do you have any other comments about this? Should we merge it? |
|
I tried it out in our slow Australian internet and was very pleased :) |
|
Thanks yet again for your great work, @thomasjpfan! |
Reference Issues/PRs
Fixes #11822
What does this implement/fix? Explain your changes.
Adds the HTTP header:
Accept-encoding: gzipwhendata_homeis notNone.