ENH Verify md5-checksums received from openml arff file metadata#14800
Conversation
6281ea2 to
df4c049
Compare
|
Most files in the local tests sklearn.datasets.tests.data.openml folder
do not match their checksums. This may be because of differences between
versions of metadata vs actual file downloaded.
Those files were artificially shortened to fit within the package. Feel
free to calculate and store their checksums rather than those on OpenML
|
|
@thomasjpfan ready for your review, incorporates what we discussed last time |
sklearn/datasets/openml.py
Outdated
| BytesIO stream with the same content as input_stream for consumption | ||
| """ | ||
| with closing(input_stream): | ||
| bytes_content = input_stream.read() |
There was a problem hiding this comment.
Can we do this without consume the whole stream? https://docs.python.org/3/library/hashlib.html?highlight=hashlib#examples
There was a problem hiding this comment.
I don't think so. I could only think of this: to verify the hash we need to read the stream, so we read it and then return a new stream to keep the methods API consistent.
Did you mean consuming the stream in chunks and updating the hashlib.md5 instance with the chunks instead of reading it all in one go? I don't think that would make any difference.
There was a problem hiding this comment.
Yes I mean in chunks.
Currently, with this PR the stream will be read twice, once during the the md5 check and again by the caller of _open_openml_url.
Another option is to have _open_openml_url return the byte content such that the callers do not need to call read.
There was a problem hiding this comment.
Done the second option, now reads stream once and passes content along
|
@ingrid88 @kellycarmody |
d53fe3d to
831d78b
Compare
|
@thomasjpfan changes ready for your review. Thanks. |
sklearn/datasets/openml.py
Outdated
| ------- | ||
| bytes | ||
| """ | ||
| actual_md5_checksum = hashlib.md5(bytes_content).hexdigest() |
There was a problem hiding this comment.
I'm wondering if we should checksum in in chunks. Could you please take a medium sized OpenML dataset (e.g. MNIST https://www.openml.org/d/554 ) and see computing the checksum with the above way is still reasonable with respect to memory usage (see memory_profiler).
There was a problem hiding this comment.
It does not make any difference (see test gist)
I guess it makes sense, from the wiki page:
The main algorithm then uses each 512-bit message block in turn to modify the state
It sounds like a state machine that uses fixed memory (states) irrespective of the size of the stream.
The only optimization that can be done here afaik is that if the md5checksum flag is enabled (default), we could download the arff file in chunks and keep updating md5, which would save us some time. It might add some code complexity though (we will have to manage the chunking with urllib), do the maintainers think it is worth it?
There was a problem hiding this comment.
Currently the data is traversed twice. First when the stream is read into memory. Second when hashlib.md5 goes through the data.
Would the goal of this optimization make it so we read the data twice?
In your testing do you find that the checksum increases the time it loads the data?
There was a problem hiding this comment.
Added the optimization (it now reads the stream only once, and keeps updating the md5 while reading).
In terms of memory, it will not add any significant overhead as I mentioned, since md5 uses a state-machine.
In terms of time / cpu, here is a quick test result for a medium size dataset:
Existing openml (master branch)
$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None)"
1 loop, best of 5: 12.4 sec per loop
This PR:
$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None, verify_checksum=False)"
1 loop, best of 5: 12.3 sec per loop
$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None, verify_checksum=True)"
1 loop, best of 5: 12.6 sec per loop
This is single iteration but it looks like it is not making it significantly slow
There was a problem hiding this comment.
What was the time before the optimization? (831d78b)
There was a problem hiding this comment.
I do not think this actually is better than before. The way to get a benefit from chunking is to verify the data the same time it is being encoded by the arrf parser.
The chunking this PR is doing still end up reading the data twice. Once, for checking, and another when the data is being parsed by the arrf parser.
There was a problem hiding this comment.
Data was already being read twice, once here and once again here.
This PR maintains the number of times the stream is read.
To make this validation occur during decoding the arff would:
- Be wasteful if done each time decode is called (currently validation is done only when data is downloaded from internet), future decodes are from a local cached file
- Would involve adding check in the arff decode logic which is in the externals module.
Would we prefer to modify this?
There was a problem hiding this comment.
I would not prefer not changing the arrf module. I suspect the parsing of the arrf file needs the whole thing in memory. I am happy with previous non chucked version.
There was a problem hiding this comment.
831d78b would make the stream be read 3 times, once at download (non chunked), then to validate (all at one go) and then to decode (arff).
In my testing the second validation overhead is insignificant in terms of memory or cpu, so I would not worry too much about it. If you also agree then I will push a revert commit
There was a problem hiding this comment.
Okay, lets keep the current version.
sklearn/datasets/openml.py
Outdated
| fsrc = gzip.GzipFile(fileobj=fsrc, mode='rb') | ||
| bytes_content = fsrc.read() | ||
| if expected_md5_checksum: | ||
| # validating checksum reads and consumes the stream |
sklearn/datasets/openml.py
Outdated
| ------- | ||
| bytes | ||
| """ | ||
| actual_md5_checksum = hashlib.md5(bytes_content).hexdigest() |
There was a problem hiding this comment.
Currently the data is traversed twice. First when the stream is read into memory. Second when hashlib.md5 goes through the data.
Would the goal of this optimization make it so we read the data twice?
In your testing do you find that the checksum increases the time it loads the data?
2b18909 to
10ecf9a
Compare
sklearn/datasets/openml.py
Outdated
| break | ||
| if expected_md5: | ||
| file_md5.update(data) | ||
| fsrc_bytes += data |
There was a problem hiding this comment.
I am concerned that every assignment here will create a new bytes since bytes is immutable. Would bytearray and bytearray.extend be better in this case?
|
I don't know why the CI is failing now (some installation step in py35_ubuntu_atlas), has anyone seen this before? |
return bytearray instead of new bytes Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>
|
Any thoughts on this PR? I have updated with master... |
|
@cmarmo Can we add this PR to your radar as well? |
|
@rth @thomasjpfan are all the comments addressed here? Are you happy enough? :) |
There was a problem hiding this comment.
Seeing the memory usage:
On master
from sklearn.datasets import fetch_openml
%memit fetch_openml(data_id=554, data_home="data_home")
# peak memory: 1266.97 MiB, increment: 1175.53 MiBthis PR
%memit fetch_openml(data_id=554, data_home="data_home")
# peak memory: 1403.14 MiB, increment: 1313.07 MiB
sklearn/datasets/_openml.py
Outdated
| if expected_md5 does not match actual md5-checksum of stream | ||
| """ | ||
| fsrc_bytes = bytearray() | ||
| file_md5 = hashlib.md5() if expected_md5_checksum else None |
There was a problem hiding this comment.
With the early exiting above:
| file_md5 = hashlib.md5() if expected_md5_checksum else None | |
| file_md5 = hashlib.md5() |
sklearn/datasets/_openml.py
Outdated
| ------ | ||
| ValueError : | ||
| if expected_md5 does not match actual md5-checksum of stream | ||
| """ |
There was a problem hiding this comment.
We can early exit here:
if expected_md5 is None:
return fsrc.read()
sklearn/datasets/_openml.py
Outdated
| data = fsrc.read(chunk_size) | ||
| if not data: | ||
| break | ||
| if expected_md5: |
There was a problem hiding this comment.
With the early exiting:
file_md5.update(data)| modified_gzip.write(data) | ||
|
|
||
| # simulate request to return modified file | ||
| mocked_openml_url = sklearn.datasets._openml.urlopen |
There was a problem hiding this comment.
Leave a comment that sklearn.datasets._openml.urlopen is already mocked by _monkey_patch_webbased_functions.
|
Please add an entry to the change log at |
flake8 Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…h28/scikit-learn into md5checksum_validate_openml
| assert exc.match("1666876") | ||
|
|
||
| # cleanup fake local file | ||
| os.remove(corrupt_copy) |
There was a problem hiding this comment.
The removal of the tmpdir is handled by by pytest. (Also this is causing the windows builds to fail)
| os.remove(corrupt_copy) |
8c049a0 to
c55f64a
Compare
|
#17053 is merged now! |
|
I have merged master in to resolve conflicts and will try to review in the next few days. (Though don't wait for merging if there are enough reviewers). Thanks for your patience @shashanksingh28 ! |
rth
left a comment
There was a problem hiding this comment.
Actually LGTM, thanks @shashanksingh28 !
thomasjpfan
left a comment
There was a problem hiding this comment.
LGTM
@shashanksingh28 Thank you for your patience on this!
|
This is an accomplishment! This PR is one of the last two from the Aug 2019 NYC sprint. (I just pinged on the other one.) Congrats, everyone. @thomasjpfan can we remove the "Waiting for Reviewer" label on this one? Thanks! |
jnothman
left a comment
There was a problem hiding this comment.
Nice! Great work, @shashanksingh28, and a much neater solution than some of those we tried on the way!!
|
Thanks @jnothman, @thomasjpfan and @rth for your help with this :) Feels nice to have my first contribution to scikit-learn! |
…it-learn#14800) Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
…it-learn#14800) Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Reference Issues/PRs
Fixes #11816
What does this implement/fix? Explain your changes.
When fetching an open-ml dataset, metadata about the dataset is fetched via
https://openml.org/api/v1/json/data/, which includes the latest file version to download, its md5-checksum, etc.This PR adds functionality to verify the md5-checksum of the file downloaded to the one provided via the api. If the validation fails, it produces a
ValueErrorstating the same.Any other comments?
sklearn.datasets.tests.data.openmlfolder do not match their checksums. This may be because of differences between versions of metadata vs actual file downloaded.