ENH Verify md5-checksums received from openml arff file metadata by shashanksingh28 · Pull Request #14800 · scikit-learn/scikit-learn

shashanksingh28 · 2019-08-24T21:11:05Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

When fetching an open-ml dataset, metadata about the dataset is fetched via https://openml.org/api/v1/json/data/, which includes the latest file version to download, its md5-checksum, etc.

This PR adds functionality to verify the md5-checksum of the file downloaded to the one provided via the api. If the validation fails, it produces a ValueError stating the same.

Any other comments?

Validation is done by default unless explicitly overriden
Most files in the local tests sklearn.datasets.tests.data.openml folder do not match their checksums. This may be because of differences between versions of metadata vs actual file downloaded.

jnothman · 2019-08-25T10:36:25Z

Most files in the local tests sklearn.datasets.tests.data.openml folder

do not match their checksums. This may be because of differences between versions of metadata vs actual file downloaded. Those files were artificially shortened to fit within the package. Feel free to calculate and store their checksums rather than those on OpenML

shashanksingh28 · 2019-09-05T11:34:10Z

@thomasjpfan ready for your review, incorporates what we discussed last time

thomasjpfan · 2019-09-05T19:46:13Z

sklearn/datasets/openml.py

+        BytesIO stream with the same content as input_stream for consumption
+        """
+        with closing(input_stream):
+            bytes_content = input_stream.read()


Can we do this without consume the whole stream? https://docs.python.org/3/library/hashlib.html?highlight=hashlib#examples

I don't think so. I could only think of this: to verify the hash we need to read the stream, so we read it and then return a new stream to keep the methods API consistent.

Did you mean consuming the stream in chunks and updating the hashlib.md5 instance with the chunks instead of reading it all in one go? I don't think that would make any difference.

Yes I mean in chunks.

Currently, with this PR the stream will be read twice, once during the the md5 check and again by the caller of _open_openml_url.

Another option is to have _open_openml_url return the byte content such that the callers do not need to call read.

Done the second option, now reads stream once and passes content along

reshamas · 2019-09-09T15:33:03Z

@ingrid88 @kellycarmody
Let's keep an eye on this PR and complete if needed.

shashanksingh28 · 2019-09-15T03:24:38Z

@thomasjpfan changes ready for your review. Thanks.

rth · 2019-09-16T09:21:42Z

sklearn/datasets/openml.py

+        -------
+        bytes
+        """
+        actual_md5_checksum = hashlib.md5(bytes_content).hexdigest()


I'm wondering if we should checksum in in chunks. Could you please take a medium sized OpenML dataset (e.g. MNIST https://www.openml.org/d/554 ) and see computing the checksum with the above way is still reasonable with respect to memory usage (see memory_profiler).

It does not make any difference (see test gist)

I guess it makes sense, from the wiki page:

The main algorithm then uses each 512-bit message block in turn to modify the state

It sounds like a state machine that uses fixed memory (states) irrespective of the size of the stream.

The only optimization that can be done here afaik is that if the md5checksum flag is enabled (default), we could download the arff file in chunks and keep updating md5, which would save us some time. It might add some code complexity though (we will have to manage the chunking with urllib), do the maintainers think it is worth it?

Currently the data is traversed twice. First when the stream is read into memory. Second when hashlib.md5 goes through the data.

Would the goal of this optimization make it so we read the data twice?

In your testing do you find that the checksum increases the time it loads the data?

Added the optimization (it now reads the stream only once, and keeps updating the md5 while reading).

In terms of memory, it will not add any significant overhead as I mentioned, since md5 uses a state-machine.

In terms of time / cpu, here is a quick test result for a medium size dataset:

Existing openml (master branch)

$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None)" 1 loop, best of 5: 12.4 sec per loop

This PR:

$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None, verify_checksum=False)" 1 loop, best of 5: 12.3 sec per loop $ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None, verify_checksum=True)" 1 loop, best of 5: 12.6 sec per loop

This is single iteration but it looks like it is not making it significantly slow

What was the time before the optimization? (831d78b)

I do not think this actually is better than before. The way to get a benefit from chunking is to verify the data the same time it is being encoded by the arrf parser.

The chunking this PR is doing still end up reading the data twice. Once, for checking, and another when the data is being parsed by the arrf parser.

Data was already being read twice, once here and once again here.

This PR maintains the number of times the stream is read.

To make this validation occur during decoding the arff would:

Be wasteful if done each time decode is called (currently validation is done only when data is downloaded from internet), future decodes are from a local cached file

Would involve adding check in the arff decode logic which is in the externals module.
Would we prefer to modify this?

I would not prefer not changing the arrf module. I suspect the parsing of the arrf file needs the whole thing in memory. I am happy with previous non chucked version.

831d78b would make the stream be read 3 times, once at download (non chunked), then to validate (all at one go) and then to decode (arff).

In my testing the second validation overhead is insignificant in terms of memory or cpu, so I would not worry too much about it. If you also agree then I will push a revert commit

Okay, lets keep the current version.

thomasjpfan · 2019-09-18T18:47:28Z

sklearn/datasets/openml.py

+            fsrc = gzip.GzipFile(fileobj=fsrc, mode='rb')
+        bytes_content = fsrc.read()
+        if expected_md5_checksum:
+            # validating checksum reads and consumes the stream


Not a stream anymore

thomasjpfan · 2019-09-18T18:54:43Z

sklearn/datasets/openml.py

+        -------
+        bytes
+        """
+        actual_md5_checksum = hashlib.md5(bytes_content).hexdigest()


Currently the data is traversed twice. First when the stream is read into memory. Second when hashlib.md5 goes through the data.

Would the goal of this optimization make it so we read the data twice?

In your testing do you find that the checksum increases the time it loads the data?

thomasjpfan · 2019-10-10T00:43:21Z

sklearn/datasets/openml.py

+                break
+            if expected_md5:
+                file_md5.update(data)
+            fsrc_bytes += data


I am concerned that every assignment here will create a new bytes since bytes is immutable. Would bytearray and bytearray.extend be better in this case?

shashanksingh28 · 2019-10-11T02:41:31Z

I don't know why the CI is failing now (some installation step in py35_ubuntu_atlas), has anyone seen this before?

sklearn/datasets/openml.py

return bytearray instead of new bytes Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>

shashanksingh28 · 2019-11-25T02:15:52Z

Any thoughts on this PR? I have updated with master...

reshamas · 2019-11-25T13:48:17Z

@cmarmo Can we add this PR to your radar as well?
It's from the NYC Aug 2019 sprint. Thanks.

cmarmo · 2019-11-26T21:05:27Z

@rth @thomasjpfan are all the comments addressed here? Are you happy enough? :)

thomasjpfan

Seeing the memory usage:

On master

from sklearn.datasets import fetch_openml

%memit fetch_openml(data_id=554, data_home="data_home")
# peak memory: 1266.97 MiB, increment: 1175.53 MiB

this PR

%memit fetch_openml(data_id=554, data_home="data_home")
# peak memory: 1403.14 MiB, increment: 1313.07 MiB

thomasjpfan · 2019-11-27T04:09:41Z

sklearn/datasets/_openml.py

+            if expected_md5 does not match actual md5-checksum of stream
+        """
+        fsrc_bytes = bytearray()
+        file_md5 = hashlib.md5() if expected_md5_checksum else None


With the early exiting above:

Suggested change

file_md5 = hashlib.md5() if expected_md5_checksum else None

file_md5 = hashlib.md5()

thomasjpfan · 2019-11-27T04:10:47Z

sklearn/datasets/_openml.py

+        ------
+        ValueError :
+            if expected_md5 does not match actual md5-checksum of stream
+        """


We can early exit here:

if expected_md5 is None: return fsrc.read()

thomasjpfan · 2019-11-27T04:11:26Z

sklearn/datasets/_openml.py

+            data = fsrc.read(chunk_size)
+            if not data:
+                break
+            if expected_md5:


With the early exiting:

file_md5.update(data)

sklearn/datasets/tests/test_openml.py

thomasjpfan · 2020-06-07T01:15:48Z

sklearn/datasets/tests/test_openml.py

+        modified_gzip.write(data)
+
+    # simulate request to return modified file
+    mocked_openml_url = sklearn.datasets._openml.urlopen


Leave a comment that sklearn.datasets._openml.urlopen is already mocked by _monkey_patch_webbased_functions.

thomasjpfan · 2020-06-07T01:28:16Z

Please add an entry to the change log at doc/whats_new/v0.24.rst with the |Feature| tag. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

flake8 Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

…h28/scikit-learn into md5checksum_validate_openml

thomasjpfan · 2020-06-07T14:22:40Z

sklearn/datasets/tests/test_openml.py

+    assert exc.match("1666876")
+
+    # cleanup fake local file
+    os.remove(corrupt_copy)


The removal of the tmpdir is handled by by pytest. (Also this is causing the windows builds to fail)

Suggested change

os.remove(corrupt_copy)

thomasjpfan · 2020-06-25T19:20:32Z

#17053 is merged now!

rth · 2020-06-25T19:35:15Z

I have merged master in to resolve conflicts and will try to review in the next few days. (Though don't wait for merging if there are enough reviewers). Thanks for your patience @shashanksingh28 !

rth

Actually LGTM, thanks @shashanksingh28 !

thomasjpfan

LGTM

@shashanksingh28 Thank you for your patience on this!

reshamas · 2020-06-26T00:45:59Z

This is an accomplishment! This PR is one of the last two from the Aug 2019 NYC sprint. (I just pinged on the other one.) Congrats, everyone.

@thomasjpfan can we remove the "Waiting for Reviewer" label on this one? Thanks!

cc: @amueller @cmarmo

jnothman

Nice! Great work, @shashanksingh28, and a much neater solution than some of those we tried on the way!!

shashanksingh28 · 2020-06-26T02:42:04Z

Thanks @jnothman, @thomasjpfan and @rth for your help with this :) Feels nice to have my first contribution to scikit-learn!

…it-learn#14800) Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

shashanksingh28 added 3 commits August 24, 2019 16:56

add verify_checksum functionality with tests

5e9f6c9

python3.5 compatible multi-line string

5df30c0

use titanic local file and format string

df4c049

shashanksingh28 force-pushed the md5checksum_validate_openml branch from 6281ea2 to df4c049 Compare August 25, 2019 02:27

update locally truncated arff md5sums

f4ca32b

thomasjpfan self-requested a review September 5, 2019 12:14

thomasjpfan reviewed Sep 5, 2019

View reviewed changes

return bytes instead of stream, read once

831d78b

shashanksingh28 force-pushed the md5checksum_validate_openml branch from d53fe3d to 831d78b Compare September 10, 2019 00:21

rth reviewed Sep 16, 2019

View reviewed changes

thomasjpfan self-assigned this Sep 16, 2019

thomasjpfan reviewed Sep 18, 2019

View reviewed changes

read and update md5 in chunks

10ecf9a

shashanksingh28 force-pushed the md5checksum_validate_openml branch from 2b18909 to 10ecf9a Compare September 30, 2019 02:17

thomasjpfan reviewed Oct 10, 2019

View reviewed changes

bytearray extend while chunked construction

f8c8fe4

thomasjpfan reviewed Oct 11, 2019

View reviewed changes

sklearn/datasets/openml.py Outdated Show resolved Hide resolved

Update sklearn/datasets/openml.py

3c5ab3e

return bytearray instead of new bytes Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>

thomasjpfan mentioned this pull request Oct 14, 2019

WIP: Verify MD5 checksums on OpenML ARFF files #15238

Closed

adrinjalali mentioned this pull request Oct 15, 2019

[MRG] Validate MD5 checksum of downloaded ARFF in fetch_openml #11890

Closed

Merge branch 'upstream_master' into md5checksum_validate_openml

f2624e0

thomasjpfan reviewed Nov 27, 2019

View reviewed changes

Test should not modify local test-suite shared file

eac5a1e

thomasjpfan reviewed Jun 7, 2020

View reviewed changes

shashanksingh28 and others added 5 commits June 6, 2020 21:53

Update sklearn/datasets/tests/test_openml.py

d075a83

flake8 Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Use tmpdir for creating corrupt file, add comments, update changelog

23ba190

Merge branch 'md5checksum_validate_openml' of github.com:shashanksing…

9bda995

…h28/scikit-learn into md5checksum_validate_openml

Merge upstream master

4ab9036

Make test-path platform independent

c216171

thomasjpfan reviewed Jun 7, 2020

View reviewed changes

shashanksingh28 added 2 commits June 7, 2020 10:40

Do not remove file explicitly from tmpdir

536bc4f

Make test mock class private to ignore coverage

c55f64a

shashanksingh28 force-pushed the md5checksum_validate_openml branch from 8c049a0 to c55f64a Compare June 7, 2020 22:49

Merge branch 'upstream_master' into md5checksum_validate_openml

c12fd02

Merge branch 'master' into md5checksum_validate_openml

02729a4

Fix merge conflict issues

fc9181d

rth self-requested a review June 25, 2020 19:38

rth approved these changes Jun 25, 2020

View reviewed changes

rth and others added 2 commits June 25, 2020 21:43

fmt

07a53b6

CLN Early skip if pandas is not avaliable

d5bebcf

thomasjpfan approved these changes Jun 26, 2020

View reviewed changes

thomasjpfan changed the title ~~[MGR] Verify md5-checksums received from openml arff file metadata~~ ENH Verify md5-checksums received from openml arff file metadata Jun 26, 2020

thomasjpfan merged commit 1e08459 into scikit-learn:master Jun 26, 2020

jnothman reviewed Jun 26, 2020

View reviewed changes

jnothman removed the Waiting for Reviewer label Jun 26, 2020

	file_md5 = hashlib.md5() if expected_md5_checksum else None
	file_md5 = hashlib.md5()

Uh oh!

Conversation

shashanksingh28 commented Aug 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Aug 25, 2019 via email

Uh oh!

shashanksingh28 commented Sep 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reshamas commented Sep 9, 2019

Uh oh!

shashanksingh28 commented Sep 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shashanksingh28 Sep 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shashanksingh28 Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shashanksingh28 commented Oct 11, 2019

Uh oh!

Uh oh!

shashanksingh28 commented Nov 25, 2019

Uh oh!

reshamas commented Nov 25, 2019

Uh oh!

cmarmo commented Nov 26, 2019

Uh oh!

thomasjpfan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

On master

this PR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Jun 7, 2020

Uh oh!

shashanksingh28 commented Aug 24, 2019 •

edited

Loading

shashanksingh28 Sep 17, 2019 •

edited

Loading

shashanksingh28 Oct 9, 2019 •

edited

Loading

thomasjpfan left a comment •

edited

Loading