[MRG] ENH: dataset-fetching with use figshare, partial download and checksum#7429
[MRG] ENH: dataset-fetching with use figshare, partial download and checksum#7429nelson-liu wants to merge 25 commits intoscikit-learn:masterfrom
Conversation
|
sorry for not following up on this. Will review shortly. |
|
@amueller no problem, this sort of thing is pretty tedious to check. |
jnothman
left a comment
There was a problem hiding this comment.
From a code perspective this otherwise LGTM
sklearn/datasets/covtype.py
Outdated
|
|
||
| URL = ('http://archive.ics.uci.edu/ml/' | ||
| 'machine-learning-databases/covtype/covtype.data.gz') | ||
| URL = ('https://ndownloader.figshare.com/files/5976039') |
There was a problem hiding this comment.
no need to keep parentheses. They were there for line-wrapped strings only.
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| BASE_URL = "http://vis-www.cs.umass.edu/lfw/" |
There was a problem hiding this comment.
Hmm. Can we assume no backwards compatibility issues with these names? I'm fine with this.
There was a problem hiding this comment.
hmm, I believe so. I specifically made an attempt to maintain backwards compatibility by changing the associated variable names
sklearn/datasets/kddcup99.py
Outdated
|
|
||
| URL10 = ('http://archive.ics.uci.edu/ml/' | ||
| 'machine-learning-databases/kddcup99-mld/kddcup.data_10_percent.gz') | ||
| URL10 = ('https://ndownloader.figshare.com/files/5976042') |
sklearn/datasets/kddcup99.py
Outdated
|
|
||
| URL = ('http://archive.ics.uci.edu/ml/' | ||
| 'machine-learning-databases/kddcup99-mld/kddcup.data.gz') | ||
| URL = ('https://ndownloader.figshare.com/files/5976045') |
sklearn/datasets/rcv1.py
Outdated
| 'https://ndownloader.figshare.com/files/5976060', | ||
| 'https://ndownloader.figshare.com/files/5976057' | ||
| ] | ||
| URL_topics = ('https://ndownloader.figshare.com/files/5976048') |
|
|
||
| URL = ("http://people.csail.mit.edu/jrennie/" | ||
| "20Newsgroups/20news-bydate.tar.gz") | ||
| URL = ("https://ndownloader.figshare.com/files/5975967") |
I've then compared the MD5 sums produced and found them identical except for generated images. However, I've realised that those MD5 sum checks should be part of the fetch. Please add this: I'm also concerned about what we lose by throwing away the original URLs. Should we use a mapping rather than replacing them? |
|
@jnothman good idea, but I'm not too sure where you want me to add the MD5 checks to. Could you elaborate a bit more
Hmm...perhaps I'm being slow, but how would such a mapping work? |
|
I would create a helper function that fetches a file from a particular URL and stores it at a path relative to data home, with an MD5 check. So it would be passed the URL and the expected MD5 among other args. Not sure if that's sufficient. |
|
Not sure if mapping is necessary |
|
@jnothman what i ended up doing instead was running md5 on the downloaded archives rather than the raw datafiles. I basically just wrote a general |
|
That's okay, except that it doesn't catch the case where files later get
corrupted.
…On 21 December 2016 at 19:07, Nelson Liu ***@***.***> wrote:
@jnothman <https://github.com/jnothman> what i ended up doing instead was
running md5 on the downloaded archives rather than the raw datafiles. I
basically just wrote a general fetch_dataset function that given a URL,
path, and md5, downloads the URL to a temp location, checks the archive /
raw data with the given MD5, then moves it to the proper location. Do you
think this would be suitable, or is it greatly preferred to run md5 on the
extracted files?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7429 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68U4T8loSOcrupef-rLFvo-4s0Iaks5rKN49gaJpZM4J9Y_O>
.
|
|
@jnothman i initially had written a function to just check the archives, but I decided to check the hashes of the generated files as well. However, it seems like i get different md5 hashes than you do. For example, |
d23e1d2 to
7186af8
Compare
|
I agree. We should aim to merge this soon and certainly ensure it's available for the next release. Are you able to wrap it up @nelson-liu? Fix merge conflicts and I'll give it a final look. |
|
When were you planning for the next release? In a bit of a crunch right now, but I should be able to get to this next weekend (unfortunately not the coming one) |
|
Current vague release plan is "before June". I hope we can actually make it
happen. So this is urgent only from the perspective of users seeking
solutions in master to current broken functionality.
We can get someone else to put in any finishing touches if you'd rather.
…On 5 April 2017 at 09:11, Nelson Liu ***@***.***> wrote:
When we're you planning for the next release? In a bit of a crunch right
now, but I should be able to get to this next weekend (unfortunately not
the coming one)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7429 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63N-9sahTIQMWM7p7jTNkdTWXYBMks5rss4ngaJpZM4J9Y_O>
.
|
|
Bump? |
|
Oops, this fell of my radar; I'll do it today. Thanks for the reminder @jnothman |
|
@jnothman ok, I think it should be good now (pending CI)...I manually reran some of the |
|
CircleCI error seems genuine: |
|
@lesteve thanks! I think it'd be best to remove the validation on the actual processed file, since I also realized that the hash of the generated file would change depending on python version (2 vs 3)... |
|
appveyor failure looks spurious...does someone want to rerun that test or should i just push a spurious commit? |
|
I'm sure we'll find something to nitpick about to make your next commit not spurious. |
jnothman
left a comment
There was a problem hiding this comment.
I would appreciate if there were a bit more consistency across fetch functions as to how the checksum, URL and local path are related in code. We could of course just store this in a table globally, but I'm not sure that's so elegant either.
| if expected_checksum != _md5(path): | ||
| # remove the corrupted file | ||
| remove(path) | ||
| raise ValueError("{} has an MD5 hash differing " |
| "{}-").format(existing_size)): | ||
| raise IOError("Server does not support the HTTP Range " | ||
| "header, cannot resume download.") | ||
| except: |
There was a problem hiding this comment.
except Exception. Otherwise it blocks KeyboardInterrupt
| # get the amount of path_temp we've downloaded | ||
| existing_size = getsize(path_temp) | ||
| print("Resuming download from previous temp file, " | ||
| "already have {} bytes".format(existing_size)) |
| except: | ||
| # delete the temp file and retry download of whole file | ||
| remove(path_temp) | ||
| print("Attempting to re-download file.") |
There was a problem hiding this comment.
mention "after {!r}".format(exc)?
| temp_file = open(path_temp, "ab") | ||
| # get the amount of path_temp we've downloaded | ||
| existing_size = getsize(path_temp) | ||
| print("Resuming download from previous temp file, " |
| from ..externals import joblib | ||
|
|
||
|
|
||
| DATA_URL = "http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz" |
There was a problem hiding this comment.
Perhaps it is appropriate to leave the original URL in a comment.
There was a problem hiding this comment.
(Or indeed to have it available as a mirror??)
| print('downloading Cal. housing from %s to %s' % (DATA_URL, data_home)) | ||
| archive_fileobj = BytesIO(urlopen(DATA_URL).read()) | ||
| archive_path = join(data_home, "cal_housing.tgz") | ||
| expected_checksum = "130d0eececf165046ec4dc621d121d80" |
There was a problem hiding this comment.
It seems appropriate to me for this to appear as a global with the URL.
| URL = ('http://archive.ics.uci.edu/ml/' | ||
| 'machine-learning-databases/kddcup99-mld/kddcup.data.gz') | ||
| URL10 = 'https://ndownloader.figshare.com/files/5976042' | ||
|
|
|
|
||
| URL = ('http://archive.ics.uci.edu/ml/' | ||
| 'machine-learning-databases/kddcup99-mld/kddcup.data.gz') | ||
| URL10 = 'https://ndownloader.figshare.com/files/5976042' |
There was a problem hiding this comment.
I think it is now less clear what 10 means. Rename the variable to URL_10_PERCENT
| file_urls.append("%s_train.dat.gz" % URL) | ||
| files = [] | ||
| for file_url in file_urls: | ||
| for file_name, file_url in zip(FILE_NAMES, FILE_URLS): |
There was a problem hiding this comment.
If you already have zip, why not zip the checksums in too, rather than using a different data structure?
There was a problem hiding this comment.
also, a needless inconsistency with lfw.
|
Good work, @nelson-liu |
|
Would you be able to fix this up to merge for the upcoming release, @nelson-liu? Or should we get someone else to do the final touch-ups? |
|
sorry for being so flaky --- it'd be good to get someone else to finish the last bits. I'll see if I can pitch in, though... |
|
Closing in favor of #9240. |
Reference Issue
addresses #7425, #8089
What does this implement/fix? Explain your changes.
fetch_...) to figshare and changed their URLs accordingly.Any other comments?
I'm pretty sure CI doesn't test datasets that aren't already downloaded to my machine, is there a way to test this PR besides going into all of the tests and setting
download_if_missing=True?@amueller I emailed the figshare details to your gmail (t3kcit)