Skip to content

Check mirror cache before attempting download#6987

Merged
eteq merged 2 commits intoastropy:masterfrom
pllim:fix-utils-cache
Dec 23, 2017
Merged

Check mirror cache before attempting download#6987
eteq merged 2 commits intoastropy:masterfrom
pllim:fix-utils-cache

Conversation

@pllim
Copy link
Member

@pllim pllim commented Dec 14, 2017

Fix #6982

@pllim pllim added this to the v2.0.4 milestone Dec 14, 2017
@pllim pllim requested a review from adrn December 14, 2017 19:44
@astropy-bot
Copy link

astropy-bot bot commented Dec 14, 2017

Hi there @pllim 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃.

Everything looks good from my point of view! 👍

If there are any issues with this message, please report them here.

return url2hash[url_key]
# If there is a cached copy from mirror, use it.
else:
for cur_url in (conf.dataurl, conf.dataurl_alias):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I am open to a more elegant way to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can decide if it's really "elegant", but this at least has the advantage of not requiring that dataurl_alias be specified:

dataurls_to_alias = {}  # this could probably just sit at module-level global cope?

# do this whenever the first time is the dataurl is actually *accessed*
from urllib.requests import urlopen

u = urlopen(conf.dataurl)
u.close()
dataurls_to_alias[conf.dataurl] = u.geturl()

# then at the line this comment is at, do this:
urls_to_try = [conf.dataurl]
if conf.dataurl in dataurls_to_alias:
    urls_to_try.append(dataurls_to_alias[conf.dataurl])
for cur_url in urls_to_try:
    ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll give it a try. Thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this more, this solution will be harder to backport. I'll implement Python 3 only code in this PR and comment on how to make it Python 2 compatible.



@pytest.mark.remote_data('astropy')
def test_get_cached_urls():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is now indirectly tested in test_download_parallel above. No need to have duplicate tests that do the same thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just note in a comment in test_download_parallel that it does this? Just want to make sure someone in the future who might decide to edit test_download_parallel knows that it should also be checking this

@adrn
Copy link
Member

adrn commented Dec 14, 2017

Nice! Thanks for this. I have to wrap my head around whether the test you have is sufficient...but looks good otherwise.

_get_download_cache_locs, get_cached_urls)

main_url = 'http://data.astropy.org/intersphinx/README'
mirror_url = 'http://www.astropy.org/astropy-data/intersphinx/README'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is astropy.org really a mirror, isn't it more fool proof to link directly to github as the mirror here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the mirror listed in astropy.utils.data.conf. I could use GitHub but it would be inconsistent with actual config -- @astrofrog ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, OK, then never mind my comment.

Copy link
Member

@eteq eteq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general idea, @pllim, but I don't like requiring the user to know what the alias is. In the inline comments I offer an idea for how to determine it, though. I tested and it seems to work just as you'd like for the astropy data case!

Configuration parameters for `astropy.utils.data`.
"""

# NOTE: Make sure all the dataurl values have trailing "/".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this in the "help" string (on line 48) instead? While unlikely, random user might decide to re-assign the data site to some local copy and they should know the trailing / is necessary.



@pytest.mark.remote_data('astropy')
def test_get_cached_urls():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just note in a comment in test_download_parallel that it does this? Just want to make sure someone in the future who might decide to edit test_download_parallel knows that it should also be checking this

return url2hash[url_key]
# If there is a cached copy from mirror, use it.
else:
for cur_url in (conf.dataurl, conf.dataurl_alias):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can decide if it's really "elegant", but this at least has the advantage of not requiring that dataurl_alias be specified:

dataurls_to_alias = {}  # this could probably just sit at module-level global cope?

# do this whenever the first time is the dataurl is actually *accessed*
from urllib.requests import urlopen

u = urlopen(conf.dataurl)
u.close()
dataurls_to_alias[conf.dataurl] = u.geturl()

# then at the line this comment is at, do this:
urls_to_try = [conf.dataurl]
if conf.dataurl in dataurls_to_alias:
    urls_to_try.append(dataurls_to_alias[conf.dataurl])
for cur_url in urls_to_try:
    ...

Copy link
Member Author

@pllim pllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eteq , I think I have addressed your comments.

astropy.utils
^^^^^^^^^^^^^

- ``download_file`` function will check for cache downloaded from mirror URL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: If we decide not to backport, this milestone needs to be moved.

# Check if URL is Astropy data server, which has alias, and cache it.
if (url_key.startswith(conf.dataurl) and
conf.dataurl not in _dataurls_to_alias):
with urllib.request.urlopen(conf.dataurl, timeout=timeout) as remote:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Use contextlib for PY2 compat (see example in 2.x code just a few lines below this one).

# Now test that download_file looks in mirror's cache before download.
# https://github.com/astropy/astropy/issues/6982
dldir, urlmapfn = _get_download_cache_locs()
with shelve.open(urlmapfn) as url2hash:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Use _open_shelve for PY2 compat (see example in 2.x code in data.py).

@eteq eteq merged commit 2203fa5 into astropy:master Dec 23, 2017
@eteq
Copy link
Member

eteq commented Dec 23, 2017

Merged!

@pllim - I think we do want to backport this, if we can. Do you think you can write a PR that does the backporting since it requires significant code changes?

What I'm not sure about is the best way to do the backporting given that it involves substantial changes. Probably the best thing is to make a branch from 2.0.x, starting with a git cherry-pick 2203fa5f4c1868555f156929d1c06290d3548989, and then add a second commit and create a PR against the 2.0.x branch? But perhaps @bsipocz or @astrofrog know if this will break the release-related scripts?

@bsipocz
Copy link
Member

bsipocz commented Dec 23, 2017

yes, if we don't directly cherry-pick to the branch but through a feature branch, then the scripts are not recognizing it (currently, hopefully they will one day). But, we can always flag this up for manual backport, so then the script skips the checks.

@pllim
Copy link
Member Author

pllim commented Dec 23, 2017

If it's easier to do manual backport, I don't mind opening another PR but it'll have to be next year. ;)

bsipocz pushed a commit that referenced this pull request Dec 31, 2017
Check mirror cache before attempting download
@pllim pllim deleted the fix-utils-cache branch January 8, 2018 15:41
@pllim
Copy link
Member Author

pllim commented Jan 10, 2018

Looks like @bsipocz beat me to it (481b67c). Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants