Check mirror cache before attempting download#6987
Conversation
|
Hi there @pllim 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃. Everything looks good from my point of view! 👍 If there are any issues with this message, please report them here. |
b8313c5 to
1556b28
Compare
astropy/utils/data.py
Outdated
| return url2hash[url_key] | ||
| # If there is a cached copy from mirror, use it. | ||
| else: | ||
| for cur_url in (conf.dataurl, conf.dataurl_alias): |
There was a problem hiding this comment.
Note: I am open to a more elegant way to do this.
There was a problem hiding this comment.
You can decide if it's really "elegant", but this at least has the advantage of not requiring that dataurl_alias be specified:
dataurls_to_alias = {} # this could probably just sit at module-level global cope?
# do this whenever the first time is the dataurl is actually *accessed*
from urllib.requests import urlopen
u = urlopen(conf.dataurl)
u.close()
dataurls_to_alias[conf.dataurl] = u.geturl()
# then at the line this comment is at, do this:
urls_to_try = [conf.dataurl]
if conf.dataurl in dataurls_to_alias:
urls_to_try.append(dataurls_to_alias[conf.dataurl])
for cur_url in urls_to_try:
...
There was a problem hiding this comment.
Okay, I'll give it a try. Thank you!
There was a problem hiding this comment.
Looking at this more, this solution will be harder to backport. I'll implement Python 3 only code in this PR and comment on how to make it Python 2 compatible.
|
|
||
|
|
||
| @pytest.mark.remote_data('astropy') | ||
| def test_get_cached_urls(): |
There was a problem hiding this comment.
Note: This is now indirectly tested in test_download_parallel above. No need to have duplicate tests that do the same thing.
There was a problem hiding this comment.
Maybe just note in a comment in test_download_parallel that it does this? Just want to make sure someone in the future who might decide to edit test_download_parallel knows that it should also be checking this
|
Nice! Thanks for this. I have to wrap my head around whether the test you have is sufficient...but looks good otherwise. |
| _get_download_cache_locs, get_cached_urls) | ||
|
|
||
| main_url = 'http://data.astropy.org/intersphinx/README' | ||
| mirror_url = 'http://www.astropy.org/astropy-data/intersphinx/README' |
There was a problem hiding this comment.
is astropy.org really a mirror, isn't it more fool proof to link directly to github as the mirror here?
There was a problem hiding this comment.
It is the mirror listed in astropy.utils.data.conf. I could use GitHub but it would be inconsistent with actual config -- @astrofrog ?
There was a problem hiding this comment.
oh, OK, then never mind my comment.
There was a problem hiding this comment.
I like the general idea, @pllim, but I don't like requiring the user to know what the alias is. In the inline comments I offer an idea for how to determine it, though. I tested and it seems to work just as you'd like for the astropy data case!
astropy/utils/data.py
Outdated
| Configuration parameters for `astropy.utils.data`. | ||
| """ | ||
|
|
||
| # NOTE: Make sure all the dataurl values have trailing "/". |
There was a problem hiding this comment.
Maybe put this in the "help" string (on line 48) instead? While unlikely, random user might decide to re-assign the data site to some local copy and they should know the trailing / is necessary.
|
|
||
|
|
||
| @pytest.mark.remote_data('astropy') | ||
| def test_get_cached_urls(): |
There was a problem hiding this comment.
Maybe just note in a comment in test_download_parallel that it does this? Just want to make sure someone in the future who might decide to edit test_download_parallel knows that it should also be checking this
astropy/utils/data.py
Outdated
| return url2hash[url_key] | ||
| # If there is a cached copy from mirror, use it. | ||
| else: | ||
| for cur_url in (conf.dataurl, conf.dataurl_alias): |
There was a problem hiding this comment.
You can decide if it's really "elegant", but this at least has the advantage of not requiring that dataurl_alias be specified:
dataurls_to_alias = {} # this could probably just sit at module-level global cope?
# do this whenever the first time is the dataurl is actually *accessed*
from urllib.requests import urlopen
u = urlopen(conf.dataurl)
u.close()
dataurls_to_alias[conf.dataurl] = u.geturl()
# then at the line this comment is at, do this:
urls_to_try = [conf.dataurl]
if conf.dataurl in dataurls_to_alias:
urls_to_try.append(dataurls_to_alias[conf.dataurl])
for cur_url in urls_to_try:
...
1556b28 to
9c731e5
Compare
| astropy.utils | ||
| ^^^^^^^^^^^^^ | ||
|
|
||
| - ``download_file`` function will check for cache downloaded from mirror URL |
There was a problem hiding this comment.
NOTE: If we decide not to backport, this milestone needs to be moved.
| # Check if URL is Astropy data server, which has alias, and cache it. | ||
| if (url_key.startswith(conf.dataurl) and | ||
| conf.dataurl not in _dataurls_to_alias): | ||
| with urllib.request.urlopen(conf.dataurl, timeout=timeout) as remote: |
There was a problem hiding this comment.
NOTE: Use contextlib for PY2 compat (see example in 2.x code just a few lines below this one).
| # Now test that download_file looks in mirror's cache before download. | ||
| # https://github.com/astropy/astropy/issues/6982 | ||
| dldir, urlmapfn = _get_download_cache_locs() | ||
| with shelve.open(urlmapfn) as url2hash: |
There was a problem hiding this comment.
NOTE: Use _open_shelve for PY2 compat (see example in 2.x code in data.py).
|
Merged! @pllim - I think we do want to backport this, if we can. Do you think you can write a PR that does the backporting since it requires significant code changes? What I'm not sure about is the best way to do the backporting given that it involves substantial changes. Probably the best thing is to make a branch from |
|
yes, if we don't directly cherry-pick to the branch but through a feature branch, then the scripts are not recognizing it (currently, hopefully they will one day). But, we can always flag this up for manual backport, so then the script skips the checks. |
|
If it's easier to do manual backport, I don't mind opening another PR but it'll have to be next year. ;) |
Check mirror cache before attempting download
Fix #6982