Cache preload by aarchiba · Pull Request #9182 · astropy/astropy

aarchiba · 2019-08-29T15:01:46Z

Various tools for managing the astropy download cache. Perhaps most useful for people wanting disconnected operation.

It is possible to export part or all of the contents of the download cache in the form of a ZIP file, which can then be loaded into astropy on another machine to populate its cache without requiring internet access. May be of interest for machines running CI, so they don't re-download everything they need every time a CI run happens - if the ZIP file can be provided to them somehow.

Also there's a new function to consistency check the contents of a cache in case there are unused items or it has become damaged somehow. Recall that the cache is user-wide, across multiple virtualenvs possibly including a development one, and it is accessed concurrently by multiple processes.

Finally, the ZIP file mechanism is used to make sure that tests that require purging the cache don't leave the user with an empty cache needing to download all their downloadable files again. Probably only a minor inconvenience for most users with Net connections.

jbcurtin · 2019-09-03T14:17:56Z

This seems like a step in the right direction. When performing CI/Builds with preloaded data, it would be nice to have an additional sha256 checksum for the zipfile file from os.environ. Combined with the hashes available in the zipfile being rechecked, we'll know for certain zipfile integrity hasn't been compromised

If the os.environ ENVVar fails to match the zipfile hash.hexdigest, please throw an IOError

saimn · 2019-09-04T20:48:07Z

One idea while reading quickly the recent cache issues: it could be convenient to have a command-line script to list the files in the cache(s), delete one or all files, export, import, get a copy of a file, etc. (astropy-cache [list|export|import|get|delete] ?)

edit: obviously this could be done by wrapping some of the functions added here ;)

mhvk · 2019-09-05T19:14:58Z

Now looked a bit beyond my general 👍 at the top. Overall, the idea seems sensible.

However, the big question I have is how this is going to be discovered and used. Right now, most of utils is meant for internal astropy use, exposed mostly as one hopes things might be useful for others too (see http://docs.astropy.org/en/latest/utils/index.html; iers is the one exception and it has been argued that it in fact doesn't belong in utils partially for that reason...). Certainly, the way I have found things in utils is by looking at astropy code, not because of going through its scattered documentation (though as a developer I'm hardly a typical case...).

Is the idea to follow up this PR to use it in CI runs? (Would be great!) Or to use it in external packages like PINT (where you mentioned this PR)? If the latter, we'd need to think how about making things more discoverable, perhaps even go the route suggested by @saimn above of using them in scripts.

pllim · 2019-09-05T19:32:47Z

@aarchiba , what is your use case or workflow that needs such a feature? Is it because of astroquery downloads? Or is it something else?

aarchiba · 2019-09-05T20:10:55Z

@aarchiba , what is your use case or workflow that needs such a feature? Is it because of astroquery downloads? Or is it something else?

I'm coming at this from PINT, which needs files not necessarily obtained through astropy but that are conveniently downloaded through the astropy mechanism, avoiding the need for PINT to implement its own caching mechanism. (The files are things like other solar system ephemerides.)

We ran into trouble because the JPL server blocked the Travis machines because they were getting hammered with requests for the same files over and over. NANOGrav created a mirror, but that seems to go down with some regularity. So CI builds were failing with non-obvious messages. If we had been able to just load up a zip file into the machines' astropy caches, that would have resolved the issue. Unfortunately I couldn't find any reasonable way to provide the ZIP file to travis that didn't involve hosting a server, which if we were able to do that, could replace the NANOGrav server. (Git LFS from github can work, but the limit is 1GB storage and 1GB downloads per month, and anyone who forks your project can use up your quota for both.)

We also had requests from users wanting to use machines offline.

aarchiba · 2019-09-05T20:15:45Z

However, the big question I have is how this is going to be discovered and used. Right now, most of utils is meant for internal astropy use, exposed mostly as one hopes things might be useful for others too (see http://docs.astropy.org/en/latest/utils/index.html; iers is the one exception and it has been argued that it in fact doesn't belong in utils partially for that reason...). Certainly, the way I have found things in utils is by looking at astropy code, not because of going through its scattered documentation (though as a developer I'm hardly a typical case...).

Well, I find the astropy documentation somewhat hard to navigate, but surely discoverability is largely a documentation problem? And while the IERS stuff is clearly astronomical and maybe belongs in time or coords, downloading files into a cache is not. The static nature of (some) astronomical data is somewhat unusual, which helps make a case for it being in astropy.

Is the idea to follow up this PR to use it in CI runs? (Would be great!) Or to use it in external packages like PINT (where you mentioned this PR)? If the latter, we'd need to think how about making things more discoverable, perhaps even go the route suggested by @saimn above of using them in scripts.

I think it's something of a niche use - the mechanism isn't really suitable for users keeping their wikipedia mirrors or podcast episodes or whatever - but I do think it makes sense for it to be usable for packages that make use of astropy to have this handy mechanism. I see less use for individual users, but it is in the public docs.

aarchiba · 2019-09-05T20:18:29Z

One idea while reading quickly the recent cache issues: it could be convenient to have a command-line script to list the files in the cache(s), delete one or all files, export, import, get a copy of a file, etc. (astropy-cache [list|export|import|get|delete] ?)

edit: obviously this could be done by wrapping some of the functions added here ;)

Would it really help much to have a command-line script, compared to running relevant commands from an ipython prompt? All of the operations you list are available already using fairly straightforward commands already in astropy.utils.data (with this PR).

aarchiba · 2019-09-05T20:21:25Z

I should say, the code as it stands is not really able to handle concurrency, and that problem is baked into the UI: download_file returns a filename. If another process, maybe in another virtualenv, does a clear_download_cache on the url, the file might be gone before it can be opened. (Hooray for race conditions!) So we don't want to hammer on the cache too hard, particularly on clearing it. (Simultaneous downloads are probably fine.)

aarchiba · 2019-09-05T20:24:58Z

I also have some code to handle cases where the same file is also available from several mirrors (get it from the first one that works but index it under the original URL) and to handle cases where, like the IERS A/B tables, the online data is updated regularly and we want to pull in new data without wiping out the old if the new can't be obtained. The code needs some work and some test cases, so I was waiting to see if people think filling out the cache abilities at all is a good idea.

aarchiba · 2019-09-05T20:27:42Z

@aarchiba , what is your use case or workflow that needs such a feature? Is it because of astroquery downloads? Or is it something else?

Let me also add that as it currently stands, running the astropy test suite with remote tests enabled completely wipes out the user's astropy cache, forcing them to re-download everything that's in there. This PR fixes that by stashing the cache contents while those tests are run.

aarchiba · 2019-09-05T20:30:27Z

Let me add also that the Travis failure is some kind of Travis misbehaviour - it reports a flake8 failure on a file that I haven't touched and that doesn't have the spaces that flake8 complains about. I assume it'll go away when I push a few more changes, but I've been waiting on some interest from the project for that.

pllim · 2019-09-05T20:44:20Z

Thank you for the clarifications, @aarchiba . Lots to think about here.

Re: Travis failure -- Usually a git rebase against upstream master would fix it. But okay to wait as well until your next push.

mhvk · 2019-09-05T20:59:13Z

Thanks for the detailed replies - the PINT use case is certainly a reasonable one and so is the fact that one should avoid hammering servers for CI tests (which will only become flaky as a result anyway). The latter means it is a good idea to have it in astropy rather than in PINT, as long as we can make sure other packages can find it - i.e., I agree that documentation is key! Indeed, my comments were partially an indirect question of how to do that: you already have good docstrings, but that doesn't make it too discoverable... Maybe at the least the utils.data package should gain its own page (or at least have a better module docstring)? Note that I'm not asking you to do all that - we can make it a new issue, but hopefully we can at least agree on what approach might be best!

Also, your point about user caches being wiped by using --remote-data would seem a good reason. Let me also ping @astrofrog, since I think he is more familiar with those options, and has thought about the cache a lot more than I have.

aarchiba · 2019-09-05T21:25:11Z

Also, your point about user caches being wiped by using --remote-data would seem a good reason. Let me also ping @astrofrog, since I think he is more familiar with those options, and has thought about the cache a lot more than I have.

Actually this enables more stringent local as well as CI testing: the test suite could reset the cache location to somewhere under /tmp, then load up a standard ZIP file and mock urllib to raise an exception: now any test that tries to request anything unexpected from the net can show up in the test suite, and tests that request expected data can be run without Net access.

(I have a hackish stub to mock urllib to raise exceptions so it's not hard.)

pllim · 2019-09-06T14:48:39Z

I still have a slight concern about the security implication of passing around ZIP files. xref #9182 (comment)

aarchiba · 2019-09-06T16:08:27Z

I still have a slight concern about the security implication of passing around ZIP files. xref #9182 (comment)

At the moment there's no actual passing around; the user can make them and the user can load them in. I don't think any of the files that end up in astropy's cache contain code, but in any case if their contents pose security issues those are maybe made worse by getting them from the Web instead of a checked ZIP file.

If you're worried about malevolently crafted ZIPs, the index internal to the ZIP file is in JSON, not pickle, so it should not be an attack vector; it is used only to associate URLs with file contents (I considered skipping the index by naming the files inside the ZIP based on urlencoded versions of their URLs but this seemed cumbersome). New file hashes are computed before inserting them in the index, though I could check the hashes against the contents if that would be a useful precaution. ZIP file contents are never extracted to the filesystem, and strange paths inside the ZIP file do not affect the ultimate location of the data. But really, making sure you have the right ZIP file is not an inside-astropy problem.

CI environments should indeed be provided with a shasum of the file to check after downloading. This improves security by preventing anyone from intercepting our outgoing file requests and returning bogus contents - the ZIP file contains pre-downloaded correct contents, checked by shasum. But this PR doesn't do anything with CI integration; that has to wait for someone to be willing to serve up the ZIP file to our CI servers on every run. This PR just makes it possible to import and export data from the cache.

aarchiba · 2019-09-16T17:55:36Z

It may or may not be helpful here, but as an experiment I turned on Travis' "cache" feature for PINT. Specifically I had it cache ~/.astropy/cache. So the contents are stored in some cloud something or other and downloaded into each Travis run before it begins. Any updates to the cache are then uploaded to the cloud whatzit at the end. So the IERS server doesn't get hammered by our CI runs.

Now, the cache isn't really intended for this, so Travis keeps separate versions for each entry in the build matrix and for each branch. But if a branch has no cached data, Travis will pull in the cache from master. So once master gets a copy for each entry in the build matrix, the individual runs operate out of cache.

aarchiba · 2019-09-16T22:08:08Z

Fixes #9187
Fixes #8676
Makes solution of #9162 more straightforward, but doesn't decide for itself when the cache is invalid

aarchiba · 2019-09-17T09:42:12Z

I have to admit I don't understand the problem with the Windows build. I've just created some temporary files and I can't access them with urllib?

pllim · 2019-10-30T15:22:00Z

CHANGES.rst

 - All operations that act on the astropy download cache now take an argument
  ``pkgname`` that allows one to specify which package's cache to use.
-  [#8327]
+  [#8237]


Good catch!

pllim

LGTM. I am betting on that nothing weird crept in during all the rebase that you had to do. Again, thank you very much for contributing your expertise to improving our caching system and bearing with all the reviews (353 comments logged in "Conversation"!). I dare to say that utils.data might just be one of the most well covered module by the tests in the core now!

🐴 💀 🚀

mhvk · 2019-10-30T16:11:46Z

🎆

aarchiba · 2019-10-30T17:34:21Z

Thank all of you for your careful code reviews - the code is much better for them!

astropy-bot bot added the utils label Aug 29, 2019

aarchiba mentioned this pull request Aug 29, 2019

Cache name resolve results in astropy.coordinates #9162

Merged

pllim added Refactoring needs-discussion labels Aug 29, 2019

aarchiba mentioned this pull request Aug 29, 2019

A way to pre-download data files for offline work nanograv/PINT#383

Open

pllim mentioned this pull request Aug 29, 2019

Even when a cached downloaded file exists, urlopen() is called #8676

Closed

aarchiba mentioned this pull request Aug 30, 2019

initialize data url alias for data.astropy.org #9187

Closed

aarchiba force-pushed the cache-preload branch from 5f0a4b2 to c29c20e Compare September 5, 2019 21:19

aarchiba mentioned this pull request Sep 13, 2019

Fix loading of IERS data files #9204

Closed

mhvk mentioned this pull request Sep 16, 2019

TST: Tests failures in coordinates caused by IndexError in IERS with numpy-dev #9240

Closed

aarchiba force-pushed the cache-preload branch from c29c20e to 0bfdd5e Compare September 16, 2019 21:59

aarchiba added 2 commits October 30, 2019 14:09

Correct other PR: astropy#8237 [skip ci]

b866b0b

Merge online changes [skip ci]

0935480

pllim reviewed Oct 30, 2019

View reviewed changes

pllim approved these changes Oct 30, 2019

View reviewed changes

pllim merged commit 87dd101 into astropy:master Oct 30, 2019

pllim mentioned this pull request Oct 30, 2019

Cache for test files #8748

Open

aarchiba mentioned this pull request Nov 7, 2019

Persist astropy cache #9557

Open

eteq mentioned this pull request Nov 7, 2019

Use sources keyword of download_file in IERS #9561

Closed

pllim added a commit to pllim/astropy that referenced this pull request Nov 7, 2019

Use caching from astropy#9182 in iers

f6f4b81

This was referenced Nov 7, 2019

Use caching from PR 9182 in iers #9562

Merged

Handle utils test warnings #9569

Merged

pllim added a commit to pllim/astropy that referenced this pull request Nov 10, 2019

Use caching from astropy#9182 in iers

ddc6a1c

pllim mentioned this pull request Nov 10, 2019

Add config option to set default User-Agent for http_headers #9564

Merged

aarchiba mentioned this pull request Nov 12, 2019

Test basic operation with all backends #9594

Merged

pllim mentioned this pull request Nov 14, 2019

cache issue: dbm.error: db type is dbm.gnu, but the module is not available #9607

Closed

pllim added a commit to pllim/astropy that referenced this pull request Nov 16, 2019

Use caching from astropy#9182 in iers

1863c6e

This was referenced Nov 18, 2019

TST: Investigate cache warnings in tests (currently ignored) #9620

Closed

Removing imp usage astropy/astroquery#1588

Merged

Download cache gets confused in parallel (using shell bg / GNU parallel / SLURM / etc) #9642

Closed

pllim mentioned this pull request Dec 2, 2019

MNT: Compatibility with data caching in astropy 4.0 astropy/astroquery#1594

Closed

pllim mentioned this pull request Dec 12, 2019

TST: CI job frequently fails with cache damaged error #9765

Closed

maxnoe pushed a commit to maxnoe/astropy that referenced this pull request Jan 6, 2020

Use caching from astropy#9182 in iers

06780e2

neutrinoceros mentioned this pull request Jan 28, 2025

ENH: add support for setting astropy's cache and config directories via environment variables (blocked by #17934) #17640

Draft

1 task

dhomeier mentioned this pull request Sep 27, 2025

BUG: Fix refresh_cache=True behavior for EarthLocation.of_site and EarthLocation.get_site_names #18644

Closed

1 task

Uh oh!

Conversation

aarchiba commented Aug 29, 2019

Uh oh!

jbcurtin commented Sep 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saimn commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhvk commented Sep 5, 2019

Uh oh!

pllim commented Sep 5, 2019

Uh oh!

aarchiba commented Sep 5, 2019

Uh oh!

aarchiba commented Sep 5, 2019

Uh oh!

aarchiba commented Sep 5, 2019

Uh oh!

aarchiba commented Sep 5, 2019

Uh oh!

aarchiba commented Sep 5, 2019

Uh oh!

aarchiba commented Sep 5, 2019

Uh oh!

aarchiba commented Sep 5, 2019

Uh oh!

pllim commented Sep 5, 2019

Uh oh!

mhvk commented Sep 5, 2019

Uh oh!

aarchiba commented Sep 5, 2019

Uh oh!

pllim commented Sep 6, 2019

Uh oh!

aarchiba commented Sep 6, 2019

Uh oh!

aarchiba commented Sep 16, 2019

Uh oh!

aarchiba commented Sep 16, 2019

Uh oh!

aarchiba commented Sep 17, 2019

Uh oh!

pllim Oct 30, 2019

Choose a reason for hiding this comment

Uh oh!

pllim left a comment

Choose a reason for hiding this comment

Uh oh!

mhvk commented Oct 30, 2019

Uh oh!

aarchiba commented Oct 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

jbcurtin commented Sep 3, 2019 •

edited

Loading

saimn commented Sep 4, 2019 •

edited

Loading