Skip to content

Cache preload#9182

Merged
pllim merged 12 commits intoastropy:masterfrom
aarchiba:cache-preload
Oct 30, 2019
Merged

Cache preload#9182
pllim merged 12 commits intoastropy:masterfrom
aarchiba:cache-preload

Conversation

@aarchiba
Copy link
Contributor

Various tools for managing the astropy download cache. Perhaps most useful for people wanting disconnected operation.

It is possible to export part or all of the contents of the download cache in the form of a ZIP file, which can then be loaded into astropy on another machine to populate its cache without requiring internet access. May be of interest for machines running CI, so they don't re-download everything they need every time a CI run happens - if the ZIP file can be provided to them somehow.

Also there's a new function to consistency check the contents of a cache in case there are unused items or it has become damaged somehow. Recall that the cache is user-wide, across multiple virtualenvs possibly including a development one, and it is accessed concurrently by multiple processes.

Finally, the ZIP file mechanism is used to make sure that tests that require purging the cache don't leave the user with an empty cache needing to download all their downloadable files again. Probably only a minor inconvenience for most users with Net connections.

@jbcurtin
Copy link

jbcurtin commented Sep 3, 2019

This seems like a step in the right direction. When performing CI/Builds with preloaded data, it would be nice to have an additional sha256 checksum for the zipfile file from os.environ. Combined with the hashes available in the zipfile being rechecked, we'll know for certain zipfile integrity hasn't been compromised

If the os.environ ENVVar fails to match the zipfile hash.hexdigest, please throw an IOError

@saimn
Copy link
Contributor

saimn commented Sep 4, 2019

One idea while reading quickly the recent cache issues: it could be convenient to have a command-line script to list the files in the cache(s), delete one or all files, export, import, get a copy of a file, etc. (astropy-cache [list|export|import|get|delete] ?)

edit: obviously this could be done by wrapping some of the functions added here ;)

@mhvk
Copy link
Contributor

mhvk commented Sep 5, 2019

Now looked a bit beyond my general 👍 at the top. Overall, the idea seems sensible.

However, the big question I have is how this is going to be discovered and used. Right now, most of utils is meant for internal astropy use, exposed mostly as one hopes things might be useful for others too (see http://docs.astropy.org/en/latest/utils/index.html; iers is the one exception and it has been argued that it in fact doesn't belong in utils partially for that reason...). Certainly, the way I have found things in utils is by looking at astropy code, not because of going through its scattered documentation (though as a developer I'm hardly a typical case...).

Is the idea to follow up this PR to use it in CI runs? (Would be great!) Or to use it in external packages like PINT (where you mentioned this PR)? If the latter, we'd need to think how about making things more discoverable, perhaps even go the route suggested by @saimn above of using them in scripts.

@pllim
Copy link
Member

pllim commented Sep 5, 2019

@aarchiba , what is your use case or workflow that needs such a feature? Is it because of astroquery downloads? Or is it something else?

@aarchiba
Copy link
Contributor Author

aarchiba commented Sep 5, 2019

@aarchiba , what is your use case or workflow that needs such a feature? Is it because of astroquery downloads? Or is it something else?

I'm coming at this from PINT, which needs files not necessarily obtained through astropy but that are conveniently downloaded through the astropy mechanism, avoiding the need for PINT to implement its own caching mechanism. (The files are things like other solar system ephemerides.)

We ran into trouble because the JPL server blocked the Travis machines because they were getting hammered with requests for the same files over and over. NANOGrav created a mirror, but that seems to go down with some regularity. So CI builds were failing with non-obvious messages. If we had been able to just load up a zip file into the machines' astropy caches, that would have resolved the issue. Unfortunately I couldn't find any reasonable way to provide the ZIP file to travis that didn't involve hosting a server, which if we were able to do that, could replace the NANOGrav server. (Git LFS from github can work, but the limit is 1GB storage and 1GB downloads per month, and anyone who forks your project can use up your quota for both.)

We also had requests from users wanting to use machines offline.

@aarchiba
Copy link
Contributor Author

aarchiba commented Sep 5, 2019

However, the big question I have is how this is going to be discovered and used. Right now, most of utils is meant for internal astropy use, exposed mostly as one hopes things might be useful for others too (see http://docs.astropy.org/en/latest/utils/index.html; iers is the one exception and it has been argued that it in fact doesn't belong in utils partially for that reason...). Certainly, the way I have found things in utils is by looking at astropy code, not because of going through its scattered documentation (though as a developer I'm hardly a typical case...).

Well, I find the astropy documentation somewhat hard to navigate, but surely discoverability is largely a documentation problem? And while the IERS stuff is clearly astronomical and maybe belongs in time or coords, downloading files into a cache is not. The static nature of (some) astronomical data is somewhat unusual, which helps make a case for it being in astropy.

Is the idea to follow up this PR to use it in CI runs? (Would be great!) Or to use it in external packages like PINT (where you mentioned this PR)? If the latter, we'd need to think how about making things more discoverable, perhaps even go the route suggested by @saimn above of using them in scripts.

I think it's something of a niche use - the mechanism isn't really suitable for users keeping their wikipedia mirrors or podcast episodes or whatever - but I do think it makes sense for it to be usable for packages that make use of astropy to have this handy mechanism. I see less use for individual users, but it is in the public docs.

@aarchiba
Copy link
Contributor Author

aarchiba commented Sep 5, 2019

One idea while reading quickly the recent cache issues: it could be convenient to have a command-line script to list the files in the cache(s), delete one or all files, export, import, get a copy of a file, etc. (astropy-cache [list|export|import|get|delete] ?)

edit: obviously this could be done by wrapping some of the functions added here ;)

Would it really help much to have a command-line script, compared to running relevant commands from an ipython prompt? All of the operations you list are available already using fairly straightforward commands already in astropy.utils.data (with this PR).

@aarchiba
Copy link
Contributor Author

aarchiba commented Sep 5, 2019

I should say, the code as it stands is not really able to handle concurrency, and that problem is baked into the UI: download_file returns a filename. If another process, maybe in another virtualenv, does a clear_download_cache on the url, the file might be gone before it can be opened. (Hooray for race conditions!) So we don't want to hammer on the cache too hard, particularly on clearing it. (Simultaneous downloads are probably fine.)

@aarchiba
Copy link
Contributor Author

aarchiba commented Sep 5, 2019

I also have some code to handle cases where the same file is also available from several mirrors (get it from the first one that works but index it under the original URL) and to handle cases where, like the IERS A/B tables, the online data is updated regularly and we want to pull in new data without wiping out the old if the new can't be obtained. The code needs some work and some test cases, so I was waiting to see if people think filling out the cache abilities at all is a good idea.

@aarchiba
Copy link
Contributor Author

aarchiba commented Sep 5, 2019

@aarchiba , what is your use case or workflow that needs such a feature? Is it because of astroquery downloads? Or is it something else?

Let me also add that as it currently stands, running the astropy test suite with remote tests enabled completely wipes out the user's astropy cache, forcing them to re-download everything that's in there. This PR fixes that by stashing the cache contents while those tests are run.

@aarchiba
Copy link
Contributor Author

aarchiba commented Sep 5, 2019

Let me add also that the Travis failure is some kind of Travis misbehaviour - it reports a flake8 failure on a file that I haven't touched and that doesn't have the spaces that flake8 complains about. I assume it'll go away when I push a few more changes, but I've been waiting on some interest from the project for that.

@pllim
Copy link
Member

pllim commented Sep 5, 2019

Thank you for the clarifications, @aarchiba . Lots to think about here.

Re: Travis failure -- Usually a git rebase against upstream master would fix it. But okay to wait as well until your next push.

@mhvk
Copy link
Contributor

mhvk commented Sep 5, 2019

Thanks for the detailed replies - the PINT use case is certainly a reasonable one and so is the fact that one should avoid hammering servers for CI tests (which will only become flaky as a result anyway). The latter means it is a good idea to have it in astropy rather than in PINT, as long as we can make sure other packages can find it - i.e., I agree that documentation is key! Indeed, my comments were partially an indirect question of how to do that: you already have good docstrings, but that doesn't make it too discoverable... Maybe at the least the utils.data package should gain its own page (or at least have a better module docstring)? Note that I'm not asking you to do all that - we can make it a new issue, but hopefully we can at least agree on what approach might be best!

Also, your point about user caches being wiped by using --remote-data would seem a good reason. Let me also ping @astrofrog, since I think he is more familiar with those options, and has thought about the cache a lot more than I have.

@aarchiba
Copy link
Contributor Author

aarchiba commented Sep 5, 2019

Also, your point about user caches being wiped by using --remote-data would seem a good reason. Let me also ping @astrofrog, since I think he is more familiar with those options, and has thought about the cache a lot more than I have.

Actually this enables more stringent local as well as CI testing: the test suite could reset the cache location to somewhere under /tmp, then load up a standard ZIP file and mock urllib to raise an exception: now any test that tries to request anything unexpected from the net can show up in the test suite, and tests that request expected data can be run without Net access.

(I have a hackish stub to mock urllib to raise exceptions so it's not hard.)

@pllim
Copy link
Member

pllim commented Sep 6, 2019

I still have a slight concern about the security implication of passing around ZIP files. xref #9182 (comment)

@aarchiba
Copy link
Contributor Author

aarchiba commented Sep 6, 2019

I still have a slight concern about the security implication of passing around ZIP files. xref #9182 (comment)

At the moment there's no actual passing around; the user can make them and the user can load them in. I don't think any of the files that end up in astropy's cache contain code, but in any case if their contents pose security issues those are maybe made worse by getting them from the Web instead of a checked ZIP file.

If you're worried about malevolently crafted ZIPs, the index internal to the ZIP file is in JSON, not pickle, so it should not be an attack vector; it is used only to associate URLs with file contents (I considered skipping the index by naming the files inside the ZIP based on urlencoded versions of their URLs but this seemed cumbersome). New file hashes are computed before inserting them in the index, though I could check the hashes against the contents if that would be a useful precaution. ZIP file contents are never extracted to the filesystem, and strange paths inside the ZIP file do not affect the ultimate location of the data. But really, making sure you have the right ZIP file is not an inside-astropy problem.

CI environments should indeed be provided with a shasum of the file to check after downloading. This improves security by preventing anyone from intercepting our outgoing file requests and returning bogus contents - the ZIP file contains pre-downloaded correct contents, checked by shasum. But this PR doesn't do anything with CI integration; that has to wait for someone to be willing to serve up the ZIP file to our CI servers on every run. This PR just makes it possible to import and export data from the cache.

@aarchiba
Copy link
Contributor Author

It may or may not be helpful here, but as an experiment I turned on Travis' "cache" feature for PINT. Specifically I had it cache ~/.astropy/cache. So the contents are stored in some cloud something or other and downloaded into each Travis run before it begins. Any updates to the cache are then uploaded to the cloud whatzit at the end. So the IERS server doesn't get hammered by our CI runs.

Now, the cache isn't really intended for this, so Travis keeps separate versions for each entry in the build matrix and for each branch. But if a branch has no cached data, Travis will pull in the cache from master. So once master gets a copy for each entry in the build matrix, the individual runs operate out of cache.

@aarchiba
Copy link
Contributor Author

Fixes #9187
Fixes #8676
Makes solution of #9162 more straightforward, but doesn't decide for itself when the cache is invalid

@aarchiba
Copy link
Contributor Author

I have to admit I don't understand the problem with the Windows build. I've just created some temporary files and I can't access them with urllib?

- All operations that act on the astropy download cache now take an argument
``pkgname`` that allows one to specify which package's cache to use.
[#8327]
[#8237]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Member

@pllim pllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am betting on that nothing weird crept in during all the rebase that you had to do. Again, thank you very much for contributing your expertise to improving our caching system and bearing with all the reviews (353 comments logged in "Conversation"!). I dare to say that utils.data might just be one of the most well covered module by the tests in the core now!

🐴 💀 🚀

@pllim pllim merged commit 87dd101 into astropy:master Oct 30, 2019
@pllim pllim mentioned this pull request Oct 30, 2019
@mhvk
Copy link
Contributor

mhvk commented Oct 30, 2019

🎆

@aarchiba
Copy link
Contributor Author

Thank all of you for your careful code reviews - the code is much better for them!

pllim added a commit to pllim/astropy that referenced this pull request Nov 7, 2019
pllim added a commit to pllim/astropy that referenced this pull request Nov 10, 2019
pllim added a commit to pllim/astropy that referenced this pull request Nov 16, 2019
maxnoe pushed a commit to maxnoe/astropy that referenced this pull request Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants