Conversation
|
This seems like a step in the right direction. When performing CI/Builds with preloaded data, it would be nice to have an additional If the |
|
One idea while reading quickly the recent cache issues: it could be convenient to have a command-line script to list the files in the cache(s), delete one or all files, export, import, get a copy of a file, etc. ( edit: obviously this could be done by wrapping some of the functions added here ;) |
|
Now looked a bit beyond my general 👍 at the top. Overall, the idea seems sensible. However, the big question I have is how this is going to be discovered and used. Right now, most of Is the idea to follow up this PR to use it in CI runs? (Would be great!) Or to use it in external packages like PINT (where you mentioned this PR)? If the latter, we'd need to think how about making things more discoverable, perhaps even go the route suggested by @saimn above of using them in scripts. |
|
@aarchiba , what is your use case or workflow that needs such a feature? Is it because of |
I'm coming at this from PINT, which needs files not necessarily obtained through astropy but that are conveniently downloaded through the astropy mechanism, avoiding the need for PINT to implement its own caching mechanism. (The files are things like other solar system ephemerides.) We ran into trouble because the JPL server blocked the Travis machines because they were getting hammered with requests for the same files over and over. NANOGrav created a mirror, but that seems to go down with some regularity. So CI builds were failing with non-obvious messages. If we had been able to just load up a zip file into the machines' astropy caches, that would have resolved the issue. Unfortunately I couldn't find any reasonable way to provide the ZIP file to travis that didn't involve hosting a server, which if we were able to do that, could replace the NANOGrav server. (Git LFS from github can work, but the limit is 1GB storage and 1GB downloads per month, and anyone who forks your project can use up your quota for both.) We also had requests from users wanting to use machines offline. |
Well, I find the astropy documentation somewhat hard to navigate, but surely discoverability is largely a documentation problem? And while the IERS stuff is clearly astronomical and maybe belongs in
I think it's something of a niche use - the mechanism isn't really suitable for users keeping their wikipedia mirrors or podcast episodes or whatever - but I do think it makes sense for it to be usable for packages that make use of astropy to have this handy mechanism. I see less use for individual users, but it is in the public docs. |
Would it really help much to have a command-line script, compared to running relevant commands from an ipython prompt? All of the operations you list are available already using fairly straightforward commands already in astropy.utils.data (with this PR). |
|
I should say, the code as it stands is not really able to handle concurrency, and that problem is baked into the UI: download_file returns a filename. If another process, maybe in another virtualenv, does a clear_download_cache on the url, the file might be gone before it can be opened. (Hooray for race conditions!) So we don't want to hammer on the cache too hard, particularly on clearing it. (Simultaneous downloads are probably fine.) |
|
I also have some code to handle cases where the same file is also available from several mirrors (get it from the first one that works but index it under the original URL) and to handle cases where, like the IERS A/B tables, the online data is updated regularly and we want to pull in new data without wiping out the old if the new can't be obtained. The code needs some work and some test cases, so I was waiting to see if people think filling out the cache abilities at all is a good idea. |
Let me also add that as it currently stands, running the astropy test suite with remote tests enabled completely wipes out the user's astropy cache, forcing them to re-download everything that's in there. This PR fixes that by stashing the cache contents while those tests are run. |
|
Let me add also that the Travis failure is some kind of Travis misbehaviour - it reports a flake8 failure on a file that I haven't touched and that doesn't have the spaces that flake8 complains about. I assume it'll go away when I push a few more changes, but I've been waiting on some interest from the project for that. |
|
Thank you for the clarifications, @aarchiba . Lots to think about here. Re: Travis failure -- Usually a git rebase against upstream master would fix it. But okay to wait as well until your next push. |
|
Thanks for the detailed replies - the PINT use case is certainly a reasonable one and so is the fact that one should avoid hammering servers for CI tests (which will only become flaky as a result anyway). The latter means it is a good idea to have it in astropy rather than in PINT, as long as we can make sure other packages can find it - i.e., I agree that documentation is key! Indeed, my comments were partially an indirect question of how to do that: you already have good docstrings, but that doesn't make it too discoverable... Maybe at the least the Also, your point about user caches being wiped by using |
5f0a4b2 to
c29c20e
Compare
Actually this enables more stringent local as well as CI testing: the test suite could reset the cache location to somewhere under /tmp, then load up a standard ZIP file and mock urllib to raise an exception: now any test that tries to request anything unexpected from the net can show up in the test suite, and tests that request expected data can be run without Net access. (I have a hackish stub to mock urllib to raise exceptions so it's not hard.) |
|
I still have a slight concern about the security implication of passing around ZIP files. xref #9182 (comment) |
At the moment there's no actual passing around; the user can make them and the user can load them in. I don't think any of the files that end up in astropy's cache contain code, but in any case if their contents pose security issues those are maybe made worse by getting them from the Web instead of a checked ZIP file. If you're worried about malevolently crafted ZIPs, the index internal to the ZIP file is in JSON, not pickle, so it should not be an attack vector; it is used only to associate URLs with file contents (I considered skipping the index by naming the files inside the ZIP based on urlencoded versions of their URLs but this seemed cumbersome). New file hashes are computed before inserting them in the index, though I could check the hashes against the contents if that would be a useful precaution. ZIP file contents are never extracted to the filesystem, and strange paths inside the ZIP file do not affect the ultimate location of the data. But really, making sure you have the right ZIP file is not an inside-astropy problem. CI environments should indeed be provided with a shasum of the file to check after downloading. This improves security by preventing anyone from intercepting our outgoing file requests and returning bogus contents - the ZIP file contains pre-downloaded correct contents, checked by shasum. But this PR doesn't do anything with CI integration; that has to wait for someone to be willing to serve up the ZIP file to our CI servers on every run. This PR just makes it possible to import and export data from the cache. |
|
It may or may not be helpful here, but as an experiment I turned on Travis' "cache" feature for PINT. Specifically I had it cache ~/.astropy/cache. So the contents are stored in some cloud something or other and downloaded into each Travis run before it begins. Any updates to the cache are then uploaded to the cloud whatzit at the end. So the IERS server doesn't get hammered by our CI runs. Now, the cache isn't really intended for this, so Travis keeps separate versions for each entry in the build matrix and for each branch. But if a branch has no cached data, Travis will pull in the cache from master. So once master gets a copy for each entry in the build matrix, the individual runs operate out of cache. |
c29c20e to
0bfdd5e
Compare
|
I have to admit I don't understand the problem with the Windows build. I've just created some temporary files and I can't access them with urllib? |
| - All operations that act on the astropy download cache now take an argument | ||
| ``pkgname`` that allows one to specify which package's cache to use. | ||
| [#8327] | ||
| [#8237] |
pllim
left a comment
There was a problem hiding this comment.
LGTM. I am betting on that nothing weird crept in during all the rebase that you had to do. Again, thank you very much for contributing your expertise to improving our caching system and bearing with all the reviews (353 comments logged in "Conversation"!). I dare to say that utils.data might just be one of the most well covered module by the tests in the core now!
🐴 💀 🚀
|
🎆 |
|
Thank all of you for your careful code reviews - the code is much better for them! |
Various tools for managing the astropy download cache. Perhaps most useful for people wanting disconnected operation.
It is possible to export part or all of the contents of the download cache in the form of a ZIP file, which can then be loaded into astropy on another machine to populate its cache without requiring internet access. May be of interest for machines running CI, so they don't re-download everything they need every time a CI run happens - if the ZIP file can be provided to them somehow.
Also there's a new function to consistency check the contents of a cache in case there are unused items or it has become damaged somehow. Recall that the cache is user-wide, across multiple virtualenvs possibly including a development one, and it is accessed concurrently by multiple processes.
Finally, the ZIP file mechanism is used to make sure that tests that require purging the cache don't leave the user with an empty cache needing to download all their downloadable files again. Probably only a minor inconvenience for most users with Net connections.