Skip to content

Cache name resolve results in astropy.coordinates#9162

Merged
mhvk merged 12 commits intoastropy:masterfrom
adrn:coordinates/cache-name-resolve
May 3, 2020
Merged

Cache name resolve results in astropy.coordinates#9162
mhvk merged 12 commits intoastropy:masterfrom
adrn:coordinates/cache-name-resolve

Conversation

@adrn
Copy link
Member

@adrn adrn commented Aug 23, 2019

This PR adds support for caching results from the name resolve (Sesame) functionality in astropy.coordinates (primarily used via SkyCoord.from_name).

The motivator here was that, in astropy-tutorials (and in my own scripts), we use .from_name() to retrieve coordinates of some objects every time the build is run, but this unnecessarily hammers the Sesame servers and occasionally fails because of connection issues.

Before I add a changelog entry, I figured I would open the PR to see if anyone has strong opinions about this.

@adrn adrn requested a review from eteq August 23, 2019 20:45
@eteq
Copy link
Member

eteq commented Aug 23, 2019

I wanted to say this seems dangerous... But honestly when I think on it I realize this would indeed be super useful for me in my own science some of the time - often I grumble that I can't keep going with something using from_name when the internet goes out. So generally I like it (haven't reviewed in detail, but that's my first reaction).

I do think it needs a way to bust the cache easily though. I.e., "I know sesame was updated last week and I want the newest coordinates". Right now that requires knowing how to clear the cache via a url the user never sees... Maybe just have an option for cache to be 'overwrite' or similar)

@adrn
Copy link
Member Author

adrn commented Aug 24, 2019

Good point! I added an overwrite argument - I personally don't like having boolean arguments that also accept magic string values, so I opted for another argument instead.

@aarchiba
Copy link
Contributor

Why is it necessary to manually handle locking and caching? Would astropy.utils.data.download_file do what you want if it had an overwrite argument?

@mhvk
Copy link
Contributor

mhvk commented Aug 31, 2019

Like the idea and wonder if we can at least make a session-local cache the default...

But agree with @aarchiba that the actual cache handling should be in utils.data, not here.

@aarchiba aarchiba mentioned this pull request Sep 16, 2019
@bsipocz bsipocz added the utils label Oct 4, 2019
@bsipocz
Copy link
Member

bsipocz commented Oct 27, 2019

Needs a rebase and review being addressed. Most likely also a milestone change.

@adrn adrn removed this from the v4.0 milestone Oct 28, 2019
@bsipocz bsipocz added this to the v4.1 milestone Oct 28, 2019
@adrn adrn force-pushed the coordinates/cache-name-resolve branch from 0d95f33 to 994253d Compare December 11, 2019 18:19
@adrn
Copy link
Member Author

adrn commented Dec 11, 2019

I rebased and switched to using the new download_file functionality. However, this currently doesn't work because of the way CDS delivers data, and they way we generate hashes for the cache. Each response from CDS may not be identical because it includes (amongst other things) the amount of time the query took and the date/time, even if repeatedly querying with the same search string. For example, here are two responses with the same query "Castor":

# castor	#Q62208733
#!Vl=VizieR (local): No table found for: castor

#(Q62208730)

#=Sa=Simbad@CfA:    1     0ms (from cache)
%@ 983633
%I.0 * alf Gem
%C.0 **
%C.N0 12.13.00.00
%J 113.649471640 +31.888282216 = 07:34:35.87 +31:53:17.8
%J.E [34.72 25.95 90] A 2007A&A...474..653V
%P -191.45 -145.19 [3.95 2.95 0] A 2007A&A...474..653V
%X 64.12 [3.75] A 2007A&A...474..653V
%V v 5.40 ~ [0.5] A 2006AstL...32..759G
%S A1V+A2Vm =0.0000D200.0030.0110000000100000 E ~
%#B 235


#!Vl=VizieR (local): No table found for: castor

#(Q62208730)

#=Sa=Simbad@CfA:    1     0ms (from cache)
%@ 983633
%I.0 * alf Gem
%C.0 **
%C.N0 12.13.00.00
%J 113.649471640 +31.888282216 = 07:34:35.87 +31:53:17.8
%J.E [34.72 25.95 90] A 2007A&A...474..653V
%P -191.45 -145.19 [3.95 2.95 0] A 2007A&A...474..653V
%X 64.12 [3.75] A 2007A&A...474..653V
%V v 5.40 ~ [0.5] A 2006AstL...32..759G
%S A1V+A2Vm =0.0000D200.0030.0110000000100000 E ~
%#B 235



#====Done (2019-Dec-11,18:16:23z)====

and

# castor	#Q62208730
#!Vl=VizieR (local): No table found for: castor
#=Sa=Simbad@CfA:    1    52ms
%@ 983633
%I.0 * alf Gem
%C.0 **
%C.N0 12.13.00.00
%J 113.649471640 +31.888282216 = 07:34:35.87 +31:53:17.8
%J.E [34.72 25.95 90] A 2007A&A...474..653V
%P -191.45 -145.19 [3.95 2.95 0] A 2007A&A...474..653V
%X 64.12 [3.75] A 2007A&A...474..653V
%V v 5.40 ~ [0.5] A 2006AstL...32..759G
%S A1V+A2Vm =0.0000D200.0030.0110000000100000 E ~
%#B 235
#!Vl=VizieR (local): No table found for: castor

#(Q62208730)

#=Sa=Simbad@CfA:    1    14ms
%@ 983633
%I.0 * alf Gem
%C.0 **
%C.N0 12.13.00.00
%J 113.649471640 +31.888282216 = 07:34:35.87 +31:53:17.8
%J.E [34.72 25.95 90] A 2007A&A...474..653V
%P -191.45 -145.19 [3.95 2.95 0] A 2007A&A...474..653V
%X 64.12 [3.75] A 2007A&A...474..653V
%V v 5.40 ~ [0.5] A 2006AstL...32..759G
%S A1V+A2Vm =0.0000D200.0030.0110000000100000 E ~
%#B 235

#====Done (2019-Dec-11,18:16:23z)====

In the previous set of commits (overwritten with the rebase), this is why I was manually interfacing with the cache: I needed to set the hash to the URL instead of the file content.

So, in order to get this PR to work, we would need to add functionality to download_file and related machinery to specify or overwrite the hash...

@pllim
Copy link
Member

pllim commented Dec 11, 2019

I think you already can. See "If they are concerned that the primary source of the data..." in https://astropy.readthedocs.io/en/latest/utils/data.html

@aarchiba
Copy link
Contributor

I rebased and switched to using the new download_file functionality. However, this currently doesn't work because of the way CDS delivers data, and they way we generate hashes for the cache. Each response from CDS may not be identical because it includes (amongst other things) the amount of time the query took and the date/time, even if repeatedly querying with the same search string. For example, here are two responses with the same query "Castor":

Indeed, the result is going to be different every time, and if the cache works, you will get the same result as the first successful try (unless you choose to use cache="update" and re-download). Why is this a problem?

In the previous set of commits (overwritten with the rebase), this is why I was manually interfacing with the cache: I needed to set the hash to the URL instead of the file content.

Why? The hash need not be user-visible.

So, in order to get this PR to work, we would need to add functionality to download_file and related machinery to specify or overwrite the hash...

Why? I don't understand.

If I understand what you want (maybe I don't?) you should be able to write:

response = get_file_contents(download_file("http://cds.query/goes?here", cache=True))

This will get you a fresh request the first time and a copy of that result every later time. Isn't that what you want? The hash value appears only in the filename returned from download_file and the user never needs to care about it. (I considered just replacing the hash with a UUID in this context).

@astrofrog
Copy link
Member

This seems worthwhile, so just wanted to give a 👍 on the idea!

@adrn adrn force-pushed the coordinates/cache-name-resolve branch from 994253d to 03c246e Compare May 1, 2020 20:17
@adrn
Copy link
Member Author

adrn commented May 1, 2020

Oh right, this PR! Updated and rebased.

@adrn adrn force-pushed the coordinates/cache-name-resolve branch from 03c246e to ee51f5b Compare May 1, 2020 22:10
@adrn
Copy link
Member Author

adrn commented May 1, 2020

RE: @aarchiba's comments - I don't remember why I was confused before. On revisiting this, it seemed pretty straightforward to get this working, so I think the new commits address the remaining issues. The test right now is pretty "weak," so open to suggestions there, but don't think it's critical.

Copy link
Member

@eteq eteq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm basically happy on this, but one minor consideration: I think overwrite is a bit ambiguous of a keyword on its own, which is why I was proposing cache='overwrite'. So what about overwrite->overwrite_cache?

I don't want this to block merging, though, so I'm approving since everything else looks good. And the test failure is a coverage red-herring because it's tested via remote_data.

@adrn
Copy link
Member Author

adrn commented May 2, 2020

OK, sure!

Comment on lines +164 to +165
download_file(url, cache=cache, show_progress=False,
timeout=data.conf.remote_timeout))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download_file(url, cache=cache, show_progress=False,
timeout=data.conf.remote_timeout))
download_file(url, cache=cache, show_progress=False))

timeout setting here is redundant, as it is already the default behavior:

astropy/astropy/utils/data.py

Lines 1138 to 1139 in 986d5b7

if timeout is None:
timeout = conf.remote_timeout

"you are using caching, i.e. if `cache=True`.")

if overwrite_cache and cache:
overwrite_cache = "update"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bug?

Suggested change
overwrite_cache = "update"
cache = "update"

sub-arcsecond accuracy for coordinates.
cache : bool, optional
Determines whether to cache the results or not.
overwrite_cache : bool, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need to have two different keywords to do this. Why not ask user to use cache='update' directly like in get_icrs_coordinates above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, fine. I just hate string arguments like this so much!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can slowly move to Enum - https://docs.python.org/3/library/enum.html


icrs1 = get_icrs_coordinates("castor", cache=True)

# This is a weak test: we just check to see that a url is added to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have a stronger test with astropy.utils.data.get_cached_urls?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I didn't know about that :) - I'll add a new check

with shelve.open(urlmapfn) as url2hash:
assert len(url2hash) == 1

# Try reloading coordinates, now should just reload cached data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you can temporarily disable internet with no_internet to really test this. 🤔

@adrn
Copy link
Member Author

adrn commented May 2, 2020

Question: Do we want the default for from_name() to be cache=False or cache=True?

@adrn adrn force-pushed the coordinates/cache-name-resolve branch from 739f825 to 1f76352 Compare May 2, 2020 16:40
@adrn
Copy link
Member Author

adrn commented May 2, 2020

Thanks @pllim - I addressed your comments.

@adrn adrn force-pushed the coordinates/cache-name-resolve branch from f5ebb48 to 000bec4 Compare May 2, 2020 18:13
@adrn
Copy link
Member Author

adrn commented May 2, 2020

Note to self: rebase when #10274 is merged (assuming that fixes the failing test)

@mhvk
Copy link
Contributor

mhvk commented May 2, 2020

@adrn - on whether to cache or not by default - my preference would be something that expires, but I don't think we have that yet! Failing that, I think True makes the most sense - the use case @eteq mentioned, of finding oneself without internet and wanting to continue to work, is I think one of the strongest ones. Plus, if one is using from_name anyway, clearly one is not too worried about small changes in position.

@adrn adrn force-pushed the coordinates/cache-name-resolve branch from 000bec4 to 0e9644c Compare May 3, 2020 13:20
@adrn
Copy link
Member Author

adrn commented May 3, 2020

Thanks @mhvk - I also think that cache=True makes sense as the default. I updated and pushed.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog text is out of date so I fear you'll need a ci skip update, but otherwise it all looks great.

@adrn
Copy link
Member Author

adrn commented May 3, 2020

@bsipocz This is not critical and could be moved to 4.2. On the other hand, I think it is ready to be merged now. I leave it up to you!

@mhvk
Copy link
Contributor

mhvk commented May 3, 2020

Well, my goal this morning was to empty the slate of 4.1-milestoned issues and PRs, so I'm going to continue to do so and merge... (hoping that I do more good than bad - though my inner dictator is having a blast, with the part of me that stifles it distracted by the beautiful weather - on a morning like this, with the sky pure blue, without brown haze on the horizon and unspoiled by con trails, the air pure and the ears filled with bird song and wind blowing throw new, light green leaves, and the noise smelling flowers rather than traffic - I am hoping some things of the lockdown will stay...).

@mhvk mhvk merged commit 5e1e0c7 into astropy:master May 3, 2020
@adrn
Copy link
Member Author

adrn commented May 3, 2020

@mhvk OK, thanks. Yes, about to head outside myself - my eyes aren't focusing at infinity as much as they need to be! BTW: that message should be archived under #githubpoetry!

@pllim
Copy link
Member

pllim commented May 4, 2020

@adrn -- I created a #githubpoetry channel on Slack...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants