Cache name resolve results in astropy.coordinates by adrn · Pull Request #9162 · astropy/astropy

adrn · 2019-08-23T20:45:24Z

This PR adds support for caching results from the name resolve (Sesame) functionality in astropy.coordinates (primarily used via SkyCoord.from_name).

The motivator here was that, in astropy-tutorials (and in my own scripts), we use .from_name() to retrieve coordinates of some objects every time the build is run, but this unnecessarily hammers the Sesame servers and occasionally fails because of connection issues.

Before I add a changelog entry, I figured I would open the PR to see if anyone has strong opinions about this.

eteq · 2019-08-23T21:25:23Z

I wanted to say this seems dangerous... But honestly when I think on it I realize this would indeed be super useful for me in my own science some of the time - often I grumble that I can't keep going with something using from_name when the internet goes out. So generally I like it (haven't reviewed in detail, but that's my first reaction).

I do think it needs a way to bust the cache easily though. I.e., "I know sesame was updated last week and I want the newest coordinates". Right now that requires knowing how to clear the cache via a url the user never sees... Maybe just have an option for cache to be 'overwrite' or similar)

adrn · 2019-08-24T14:56:42Z

Good point! I added an overwrite argument - I personally don't like having boolean arguments that also accept magic string values, so I opted for another argument instead.

aarchiba · 2019-08-29T16:32:17Z

Why is it necessary to manually handle locking and caching? Would astropy.utils.data.download_file do what you want if it had an overwrite argument?

astropy/coordinates/name_resolve.py

mhvk · 2019-08-31T00:37:26Z

Like the idea and wonder if we can at least make a session-local cache the default...

But agree with @aarchiba that the actual cache handling should be in utils.data, not here.

bsipocz · 2019-10-27T00:22:26Z

Needs a rebase and review being addressed. Most likely also a milestone change.

adrn · 2019-12-11T18:24:53Z

I rebased and switched to using the new download_file functionality. However, this currently doesn't work because of the way CDS delivers data, and they way we generate hashes for the cache. Each response from CDS may not be identical because it includes (amongst other things) the amount of time the query took and the date/time, even if repeatedly querying with the same search string. For example, here are two responses with the same query "Castor":

# castor	#Q62208733
#!Vl=VizieR (local): No table found for: castor

#(Q62208730)

#=Sa=Simbad@CfA:    1     0ms (from cache)
%@ 983633
%I.0 * alf Gem
%C.0 **
%C.N0 12.13.00.00
%J 113.649471640 +31.888282216 = 07:34:35.87 +31:53:17.8
%J.E [34.72 25.95 90] A 2007A&A...474..653V
%P -191.45 -145.19 [3.95 2.95 0] A 2007A&A...474..653V
%X 64.12 [3.75] A 2007A&A...474..653V
%V v 5.40 ~ [0.5] A 2006AstL...32..759G
%S A1V+A2Vm =0.0000D200.0030.0110000000100000 E ~
%#B 235


#!Vl=VizieR (local): No table found for: castor

#(Q62208730)

#=Sa=Simbad@CfA:    1     0ms (from cache)
%@ 983633
%I.0 * alf Gem
%C.0 **
%C.N0 12.13.00.00
%J 113.649471640 +31.888282216 = 07:34:35.87 +31:53:17.8
%J.E [34.72 25.95 90] A 2007A&A...474..653V
%P -191.45 -145.19 [3.95 2.95 0] A 2007A&A...474..653V
%X 64.12 [3.75] A 2007A&A...474..653V
%V v 5.40 ~ [0.5] A 2006AstL...32..759G
%S A1V+A2Vm =0.0000D200.0030.0110000000100000 E ~
%#B 235



#====Done (2019-Dec-11,18:16:23z)====

and

# castor	#Q62208730
#!Vl=VizieR (local): No table found for: castor
#=Sa=Simbad@CfA:    1    52ms
%@ 983633
%I.0 * alf Gem
%C.0 **
%C.N0 12.13.00.00
%J 113.649471640 +31.888282216 = 07:34:35.87 +31:53:17.8
%J.E [34.72 25.95 90] A 2007A&A...474..653V
%P -191.45 -145.19 [3.95 2.95 0] A 2007A&A...474..653V
%X 64.12 [3.75] A 2007A&A...474..653V
%V v 5.40 ~ [0.5] A 2006AstL...32..759G
%S A1V+A2Vm =0.0000D200.0030.0110000000100000 E ~
%#B 235
#!Vl=VizieR (local): No table found for: castor

#(Q62208730)

#=Sa=Simbad@CfA:    1    14ms
%@ 983633
%I.0 * alf Gem
%C.0 **
%C.N0 12.13.00.00
%J 113.649471640 +31.888282216 = 07:34:35.87 +31:53:17.8
%J.E [34.72 25.95 90] A 2007A&A...474..653V
%P -191.45 -145.19 [3.95 2.95 0] A 2007A&A...474..653V
%X 64.12 [3.75] A 2007A&A...474..653V
%V v 5.40 ~ [0.5] A 2006AstL...32..759G
%S A1V+A2Vm =0.0000D200.0030.0110000000100000 E ~
%#B 235

#====Done (2019-Dec-11,18:16:23z)====

In the previous set of commits (overwritten with the rebase), this is why I was manually interfacing with the cache: I needed to set the hash to the URL instead of the file content.

So, in order to get this PR to work, we would need to add functionality to download_file and related machinery to specify or overwrite the hash...

pllim · 2019-12-11T18:32:20Z

I think you already can. See "If they are concerned that the primary source of the data..." in https://astropy.readthedocs.io/en/latest/utils/data.html

aarchiba · 2019-12-11T18:53:59Z

I rebased and switched to using the new download_file functionality. However, this currently doesn't work because of the way CDS delivers data, and they way we generate hashes for the cache. Each response from CDS may not be identical because it includes (amongst other things) the amount of time the query took and the date/time, even if repeatedly querying with the same search string. For example, here are two responses with the same query "Castor":

Indeed, the result is going to be different every time, and if the cache works, you will get the same result as the first successful try (unless you choose to use cache="update" and re-download). Why is this a problem?

In the previous set of commits (overwritten with the rebase), this is why I was manually interfacing with the cache: I needed to set the hash to the URL instead of the file content.

Why? The hash need not be user-visible.

So, in order to get this PR to work, we would need to add functionality to download_file and related machinery to specify or overwrite the hash...

Why? I don't understand.

If I understand what you want (maybe I don't?) you should be able to write:

response = get_file_contents(download_file("http://cds.query/goes?here", cache=True))

This will get you a fresh request the first time and a copy of that result every later time. Isn't that what you want? The hash value appears only in the filename returned from download_file and the user never needs to care about it. (I considered just replacing the hash with a UUID in this context).

astrofrog · 2020-05-01T16:27:38Z

This seems worthwhile, so just wanted to give a 👍 on the idea!

adrn · 2020-05-01T20:58:55Z

Oh right, this PR! Updated and rebased.

adrn · 2020-05-01T22:30:58Z

RE: @aarchiba's comments - I don't remember why I was confused before. On revisiting this, it seemed pretty straightforward to get this working, so I think the new commits address the remaining issues. The test right now is pretty "weak," so open to suggestions there, but don't think it's critical.

eteq

I'm basically happy on this, but one minor consideration: I think overwrite is a bit ambiguous of a keyword on its own, which is why I was proposing cache='overwrite'. So what about overwrite->overwrite_cache?

I don't want this to block merging, though, so I'm approving since everything else looks good. And the test failure is a coverage red-herring because it's tested via remote_data.

adrn · 2020-05-02T01:08:22Z

OK, sure!

pllim · 2020-05-02T01:49:35Z

astropy/coordinates/name_resolve.py

+                download_file(url, cache=cache, show_progress=False,
+                              timeout=data.conf.remote_timeout))


Suggested change

download_file(url, cache=cache, show_progress=False,

timeout=data.conf.remote_timeout))

download_file(url, cache=cache, show_progress=False))

timeout setting here is redundant, as it is already the default behavior:

astropy/astropy/utils/data.py

Lines 1138 to 1139 in 986d5b7

if timeout is None:

timeout = conf.remote_timeout

pllim · 2020-05-02T01:51:18Z

astropy/coordinates/sky_coordinate.py

+                             "you are using caching, i.e. if `cache=True`.")
+
+        if overwrite_cache and cache:
+            overwrite_cache = "update"


A bug?

Suggested change

overwrite_cache = "update"

cache = "update"

pllim · 2020-05-02T01:52:15Z

astropy/coordinates/sky_coordinate.py

            sub-arcsecond accuracy for coordinates.
+        cache : bool, optional
+            Determines whether to cache the results or not.
+        overwrite_cache : bool, optional


I don't understand the need to have two different keywords to do this. Why not ask user to use cache='update' directly like in get_icrs_coordinates above?

Sigh, fine. I just hate string arguments like this so much!!

Maybe we can slowly move to Enum - https://docs.python.org/3/library/enum.html

pllim · 2020-05-02T01:54:22Z

astropy/coordinates/tests/test_name_resolve.py

+
+        icrs1 = get_icrs_coordinates("castor", cache=True)
+
+        # This is a weak test: we just check to see that a url is added to the


Why not have a stronger test with astropy.utils.data.get_cached_urls?

Because I didn't know about that :) - I'll add a new check

pllim · 2020-05-02T01:57:55Z

astropy/coordinates/tests/test_name_resolve.py

+        with shelve.open(urlmapfn) as url2hash:
+            assert len(url2hash) == 1
+
+        # Try reloading coordinates, now should just reload cached data:


I wonder if you can temporarily disable internet with no_internet to really test this. 🤔

adrn · 2020-05-02T16:26:06Z

Question: Do we want the default for from_name() to be cache=False or cache=True?

adrn · 2020-05-02T16:40:55Z

Thanks @pllim - I addressed your comments.

adrn · 2020-05-02T18:25:20Z

Note to self: rebase when #10274 is merged (assuming that fixes the failing test)

mhvk · 2020-05-02T18:37:55Z

@adrn - on whether to cache or not by default - my preference would be something that expires, but I don't think we have that yet! Failing that, I think True makes the most sense - the use case @eteq mentioned, of finding oneself without internet and wanting to continue to work, is I think one of the strongest ones. Plus, if one is using from_name anyway, clearly one is not too worried about small changes in position.

pass through cache key test caching name resolve add an overwrite option for name resolve caching changelog hash on URL not response content actual testing of caching funcitonality simplify download using new caching machinery

… no_internet

adrn · 2020-05-03T14:14:38Z

Thanks @mhvk - I also think that cache=True makes sense as the default. I updated and pushed.

mhvk

The changelog text is out of date so I fear you'll need a ci skip update, but otherwise it all looks great.

CHANGES.rst

adrn · 2020-05-03T14:48:24Z

@bsipocz This is not critical and could be moved to 4.2. On the other hand, I think it is ready to be merged now. I leave it up to you!

mhvk · 2020-05-03T14:59:37Z

Well, my goal this morning was to empty the slate of 4.1-milestoned issues and PRs, so I'm going to continue to do so and merge... (hoping that I do more good than bad - though my inner dictator is having a blast, with the part of me that stifles it distracted by the beautiful weather - on a morning like this, with the sky pure blue, without brown haze on the horizon and unspoiled by con trails, the air pure and the ears filled with bird song and wind blowing throw new, light green leaves, and the noise smelling flowers rather than traffic - I am hoping some things of the lockdown will stay...).

adrn · 2020-05-03T15:06:00Z

@mhvk OK, thanks. Yes, about to head outside myself - my eyes aren't focusing at infinity as much as they need to be! BTW: that message should be archived under #githubpoetry!

pllim · 2020-05-04T16:18:18Z

@adrn -- I created a #githubpoetry channel on Slack...

adrn added the coordinates label Aug 23, 2019

adrn requested a review from eteq August 23, 2019 20:45

adrn added this to the v4.0 milestone Aug 24, 2019

adrn force-pushed the coordinates/cache-name-resolve branch from d87a526 to 0d95f33 Compare August 26, 2019 16:20

This was referenced Aug 26, 2019

Mock remote HTTP requests astropy-learn/astropy-tutorials#390

Closed

Comment out and replace any small http requests astropy-learn/astropy-tutorials#348

Closed

aarchiba reviewed Aug 29, 2019

View reviewed changes

astropy/coordinates/name_resolve.py Outdated Show resolved Hide resolved

astropy/coordinates/name_resolve.py Outdated Show resolved Hide resolved

astropy/coordinates/name_resolve.py Outdated Show resolved Hide resolved

aarchiba mentioned this pull request Sep 16, 2019

Cache preload #9182

Merged

bsipocz added the utils label Oct 4, 2019

adrn removed this from the v4.0 milestone Oct 28, 2019

bsipocz added this to the v4.1 milestone Oct 28, 2019

adrn force-pushed the coordinates/cache-name-resolve branch from 0d95f33 to 994253d Compare December 11, 2019 18:19

adrn force-pushed the coordinates/cache-name-resolve branch from 994253d to 03c246e Compare May 1, 2020 20:17

adrn force-pushed the coordinates/cache-name-resolve branch from 03c246e to ee51f5b Compare May 1, 2020 22:10

eteq approved these changes May 2, 2020

View reviewed changes

pllim reviewed May 2, 2020

View reviewed changes

adrn force-pushed the coordinates/cache-name-resolve branch from 739f825 to 1f76352 Compare May 2, 2020 16:40

adrn force-pushed the coordinates/cache-name-resolve branch from f5ebb48 to 000bec4 Compare May 2, 2020 18:13

adrn added 11 commits May 3, 2020 09:12

add functionality to cache name resolve results

f7a35b0

pass through cache key test caching name resolve add an overwrite option for name resolve caching changelog hash on URL not response content actual testing of caching funcitonality simplify download using new caching machinery

updating test

c6edb4f

use get_file_contents

e158e47

modify minimal test of cached results

7fdd960

fix changelog entry location

34fa770

newline and fix changelog conflict [docs only]

cd5eb25

change kwarg name

ce4d8ca

suggestions from pllim

a97659c

update test to check for sesame url and try loading cached value with…

4362eaa

… no_internet

backticks [docs only]

94a78b4

change default to cache=True

0e9644c

adrn force-pushed the coordinates/cache-name-resolve branch from 000bec4 to 0e9644c Compare May 3, 2020 13:20

mhvk approved these changes May 3, 2020

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

update changelog entry [ci skip]

672c178

mhvk merged commit 5e1e0c7 into astropy:master May 3, 2020

pllim mentioned this pull request May 5, 2020

TST: Minimize calls to remote sites.json for testing #10304

Merged

3 tasks

pllim mentioned this pull request Aug 27, 2020

Lock-free cache implementation #10437

Merged

		download_file(url, cache=cache, show_progress=False,
		timeout=data.conf.remote_timeout))


		icrs1 = get_icrs_coordinates("castor", cache=True)

		# This is a weak test: we just check to see that a url is added to the

Uh oh!

Conversation

adrn commented Aug 23, 2019

Uh oh!

eteq commented Aug 23, 2019

Uh oh!

adrn commented Aug 24, 2019

Uh oh!

aarchiba commented Aug 29, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhvk commented Aug 31, 2019

Uh oh!

bsipocz commented Oct 27, 2019

Uh oh!

adrn commented Dec 11, 2019

Uh oh!

pllim commented Dec 11, 2019

Uh oh!

aarchiba commented Dec 11, 2019

Uh oh!

astrofrog commented May 1, 2020

Uh oh!

adrn commented May 1, 2020

Uh oh!

adrn commented May 1, 2020

Uh oh!

eteq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrn commented May 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrn commented May 2, 2020

Uh oh!

adrn commented May 2, 2020

Uh oh!

adrn commented May 2, 2020

Uh oh!

mhvk commented May 2, 2020

Uh oh!

adrn commented May 3, 2020

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adrn commented May 3, 2020

Uh oh!

mhvk commented May 3, 2020

Uh oh!

adrn commented May 3, 2020

Uh oh!

pllim commented May 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

eteq left a comment •

edited

Loading