Fix condition when cache keeps refreshing even error is received with RPC response by yurkeen · Pull Request #8541 · hashicorp/consul

yurkeen · 2020-08-20T09:54:32Z

Hello,

While working on the issue described and fixed in #8218 I discovered another interesting side effect.

Simple Explanation

When a DNS or HTTP query is issued to service catalog for a nonexistent datacenter, and cache is enabled, Consul agent keeps trying to refresh its cache for the request. This results in a situation, where new goroutines are being continuously created per every such request. These goroutines keep re-trying fetching the request and eventually end up in the sleeping state due to the back-off mechanism in the algorithm refreshing agent cache. However, until the affected agent restarts, refreshing goroutines keep going. Such behaviour produces a spike of RPC errors on the affected Consul client agent and elevated RPC error rate on Consul server agents. It can also enable the cache flooding attack vector to Consul agent process.

This PR prevents cache to spawn refresh goroutines for cache entries, where such entries are not valid RPC responses, letting those entries simply expire.

How To Read This PR

Commit 9860e3b: Before making further changes, I felt like the code needed some tidying up for better logic comprehension. In the Cache.fetch() method I rearranged conditionals for the part checking cache hit/miss (if ok {...} else {...}) as well as for the part manipulating the error returned from request.Fetch() (if err != nil {...} else {...}). The logic and the resulting values should not change in both cases.
Commit ae04c35: This is the next step and it does change some of the logic by removing the bool entryExists value returned from the Cache.getEntryLocked() method.
Commit 2afd0b0: Finally removes one more return value from the values returned by Cache.getEntryLocked() – bool entryValid.

See the Reasoning For Changes section below for more details.

Full Explanation And Reasoning

I. Walkthorugh On Consul Cache

When cache is enabled for a DNS or HTTP request to obtain service nodes from a Consul agent, instead of directly issuing an RPC request using agent.RPC("Health.ServiceNodes", &args, &out) (or agent.RPC("Catalog.ServiceNodes", &args, &out) respectively), Consul is trying to obtain the request from its cache mechanism with the agent.cache.Get(context.TODO(), cachetype..., &args) call.
Such request can be issued using Consul's DNS interface:

dig -p 8600 +noall +answer +comment srv consul.service.NONEXISTENT.consul-testing @127.0.0.201

Or, alternatively, using the HTTP interface:

curl 'http://127.0.0.201:8500/v1/catalog/service/consul?dc=NONEXISTENT&cached=true'

The logs visible on client agent:

2020-08-20T05:08:13.867+0100 [ERROR] agent.client: RPC failed to server: method=Health.ServiceNodes server=127.0.0.103:8300 error="rpc error making call: No path to datacenter"
2020-08-20T05:08:16.527+0100 [ERROR] agent.client: RPC failed to server: method=Health.ServiceNodes server=127.0.0.101:8300 error="rpc error making call: No path to datacenter"
2020-08-20T05:08:17.617+0100 [ERROR] agent.client: RPC failed to server: method=Health.ServiceNodes server=127.0.0.102:8300 error="rpc error making call: No path to datacenter"
... on and on ...

The logs visible on server agents:

...
2020-08-20T07:52:49.536+0100 [WARN]  agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=nonexistent method=Health.ServiceNodes
2020-08-20T07:52:49.537+0100 [WARN]  agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=nonexistent method=Health.ServiceNodes
2020-08-20T07:52:49.537+0100 [WARN]  agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=nonexistent method=Health.ServiceNodes
2020-08-20T07:52:49.538+0100 [WARN]  agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=nonexistent method=Health.ServiceNodes
... on and on ...

By default, and for the most of cache types, refreshing is disabled:

consul/agent/cache/cache.go

Lines 189 to 192 in be01c42

    
           // Refresh configures whether the data is actively refreshed or if 
        
           // the data is only refreshed on an explicit Get. The default (false) 
        
           // is to only request data on explicit Get. 
        
           Refresh bool

However, it is explicitly enabled for the HealthServices and CatalogServices cache types with the RegisterOptionsBlockingRefresh options set:

consul/agent/cache-types/options.go

Lines 15 to 23 in 12acdd7

    
           func (r RegisterOptionsBlockingRefresh) RegisterOptions() cache.RegisterOptions { 
        
           	return cache.RegisterOptions{ 
        
           		// Maintain a blocking query, retry dropped connections quickly 
        
           		Refresh:          true, 
        
           		SupportsBlocking: true, 
        
           		RefreshTimer:     0 * time.Second, 
        
           		QueryTimeout:     10 * time.Minute, 
        
           	} 
        
           }

The logic for fetching items from cache is wrapped into three layers.

First is the Cache.getWithIndex() method. This is where cache is checked using the Cache.getEntryLocked() method. When this method returns with a valid value, i.e. cache hit, such values is forwarded back to the caller. On cache miss the Cache.fetch() method is called, which re-tries checking the cache and returns the waiterCh channel if it's a miss. Once waiterCh is closed, the logic follows back to re-check the cache again as the value should be there by that moment.
The second layer, Cache.fetch() method is directly responsible for background refresh of cache entries. Initially, it calls the Cache.getEntryLocked() method which checks whether there is a valid entry in cache, and when there is – a closed waiterCh channel is returned back to Cache.getWithIndex(). Otherwise, a goroutine, which calls Cache.fetch() recursively, is started. This is where I believe the problem occurs: an invalid/empty cache entry keeps refreshing because the logic is to continue spawning a new Cache.fetch() goroutine with having entryExists (but !entryValid at the same time) instead of returning the waiterCh early without spawning a new Cache.fetch().
The third layer is the Cache.getEntryLocked() method (original). It checks for and validates cache entries.

II. Reasoning For Changes

Removing the bool entryExists return value from the Cache.getEntryLocked() method. A nonexistent and invalid entries share the same set of properties: entry.Valid set to false and entry.Value being an empty interface{}. Effectively this fact makes it a single case. In the codeentryExists was used exactly in two cases:
- First, when creating an empty cache entry if one does not exist or invalid:
  https://github.com/hashicorp/consul/blob/9860e3be4896c29438210ba9e621ee84cf95c554/agent/cache/cache.go#L477-L487
- Second, when a cache entry does not exist and we are not allowing new values with the !allowNew flag:
  https://github.com/hashicorp/consul/blob/9860e3be4896c29438210ba9e621ee84cf95c554/agent/cache/cache.go#L489-L498
  In both the above cases, the cache entry{} received from Cache.getEntryLocked() resulted in the same set of flags set and is treated the same way down the logic in the code.
Removing the bool entryValid return value from the Cache.getEntryLocked() method. Through the code entryValid return value simply reflects the value of the entry.Valid field. Removing duplicate variable simplifies the code by reusing existing and immediate value of entry.Valid.
Now will logic simplified, the Cache.fetch() method returns on an invalid (same as missing) entry and does not spawn additional goroutines for cache refresh.

P. S.
I sincerely hope my thought train makes some sense. Possibly I overlooked a thing or a few, so I'm ready to work this PR out where necessary.

yurkeen · 2020-08-20T11:01:23Z

Looking into the broken tests now, might need some help here.

dnephin

Thank you for working on this problem and opening this PR! Sorry it has taken some time for us to respond.

I believe the cache refreshing even on RPC error is the desired behaviour. If the RPC failure is due to a temporary network failure, or because something was deleted and is about to be recreated, then continuing to refresh seems correct.

Maybe we should instead add some kind of maximum attempts? That way if the failures continue for a long period of time, we stop refreshing sooner. Currently the refresh will happen until the 3day TTL is hit, which is a pretty long time to be refreshing something with errors.

dnephin · 2021-01-06T00:05:54Z

agent/cache/cache.go

-	key string,
-	info RequestInfo,
-) (entryExists bool, entryValid bool, entry cacheEntry) {
+func (c *Cache) getEntryLocked(tEntry typeEntry, key string, info RequestInfo) (entry cacheEntry) {


I attempted to make a similar change at some point, but ran into the same test failures you have here. I believe there is a subtle reason that both exists and valid are required.

I ended up taking a bit of a different approach in this branch dnephin/agent-cache-fetch...dnephin/agent-cache-fetch-2 by replacing the function with isEntryValid(), which I think improves things a bit, because only one of the call sites cares about the existence of the entry.

Unfortunately that was mixed up with some other changes and never got into a state that was ready for a merge.

dnephin · 2021-01-28T18:32:59Z

Thank you again for working on this issue! I've created #9588 to track the change that I think will fix this problem. Since this PR has conflicts and does not pass tests I'm going to close it. If you are still interested in contributing a fix for this problem I would be happy to help out and discuss further in #9588.

yurkeen added 4 commits September 8, 2020 19:02

Re-group ok and err condition blocks for easier code comprehension

67c8872

Remove entryExists value returned from getEntryLocked()

e586155

Remove entryValid value returned from getEntryLocked()

b8324a3

Fixing tests

58d4604

yurkeen force-pushed the fix-refreshing-nonexistent-entries branch from 2afd0b0 to 58d4604 Compare September 11, 2020 08:59

dnephin added theme/agent-cache Agent Cache type/bug Feature does not function as expected labels Jan 6, 2021

dnephin reviewed Jan 6, 2021

View reviewed changes

dnephin mentioned this pull request Jan 19, 2021

agent/cache: stop refreshing a cache entry after some number of failed attempts #9588

Open

dnephin closed this Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix condition when cache keeps refreshing even error is received with RPC response #8541

Fix condition when cache keeps refreshing even error is received with RPC response #8541
yurkeen wants to merge 4 commits intohashicorp:masterfrom
yurkeen:fix-refreshing-nonexistent-entries

yurkeen commented Aug 20, 2020

Uh oh!

yurkeen commented Aug 20, 2020

Uh oh!

dnephin left a comment

Uh oh!

dnephin Jan 6, 2021

Uh oh!

dnephin commented Jan 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// Refresh configures whether the data is actively refreshed or if
	// the data is only refreshed on an explicit Get. The default (false)
	// is to only request data on explicit Get.
	Refresh bool

	func (r RegisterOptionsBlockingRefresh) RegisterOptions() cache.RegisterOptions {
	return cache.RegisterOptions{
	// Maintain a blocking query, retry dropped connections quickly
	Refresh: true,
	SupportsBlocking: true,
	RefreshTimer: 0 * time.Second,
	QueryTimeout: 10 * time.Minute,
	}
	}

Conversation

yurkeen commented Aug 20, 2020

Simple Explanation

How To Read This PR

Full Explanation And Reasoning

I. Walkthorugh On Consul Cache

II. Reasoning For Changes

Uh oh!

yurkeen commented Aug 20, 2020

Uh oh!

dnephin left a comment

Choose a reason for hiding this comment

Uh oh!

dnephin Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

dnephin commented Jan 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants