Skip to content

Fix condition when cache keeps refreshing even error is received with RPC response #8541

Closed
yurkeen wants to merge 4 commits intohashicorp:masterfrom
yurkeen:fix-refreshing-nonexistent-entries
Closed

Fix condition when cache keeps refreshing even error is received with RPC response #8541
yurkeen wants to merge 4 commits intohashicorp:masterfrom
yurkeen:fix-refreshing-nonexistent-entries

Conversation

@yurkeen
Copy link
Copy Markdown
Contributor

@yurkeen yurkeen commented Aug 20, 2020

Hello,

While working on the issue described and fixed in #8218 I discovered another interesting side effect.

Simple Explanation

When a DNS or HTTP query is issued to service catalog for a nonexistent datacenter, and cache is enabled, Consul agent keeps trying to refresh its cache for the request. This results in a situation, where new goroutines are being continuously created per every such request. These goroutines keep re-trying fetching the request and eventually end up in the sleeping state due to the back-off mechanism in the algorithm refreshing agent cache. However, until the affected agent restarts, refreshing goroutines keep going. Such behaviour produces a spike of RPC errors on the affected Consul client agent and elevated RPC error rate on Consul server agents. It can also enable the cache flooding attack vector to Consul agent process.

This PR prevents cache to spawn refresh goroutines for cache entries, where such entries are not valid RPC responses, letting those entries simply expire.


How To Read This PR

  • Commit 9860e3b: Before making further changes, I felt like the code needed some tidying up for better logic comprehension. In the Cache.fetch() method I rearranged conditionals for the part checking cache hit/miss (if ok {...} else {...}) as well as for the part manipulating the error returned from request.Fetch() (if err != nil {...} else {...}). The logic and the resulting values should not change in both cases.
  • Commit ae04c35: This is the next step and it does change some of the logic by removing the bool entryExists value returned from the Cache.getEntryLocked() method.
  • Commit 2afd0b0: Finally removes one more return value from the values returned by Cache.getEntryLocked()bool entryValid.

See the Reasoning For Changes section below for more details.


Full Explanation And Reasoning

I. Walkthorugh On Consul Cache

When cache is enabled for a DNS or HTTP request to obtain service nodes from a Consul agent, instead of directly issuing an RPC request using agent.RPC("Health.ServiceNodes", &args, &out) (or agent.RPC("Catalog.ServiceNodes", &args, &out) respectively), Consul is trying to obtain the request from its cache mechanism with the agent.cache.Get(context.TODO(), cachetype..., &args) call.
Such request can be issued using Consul's DNS interface:

dig -p 8600 +noall +answer +comment srv consul.service.NONEXISTENT.consul-testing @127.0.0.201

Or, alternatively, using the HTTP interface:

curl 'http://127.0.0.201:8500/v1/catalog/service/consul?dc=NONEXISTENT&cached=true'

The logs visible on client agent:

2020-08-20T05:08:13.867+0100 [ERROR] agent.client: RPC failed to server: method=Health.ServiceNodes server=127.0.0.103:8300 error="rpc error making call: No path to datacenter"
2020-08-20T05:08:16.527+0100 [ERROR] agent.client: RPC failed to server: method=Health.ServiceNodes server=127.0.0.101:8300 error="rpc error making call: No path to datacenter"
2020-08-20T05:08:17.617+0100 [ERROR] agent.client: RPC failed to server: method=Health.ServiceNodes server=127.0.0.102:8300 error="rpc error making call: No path to datacenter"
... on and on ...

The logs visible on server agents:

...
2020-08-20T07:52:49.536+0100 [WARN]  agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=nonexistent method=Health.ServiceNodes
2020-08-20T07:52:49.537+0100 [WARN]  agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=nonexistent method=Health.ServiceNodes
2020-08-20T07:52:49.537+0100 [WARN]  agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=nonexistent method=Health.ServiceNodes
2020-08-20T07:52:49.538+0100 [WARN]  agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=nonexistent method=Health.ServiceNodes
... on and on ...

By default, and for the most of cache types, refreshing is disabled:

consul/agent/cache/cache.go

Lines 189 to 192 in be01c42

// Refresh configures whether the data is actively refreshed or if
// the data is only refreshed on an explicit Get. The default (false)
// is to only request data on explicit Get.
Refresh bool

However, it is explicitly enabled for the HealthServices and CatalogServices cache types with the RegisterOptionsBlockingRefresh options set:
func (r RegisterOptionsBlockingRefresh) RegisterOptions() cache.RegisterOptions {
return cache.RegisterOptions{
// Maintain a blocking query, retry dropped connections quickly
Refresh: true,
SupportsBlocking: true,
RefreshTimer: 0 * time.Second,
QueryTimeout: 10 * time.Minute,
}
}

The logic for fetching items from cache is wrapped into three layers.

  • First is the Cache.getWithIndex() method. This is where cache is checked using the Cache.getEntryLocked() method. When this method returns with a valid value, i.e. cache hit, such values is forwarded back to the caller. On cache miss the Cache.fetch() method is called, which re-tries checking the cache and returns the waiterCh channel if it's a miss. Once waiterCh is closed, the logic follows back to re-check the cache again as the value should be there by that moment.
  • The second layer, Cache.fetch() method is directly responsible for background refresh of cache entries. Initially, it calls the Cache.getEntryLocked() method which checks whether there is a valid entry in cache, and when there is – a closed waiterCh channel is returned back to Cache.getWithIndex(). Otherwise, a goroutine, which calls Cache.fetch() recursively, is started. This is where I believe the problem occurs: an invalid/empty cache entry keeps refreshing because the logic is to continue spawning a new Cache.fetch() goroutine with having entryExists (but !entryValid at the same time) instead of returning the waiterCh early without spawning a new Cache.fetch().
  • The third layer is the Cache.getEntryLocked() method (original). It checks for and validates cache entries.

II. Reasoning For Changes

  1. Removing the bool entryExists return value from the Cache.getEntryLocked() method. A nonexistent and invalid entries share the same set of properties: entry.Valid set to false and entry.Value being an empty interface{}. Effectively this fact makes it a single case. In the codeentryExists was used exactly in two cases:
  2. Removing the bool entryValid return value from the Cache.getEntryLocked() method. Through the code entryValid return value simply reflects the value of the entry.Valid field. Removing duplicate variable simplifies the code by reusing existing and immediate value of entry.Valid.
  3. Now will logic simplified, the Cache.fetch() method returns on an invalid (same as missing) entry and does not spawn additional goroutines for cache refresh.

P. S.
I sincerely hope my thought train makes some sense. Possibly I overlooked a thing or a few, so I'm ready to work this PR out where necessary.

@yurkeen
Copy link
Copy Markdown
Contributor Author

yurkeen commented Aug 20, 2020

Looking into the broken tests now, might need some help here.

@yurkeen yurkeen force-pushed the fix-refreshing-nonexistent-entries branch from 2afd0b0 to 58d4604 Compare September 11, 2020 08:59
@dnephin dnephin added theme/agent-cache Agent Cache type/bug Feature does not function as expected labels Jan 6, 2021
Copy link
Copy Markdown
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this problem and opening this PR! Sorry it has taken some time for us to respond.

I believe the cache refreshing even on RPC error is the desired behaviour. If the RPC failure is due to a temporary network failure, or because something was deleted and is about to be recreated, then continuing to refresh seems correct.

Maybe we should instead add some kind of maximum attempts? That way if the failures continue for a long period of time, we stop refreshing sooner. Currently the refresh will happen until the 3day TTL is hit, which is a pretty long time to be refreshing something with errors.

key string,
info RequestInfo,
) (entryExists bool, entryValid bool, entry cacheEntry) {
func (c *Cache) getEntryLocked(tEntry typeEntry, key string, info RequestInfo) (entry cacheEntry) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to make a similar change at some point, but ran into the same test failures you have here. I believe there is a subtle reason that both exists and valid are required.

I ended up taking a bit of a different approach in this branch dnephin/agent-cache-fetch...dnephin/agent-cache-fetch-2 by replacing the function with isEntryValid(), which I think improves things a bit, because only one of the call sites cares about the existence of the entry.

Unfortunately that was mixed up with some other changes and never got into a state that was ready for a merge.

@dnephin
Copy link
Copy Markdown
Contributor

dnephin commented Jan 28, 2021

Thank you again for working on this issue! I've created #9588 to track the change that I think will fix this problem. Since this PR has conflicts and does not pass tests I'm going to close it. If you are still interested in contributing a fix for this problem I would be happy to help out and discuss further in #9588.

@dnephin dnephin closed this Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme/agent-cache Agent Cache type/bug Feature does not function as expected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants