Fix condition when cache keeps refreshing even error is received with RPC response #8541
Fix condition when cache keeps refreshing even error is received with RPC response #8541yurkeen wants to merge 4 commits intohashicorp:masterfrom
Conversation
|
Looking into the broken tests now, might need some help here. |
2afd0b0 to
58d4604
Compare
dnephin
left a comment
There was a problem hiding this comment.
Thank you for working on this problem and opening this PR! Sorry it has taken some time for us to respond.
I believe the cache refreshing even on RPC error is the desired behaviour. If the RPC failure is due to a temporary network failure, or because something was deleted and is about to be recreated, then continuing to refresh seems correct.
Maybe we should instead add some kind of maximum attempts? That way if the failures continue for a long period of time, we stop refreshing sooner. Currently the refresh will happen until the 3day TTL is hit, which is a pretty long time to be refreshing something with errors.
| key string, | ||
| info RequestInfo, | ||
| ) (entryExists bool, entryValid bool, entry cacheEntry) { | ||
| func (c *Cache) getEntryLocked(tEntry typeEntry, key string, info RequestInfo) (entry cacheEntry) { |
There was a problem hiding this comment.
I attempted to make a similar change at some point, but ran into the same test failures you have here. I believe there is a subtle reason that both exists and valid are required.
I ended up taking a bit of a different approach in this branch dnephin/agent-cache-fetch...dnephin/agent-cache-fetch-2 by replacing the function with isEntryValid(), which I think improves things a bit, because only one of the call sites cares about the existence of the entry.
Unfortunately that was mixed up with some other changes and never got into a state that was ready for a merge.
|
Thank you again for working on this issue! I've created #9588 to track the change that I think will fix this problem. Since this PR has conflicts and does not pass tests I'm going to close it. If you are still interested in contributing a fix for this problem I would be happy to help out and discuss further in #9588. |
Hello,
While working on the issue described and fixed in #8218 I discovered another interesting side effect.
Simple Explanation
When a DNS or HTTP query is issued to service catalog for a nonexistent datacenter, and cache is enabled, Consul agent keeps trying to refresh its cache for the request. This results in a situation, where new goroutines are being continuously created per every such request. These goroutines keep re-trying fetching the request and eventually end up in the sleeping state due to the back-off mechanism in the algorithm refreshing agent cache. However, until the affected agent restarts, refreshing goroutines keep going. Such behaviour produces a spike of RPC errors on the affected Consul client agent and elevated RPC error rate on Consul server agents. It can also enable the cache flooding attack vector to Consul agent process.
This PR prevents cache to spawn refresh goroutines for cache entries, where such entries are not valid RPC responses, letting those entries simply expire.
How To Read This PR
Cache.fetch()method I rearranged conditionals for the part checking cache hit/miss (if ok {...} else {...}) as well as for the part manipulating the error returned fromrequest.Fetch()(if err != nil {...} else {...}). The logic and the resulting values should not change in both cases.entryExistsvalue returned from theCache.getEntryLocked()method.Cache.getEntryLocked()– boolentryValid.See the Reasoning For Changes section below for more details.
Full Explanation And Reasoning
I. Walkthorugh On Consul Cache
When cache is enabled for a DNS or HTTP request to obtain service nodes from a Consul agent, instead of directly issuing an RPC request using
agent.RPC("Health.ServiceNodes", &args, &out)(oragent.RPC("Catalog.ServiceNodes", &args, &out)respectively), Consul is trying to obtain the request from its cache mechanism with theagent.cache.Get(context.TODO(), cachetype..., &args)call.Such request can be issued using Consul's DNS interface:
Or, alternatively, using the HTTP interface:
curl 'http://127.0.0.201:8500/v1/catalog/service/consul?dc=NONEXISTENT&cached=true'The logs visible on client agent:
The logs visible on server agents:
By default, and for the most of cache types, refreshing is disabled:
consul/agent/cache/cache.go
Lines 189 to 192 in be01c42
However, it is explicitly enabled for the
HealthServicesandCatalogServicescache types with theRegisterOptionsBlockingRefreshoptions set:consul/agent/cache-types/options.go
Lines 15 to 23 in 12acdd7
The logic for fetching items from cache is wrapped into three layers.
Cache.getEntryLocked()method. When this method returns with a valid value, i.e. cache hit, such values is forwarded back to the caller. On cache miss theCache.fetch()method is called, which re-tries checking the cache and returns thewaiterChchannel if it's a miss. OncewaiterChis closed, the logic follows back to re-check the cache again as the value should be there by that moment.Cache.getEntryLocked()method which checks whether there is a valid entry in cache, and when there is – a closedwaiterChchannel is returned back toCache.getWithIndex(). Otherwise, a goroutine, which callsCache.fetch()recursively, is started. This is where I believe the problem occurs: an invalid/empty cache entry keeps refreshing because the logic is to continue spawning a newCache.fetch()goroutine with havingentryExists(but!entryValid at the same time) instead of returning thewaiterChearly without spawning a newCache.fetch().II. Reasoning For Changes
entryExistsreturn value from theCache.getEntryLocked()method. A nonexistent and invalid entries share the same set of properties:entry.Validset tofalseandentry.Valuebeing an emptyinterface{}. Effectively this fact makes it a single case. In the codeentryExistswas used exactly in two cases:https://github.com/hashicorp/consul/blob/9860e3be4896c29438210ba9e621ee84cf95c554/agent/cache/cache.go#L477-L487
!allowNewflag:https://github.com/hashicorp/consul/blob/9860e3be4896c29438210ba9e621ee84cf95c554/agent/cache/cache.go#L489-L498
In both the above cases, the cache
entry{}received fromCache.getEntryLocked()resulted in the same set of flags set and is treated the same way down the logic in the code.entryValidreturn value from theCache.getEntryLocked()method. Through the codeentryValidreturn value simply reflects the value of theentry.Validfield. Removing duplicate variable simplifies the code by reusing existing and immediate value ofentry.Valid.Cache.fetch()method returns on an invalid (same as missing) entry and does not spawn additional goroutines for cache refresh.P. S.
I sincerely hope my thought train makes some sense. Possibly I overlooked a thing or a few, so I'm ready to work this PR out where necessary.