-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Overview of the Issue
With multiple many service registered with changing healthcheck output, noticed very high CPU usage on the Consul Server. After restarting the consul cluster, the CPU start to spike up gradually again after check_update_interval (24h) / 2, till about the same high cpu usage as before restarting eventually.
The raft index is also changing at a rate of ~5000 per minute, causing issues with healthcheck blocking queries as well.
Causes
After investigation noticed DeferCheck is leaking, with the following patch applied in agent/local/state.go:
Patch
--- state.go 2023-08-10 15:18:29.734073200 +0800
+++ state.go.bk-evidence 2023-08-10 15:20:22.963909000 +0800
@@ -26,6 +26,8 @@
"github.com/mitchellh/copystructure"
)
+var deferCheckCount atomic.Int64
+
var StateCounters = []prometheus.CounterDefinition{
{
Name: []string{"acl", "blocked", "service", "registration"},
@@ -231,6 +233,13 @@
agentEnterpriseMeta: *structs.NodeEnterpriseMetaInPartition(c.Partition),
}
l.SetDiscardCheckOutput(c.DiscardCheckOutput)
+
+ go func() {
+ for {
+ time.Sleep(time.Second * 10)
+ l.logger.Info("COUNT deferCheck", "count", deferCheckCount.Load())
+ }
+ }()
return l
}
@@ -710,12 +719,16 @@
if l.config.CheckUpdateInterval > 0 && c.Check.Status == status {
c.Check.Output = output
if c.DeferCheck == nil {
+ deferCheckCount.Add(1)
d := l.config.CheckUpdateInterval
intv := time.Duration(uint64(d)/2) + lib.RandomStagger(d)
c.DeferCheck = time.AfterFunc(intv, func() {
l.Lock()
defer l.Unlock()
+ deferCheckCount.Add(-1)
+ cCreated := c
+
c := l.checks[id]
if c == nil {
return
@@ -724,6 +737,9 @@
if c.Deleted {
return
}
+
+ l.logger.Info("UpdateCheck deferfunc triggered", "check", c.Check.CheckID, "ref", cCreated)
+
c.InSync = false
l.TriggerSyncChanges()
})
@@ -834,6 +850,10 @@
existing := l.checks[id]
if existing != nil {
c.InSync = c.Check.IsSame(existing.Check)
+
+ if existing.DeferCheck != nil && c.DeferCheck == nil {
+ l.logger.Error("FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!", "check", id)
+ }
}
l.checks[id] = c
@@ -1276,6 +1296,7 @@
if c.DeferCheck != nil {
c.DeferCheck.Stop()
c.DeferCheck = nil
+ deferCheckCount.Add(-1)
}
err = l.syncCheck(id)
default:
@@ -1392,6 +1413,7 @@
c := l.checks[id]
if c != nil && c.DeferCheck != nil {
c.DeferCheck.Stop()
+ deferCheckCount.Add(-1)
}
delete(l.checks, id)
}, saw the following messages in log:
2023-08-10T03:14:48.911Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-68721c9c-cc5b-205b-bf53-9e937ca0f75c-group-search-hub-api-ldd-1651-group-search-hub-api-ldd-1651-9090-sidecar-proxy:1
2023-08-10T03:14:48.911Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-68721c9c-cc5b-205b-bf53-9e937ca0f75c-group-search-hub-api-ldd-1651-group-search-hub-api-ldd-1651-9090-sidecar-proxy:2
2023-08-10T03:14:48.962Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=_nomad-check-caf1abcff4b2365c91dd87054358af72e34816b3
2023-08-10T03:14:48.976Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-397dad29-b7d0-456e-651a-d6757fc9c763-group-device-sampling-api-ldd-875-group-device-sampling-api-ldd-875-10000-sidecar-proxy:1
2023-08-10T03:14:48.976Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-397dad29-b7d0-456e-651a-d6757fc9c763-group-device-sampling-api-ldd-875-group-device-sampling-api-ldd-875-10000-sidecar-proxy:2
2023-08-10T03:14:49.043Z [INFO] agent: COUNT deferCheck: count=828
On the particular node where the log was captured, there were only ~50 health checks registered. Yet the DeferCheck has been queued up for more than 800 after running for ~5min, and keeps increasing, eventually causes CPU pressure to the server when the DeferChecks gets executed.
Resolution
func (l *State) setCheckStateLocked(c *CheckState) {
id := c.Check.CompoundCheckID()
existing := l.checks[id]
if existing != nil {
c.InSync = c.Check.IsSame(existing.Check)
+
+ if existing.DeferCheck != nil && c.DeferCheck == nil {
+ c.DeferCheck = existing.DeferCheck
+ }
}
l.checks[id] = c
// If this is a check for an aliased service, then notify the waiters.
l.notifyIfAliased(c.Check.CompoundServiceID())
l.TriggerSyncChanges()
}Adding the above lines to setCheckStateLocked helped to resolve the issue.
Verified running the patched agent for some period of time, the number of DeferCheck registered is stable, getting down from an ever increasing number to about the same number of DeferCheck as the HealthChecks on the node:
2023-08-10T05:45:02.171Z [INFO] agent: COUNT deferCheck: count=51
2023-08-10T05:45:12.172Z [INFO] agent: COUNT deferCheck: count=51
2023-08-10T05:45:22.173Z [INFO] agent: COUNT deferCheck: count=51
Too lazy to fill a proper pr with test...
Consul info for both Client and Server
Consul v1.15.4
Operating system and Environment details
Linux
