Skip to content

DeferCheck is leaking, causing high CPU usage and raft index churn. #18429

@fredwangwang

Description

@fredwangwang

Overview of the Issue

With multiple many service registered with changing healthcheck output, noticed very high CPU usage on the Consul Server. After restarting the consul cluster, the CPU start to spike up gradually again after check_update_interval (24h) / 2, till about the same high cpu usage as before restarting eventually.

image

The raft index is also changing at a rate of ~5000 per minute, causing issues with healthcheck blocking queries as well.


Causes

After investigation noticed DeferCheck is leaking, with the following patch applied in agent/local/state.go:

Patch
--- state.go	2023-08-10 15:18:29.734073200 +0800
+++ state.go.bk-evidence	2023-08-10 15:20:22.963909000 +0800
@@ -26,6 +26,8 @@
 	"github.com/mitchellh/copystructure"
 )
 
+var deferCheckCount atomic.Int64
+
 var StateCounters = []prometheus.CounterDefinition{
 	{
 		Name: []string{"acl", "blocked", "service", "registration"},
@@ -231,6 +233,13 @@
 		agentEnterpriseMeta: *structs.NodeEnterpriseMetaInPartition(c.Partition),
 	}
 	l.SetDiscardCheckOutput(c.DiscardCheckOutput)
+
+	go func() {
+		for {
+			time.Sleep(time.Second * 10)
+			l.logger.Info("COUNT deferCheck", "count", deferCheckCount.Load())
+		}
+	}()
 	return l
 }
 
@@ -710,12 +719,16 @@
 	if l.config.CheckUpdateInterval > 0 && c.Check.Status == status {
 		c.Check.Output = output
 		if c.DeferCheck == nil {
+			deferCheckCount.Add(1)
 			d := l.config.CheckUpdateInterval
 			intv := time.Duration(uint64(d)/2) + lib.RandomStagger(d)
 			c.DeferCheck = time.AfterFunc(intv, func() {
 				l.Lock()
 				defer l.Unlock()
 
+				deferCheckCount.Add(-1)
+				cCreated := c
+
 				c := l.checks[id]
 				if c == nil {
 					return
@@ -724,6 +737,9 @@
 				if c.Deleted {
 					return
 				}
+
+				l.logger.Info("UpdateCheck deferfunc triggered", "check", c.Check.CheckID, "ref", cCreated)
+
 				c.InSync = false
 				l.TriggerSyncChanges()
 			})
@@ -834,6 +850,10 @@
 	existing := l.checks[id]
 	if existing != nil {
 		c.InSync = c.Check.IsSame(existing.Check)
+
+		if existing.DeferCheck != nil && c.DeferCheck == nil {
+			l.logger.Error("FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!", "check", id)
+		}
 	}
 
 	l.checks[id] = c
@@ -1276,6 +1296,7 @@
 			if c.DeferCheck != nil {
 				c.DeferCheck.Stop()
 				c.DeferCheck = nil
+				deferCheckCount.Add(-1)
 			}
 			err = l.syncCheck(id)
 		default:
@@ -1392,6 +1413,7 @@
 	c := l.checks[id]
 	if c != nil && c.DeferCheck != nil {
 		c.DeferCheck.Stop()
+		deferCheckCount.Add(-1)
 	}
 	delete(l.checks, id)
 }

, saw the following messages in log:

2023-08-10T03:14:48.911Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-68721c9c-cc5b-205b-bf53-9e937ca0f75c-group-search-hub-api-ldd-1651-group-search-hub-api-ldd-1651-9090-sidecar-proxy:1
2023-08-10T03:14:48.911Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-68721c9c-cc5b-205b-bf53-9e937ca0f75c-group-search-hub-api-ldd-1651-group-search-hub-api-ldd-1651-9090-sidecar-proxy:2
2023-08-10T03:14:48.962Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=_nomad-check-caf1abcff4b2365c91dd87054358af72e34816b3
2023-08-10T03:14:48.976Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-397dad29-b7d0-456e-651a-d6757fc9c763-group-device-sampling-api-ldd-875-group-device-sampling-api-ldd-875-10000-sidecar-proxy:1
2023-08-10T03:14:48.976Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-397dad29-b7d0-456e-651a-d6757fc9c763-group-device-sampling-api-ldd-875-group-device-sampling-api-ldd-875-10000-sidecar-proxy:2
2023-08-10T03:14:49.043Z [INFO]  agent: COUNT deferCheck: count=828

On the particular node where the log was captured, there were only ~50 health checks registered. Yet the DeferCheck has been queued up for more than 800 after running for ~5min, and keeps increasing, eventually causes CPU pressure to the server when the DeferChecks gets executed.

Resolution

func (l *State) setCheckStateLocked(c *CheckState) {
	id := c.Check.CompoundCheckID()
	existing := l.checks[id]
	if existing != nil {
		c.InSync = c.Check.IsSame(existing.Check)
+
+		if existing.DeferCheck != nil && c.DeferCheck == nil {
+			c.DeferCheck = existing.DeferCheck
+		}
	}

	l.checks[id] = c

	// If this is a check for an aliased service, then notify the waiters.
	l.notifyIfAliased(c.Check.CompoundServiceID())

	l.TriggerSyncChanges()
}

Adding the above lines to setCheckStateLocked helped to resolve the issue.
Verified running the patched agent for some period of time, the number of DeferCheck registered is stable, getting down from an ever increasing number to about the same number of DeferCheck as the HealthChecks on the node:

2023-08-10T05:45:02.171Z [INFO]  agent: COUNT deferCheck: count=51
2023-08-10T05:45:12.172Z [INFO]  agent: COUNT deferCheck: count=51
2023-08-10T05:45:22.173Z [INFO]  agent: COUNT deferCheck: count=51

Too lazy to fill a proper pr with test...

Consul info for both Client and Server

Consul v1.15.4

Operating system and Environment details

Linux

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions