DeferCheck is leaking, causing high CPU usage and raft index churn.

#### Overview of the Issue

With ~multiple~ many service registered with changing healthcheck output, noticed very high CPU usage on the Consul Server. After restarting the consul cluster, the CPU start to spike up gradually again after `check_update_interval (24h) / 2`, till about the same high cpu usage as before restarting eventually.

![image](https://github.com/hashicorp/consul/assets/21690857/536346cd-18a5-4011-a266-1d7169347a2c)


The raft index is also changing at a rate of ~5000 per minute, causing issues with healthcheck blocking queries as well.

---

#### Causes

After investigation noticed `DeferCheck` is leaking, with the following patch applied in `agent/local/state.go`:

<details>
  <summary>Patch</summary>


```diff
--- state.go	2023-08-10 15:18:29.734073200 +0800
+++ state.go.bk-evidence	2023-08-10 15:20:22.963909000 +0800
@@ -26,6 +26,8 @@
 	"github.com/mitchellh/copystructure"
 )
 
+var deferCheckCount atomic.Int64
+
 var StateCounters = []prometheus.CounterDefinition{
 	{
 		Name: []string{"acl", "blocked", "service", "registration"},
@@ -231,6 +233,13 @@
 		agentEnterpriseMeta: *structs.NodeEnterpriseMetaInPartition(c.Partition),
 	}
 	l.SetDiscardCheckOutput(c.DiscardCheckOutput)
+
+	go func() {
+		for {
+			time.Sleep(time.Second * 10)
+			l.logger.Info("COUNT deferCheck", "count", deferCheckCount.Load())
+		}
+	}()
 	return l
 }
 
@@ -710,12 +719,16 @@
 	if l.config.CheckUpdateInterval > 0 && c.Check.Status == status {
 		c.Check.Output = output
 		if c.DeferCheck == nil {
+			deferCheckCount.Add(1)
 			d := l.config.CheckUpdateInterval
 			intv := time.Duration(uint64(d)/2) + lib.RandomStagger(d)
 			c.DeferCheck = time.AfterFunc(intv, func() {
 				l.Lock()
 				defer l.Unlock()
 
+				deferCheckCount.Add(-1)
+				cCreated := c
+
 				c := l.checks[id]
 				if c == nil {
 					return
@@ -724,6 +737,9 @@
 				if c.Deleted {
 					return
 				}
+
+				l.logger.Info("UpdateCheck deferfunc triggered", "check", c.Check.CheckID, "ref", cCreated)
+
 				c.InSync = false
 				l.TriggerSyncChanges()
 			})
@@ -834,6 +850,10 @@
 	existing := l.checks[id]
 	if existing != nil {
 		c.InSync = c.Check.IsSame(existing.Check)
+
+		if existing.DeferCheck != nil && c.DeferCheck == nil {
+			l.logger.Error("FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!", "check", id)
+		}
 	}
 
 	l.checks[id] = c
@@ -1276,6 +1296,7 @@
 			if c.DeferCheck != nil {
 				c.DeferCheck.Stop()
 				c.DeferCheck = nil
+				deferCheckCount.Add(-1)
 			}
 			err = l.syncCheck(id)
 		default:
@@ -1392,6 +1413,7 @@
 	c := l.checks[id]
 	if c != nil && c.DeferCheck != nil {
 		c.DeferCheck.Stop()
+		deferCheckCount.Add(-1)
 	}
 	delete(l.checks, id)
 }
```

</details>

, saw the following messages in log:


```log
2023-08-10T03:14:48.911Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-68721c9c-cc5b-205b-bf53-9e937ca0f75c-group-search-hub-api-ldd-1651-group-search-hub-api-ldd-1651-9090-sidecar-proxy:1
2023-08-10T03:14:48.911Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-68721c9c-cc5b-205b-bf53-9e937ca0f75c-group-search-hub-api-ldd-1651-group-search-hub-api-ldd-1651-9090-sidecar-proxy:2
2023-08-10T03:14:48.962Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=_nomad-check-caf1abcff4b2365c91dd87054358af72e34816b3
2023-08-10T03:14:48.976Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-397dad29-b7d0-456e-651a-d6757fc9c763-group-device-sampling-api-ldd-875-group-device-sampling-api-ldd-875-10000-sidecar-proxy:1
2023-08-10T03:14:48.976Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-397dad29-b7d0-456e-651a-d6757fc9c763-group-device-sampling-api-ldd-875-group-device-sampling-api-ldd-875-10000-sidecar-proxy:2
2023-08-10T03:14:49.043Z [INFO]  agent: COUNT deferCheck: count=828
```

On the particular node where the log was captured, there were only ~50 health checks registered. Yet the DeferCheck has been queued up for more than 800 after running for ~5min, and keeps increasing, eventually causes CPU pressure to the server when the DeferChecks gets executed.

### Resolution

```diff
func (l *State) setCheckStateLocked(c *CheckState) {
	id := c.Check.CompoundCheckID()
	existing := l.checks[id]
	if existing != nil {
		c.InSync = c.Check.IsSame(existing.Check)
+
+		if existing.DeferCheck != nil && c.DeferCheck == nil {
+			c.DeferCheck = existing.DeferCheck
+		}
	}

	l.checks[id] = c

	// If this is a check for an aliased service, then notify the waiters.
	l.notifyIfAliased(c.Check.CompoundServiceID())

	l.TriggerSyncChanges()
}
```

Adding the above lines to `setCheckStateLocked` helped to resolve the issue.
Verified running the patched agent for some period of time, the number of DeferCheck registered is stable, getting down from an ever increasing number to about the same number of DeferCheck as the HealthChecks on the node:

```
2023-08-10T05:45:02.171Z [INFO]  agent: COUNT deferCheck: count=51
2023-08-10T05:45:12.172Z [INFO]  agent: COUNT deferCheck: count=51
2023-08-10T05:45:22.173Z [INFO]  agent: COUNT deferCheck: count=51
```

Too lazy to fill a proper pr with test... 

### Consul info for both Client and Server

Consul v1.15.4

### Operating system and Environment details

Linux

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeferCheck is leaking, causing high CPU usage and raft index churn. #18429

Overview of the Issue

Causes

Resolution

Consul info for both Client and Server

Operating system and Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeferCheck is leaking, causing high CPU usage and raft index churn. #18429

Description

Overview of the Issue

Causes

Resolution

Consul info for both Client and Server

Operating system and Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions