storeliveness: smear storeliveness heartbeat messages to reduce goroutine spikes at heartbeat interval tick#156830
Conversation
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
44f2c61 to
fd0c485
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
fd0c485 to
a065c78
Compare
miraradeva
left a comment
There was a problem hiding this comment.
Looks great. A few nits and one question for a potential simplification.
@miraradeva reviewed 8 of 8 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)
pkg/kv/kvserver/storeliveness/support_manager.go line 38 at r1 (raw file):
type MessageSender interface { EnqueueMessage(ctx context.Context, msg slpb.Message) (sent bool) SendAllEnqueuedMessages(ctx context.Context)
Question: if we always call SendAllEnqueueMessages right after EnqueueMessages, can we just include that call inside EnqueueMessage and let the support manager be agnostic to this signaling? It will mimic the old behavior: as soon as the first message is enqueued, we start building the batch (in the new logic, we start the smearing).
If this works, it will also help consolidate the cluster-setting checking logic in one place: in EnqueueMessage, if the cluster setting is on, call SendAllEnqueueMessages (or directly signal all the sendMessage channels); otherwise, signal all the directMessage channels.
pkg/kv/kvserver/storeliveness/transport.go line 110 at r1 (raw file):
// Each node maintains a single instance of Transport that handles sending and // receiving messages on behalf of all stores on the node. //
nit: we want to keep the empty lines between paragraphs.
pkg/kv/kvserver/storeliveness/transport.go line 60 at r1 (raw file):
"otherwise heartbeat sends are sent when they are enqueued "+ "at the sendQueue, bypassing heartbeat smearing",
nit: stray new line.
pkg/kv/kvserver/storeliveness/transport.go line 61 at r1 (raw file):
"at the sendQueue, bypassing heartbeat smearing", true,
You can set this metamorphically to false, to get some test coverage for the off case.
pkg/kv/kvserver/storeliveness/transport.go line 64 at r1 (raw file):
) var HeartbeatSmearingRefreshInterval = settings.RegisterDurationSetting(
nit: all these global cluster settings should have a comment.
pkg/kv/kvserver/storeliveness/transport.go line 99 at r1 (raw file):
// sendQueue is a queue of outgoing Messages. type sendQueue struct {
nit: stray empty line
pkg/kv/kvserver/storeliveness/transport.go line 236 at r1 (raw file):
} // Start background goroutine to act as the transport smearing sender.
nit: maybe this can live in a separate function, so it doesn't make the transport constructor so big.
pkg/kv/kvserver/storeliveness/transport.go line 578 at r1 (raw file):
defer idleTimer.Stop() batch := &slpb.MessageBatch{}
nit: stray new line.
pkg/kv/kvserver/storeliveness/transport_test.go line 733 at r1 (raw file):
ctx := context.Background() st := cluster.MakeTestingClusterSettings() HeartbeatSmearingEnabled.Override(ctx, &st.SV, true)
nit: it's on by default, right?
pkg/kv/kvserver/storeliveness/transport_test.go line 799 at r1 (raw file):
"all messages should be delivered within reasonable time") // Disable smearing by setting the pacing refresh interval to 0 and ensure messages
nit: maybe the refresh interval = 0 case can be a separate test? It does test important behavior.
iskettaneh
left a comment
There was a problem hiding this comment.
Looks good 🔥
Added a couple of nits and one question.
Can you check the bug detector? Is this scenario (at the very end) possible: https://github.com/cockroachdb/cockroach/actions/runs/19071904390 ?
@iskettaneh reviewed 7 of 8 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dodeca12 and @miraradeva)
pkg/kv/kvserver/storeliveness/transport.go line 236 at r1 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
nit: maybe this can live in a separate function, so it doesn't make the transport constructor so big.
+1
pkg/kv/kvserver/storeliveness/transport.go line 60 at r1 (raw file):
"otherwise heartbeat sends are sent when they are enqueued "+ "at the sendQueue, bypassing heartbeat smearing",
nit: extra line
pkg/kv/kvserver/storeliveness/transport.go line 99 at r1 (raw file):
// sendQueue is a queue of outgoing Messages. type sendQueue struct {
nit: extra line
pkg/kv/kvserver/storeliveness/transport.go line 632 at r1 (raw file):
t.metrics.SendQueueBytes.Dec(batchMessagesSize) if len(batch.Messages) > 0 {
I don't think this check is needed, since we call continue a few lines above if we don't have messages in the batch.
pkg/kv/kvserver/storeliveness/testutils.go line 105 at r1 (raw file):
func (tms *testMessageSender) SendAllEnqueuedMessages(_ context.Context) { // No-op for testing
nit: end the comment with a dot.
pkg/kv/kvserver/storeliveness/transport_test.go line 904 at r1 (raw file):
// the setting on the next iteration when it processes a message, so no // dummy message is needed to synchronize. testutils.SucceedsSoon(t, func() error {
Is this check needed? Didn't we verify that there were 2 received messages above?
a065c78 to
c9914d4
Compare
dodeca12
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh and @miraradeva)
pkg/kv/kvserver/storeliveness/support_manager.go line 38 at r1 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
Question: if we always call
SendAllEnqueueMessagesright afterEnqueueMessages, can we just include that call insideEnqueueMessageand let the support manager be agnostic to this signaling? It will mimic the old behavior: as soon as the first message is enqueued, we start building the batch (in the new logic, we start the smearing).If this works, it will also help consolidate the cluster-setting checking logic in one place: in
EnqueueMessage, if the cluster setting is on, callSendAllEnqueueMessages(or directly signal all thesendMessagechannels); otherwise, signal all thedirectMessagechannels.
Done.
pkg/kv/kvserver/storeliveness/transport.go line 61 at r1 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
You can set this metamorphically to false, to get some test coverage for the off case.
I'm not clear on what you mean here; do you want me to add tests that cover the off case? If set to off, most of the tests pass, except a few that deal with smearing behaviour and logic
pkg/kv/kvserver/storeliveness/transport.go line 632 at r1 (raw file):
Previously, iskettaneh wrote…
I don't think this check is needed, since we call continue a few lines above if we don't have messages in the batch.
Done.
pkg/kv/kvserver/storeliveness/transport_test.go line 733 at r1 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
nit: it's on by default, right?
it is, but I thought I'd be more explicit in this test. Will remove
pkg/kv/kvserver/storeliveness/transport_test.go line 904 at r1 (raw file):
Previously, iskettaneh wrote…
Is this check needed? Didn't we verify that there were 2 received messages above?
Done.
miraradeva
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dodeca12 and @iskettaneh)
pkg/kv/kvserver/storeliveness/transport.go line 61 at r1 (raw file):
Previously, dodeca12 (Swapneeth Gorantla) wrote…
I'm not clear on what you mean here; do you want me to add tests that cover the off case? If set to off, most of the tests pass, except a few that deal with smearing behaviour and logic
When you have a cluster setting, in addition to setting it to true of false, you can also set it to randomly pick a value for a given test. So, for the tests where you verify the smearing behavior, you'd need to set it explicitly to true. In the rest of the tests in the code base (where storeliveness is just a low-level component), having random values for the cluster setting will expand the test coverage.
Here's an example:
.c9914d4 to
b91d4a7
Compare
TIL! I was only aware of |
dodeca12
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh and @miraradeva)
pkg/kv/kvserver/storeliveness/transport.go line 61 at r1 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
When you have a cluster setting, in addition to setting it to true of false, you can also set it to randomly pick a value for a given test. So, for the tests where you verify the smearing behavior, you'd need to set it explicitly to true. In the rest of the tests in the code base (where storeliveness is just a low-level component), having random values for the cluster setting will expand the test coverage.
Here's an example:
.
Done.
b91d4a7 to
c740ef3
Compare
9a564c4 to
b0ce908
Compare
|
The subtle race condition as described here: https://github.com/cockroachdb/cockroach/actions/runs/19071904390 already exists in the code without any of my changes when viewing master. I have a fix for this; will put up the fix in a new PR so as to logically separate changes. |
miraradeva
left a comment
There was a problem hiding this comment.
Nice catch! I think we probably are (have been?) aware of this. Raft transport has the same issue. Both protocols can handle message loss by design, and the leaked queue reference gets freed up (and should be GCed) at the end of EnqueueMessage, right? Unless you and others see bigger implications of this, I don't think you necessarily need to fix this as part of this work.
@miraradeva reviewed 2 of 5 files at r2, 1 of 3 files at r3, 2 of 2 files at r5, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dodeca12 and @iskettaneh)
pkg/kv/kvserver/storeliveness/transport_test.go line 733 at r1 (raw file):
Previously, dodeca12 (Swapneeth Gorantla) wrote…
it is, but I thought I'd be more explicit in this test. Will remove
Yeah, you might need it with the metamorphic cluster setting.
pkg/kv/kvserver/storeliveness/support_manager.go line 332 at r5 (raw file):
} }
nit: stray new line.
pkg/kv/kvserver/storeliveness/support_manager.go line 430 at r5 (raw file):
_ = sm.sender.EnqueueMessage(ctx, response) }
nit: stray new line.
pkg/kv/kvserver/storeliveness/transport.go line 56 at r5 (raw file):
// // HeartbeatSmearingEnabled controls whether heartbeat sends are distributed over // time to avoid spking the number of runnable goroutines. When enabled,
nit: "spiking"
pkg/kv/kvserver/storeliveness/transport.go line 226 at r5 (raw file):
ctx context.Context, stopper *stop.Stopper, settings *clustersettings.Settings, ) error { return stopper.RunAsyncTask(
To help with tracing and readability, check out how the startLoop goroutine is kicked off in the support manager constructor.
pkg/kv/kvserver/storeliveness/transport.go line 493 at r5 (raw file):
// Signal the processQueue goroutine if in direct mode (smearing disabled). if !t.SendHeartbeatsSmeared() {
Now that this cluster setting is only checked here, maybe you don't need the separate function anymore and you can inline it here?
I like how this turned out btw.
dodeca12
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh, @miraradeva, and @stevendanna)
pkg/kv/kvserver/storeliveness/transport.go line 226 at r5 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
To help with tracing and readability, check out how the
startLoopgoroutine is kicked off in the support manager constructor.
Done.
pkg/kv/kvserver/storeliveness/transport.go line 493 at r5 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
Now that this cluster setting is only checked here, maybe you don't need the separate function anymore and you can inline it here?
I like how this turned out btw.
Done.
Previously, heartbeat messages were sent immediately when enqueued via `EnqueueMessage`. In large clusters, many stores might all send heartbeats simultaneously at the same tick interval, causing a spike in runnable goroutines that caused issues elsewhere in the database process. This patch introduces heartbeat smearing logic that batches and smears heartbeat sends over a configurable duration using a taskpacer. The smearing logic is enabled by default via the `kv.store_liveness.heartbeat_smearing.enabled` cluster setting (defaults to true), and can be configured via `kv.store_liveness.heartbeat_smearing.refresh` (default: 10ms) and `kv.store_liveness.heartbeat_smearing.smear` (default: 1ms) settings. When enabled, messages are enqueued but not sent immediately. Instead, they wait in per-node queues until `SendAllEnqueuedMessages()` is called, which signals the coordinator. The coordinator waits briefly (`batchDuration`) to collect messages from multiple stores, then paces the signaling to each queue's `processQueue` goroutine using the pacer to spread sends over time. Fixes: cockroachdb#148210 Release note: None
b0ce908 to
824bdef
Compare
miraradeva
left a comment
There was a problem hiding this comment.
Thanks for your hard work on this!
Ship it, pending @iskettaneh's approval.
iskettaneh
left a comment
There was a problem hiding this comment.
Great work! 🔥
@iskettaneh reviewed 1 of 5 files at r2, 1 of 2 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @miraradeva and @stevendanna)
pkg/kv/kvserver/storeliveness/transport.go line 299 at r6 (raw file):
} } toSignal = toSignal[:0]
Curious: Did some test break and you needed to add this? Or is it just to make things explicit?
just to be more explicit. |
|
bors r=miraradeva,iskettaneh |
`storeliveness/doc.go` is outdated with respect to heartbeat smearing changes. In particular, the `Transport` section doesn't have information about heartbeat smearing. Additionally, the `Configuration` section didn't detail the heartbeat smearing cluster setting `kv.store_liveness.heartbeat_smearing.enabled`. Added a dedicated bullet point in the `Transport` section (`5.2`) that explicitly describes heartbeat smearing as a feature to avoid goroutine spikes. Also updated the `Configuration` section (`5.3`) to detail that `kv.store_liveness.heartbeat_smearing.enabled` is available, and describes the behaviour of heartbeat sends when the cluster setting is enabled or disabled. Informs: cockroachdb#156830 Release note: None
`storeliveness/doc.go` is outdated with respect to heartbeat smearing changes. In particular, the `Transport` section doesn't have information about heartbeat smearing. Additionally, the `Configuration` section didn't detail the heartbeat smearing cluster setting `kv.store_liveness.heartbeat_smearing.enabled`. Added a dedicated bullet point in the `Transport` section (`5.2`) that explicitly describes heartbeat smearing as a feature to avoid goroutine spikes. Also updated the `Configuration` section (`5.3`) to detail that `kv.store_liveness.heartbeat_smearing.enabled` is available, and describes the behaviour of heartbeat sends when the cluster setting is enabled or disabled. Informs: cockroachdb#156830 Release note: None
157131: catalog/lease: fix error handling for purgeOldVersions r=fqazi a=fqazi Previously, when purging old versions we would acquire a lease on the latest version, which was introduced when we added logic for acquiring leases on the previous version. The logic to acquire the previous version would clear the error, preventing errors from correctly surfacing and causing issues with dropped / offline descriptors. To address this, ensure the acquisition logic for the previous version is only executed if a clean up will occur. Informs: #156176 Release note: None 157792: storelivenss: update `storeliveness/doc.go` with heartbeat smearing info r=miraradeva a=dodeca12 `storeliveness/doc.go` is outdated with respect to heartbeat smearing changes. In particular, the `Transport` section doesn't have information about heartbeat smearing. Additionally, the `Configuration` section didn't detail the heartbeat smearing cluster setting `kv.store_liveness.heartbeat_smearing.enabled`. Added a dedicated bullet point in the `Transport` section (`5.2`) that explicitly describes heartbeat smearing as a feature to avoid goroutine spikes. Also updated the `Configuration` section (`5.3`) to detail that `kv.store_liveness.heartbeat_smearing.enabled` is available, and describes the behaviour of heartbeat sends when the cluster setting is enabled or disabled. Informs: #156830 Release note: None 157915: kvfollowerreadsccl: maybe deflake TestBoundedStalenessDataDriven r=stevendanna a=stevendanna This test determines what events occur by parsing the trace. In some cases, the parsing it was using to determine a "local read followed by remote leaseholder read" didn't account for changes in the potential trace messages encountered when leader leases are enabled. Here, I widen the scope of the trace parsing. Locally under stress this elliminated the previously encountered failure: ``` datadriven.go:357: ... SNIP ... boundedstaleness/single_row:24: still running after 10.000889738s ... SNIP ... boundedstaleness_test.go:405: condition failed to evaluate within 45s: from boundedstaleness_test.go:436: not yet a match, output: 1 events (1 found): * event 1: colbatchscan trace on node_idx 2: local read datadriven.go:343: ``` Fixes #154710 Release note: None Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Swapneeth Gorantla <swapneeth.gorantla@cockroachlabs.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com>
This PR introduces heartbeat smearing logic that batches and smears Store Liveness heartbeat sends across destination nodes to prevent thundering herd of goroutine spikes.
Changes
Core changes are within these files:
Background
Previously, all stores in a cluster sent heartbeats immediately at each heartbeat interval tick. In large clusters with multi-store nodes, this created synchronized bursts of goroutine spikes that caused issues in other parts of the running CRDB process.
Commits
Commit: Introduce heartbeat smearing
transport.gothat batches enqueued messagestaskpacerto spread traffic over timeHow it works:
EnqueueMessage()into per-node queuesSendAllEnqueuedMessages()is called, transport's smearing sender goroutine waits briefly to batch messagestaskpacerto pace signaling to each queue over a smear durationprocessQueuegoroutine drains its queue and sends when signalledNew Cluster Settings
kv.store_liveness.heartbeat_smearing.enabled(default: true) - Enable/disable smearingkv.store_liveness.heartbeat_smearing.refresh(default: 10ms) - Batching window durationkv.store_liveness.heartbeat_smearing.smear(default: 1ms) - Time to spread sends across queuesBackward Compatibility
kv.store_liveness.heartbeat_smearing.enabled=falseTesting
All existing tests updated to call
SendAllEnqueuedMessages()after enqueuing when smearing is enabled.Roachprod testing
Prototype #1
SupportManagergoroutines being put to sleep; this current design ensures thatSupportManagergoroutines do not get blocked) on a roachprod with 150 node cluster to verify smearing works.Current changes
Ran a roachprod test with current changes but without the check for empty queues (more info - https://reviewable.io/reviews/cockroachdb/cockroach/156378#-). This check did end up proving vital, as the test results didn't show the expected smearing behaviour.
Ran a mini-roachprod test on this prototype commit (where the only real difference between my changes is the inclusion of the length check on the queues that have messages on that commit) showed expected smearing behaviour.
Fixes: #148210
Release note: None