Skip to content

storeliveness: smear storeliveness heartbeat messages to reduce goroutine spikes at heartbeat interval tick#156830

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
dodeca12:sg_smeared_heartbeats_for_storeliveness
Nov 11, 2025
Merged

storeliveness: smear storeliveness heartbeat messages to reduce goroutine spikes at heartbeat interval tick#156830
craig[bot] merged 1 commit intocockroachdb:masterfrom
dodeca12:sg_smeared_heartbeats_for_storeliveness

Conversation

@dodeca12
Copy link
Copy Markdown

@dodeca12 dodeca12 commented Nov 4, 2025

This PR introduces heartbeat smearing logic that batches and smears Store Liveness heartbeat sends across destination nodes to prevent thundering herd of goroutine spikes.

Changes

Core changes are within these files:

pkg/kv/kvserver/storeliveness
├── support_manager.go  # Rename SendAsync→EnqueueMessage, add smearing settings
└── transport.go        # Add smearing sender goroutine to transport.go which takes care of smearing when enabled

Background

Previously, all stores in a cluster sent heartbeats immediately at each heartbeat interval tick. In large clusters with multi-store nodes, this created synchronized bursts of goroutine spikes that caused issues in other parts of the running CRDB process.

Commits

Commit: Introduce heartbeat smearing

  • Adds a smearing sender goroutine to transport.go that batches enqueued messages
  • Smears send signals across queues using taskpacer to spread traffic over time
  • Configurable via cluster settings (default: enabled)

How it works:

  1. Messages are enqueued via EnqueueMessage() into per-node queues
  2. When SendAllEnqueuedMessages() is called, transport's smearing sender goroutine waits briefly to batch messages
  3. Transport's smearing sender goroutine uses taskpacer to pace signaling to each queue over a smear duration
  4. Each processQueue goroutine drains its queue and sends when signalled

New Cluster Settings

  • kv.store_liveness.heartbeat_smearing.enabled (default: true) - Enable/disable smearing
  • kv.store_liveness.heartbeat_smearing.refresh (default: 10ms) - Batching window duration
  • kv.store_liveness.heartbeat_smearing.smear (default: 1ms) - Time to spread sends across queues

Backward Compatibility

  • Feature is disabled by setting kv.store_liveness.heartbeat_smearing.enabled=false
  • When disabled, messages are sent immediately via the existing path (non-smearing mode)

Testing

  • Added comprehensive unit tests verifying:
    • Messages batch correctly across multiple destinations
    • Smearing spreads signals over the configured time window
    • Smearing mode vs immediate mode behaviour
    • Message ordering and reliability

All existing tests updated to call SendAllEnqueuedMessages() after enqueuing when smearing is enabled.

Roachprod testing

Prototype #1
  • Ran a prototype with a similar design (TL;DR of prototype, the heartbeats were smeared via SupportManager goroutines being put to sleep; this current design ensures that SupportManager goroutines do not get blocked) on a roachprod with 150 node cluster to verify smearing works.
Before changes (current behaviour on master) After changes (prototype)
image image
Current changes
  • Ran a roachprod test with current changes but without the check for empty queues (more info - https://reviewable.io/reviews/cockroachdb/cockroach/156378#-). This check did end up proving vital, as the test results didn't show the expected smearing behaviour.

  • Ran a mini-roachprod test on this prototype commit (where the only real difference between my changes is the inclusion of the length check on the queues that have messages on that commit) showed expected smearing behaviour.

image

Fixes: #148210

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@dodeca12 dodeca12 requested review from a team, iskettaneh and miraradeva and removed request for a team November 4, 2025 13:49
@dodeca12 dodeca12 marked this pull request as ready for review November 4, 2025 13:50
@dodeca12 dodeca12 requested a review from a team as a code owner November 4, 2025 13:50
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Nov 4, 2025

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Nov 4, 2025
@dodeca12 dodeca12 force-pushed the sg_smeared_heartbeats_for_storeliveness branch 2 times, most recently from 44f2c61 to fd0c485 Compare November 4, 2025 14:25
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Nov 4, 2025

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@dodeca12 dodeca12 force-pushed the sg_smeared_heartbeats_for_storeliveness branch from fd0c485 to a065c78 Compare November 6, 2025 22:26
Copy link
Copy Markdown
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. A few nits and one question for a potential simplification.

@miraradeva reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)


pkg/kv/kvserver/storeliveness/support_manager.go line 38 at r1 (raw file):

type MessageSender interface {
	EnqueueMessage(ctx context.Context, msg slpb.Message) (sent bool)
	SendAllEnqueuedMessages(ctx context.Context)

Question: if we always call SendAllEnqueueMessages right after EnqueueMessages, can we just include that call inside EnqueueMessage and let the support manager be agnostic to this signaling? It will mimic the old behavior: as soon as the first message is enqueued, we start building the batch (in the new logic, we start the smearing).

If this works, it will also help consolidate the cluster-setting checking logic in one place: in EnqueueMessage, if the cluster setting is on, call SendAllEnqueueMessages (or directly signal all the sendMessage channels); otherwise, signal all the directMessage channels.


pkg/kv/kvserver/storeliveness/transport.go line 110 at r1 (raw file):

// Each node maintains a single instance of Transport that handles sending and
// receiving messages on behalf of all stores on the node.
//

nit: we want to keep the empty lines between paragraphs.


pkg/kv/kvserver/storeliveness/transport.go line 60 at r1 (raw file):

		"otherwise heartbeat sends are sent when they are enqueued "+
		"at the sendQueue, bypassing heartbeat smearing",

nit: stray new line.


pkg/kv/kvserver/storeliveness/transport.go line 61 at r1 (raw file):

		"at the sendQueue, bypassing heartbeat smearing",

	true,

You can set this metamorphically to false, to get some test coverage for the off case.


pkg/kv/kvserver/storeliveness/transport.go line 64 at r1 (raw file):

)

var HeartbeatSmearingRefreshInterval = settings.RegisterDurationSetting(

nit: all these global cluster settings should have a comment.


pkg/kv/kvserver/storeliveness/transport.go line 99 at r1 (raw file):

// sendQueue is a queue of outgoing Messages.
type sendQueue struct {

nit: stray empty line


pkg/kv/kvserver/storeliveness/transport.go line 236 at r1 (raw file):

	}

	// Start background goroutine to act as the transport smearing sender.

nit: maybe this can live in a separate function, so it doesn't make the transport constructor so big.


pkg/kv/kvserver/storeliveness/transport.go line 578 at r1 (raw file):

	defer idleTimer.Stop()
	batch := &slpb.MessageBatch{}

nit: stray new line.


pkg/kv/kvserver/storeliveness/transport_test.go line 733 at r1 (raw file):

	ctx := context.Background()
	st := cluster.MakeTestingClusterSettings()
	HeartbeatSmearingEnabled.Override(ctx, &st.SV, true)

nit: it's on by default, right?


pkg/kv/kvserver/storeliveness/transport_test.go line 799 at r1 (raw file):

		"all messages should be delivered within reasonable time")

	// Disable smearing by setting the pacing refresh interval to 0 and ensure messages

nit: maybe the refresh interval = 0 case can be a separate test? It does test important behavior.

Copy link
Copy Markdown
Contributor

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 🔥
Added a couple of nits and one question.

Can you check the bug detector? Is this scenario (at the very end) possible: https://github.com/cockroachdb/cockroach/actions/runs/19071904390 ?

@iskettaneh reviewed 7 of 8 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dodeca12 and @miraradeva)


pkg/kv/kvserver/storeliveness/transport.go line 236 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

nit: maybe this can live in a separate function, so it doesn't make the transport constructor so big.

+1


pkg/kv/kvserver/storeliveness/transport.go line 60 at r1 (raw file):

		"otherwise heartbeat sends are sent when they are enqueued "+
		"at the sendQueue, bypassing heartbeat smearing",

nit: extra line


pkg/kv/kvserver/storeliveness/transport.go line 99 at r1 (raw file):

// sendQueue is a queue of outgoing Messages.
type sendQueue struct {

nit: extra line


pkg/kv/kvserver/storeliveness/transport.go line 632 at r1 (raw file):

		t.metrics.SendQueueBytes.Dec(batchMessagesSize)

		if len(batch.Messages) > 0 {

I don't think this check is needed, since we call continue a few lines above if we don't have messages in the batch.


pkg/kv/kvserver/storeliveness/testutils.go line 105 at r1 (raw file):

func (tms *testMessageSender) SendAllEnqueuedMessages(_ context.Context) {
	// No-op for testing

nit: end the comment with a dot.


pkg/kv/kvserver/storeliveness/transport_test.go line 904 at r1 (raw file):

	// the setting on the next iteration when it processes a message, so no
	// dummy message is needed to synchronize.
	testutils.SucceedsSoon(t, func() error {

Is this check needed? Didn't we verify that there were 2 received messages above?

@dodeca12 dodeca12 force-pushed the sg_smeared_heartbeats_for_storeliveness branch from a065c78 to c9914d4 Compare November 10, 2025 12:07
Copy link
Copy Markdown
Author

@dodeca12 dodeca12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh and @miraradeva)


pkg/kv/kvserver/storeliveness/support_manager.go line 38 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

Question: if we always call SendAllEnqueueMessages right after EnqueueMessages, can we just include that call inside EnqueueMessage and let the support manager be agnostic to this signaling? It will mimic the old behavior: as soon as the first message is enqueued, we start building the batch (in the new logic, we start the smearing).

If this works, it will also help consolidate the cluster-setting checking logic in one place: in EnqueueMessage, if the cluster setting is on, call SendAllEnqueueMessages (or directly signal all the sendMessage channels); otherwise, signal all the directMessage channels.

Done.


pkg/kv/kvserver/storeliveness/transport.go line 61 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

You can set this metamorphically to false, to get some test coverage for the off case.

I'm not clear on what you mean here; do you want me to add tests that cover the off case? If set to off, most of the tests pass, except a few that deal with smearing behaviour and logic


pkg/kv/kvserver/storeliveness/transport.go line 632 at r1 (raw file):

Previously, iskettaneh wrote…

I don't think this check is needed, since we call continue a few lines above if we don't have messages in the batch.

Done.


pkg/kv/kvserver/storeliveness/transport_test.go line 733 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

nit: it's on by default, right?

it is, but I thought I'd be more explicit in this test. Will remove


pkg/kv/kvserver/storeliveness/transport_test.go line 904 at r1 (raw file):

Previously, iskettaneh wrote…

Is this check needed? Didn't we verify that there were 2 received messages above?

Done.

Copy link
Copy Markdown
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dodeca12 and @iskettaneh)


pkg/kv/kvserver/storeliveness/transport.go line 61 at r1 (raw file):

Previously, dodeca12 (Swapneeth Gorantla) wrote…

I'm not clear on what you mean here; do you want me to add tests that cover the off case? If set to off, most of the tests pass, except a few that deal with smearing behaviour and logic

When you have a cluster setting, in addition to setting it to true of false, you can also set it to randomly pick a value for a given test. So, for the tests where you verify the smearing behavior, you'd need to set it explicitly to true. In the rest of the tests in the code base (where storeliveness is just a low-level component), having random values for the cluster setting will expand the test coverage.

Here's an example:

metamorphic.ConstantWithTestChoice("kv.raft.leader_fortification.fraction_enabled",
.

@dodeca12 dodeca12 force-pushed the sg_smeared_heartbeats_for_storeliveness branch from c9914d4 to b91d4a7 Compare November 10, 2025 15:15
@dodeca12
Copy link
Copy Markdown
Author

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dodeca12 and @iskettaneh)

pkg/kv/kvserver/storeliveness/transport.go line 61 at r1 (raw file):
Previously, dodeca12 (Swapneeth Gorantla) wrote…

When you have a cluster setting, in addition to setting it to true of false, you can also set it to randomly pick a value for a given test. So, for the tests where you verify the smearing behavior, you'd need to set it explicitly to true. In the rest of the tests in the code base (where storeliveness is just a low-level component), having random values for the cluster setting will expand the test coverage.

Here's an example:

metamorphic.ConstantWithTestChoice("kv.raft.leader_fortification.fraction_enabled",

.

TIL! I was only aware of RunTrueAndFalse() which seemed like overkill and too many LOC changes to implement.

Copy link
Copy Markdown
Author

@dodeca12 dodeca12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh and @miraradeva)


pkg/kv/kvserver/storeliveness/transport.go line 61 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

When you have a cluster setting, in addition to setting it to true of false, you can also set it to randomly pick a value for a given test. So, for the tests where you verify the smearing behavior, you'd need to set it explicitly to true. In the rest of the tests in the code base (where storeliveness is just a low-level component), having random values for the cluster setting will expand the test coverage.

Here's an example:

metamorphic.ConstantWithTestChoice("kv.raft.leader_fortification.fraction_enabled",
.

Done.

@dodeca12 dodeca12 requested a review from miraradeva November 10, 2025 15:18
@dodeca12 dodeca12 force-pushed the sg_smeared_heartbeats_for_storeliveness branch from b91d4a7 to c740ef3 Compare November 10, 2025 15:51
@dodeca12 dodeca12 force-pushed the sg_smeared_heartbeats_for_storeliveness branch 2 times, most recently from 9a564c4 to b0ce908 Compare November 10, 2025 21:11
@dodeca12
Copy link
Copy Markdown
Author

The subtle race condition as described here: https://github.com/cockroachdb/cockroach/actions/runs/19071904390 already exists in the code without any of my changes when viewing master. I have a fix for this; will put up the fix in a new PR so as to logically separate changes.

Copy link
Copy Markdown
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I think we probably are (have been?) aware of this. Raft transport has the same issue. Both protocols can handle message loss by design, and the leaked queue reference gets freed up (and should be GCed) at the end of EnqueueMessage, right? Unless you and others see bigger implications of this, I don't think you necessarily need to fix this as part of this work.

@miraradeva reviewed 2 of 5 files at r2, 1 of 3 files at r3, 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dodeca12 and @iskettaneh)


pkg/kv/kvserver/storeliveness/transport_test.go line 733 at r1 (raw file):

Previously, dodeca12 (Swapneeth Gorantla) wrote…

it is, but I thought I'd be more explicit in this test. Will remove

Yeah, you might need it with the metamorphic cluster setting.


pkg/kv/kvserver/storeliveness/support_manager.go line 332 at r5 (raw file):

		}
	}

nit: stray new line.


pkg/kv/kvserver/storeliveness/support_manager.go line 430 at r5 (raw file):

		_ = sm.sender.EnqueueMessage(ctx, response)
	}

nit: stray new line.


pkg/kv/kvserver/storeliveness/transport.go line 56 at r5 (raw file):

//
// HeartbeatSmearingEnabled controls whether heartbeat sends are distributed over
// time to avoid spking the number of runnable goroutines. When enabled,

nit: "spiking"


pkg/kv/kvserver/storeliveness/transport.go line 226 at r5 (raw file):

	ctx context.Context, stopper *stop.Stopper, settings *clustersettings.Settings,
) error {
	return stopper.RunAsyncTask(

To help with tracing and readability, check out how the startLoop goroutine is kicked off in the support manager constructor.


pkg/kv/kvserver/storeliveness/transport.go line 493 at r5 (raw file):

	// Signal the processQueue goroutine if in direct mode (smearing disabled).
	if !t.SendHeartbeatsSmeared() {

Now that this cluster setting is only checked here, maybe you don't need the separate function anymore and you can inline it here?

I like how this turned out btw.

Copy link
Copy Markdown
Author

@dodeca12 dodeca12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh, @miraradeva, and @stevendanna)


pkg/kv/kvserver/storeliveness/transport.go line 226 at r5 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

To help with tracing and readability, check out how the startLoop goroutine is kicked off in the support manager constructor.

Done.


pkg/kv/kvserver/storeliveness/transport.go line 493 at r5 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

Now that this cluster setting is only checked here, maybe you don't need the separate function anymore and you can inline it here?

I like how this turned out btw.

Done.

Previously, heartbeat messages were sent immediately when enqueued via
`EnqueueMessage`. In large clusters, many stores might all send heartbeats
simultaneously at the same tick interval, causing a spike in runnable
goroutines that caused issues elsewhere in the database process.

This patch introduces heartbeat smearing logic that batches and smears
heartbeat sends over a configurable duration using a taskpacer. The
smearing logic is enabled by default via the
`kv.store_liveness.heartbeat_smearing.enabled` cluster setting
(defaults to true), and can be configured via
`kv.store_liveness.heartbeat_smearing.refresh` (default: 10ms) and
`kv.store_liveness.heartbeat_smearing.smear` (default: 1ms) settings.

When enabled, messages are enqueued but not sent immediately. Instead,
they wait in per-node queues until `SendAllEnqueuedMessages()` is called,
which signals the coordinator. The coordinator waits briefly
(`batchDuration`) to collect messages from multiple stores, then paces the
signaling to each queue's `processQueue` goroutine using the pacer to
spread sends over time.

Fixes: cockroachdb#148210
Release note: None
@dodeca12 dodeca12 force-pushed the sg_smeared_heartbeats_for_storeliveness branch from b0ce908 to 824bdef Compare November 11, 2025 05:46
@dodeca12 dodeca12 requested a review from miraradeva November 11, 2025 05:54
Copy link
Copy Markdown
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your hard work on this!

Ship it, pending @iskettaneh's approval.

Copy link
Copy Markdown
Contributor

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! 🔥

@iskettaneh reviewed 1 of 5 files at r2, 1 of 2 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miraradeva and @stevendanna)


pkg/kv/kvserver/storeliveness/transport.go line 299 at r6 (raw file):

				}
			}
			toSignal = toSignal[:0]

Curious: Did some test break and you needed to add this? Or is it just to make things explicit?

@dodeca12
Copy link
Copy Markdown
Author

Great work! 🔥

@iskettaneh reviewed 1 of 5 files at r2, 1 of 2 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miraradeva and @stevendanna)

pkg/kv/kvserver/storeliveness/transport.go line 299 at r6 (raw file):

				}
			}
			toSignal = toSignal[:0]

Curious: Did some test break and you needed to add this? Or is it just to make things explicit?

just to be more explicit.

@dodeca12
Copy link
Copy Markdown
Author

bors r=miraradeva,iskettaneh

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 11, 2025

@craig craig bot merged commit a5660a6 into cockroachdb:master Nov 11, 2025
33 of 34 checks passed
dodeca12 pushed a commit to dodeca12/cockroach that referenced this pull request Nov 13, 2025
`storeliveness/doc.go` is outdated with respect to heartbeat smearing
changes.

In particular, the `Transport` section doesn't have information about
heartbeat smearing. Additionally, the `Configuration` section didn't
detail the heartbeat smearing cluster setting
`kv.store_liveness.heartbeat_smearing.enabled`.

Added a dedicated bullet point in the `Transport` section (`5.2`) that
explicitly describes heartbeat smearing as a feature to avoid
goroutine spikes. Also updated the `Configuration` section (`5.3`) to
detail that `kv.store_liveness.heartbeat_smearing.enabled` is
available, and describes the behaviour of heartbeat sends when the
cluster setting is enabled or disabled.

Informs: cockroachdb#156830
Release note: None
dodeca12 pushed a commit to dodeca12/cockroach that referenced this pull request Nov 17, 2025
`storeliveness/doc.go` is outdated with respect to heartbeat smearing
changes.

In particular, the `Transport` section doesn't have information about
heartbeat smearing. Additionally, the `Configuration` section didn't
detail the heartbeat smearing cluster setting
`kv.store_liveness.heartbeat_smearing.enabled`.

Added a dedicated bullet point in the `Transport` section (`5.2`) that
explicitly describes heartbeat smearing as a feature to avoid
goroutine spikes. Also updated the `Configuration` section (`5.3`) to
detail that `kv.store_liveness.heartbeat_smearing.enabled` is
available, and describes the behaviour of heartbeat sends when the
cluster setting is enabled or disabled.

Informs: cockroachdb#156830
Release note: None
craig bot pushed a commit that referenced this pull request Nov 17, 2025
157131: catalog/lease: fix error handling for purgeOldVersions r=fqazi a=fqazi

Previously, when purging old versions we would acquire a lease on the latest version, which was introduced when we added logic for acquiring leases on the previous version. The logic to acquire the previous version would clear the error, preventing errors from correctly surfacing and causing issues with dropped / offline descriptors. To address this, ensure the acquisition logic for the previous version is only executed if a clean up will occur.

Informs: #156176
Release note: None

157792: storelivenss: update `storeliveness/doc.go` with heartbeat smearing info r=miraradeva a=dodeca12

`storeliveness/doc.go` is outdated with respect to heartbeat smearing changes.

In particular, the `Transport` section doesn't have information about heartbeat smearing. Additionally, the `Configuration` section didn't detail the heartbeat smearing cluster setting
`kv.store_liveness.heartbeat_smearing.enabled`.

Added a dedicated bullet point in the `Transport` section (`5.2`) that explicitly describes heartbeat smearing as a feature to avoid goroutine spikes. Also updated the `Configuration` section (`5.3`) to detail that `kv.store_liveness.heartbeat_smearing.enabled` is available, and describes the behaviour of heartbeat sends when the cluster setting is enabled or disabled.

Informs: #156830
Release note: None

157915: kvfollowerreadsccl: maybe deflake TestBoundedStalenessDataDriven r=stevendanna a=stevendanna

This test determines what events occur by parsing the trace. In some cases, the parsing it was using to determine a "local read followed by remote leaseholder read" didn't account for changes in the potential trace messages encountered when leader leases are enabled.

Here, I widen the scope of the trace parsing. Locally under stress this elliminated the previously encountered failure:

```
    datadriven.go:357: ... SNIP ... boundedstaleness/single_row:24: still running after 10.000889738s
    ... SNIP ...
    boundedstaleness_test.go:405: condition failed to evaluate within 45s: from boundedstaleness_test.go:436: not yet a match, output:
        1
        events (1 found):
         * event 1: colbatchscan trace on node_idx 2: local read
    datadriven.go:343:
```

Fixes #154710
Release note: None

Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Swapneeth Gorantla <swapneeth.gorantla@cockroachlabs.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
@nicktrav nicktrav added the O-AI-Review-Real-Issue-Found AI reviewer found real issue label Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. O-AI-Review-Real-Issue-Found AI reviewer found real issue v26.1.0-prerelease

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storeliveness: pace liveness heartbeats

6 participants