Skip to content

kv: configure leading closed timestamp target for global_read ranges#61386

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/leadPolicy
Mar 5, 2021
Merged

kv: configure leading closed timestamp target for global_read ranges#61386
craig[bot] merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/leadPolicy

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Mar 3, 2021

Informs #59680.
Informs #52745.

This commit updates closedts.TargetForPolicy to calculate a target closed
timestamp that leads present time for ranges with the LEAD_FOR_GLOBAL_READS
closed timestamp policy. This is needed for non-blocking transactions, which
require ranges to closed time in the future.

TargetForPolicy's LEAD_FOR_GLOBAL_READS calculation is more complex than its
LAG_BY_CLUSTER_SETTING calculation. Instead of the policy defining an offset
from the publisher's perspective, the policy defines a goal from the consumer's
perspective - the goal being that present time reads (with a possible
uncertainty interval) can be served from all followers. To accomplish this, we
must work backwards to establish a lead time to publish closed timestamps at.

The calculation looks something like the following:

// this should be sufficient for any present-time transaction,
// because its global uncertainty limit should be <= this time.
// For more, see (*Transaction).RequiredFrontier.
closed_ts_at_follower = now + max_offset

// the sender must account for the time it takes to propagate a
// closed timestamp update to its followers.
closed_ts_at_sender = closed_ts_at_follower + propagation_time

// closed timestamps propagate in two ways. Both need to make it to
// followers in time.
propagation_time = max(raft_propagation_time, side_propagation_time)

// raft propagation takes 3 network hops to go from a leader proposing
// a write (with a closed timestamp update) to the write being applied.
// 1. leader sends MsgProp with entry
// 2. followers send MsgPropResp with vote
// 3. leader sends MsgProp with higher commit index
//
// we also add on a small bit of overhead for request evaluation, log
// sync, and state machine apply latency.
raft_propagation_time = max_network_rtt * 1.5 + raft_overhead

// side-transport propagation takes 1 network hop, as there is no voting.
// However, it is delayed by the full side_transport_close_interval in
// the worst-case.
side_propagation_time = max_network_rtt * 0.5 + side_transport_close_interval

// put together, we get the following result
closed_ts_at_sender = now + max_offset + max(
	max_network_rtt * 1.5 + raft_overhead,
	max_network_rtt * 0.5 + side_transport_close_interval,
)

While writing this, I explored what it would take to use dynamic network latency
measurements in this calculation to complete #59680. The code for that wasn't
too bad, but brought up a number of questions, including how far into the tail
we care about and whether we place upper and lower bounds on this value. To
avoid needing to immediately answer these questions, the commit hardcodes a
maximum network RTT of 150ms, which should be an overestimate for almost any
cluster we expect to run on.

The commit also adds a new kv.closed_timestamp.lead_for_global_reads_override
cluster setting, which, if nonzero, overrides the lead time that global_read
ranges use to publish closed timestamps. The cluster setting is hidden, but
should provide an escape hatch for cases where we get the calculation
(especially when it becomes dynamic) wrong.

Release justification: needed for new functionality

@nvb nvb requested review from a team and andreimatei March 3, 2021 03:22
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Mar 3, 2021

=== RUN   TestShowCreateTable/t2
    show_create_table.go:102: pq: count-leases: AS OF SYSTEM TIME: cannot specify timestamp in the future (1614742550.694082642,3 > 1614742550.464732682,0)

It returns! I guess we do need to support AOST with future-time timestamps after all. I'm going to patch in what I had for that to see if CI likes it, but I'll pull out the real thing into a separate PR.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Mar 3, 2021

Supporting AOST with future-time timestamps fixed the testing failures.

However, I am seeing some serious slowness in some tests like TestAlterTableLocalityRegionalByRowError/GLOBAL. I took a peek and:

I210303 03:53:48.841583 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 119  completed commit-wait sleep, took 797.894ms
I210303 03:53:48.844382 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 120  performing commit-wait sleep for ~797.414ms
I210303 03:53:49.641855 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 121  completed commit-wait sleep, took 797.446ms
I210303 03:53:49.644950 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 122  performing commit-wait sleep for ~797.27ms
I210303 03:53:50.442235 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 123  completed commit-wait sleep, took 797.28ms
I210303 03:53:50.445280 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 124  performing commit-wait sleep for ~797.304ms
I210303 03:53:51.242599 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 125  completed commit-wait sleep, took 797.317ms
I210303 03:53:51.245295 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 126  performing commit-wait sleep for ~797.51ms
I210303 03:53:52.042853 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 127  completed commit-wait sleep, took 797.555ms
I210303 03:53:52.045930 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 128  performing commit-wait sleep for ~797.102ms
I210303 03:53:52.843093 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 129  completed commit-wait sleep, took 797.163ms
I210303 03:53:52.845928 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 130  performing commit-wait sleep for ~797.844ms
I210303 03:53:53.643790 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 131  completed commit-wait sleep, took 797.859ms
I210303 03:53:53.646408 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 132  performing commit-wait sleep for ~797.819ms
I210303 03:53:54.444274 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 133  completed commit-wait sleep, took 797.861ms
I210303 03:53:54.446973 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 134  performing commit-wait sleep for ~797.548ms

That would do the trick. I don't think we want the ColumnBackfiller to perform a commit-wait for the txn in columnBackfiller.runChunk, but it's not yet clear to me how best to express this. Even if individual chunks do not wait on commit (i.e. don't ensure linearizability with reads that follow immediately afterward), do we need to commit-wait after all chunks have been backfilled?

@dt do you have any time tomorrow for a quick discussion on this? I'd like to make sure I'm not dramatically slowing down any bulk operations that break work into many small transactions.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Mar 3, 2021

FWIW if I don't commit-wait in columnBackfiller.runChunk's txn, pkg/ccl/multiregionccl drops from 441.866s to 49.003s.

@nvb nvb force-pushed the nvanbenschoten/leadPolicy branch 2 times, most recently from 93a34e9 to e131973 Compare March 3, 2021 20:35
@nvb nvb force-pushed the nvanbenschoten/leadPolicy branch from e131973 to e14b64e Compare March 4, 2021 02:49
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Mar 4, 2021

This is ready for a look now. I updated the PR with a commit to allow future-time AOST queries.

I also opened #61444, which tracks the egregious slowdown in the ColumnBackfiller when writing to GLOBAL tables. I've confirmed that the proposed fix in that issue will address the problem. For now, we speed the test up by lowering how far in the future the test writes using the new kv.closed_timestamp.lead_for_global_reads_override setting. We can remove that hack once we address #61444.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but I wouldn't merge until we mess with the code around the assertion in #61443

Reviewed 2 of 9 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/ccl/multiregionccl/regional_by_row_test.go, line 370 at r2 (raw file):

							// Drop the closed timestamp target lead for GLOBAL tables.
							// The test passes with it configured to its default, but it
							// is very slow due to #61444 (2.5s vs. 35s).

please make a note on the issue to come back to this code


pkg/kv/kvserver/closedts/policy.go, line 34 at r2 (raw file):

		// Simple calculation: lag now by desired duration.
		return now.ToTimestamp().Add(-lagTargetDuration.Nanoseconds(), 0)
	case roachpb.LEAD_FOR_GLOBAL_READS:

please comment what the default ends up being


pkg/kv/kvserver/closedts/setting.go, line 53 at r2 (raw file):

// LEAD_FOR_GLOBAL_READS closed timestamp policy use to publish close
// timestamps, if it is set to a non-zero value. Meant as an escape hatch.
var LeadForGlobalReadsOverride = settings.RegisterDurationSetting(

pls link to where the target is computed when this setting is not set


pkg/sql/sem/tree/as_of.go, line 112 at r1 (raw file):

		// Parse synthetic flag.
		syn := false
		if strings.HasSuffix(s, "?") && evalCtx.Settings.Version.IsActive(context.Background(), clusterversion.PriorReadSummaries) {

doesn't the EvalContext have a ctx in it?


pkg/sql/sem/tree/as_of.go, line 112 at r1 (raw file):

		// Parse synthetic flag.
		syn := false
		if strings.HasSuffix(s, "?") && evalCtx.Settings.Version.IsActive(context.Background(), clusterversion.PriorReadSummaries) {

pls explain why we refuse to parse if the version is old. And consider generating an explicit error here directly when we don't accept it

This commit updates the handling of AS OF SYSTEM TIME clauses to accept
future-time synthetic timestamps. This functionality is needed when performing
certain kinds of schema changes on GLOBAL tables, as the operating transaction
will get its commit timestamp bumped into the future and will then try to call
into `CountLeases` from `CheckTwoVersionInvariant` with the commit timestamp.

We're already protected from users shooting themselves in the foot too badly
with this, as we only allow timestamps to lead present time by a few seconds
before they are rejected by KV.

Release justification: Needed for new functionality
@nvb nvb force-pushed the nvanbenschoten/leadPolicy branch from e14b64e to d0eb544 Compare March 4, 2021 03:53
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I'll wait on #61443.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)


pkg/ccl/multiregionccl/regional_by_row_test.go, line 370 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please make a note on the issue to come back to this code

Done.


pkg/kv/kvserver/closedts/policy.go, line 34 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment what the default ends up being

Done.


pkg/kv/kvserver/closedts/setting.go, line 53 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

pls link to where the target is computed when this setting is not set

Done.


pkg/sql/sem/tree/as_of.go, line 112 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

doesn't the EvalContext have a ctx in it?

Done.


pkg/sql/sem/tree/as_of.go, line 112 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

pls explain why we refuse to parse if the version is old. And consider generating an explicit error here directly when we don't accept it

Done.

Informs cockroachdb#59680.
Informs cockroachdb#52745.

This commit updates `closedts.TargetForPolicy` to calculate a target closed
timestamp that leads present time for ranges with the LEAD_FOR_GLOBAL_READS
closed timestamp policy. This is needed for non-blocking transactions, which
require ranges to closed time in the future.

TargetForPolicy's LEAD_FOR_GLOBAL_READS calculation is more complex than its
LAG_BY_CLUSTER_SETTING calculation. Instead of the policy defining an offset
from the publisher's perspective, the policy defines a goal from the consumer's
perspective - the goal being that present time reads (with a possible
uncertainty interval) can be served from all followers. To accomplish this, we
must work backwards to establish a lead time to publish closed timestamps at.

The calculation looks something like the following:
```
// this should be sufficient for any present-time transaction,
// because its global uncertainty limit should be <= this time.
// For more, see (*Transaction).RequiredFrontier.
closed_ts_at_follower = now + max_offset

// the sender must account for the time it takes to propagate a
// closed timestamp update to its followers.
closed_ts_at_sender = closed_ts_at_follower + propagation_time

// closed timestamps propagate in two ways. Both need to make it to
// followers in time.
propagation_time = max(raft_propagation_time, side_propagation_time)

// raft propagation takes 3 network hops to go from a leader proposing
// a write (with a closed timestamp update) to the write being applied.
// 1. leader sends MsgProp with entry
// 2. followers send MsgPropResp with vote
// 3. leader sends MsgProp with higher commit index
//
// we also add on a small bit of overhead for request evaluation, log
// sync, and state machine apply latency.
raft_propagation_time = max_network_rtt * 1.5 + raft_overhead

// side-transport propagation takes 1 network hop, as there is no voting.
// However, it is delayed by the full side_transport_close_interval in
// the worst-case.
side_propagation_time = max_network_rtt * 0.5 + side_transport_close_interval

// put together, we get the following result
closed_ts_at_sender = now + max_offset + max(
	max_network_rtt * 1.5 + raft_overhead,
	max_network_rtt * 0.5 + side_transport_close_interval,
)
```

While writing this, I explored what it would take to use dynamic network latency
measurements in this calculation to complete cockroachdb#59680. The code for that wasn't
too bad, but brought up a number of questions, including how far into the tail
we care about and whether we place upper and lower bounds on this value. To
avoid needing to immediately answer these questions, the commit hardcodes a
maximum network RTT of 150ms, which should be an overestimate for almost any
cluster we expect to run on.

The commit also adds a new `kv.closed_timestamp.lead_for_global_reads_override`
cluster setting, which, if nonzero, overrides the lead time that global_read
ranges use to publish closed timestamps. The cluster setting is hidden, but
should provide an escape hatch for cases where we get the calculation
(especially when it becomes dynamic) wrong.

Release justification: needed for new functionality
@nvb nvb force-pushed the nvanbenschoten/leadPolicy branch from d0eb544 to b320c03 Compare March 4, 2021 06:07
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Mar 5, 2021

Merging, because it sounds like we understand #61443 and that it's an artifact of the test bumping the clock, so this won't make it worse.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 5, 2021

Build failed (retrying...):

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Mar 5, 2021

bors r-

@cockroachdb cockroachdb deleted a comment from craig bot Mar 5, 2021
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Mar 5, 2021

TestFollowersDontRejectClockUpdateWithJump flaked with the error clock synchronization error: this node is more than 500ms away from at least half of the known nodes (1 of 3 are within the offset). I don't understand why the test doesn't always hit this, as it's manually bumping the clock by more than 500ms. The branch should have nothing to do with this, and I can't repro under stress.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 5, 2021

Build succeeded:

@craig craig bot merged commit 90d6ba4 into cockroachdb:master Mar 5, 2021
@nvb nvb deleted the nvanbenschoten/leadPolicy branch March 23, 2021 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants