kv: configure leading closed timestamp target for global_read ranges by nvb · Pull Request #61386 · cockroachdb/cockroach

nvb · 2021-03-03T03:22:44Z

Informs #59680.
Informs #52745.

This commit updates closedts.TargetForPolicy to calculate a target closed
timestamp that leads present time for ranges with the LEAD_FOR_GLOBAL_READS
closed timestamp policy. This is needed for non-blocking transactions, which
require ranges to closed time in the future.

TargetForPolicy's LEAD_FOR_GLOBAL_READS calculation is more complex than its
LAG_BY_CLUSTER_SETTING calculation. Instead of the policy defining an offset
from the publisher's perspective, the policy defines a goal from the consumer's
perspective - the goal being that present time reads (with a possible
uncertainty interval) can be served from all followers. To accomplish this, we
must work backwards to establish a lead time to publish closed timestamps at.

The calculation looks something like the following:

// this should be sufficient for any present-time transaction,
// because its global uncertainty limit should be <= this time.
// For more, see (*Transaction).RequiredFrontier.
closed_ts_at_follower = now + max_offset

// the sender must account for the time it takes to propagate a
// closed timestamp update to its followers.
closed_ts_at_sender = closed_ts_at_follower + propagation_time

// closed timestamps propagate in two ways. Both need to make it to
// followers in time.
propagation_time = max(raft_propagation_time, side_propagation_time)

// raft propagation takes 3 network hops to go from a leader proposing
// a write (with a closed timestamp update) to the write being applied.
// 1. leader sends MsgProp with entry
// 2. followers send MsgPropResp with vote
// 3. leader sends MsgProp with higher commit index
//
// we also add on a small bit of overhead for request evaluation, log
// sync, and state machine apply latency.
raft_propagation_time = max_network_rtt * 1.5 + raft_overhead

// side-transport propagation takes 1 network hop, as there is no voting.
// However, it is delayed by the full side_transport_close_interval in
// the worst-case.
side_propagation_time = max_network_rtt * 0.5 + side_transport_close_interval

// put together, we get the following result
closed_ts_at_sender = now + max_offset + max(
	max_network_rtt * 1.5 + raft_overhead,
	max_network_rtt * 0.5 + side_transport_close_interval,
)

While writing this, I explored what it would take to use dynamic network latency
measurements in this calculation to complete #59680. The code for that wasn't
too bad, but brought up a number of questions, including how far into the tail
we care about and whether we place upper and lower bounds on this value. To
avoid needing to immediately answer these questions, the commit hardcodes a
maximum network RTT of 150ms, which should be an overestimate for almost any
cluster we expect to run on.

The commit also adds a new kv.closed_timestamp.lead_for_global_reads_override
cluster setting, which, if nonzero, overrides the lead time that global_read
ranges use to publish closed timestamps. The cluster setting is hidden, but
should provide an escape hatch for cases where we get the calculation
(especially when it becomes dynamic) wrong.

Release justification: needed for new functionality

cockroach-teamcity · 2021-03-03T03:22:51Z

This change is

nvb · 2021-03-03T03:39:10Z

=== RUN   TestShowCreateTable/t2
    show_create_table.go:102: pq: count-leases: AS OF SYSTEM TIME: cannot specify timestamp in the future (1614742550.694082642,3 > 1614742550.464732682,0)

It returns! I guess we do need to support AOST with future-time timestamps after all. I'm going to patch in what I had for that to see if CI likes it, but I'll pull out the real thing into a separate PR.

nvb · 2021-03-03T04:06:40Z

Supporting AOST with future-time timestamps fixed the testing failures.

However, I am seeing some serious slowness in some tests like TestAlterTableLocalityRegionalByRowError/GLOBAL. I took a peek and:

I210303 03:53:48.841583 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 119  completed commit-wait sleep, took 797.894ms
I210303 03:53:48.844382 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 120  performing commit-wait sleep for ~797.414ms
I210303 03:53:49.641855 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 121  completed commit-wait sleep, took 797.446ms
I210303 03:53:49.644950 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 122  performing commit-wait sleep for ~797.27ms
I210303 03:53:50.442235 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 123  completed commit-wait sleep, took 797.28ms
I210303 03:53:50.445280 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 124  performing commit-wait sleep for ~797.304ms
I210303 03:53:51.242599 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 125  completed commit-wait sleep, took 797.317ms
I210303 03:53:51.245295 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 126  performing commit-wait sleep for ~797.51ms
I210303 03:53:52.042853 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 127  completed commit-wait sleep, took 797.555ms
I210303 03:53:52.045930 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 128  performing commit-wait sleep for ~797.102ms
I210303 03:53:52.843093 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 129  completed commit-wait sleep, took 797.163ms
I210303 03:53:52.845928 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 130  performing commit-wait sleep for ~797.844ms
I210303 03:53:53.643790 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 131  completed commit-wait sleep, took 797.859ms
I210303 03:53:53.646408 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 132  performing commit-wait sleep for ~797.819ms
I210303 03:53:54.444274 1630 kv/kvclient/kvcoord/txn_coord_sender.go:613  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 133  completed commit-wait sleep, took 797.861ms
I210303 03:53:54.446973 1630 kv/kvclient/kvcoord/txn_coord_sender.go:605  [n1,job=637905202446663681,scExec,id=55,mutation=1,ColumnBackfiller=55] 134  performing commit-wait sleep for ~797.548ms

That would do the trick. I don't think we want the ColumnBackfiller to perform a commit-wait for the txn in columnBackfiller.runChunk, but it's not yet clear to me how best to express this. Even if individual chunks do not wait on commit (i.e. don't ensure linearizability with reads that follow immediately afterward), do we need to commit-wait after all chunks have been backfilled?

@dt do you have any time tomorrow for a quick discussion on this? I'd like to make sure I'm not dramatically slowing down any bulk operations that break work into many small transactions.

nvb · 2021-03-03T04:08:55Z

FWIW if I don't commit-wait in columnBackfiller.runChunk's txn, pkg/ccl/multiregionccl drops from 441.866s to 49.003s.

nvb · 2021-03-04T02:52:51Z

This is ready for a look now. I updated the PR with a commit to allow future-time AOST queries.

I also opened #61444, which tracks the egregious slowdown in the ColumnBackfiller when writing to GLOBAL tables. I've confirmed that the proposed fix in that issue will address the problem. For now, we speed the test up by lowering how far in the future the test writes using the new kv.closed_timestamp.lead_for_global_reads_override setting. We can remove that hack once we address #61444.

andreimatei

but I wouldn't merge until we mess with the code around the assertion in #61443

Reviewed 2 of 9 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/ccl/multiregionccl/regional_by_row_test.go, line 370 at r2 (raw file):

							// Drop the closed timestamp target lead for GLOBAL tables.
							// The test passes with it configured to its default, but it
							// is very slow due to #61444 (2.5s vs. 35s).

please make a note on the issue to come back to this code

pkg/kv/kvserver/closedts/policy.go, line 34 at r2 (raw file):

		// Simple calculation: lag now by desired duration.
		return now.ToTimestamp().Add(-lagTargetDuration.Nanoseconds(), 0)
	case roachpb.LEAD_FOR_GLOBAL_READS:

please comment what the default ends up being

pkg/kv/kvserver/closedts/setting.go, line 53 at r2 (raw file):

// LEAD_FOR_GLOBAL_READS closed timestamp policy use to publish close
// timestamps, if it is set to a non-zero value. Meant as an escape hatch.
var LeadForGlobalReadsOverride = settings.RegisterDurationSetting(

pls link to where the target is computed when this setting is not set

pkg/sql/sem/tree/as_of.go, line 112 at r1 (raw file):

		// Parse synthetic flag.
		syn := false
		if strings.HasSuffix(s, "?") && evalCtx.Settings.Version.IsActive(context.Background(), clusterversion.PriorReadSummaries) {

doesn't the EvalContext have a ctx in it?

pkg/sql/sem/tree/as_of.go, line 112 at r1 (raw file):

		// Parse synthetic flag.
		syn := false
		if strings.HasSuffix(s, "?") && evalCtx.Settings.Version.IsActive(context.Background(), clusterversion.PriorReadSummaries) {

pls explain why we refuse to parse if the version is old. And consider generating an explicit error here directly when we don't accept it

This commit updates the handling of AS OF SYSTEM TIME clauses to accept future-time synthetic timestamps. This functionality is needed when performing certain kinds of schema changes on GLOBAL tables, as the operating transaction will get its commit timestamp bumped into the future and will then try to call into `CountLeases` from `CheckTwoVersionInvariant` with the commit timestamp. We're already protected from users shooting themselves in the foot too badly with this, as we only allow timestamps to lead present time by a few seconds before they are rejected by KV. Release justification: Needed for new functionality

nvb

TFTR! I'll wait on #61443.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)

pkg/ccl/multiregionccl/regional_by_row_test.go, line 370 at r2 (raw file):