Skip to content

kvserver: rationalize timestamps in proposals #63045

Merged
craig[bot] merged 7 commits intocockroachdb:masterfrom
andreimatei:replicatedEvalResult.Timestamp
Apr 22, 2021
Merged

kvserver: rationalize timestamps in proposals #63045
craig[bot] merged 7 commits intocockroachdb:masterfrom
andreimatei:replicatedEvalResult.Timestamp

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

see individual commits

@andreimatei andreimatei requested review from bdarnell, nvb and tbg April 2, 2021 22:27
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andreimatei andreimatei force-pushed the replicatedEvalResult.Timestamp branch from d28ce5a to aae1695 Compare April 2, 2021 22:28
@andreimatei
Copy link
Copy Markdown
Contributor Author

@nvanbenschoten I'm curious if you would have done something differently in the last commit other than reading the clock for !isIntentWrite requests.

Copy link
Copy Markdown
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r2, 2 of 2 files at r3, 5 of 5 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_application_state_machine.go, line 464 at r4 (raw file):

		wts := cmd.raftCmd.ReplicatedEvalResult.WriteTimestamp
		if !wts.IsEmpty() && wts.LessEq(b.state.RaftClosedTimestamp) {

The preceding comments are now out of date.

We no longer have any kind of assertion on un-migrated ranges; is that a problem for the transition?

Using the existence of RaftClosedTimestamp as a proxy for "migrated to 21.1" is subtle; can we add a method with a name that indicates this more clearly?


pkg/kv/kvserver/replica_application_state_machine.go, line 498 at r4 (raw file):

		b.maxTS.Forward(clockTS)
	}
	clockTS := cmd.replicatedResult().ClockSignal.UnsafeToClockTimestamp()

This method is documented as test-only. Shouldn't this use the TryTo version as above? (I'm not familiar with these changes to the hlc package)


pkg/kv/kvserver/replica_proposal.go, line 811 at r4 (raw file):

		// side-effects that are to be replicated to all replicas.
		res.Replicated.IsLeaseRequest = ba.IsLeaseRequest()
		// Set the timestamp fields. If this range has not "migrated" to 21.1 (as

This means that you won't be able to do rolling upgrades from the previous 21.1 beta to this one. That's not great and it's pretty late in the cycle to be introducing a version barrier like this, although it's hard enough to do range-level migrations that we don't have another practical option if we want to get this in to 21.1.


pkg/kv/kvserver/replica_proposal.go, line 821 at r4 (raw file):

			res.Replicated.WriteTimestamp = ba.WriteTimestamp()
		} else {
			res.Replicated.ClockSignal = r.store.Clock().Now()

What takes the else branch here? For lease transfers, should we be using the new lease's start time instead of the current time? For things other than lease transfers, do we need to include a clock signal at all?

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_proposal.go, line 811 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This means that you won't be able to do rolling upgrades from the previous 21.1 beta to this one. That's not great and it's pretty late in the cycle to be introducing a version barrier like this, although it's hard enough to do range-level migrations that we don't have another practical option if we want to get this in to 21.1.

Yeah... I think requiring stop-the-world for upgrades away from a beta is not the worst thing in the world, as long as the consequences of not doing that are not the worst. And in this case I think they, unfortunately, would be the worst - if a new node proposes a command without a WriteTimestamp to a combination of new and old nodes, the old ones will reject it and the new ones will apply it -> divergence.
I'll think what to do about it.

@andreimatei andreimatei force-pushed the replicatedEvalResult.Timestamp branch from aae1695 to f4c6328 Compare April 6, 2021 19:22
Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_application_state_machine.go, line 464 at r4 (raw file):

The preceding comments are now out of date.

done

We no longer have any kind of assertion on un-migrated ranges; is that a problem for the transition?

We never really had the assertion on un-migrated ranges (it was never firing) because RaftClosedTimestamp is zero.

Using the existence of RaftClosedTimestamp as a proxy for "migrated to 21.1" is subtle; can we add a method with a name that indicates this more clearly?

done


pkg/kv/kvserver/replica_application_state_machine.go, line 498 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This method is documented as test-only. Shouldn't this use the TryTo version as above? (I'm not familiar with these changes to the hlc package)

gone


pkg/kv/kvserver/replica_proposal.go, line 821 at r4 (raw file):

What takes the else branch here?

Proposals corresponding to single EndTxns, inline DeleteRangeRequest, ClearRange, RevertRange, HeartbeatTxn, GC, PushTxn, RecoverTxn, RecomputeStats, Migrate.

For lease transfers, should we be using the new lease's start time instead of the current time?

Would that be better? From a clock sync perspective, it would be slightly worse because we're missing the opportunity to sync the clocks better. Are you thinking of code clarity or of the cost of reading the clock?

For things other than lease transfers, do we need to include a clock signal at all?

Propagating a clock signal never hurts... Other than the cost of reading the clock?

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_proposal.go, line 811 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Yeah... I think requiring stop-the-world for upgrades away from a beta is not the worst thing in the world, as long as the consequences of not doing that are not the worst. And in this case I think they, unfortunately, would be the worst - if a new node proposes a command without a WriteTimestamp to a combination of new and old nodes, the old ones will reject it and the new ones will apply it -> divergence.
I'll think what to do about it.

I've went back to a single WriteTimestamp field, and used another field to discriminate "regular writes" from other misc requests.

Copy link
Copy Markdown
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_proposal.go, line 821 at r4 (raw file):

Would that be better? From a clock sync perspective, it would be slightly worse because we're missing the opportunity to sync the clocks better. Are you thinking of code clarity or of the cost of reading the clock?

I'm thinking of code clarity - sending Now() is slightly better for clock sync purposes, but then why not send the ClockSignal=Now() field unconditionally? If it's dependent on the proposal type then it seems like it should be based on the semantics of the proposal instead of just a conditional inclusion of a current-time clock signal.

Anyway, I don't feel strongly about this. I'm fine with including the current time signal (even for proposals that write an intent). I don't think it's worth the brainpower to try to limit this to proposal types where it's strictly necessary or to provide the minimum timestamp required for correctness.

@andreimatei andreimatei force-pushed the replicatedEvalResult.Timestamp branch 3 times, most recently from a743f34 to 4ac5c39 Compare April 12, 2021 14:43
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r5, 9 of 9 files at r6, 2 of 2 files at r7, 9 of 9 files at r8, 10 of 10 files at r9, 2 of 2 files at r10, 2 of 2 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, and @tbg)


pkg/kv/kvserver/replica.go, line 579 at r11 (raw file):

		tenantID roachpb.TenantID // Set when first initialized, not modified after

		// Historical information about the command that set the closed timestamp.

Is this last commit just for debugging? Or were you intending on landing it?


pkg/kv/kvserver/replica_application_state_machine.go, line 332 at r9 (raw file):

	isIntentWrite := raftCmd.ReplicatedEvalResult.IsIntentWrite()
	wts := raftCmd.ReplicatedEvalResult.WriteTimestamp
	if isIntentWrite && !wts.IsEmpty() && wts.LessEq(*replicaState.GCThreshold) {

Are the changes here backward compatable? What would happen in a mixed v20.2/v21.1 cluster if a request has a timestamp before the GC threshold but is not an intent write? Wouldn't a v20.2 replica reject the request but the v21.1 node would apply it?


pkg/kv/kvserver/replica_application_state_machine.go, line 459 at r10 (raw file):

assertNoCmdClosedTimestampRegression

Did you mean to call this twice? Once here, and once in stageTrivialReplicatedEvalResult?


pkg/kv/kvserver/replica_application_state_machine.go, line 1022 at r10 (raw file):

// down assertions.
func raftClosedTimestampAssertionsEnabled() bool {
	return envutil.EnvOrDefaultBool("COCKROACH_RAFT_CLOSEDTS_ASSERTIONS_ENABLED", true)

Pull this into a static variable so that it is only computed once. There's no need for the function.


pkg/kv/kvserver/replica_application_state_machine.go, line 1042 at r10 (raw file):

		return errors.AssertionFailedf(
			"raft closed timestamp regression in cmd: %x; batch state: %s, command: %s, lease: %s, req: %s",
			cmd.idKey, existingClosed.String(), newClosed.String(), b.state.Lease, req)

Have you confirmed that these two assertion errors aren't redacting all of the interesting information? That would be a shame.


pkg/kv/kvserver/replica_application_state_machine.go, line 412 at r11 (raw file):

type closedTimestampSetterInfo struct {
	cmd   *replicatedCmd

Will this play well with the memory recycling we perform in replicatedCmdBuf?


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 113 at r8 (raw file):

// EvalResultFlags represent the bits in ReplicatedEvalResult.flags.
enum EvalResultFlags {

ReplicatedEvalResultFlags?


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 137 at r8 (raw file):

  // flags indicates various attributes of the command. The interpretation of
  // the bits is given by EvalResultFlags.
  int64 flags = 6;

Should this be typed with EvalResultFlags?


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 121 at r9 (raw file):

  // intents (unless done outside of a txn or the command corresponds to a 1PC
  // batch). See ba.IsIntentWrite().
  IntentWrite = 2;

nit: Do you mind adding the following to ensure that whoever adds the next flag recognizes that these variants are used in a bitfield?

// NextFlag = 4;

pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 146 at r9 (raw file):

  // the command against the GC threshold. For all other commands, the timestamp
  // is to be interpreted simply as a clock signal from the proposer - the field is
  // also used to bump the followers' clocks.

also

@andreimatei andreimatei force-pushed the replicatedEvalResult.Timestamp branch from 4ac5c39 to 6972805 Compare April 14, 2021 21:49
Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica.go, line 579 at r11 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this last commit just for debugging? Or were you intending on landing it?

Intending to land it... At least until we figure out the failures.


pkg/kv/kvserver/replica_application_state_machine.go, line 332 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Are the changes here backward compatable? What would happen in a mixed v20.2/v21.1 cluster if a request has a timestamp before the GC threshold but is not an intent write? Wouldn't a v20.2 replica reject the request but the v21.1 node would apply it?

like we discussed, I've inverted the flag so that all proposals from 20.2 pass the check here and deterministic behavior is preserved in mixed 20.2/21.1 clusters is kept.


pkg/kv/kvserver/replica_application_state_machine.go, line 459 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
assertNoCmdClosedTimestampRegression

Did you mean to call this twice? Once here, and once in stageTrivialReplicatedEvalResult?

It was intentional, but you're right - it's silly. I've remove the call in stageTrivialReplicatedEvalResult.


pkg/kv/kvserver/replica_application_state_machine.go, line 1022 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Pull this into a static variable so that it is only computed once. There's no need for the function.

done


pkg/kv/kvserver/replica_application_state_machine.go, line 1042 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Have you confirmed that these two assertion errors aren't redacting all of the interesting information? That would be a shame.

I've appended a commit to make the types involved not redact.


pkg/kv/kvserver/replica_application_state_machine.go, line 412 at r11 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Will this play well with the memory recycling we perform in replicatedCmdBuf?

ah, I didn't know about that. Good catch. See now.


pkg/kv/kvserver/replica_proposal.go, line 821 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Would that be better? From a clock sync perspective, it would be slightly worse because we're missing the opportunity to sync the clocks better. Are you thinking of code clarity or of the cost of reading the clock?

I'm thinking of code clarity - sending Now() is slightly better for clock sync purposes, but then why not send the ClockSignal=Now() field unconditionally? If it's dependent on the proposal type then it seems like it should be based on the semantics of the proposal instead of just a conditional inclusion of a current-time clock signal.

Anyway, I don't feel strongly about this. I'm fine with including the current time signal (even for proposals that write an intent). I don't think it's worth the brainpower to try to limit this to proposal types where it's strictly necessary or to provide the minimum timestamp required for correctness.

In the meantime, I've moved back to a single timestamp field. It has different meanings for different requests... But I don't think it's worth introducing a new field.


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 113 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

ReplicatedEvalResultFlags?

done


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 137 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this be typed with EvalResultFlags?

I don't think so... Cause this is a bit map. Should it?


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 121 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: Do you mind adding the following to ensure that whoever adds the next flag recognizes that these variants are used in a bitfield?

// NextFlag = 4;

done


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 146 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

also

done

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_proposal.go, line 849 at r16 (raw file):

			// its lease.
			res.Replicated.WriteTimestamp = r.store.Clock().Now()
			res.Replicated.SetNonMVCC()

There is a problem here, caught by the version-upgrade test. We assign or not assign the new flag based on r.mu.state.MigratedToRaftClosedTimestamp() here; it's possible that "the migration" is in flight at this point, the range is not migrated at this point, but the command will be applied after the a preceding command does the migration. In that case, we might not be setting the flag when the application-side would like it to be set, and application fails because it appears to it that an "intent write" command is writing below the closed timestamp. I'm not sure what to do about it yet; I'm thinking we should push the setting of this flag inside the proposal buffer, and do it after the request was sequenced with preceding requests that might be doing the migration.

@andreimatei andreimatei force-pushed the replicatedEvalResult.Timestamp branch 3 times, most recently from 461d2ac to 8282ce8 Compare April 19, 2021 20:15
Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've backtracked on adding a field to proposals to discriminate IsIntentWrite commands from non-mvcc ones; it seems too hard to correctly make below-Raft code use it correctly. It seems that we need two flags in order to be properly backwards-compatible - one telling you about the character of a proposal and one telling you whether you can use the first flag.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 16 of 16 files at r17, 9 of 9 files at r18, 1 of 1 files at r19, 4 of 4 files at r20, 8 of 8 files at r21, 8 of 8 files at r22, 6 of 6 files at r23.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, and @tbg)


pkg/kv/kvserver/replica_application_state_machine.go, line 1041 at r21 (raw file):

}

// Assert that the current command is not writing under the closed timestamp.

nit: consider moving this definition and call to this method up above the corresponding ones for assertNoCmdClosedTimestampRegression, as an additional subtle hint that this check does not take into consideration the new closed timestamp, just the previous one.


pkg/kv/kvserver/replica_application_state_machine.go, line 384 at r22 (raw file):

	// stageTrivialReplicatedEvalResult.
	state                 kvserverpb.ReplicaState
	closedTimestampSetter closedTimestampSetterInfo

Give this guy a comment as well.


pkg/kv/kvserver/replica_application_state_machine.go, line 410 at r22 (raw file):

}

// closedTimestampSetterInfo contains information about the command that last

Let's move this down to the bottom of this file.


pkg/kv/kvserver/replica_application_state_machine.go, line 417 at r22 (raw file):

	// NOTE(andrei): We're going to hold on to the request after its application
	// is complete, which is illegal. But cloning it would be too expensive. We're
	// onlt using the save requests in case of assertions firing, so let's take

"only" "saved"


pkg/kv/kvserver/replica_application_state_machine.go, line 419 at r22 (raw file):

	// onlt using the save requests in case of assertions firing, so let's take
	// the risk.
	req *roachpb.BatchRequest

I'm a little concerned that we're now going to leak the last BatchRequest run on each Range. On a node with 10k ranges and batches of 10 10KB writes, that's 1GB of leaked memory.

How much do we get out of recording the entire BatchRequest? Maybe we just record the lease, the lease index, and whether the batch was a lease request or not? And if it was a lease request, maybe the previous lease?


pkg/kv/kvserver/replica_application_state_machine.go, line 845 at r22 (raw file):

	if cts := cmd.raftCmd.ClosedTimestamp; cts != nil && !cts.IsEmpty() {
		b.state.RaftClosedTimestamp = *cts
		b.closedTimestampSetter.leaseIdx = ctpb.LAI(cmd.leaseIndex)

Let's encapsulate this in a method on closedTimestampSetterInfo.


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 130 at r20 (raw file):

  // The timestamp at which this command is writing. Used to verify the validity
  // of the command against the GC threshold and to update the followers'
  // clocks. If the request that produced this command is not an IntentWrite

If we're using this field for a clock signal again, does the rename to write_timestamp make sense? Should we revert that and keep it as timestamp?


pkg/roachpb/data.go, line 1857 at r23 (raw file):

// String implements the fmt.Stringer interface.
func (s LeaseSequence) String() string {
	return strconv.FormatInt(int64(s), 10)

Was this just useless?

@andreimatei andreimatei force-pushed the replicatedEvalResult.Timestamp branch from 8282ce8 to b9500dc Compare April 20, 2021 18:36
@andreimatei andreimatei force-pushed the replicatedEvalResult.Timestamp branch 2 times, most recently from c1dd16e to 07fd1ca Compare April 20, 2021 18:39
Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_application_state_machine.go, line 1041 at r21 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider moving this definition and call to this method up above the corresponding ones for assertNoCmdClosedTimestampRegression, as an additional subtle hint that this check does not take into consideration the new closed timestamp, just the previous one.

I've moved up the definition, but the call order was intentional. If a command is both a regression and a write below the previous closed ts, I want the regression to take priority (since that probably explains the low write too).


pkg/kv/kvserver/replica_application_state_machine.go, line 384 at r22 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Give this guy a comment as well.

done


pkg/kv/kvserver/replica_application_state_machine.go, line 410 at r22 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's move this down to the bottom of this file.

done


pkg/kv/kvserver/replica_application_state_machine.go, line 417 at r22 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"only" "saved"

done


pkg/kv/kvserver/replica_application_state_machine.go, line 419 at r22 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm a little concerned that we're now going to leak the last BatchRequest run on each Range. On a node with 10k ranges and batches of 10 10KB writes, that's 1GB of leaked memory.

How much do we get out of recording the entire BatchRequest? Maybe we just record the lease, the lease index, and whether the batch was a lease request or not? And if it was a lease request, maybe the previous lease?

see now; I'm only saving lease requests, plus a flag for splits and merges. Will separately look into including the raw raft log in these assertion messages.


pkg/kv/kvserver/replica_application_state_machine.go, line 845 at r22 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's encapsulate this in a method on closedTimestampSetterInfo.

done


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 130 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we're using this field for a clock signal again, does the rename to write_timestamp make sense? Should we revert that and keep it as timestamp?

I prefer write_timestamp; it's better for "usual" requests, and just as good IMO for the others.


pkg/roachpb/data.go, line 1857 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Was this just useless?

mostly useless. It made the LeaseSequence format properly with "%s" (now a %s would result in an ugly message suggesting that you're using the wrong format for an int). Notice the diff below where a %s is changed to %d

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 9 files at r25, 3 of 4 files at r27, 1 of 2 files at r28, 1 of 2 files at r29, 7 of 7 files at r30.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, and @tbg)


pkg/kv/kvserver/replica_application_state_machine.go, line 1358 at r30 (raw file):

	// requests would be too expensive: cloning the request is expensive and also
	// requests can be large in memory.
	leaseReq *roachpb.RequestLeaseRequest

Do we need to clear this when the previous request was not a lease request?


pkg/kv/kvserver/replica_application_state_machine.go, line 1361 at r30 (raw file):

	// split and merge are set if the request was an EndTxn with the respective
	// commit trigger set.
	split, merge bool

Same thing here. Should we be clearing these fields at some point?

It was lying about helping with lease protection.

Release note: None
The ReplicatedEvalResult.Timestamp used to carry the read timestamp of
the request that proposed the respective command. This patch renames it
to WriteTimestamp and sets it accordingly.

The field is used for two purposes: for bumping the clock on followers
such that they don't have values above their clock, and for checking
that we're not writing below the GC threshold. Both of these use cases
actually want the write timestamps, so they were broken. The second use
seems dubious to me (I don't think it's actually needed) but this patch
does not take a position on it beyond adding a TODO.

Beyond fixing the existing uses of this field, putting the write
timestamp in every Raft command has the added benefit that we can use it
to assert below Raft that nobody tries to write below the closed
timestamp. Before this patch, only the leaseholder was able to do this
assertion (because it was the only replica that had access to the write
timestamp) and so the leaseholder had to do it as a non-deterministic
error (and non-deterministic errors are nasty, even when they're
assertion failures). This patch will help turn the assertion into a
deterministic one in the future.

In addition, by clarifying the meaning of this field, this patch opens
the door for cockroachdb#62569 to muddy the field back and give it a special
meaning for lease transfers - the lease start time.

Also mentioning cockroachdb#55293 because this patch lightly touches on the
interaction between requests and the GC threshold.

Release note: None
This patch makes below-Raft code not freak out if the field is not set
(i.e. it appears to be an empty timestamp). This is in order to leave
the door open for future versions to not send the field any more, or at
least send it only on some commands. Concretely, this patch allows the
proposer to declare that particular request does not need to be checked
against the GC threshold below Raft (whether *any* request actually
needs to be checked below Raft is a question that we leave open).

Release note: None
This patch reworks the ReplicatedEvalResult.WriteTimestamp field
(*).  Before this patch, WriteTimestamp was always coming from
ba.WriteTimestamp(), which is either a transaction's write timestamp or,
for non-txn requests, the batch's read timestamp or, for non-MVCC
requests, some random clock value. Below Raft, the field is used for
updating the followers' clocks, and also to check the request against
the GC threshold. This patch sets the WriteTimestamp differently for
IntentWrite requests than other requests:
- for regular writes, the field remains ba.WriteTimestamp()
- for other proposals, the field is a clock reading on the proposer

Some requests (e.g. LeaseTransfers) need a clock signal to travel with
their proposal, and now they get it (see cockroachdb#62569).

[*] An alternative to split the field into two was considered, but it's
hard to do now because of backwards compatibility. It can be done in the
next release, though, because now all the uses of the WriteTimestamp
field tolerate it being empty.

Fixes cockroachdb#62569

Release note: None
In cockroachdb#62655 we see that there appears to be something wrong with the Raft
closed timestamp. That issue shows an assertion failure about a command
trying to write below a timestamp that was promised to have been closed
by a previous command. This patch includes a little bit more info in
that assertion (the current lease) and adds another two assertions:
- an assertion that the closed timestamp itself does not regress. This
assertion already existed in stageTrivialReplicatedEvalResult(), but
that comes after the failure we've seen. The point of copying this
assertion up is to ascertain whether we're screwing up by regressing the
closed timestamp, or whether a particular request/command is at fault
for not obeying the closed timestamp.
- an assertion against closed ts regression when copying the replica
state from a staging batch to the replica.

All these assertions can be disabled with an env var if things go bad.

Fixes cockroachdb#62765

Release note: None
This patch adds historical information to the assertion against closed
timestamp regressions. We've seen that assertion fire in cockroachdb#61981.
The replica now maintains info about what command last bumped the
ClosedTimestamp.

Release note: None
This patch makes some types survive log redacting. In particular, it
ensures that the assertions added in previous patches don't get too
redacted.

Release note: None
@andreimatei andreimatei force-pushed the replicatedEvalResult.Timestamp branch from 07fd1ca to 3f38eb2 Compare April 21, 2021 19:04
Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, and @tbg)


pkg/kv/kvserver/replica_application_state_machine.go, line 1358 at r30 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need to clear this when the previous request was not a lease request?

done, good catch. I've added a wipe of the whole struct in record().


pkg/kv/kvserver/replica_application_state_machine.go, line 1361 at r30 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Same thing here. Should we be clearing these fields at some point?

done

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 9 files at r32, 3 of 4 files at r34, 1 of 2 files at r35, 1 of 2 files at r36, 7 of 7 files at r37.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, and @tbg)

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, and @tbg)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 21, 2021

Build failed:

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curl: (7) Failed to connect to register.cockroachdb.com port 443: Connection timed out

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, and @tbg)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 22, 2021

Build succeeded:

@craig craig bot merged commit bb86f86 into cockroachdb:master Apr 22, 2021
@andreimatei andreimatei deleted the replicatedEvalResult.Timestamp branch April 22, 2021 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants