spanconfig: introduce spanconfig.Reconciler by irfansharif · Pull Request #71994 · cockroachdb/cockroach

irfansharif · 2021-10-26T18:14:57Z

spanconfig: introduce spanconfig.Reconciler

Reconciler is responsible for reconciling a tenant's zone configs (SQL
construct) with the cluster's span configs (KV construct). It's the
central engine for the span configs infrastructure; a single Reconciler
instance is active for every tenant in the system.

type Reconciler interface {
  // Reconcile starts the incremental reconciliation process from
  // the given checkpoint. If it does not find MVCC history going
  // far back enough[1], it falls back to a scan of all
  // descriptors and zone configs before being able to do more
  // incremental work. The provided callback is invoked with
  // timestamps that can be safely checkpointed. A future
  // Reconciliation attempt can make use of this timestamp to
  // reduce the amount of necessary work (provided the MVCC
  // history is still available).
  //
  // [1]: It's possible for system.{zones,descriptor} to have been
  //      GC-ed away; think suspended tenants.
  Reconcile(
    ctx context.Context,
    checkpoint hlc.Timestamp,
    callback func(checkpoint hlc.Timestamp) error,
  ) error
}

Let's walk through what it does. At a high-level, we maintain an
in-memory data structure that's up-to-date with the contents of the KV
(at least the subset of spans we have access to, i.e. the keyspace
carved out for our tenant ID). We watch for changes to SQL state
(descriptors, zone configs), translate the SQL updates to the flattened
span+config form, "diff" the updates against our data structure to see
if there are any changes we need to inform KV of. If so, we do, and
ensure that our data structure is kept up-to-date. We continue watching
for future updates and repeat as necessary.

There's only single instance of the Reconciler running for a given
tenant at a given point it time (mutual exclusion/leasing is provided by
the jobs subsystem). We needn't worry about contending writers, or the
KV state being changed from underneath us. What we do have to worry
about, however, is suspended tenants' not being reconciling while
suspended. It's possible for a suspended tenant's SQL state to be GC-ed
away at older MVCC timestamps; when watching for changes, we could fail
to observe tables/indexes/partitions getting deleted. Left as is, this
would result in us never issuing a corresponding deletion requests for
the dropped span configs -- we'd be leaving orphaned span configs lying
around (taking up storage space and creating pointless empty ranges). A
"full reconciliation pass" is our attempt to find all these extraneous
entries in KV and to delete them.

We can use our span config data structure here too, one that's
pre-populated with the contents of KV. We translate the entire SQL state
into constituent spans and configs, diff against our data structure to
generate KV updates that we then apply. We follow this with clearing out
all these spans in our data structure, leaving behind all extraneous
entries to be found in KV -- entries we can then simply issue deletes
for.

----

server,kvaccessor: record span configs during tenant creation/gc

For newly created tenants, we want to ensure hard splits on tenant
boundaries. The source of truth for split points in the span configs
subsystem is the contents of system.span_configurations. To ensure
hard splits, we insert a single key record at the tenant prefix. In a
future commit we'll introduce the spanconfig.Reconciler process, which
runs per tenant and governs all config entries under each tenant's
purview. This has implications for this initial record we're talking
about (this record might get cleaned up for e.g.); we'll explore it in
tests for the Reconciler itself.

Creating a single key record is easy enough -- we could've written
directly to system.span_configurations. When a tenant is GC-ed
however, we need to clear out all tenant state transactionally. To that
end we plumb in a txn-scoped KVAccessor into the planner where
crdb_internal.destroy_tenant is executed. This lets us easily delete
all abandoned tenant span config records.

Note: We get rid of spanconfig.experimental_kvaccessor.enabled. Access
to spanconfigs infrastructure is already sufficiently gated through
COCKROACH_EXPERIMENTAL_SPAN_CONFIGS. Now that
crdb_internal.create_tenant attempts to write through the KVAccessor,
it's cumbersome to have to enable the setting manually in every
multi-tenant test (increasingly the default) enabling some part of the
span configs infrastructure.

This commit introduces the need for a migration -- for existing clusters
with secondary tenants, when upgrading we need to install this initial
record at the tenant prefix for all extant tenants (and make sure to
continue doing so for subsequent newly created tenants). This is to
preserve the hard-split-on-tenant-boundary invariant we wish to provide.
It's possible for an upgraded multi-tenant cluster to have dormant sql
pods that have never reconciled. If KV switches over to the span configs
subsystem, splitting only on the contents of
system.span_configurations, we'll fail to split on all tenant
boundaries. We'll introduce this migration in a future PR (before
enabling span configs by default).

Release note: None

Only the last two commits here are of interest (first one is from #73531).

cockroach-teamcity · 2021-10-26T18:15:07Z

This change is

We first introduced spanconfig.StoreWriter in cockroachdb#70287. Here we extend the interface to accept a batch of updates instead of just one. type StoreWriter interface { Apply(ctx context.Context, dryrun bool, updates ...Update) ( deleted []roachpb.Span, added []roachpb.SpanConfigEntry, ) } The implementation is subtle -- we're not processing one update at a time. The semantics we're after is applying a batch of updates atomically on a consistent snapshot of the underlying store. This comes up in the upcoming spanconfig.Reconciler (cockroachdb#71994) -- there, following a zone config/descriptor change, we want to update KV state in a single request/RTT instead of an RTT per descendent table. The intended usage is better captured in the aforementioned PR; let's now talk about what this entails for the datastructure. To apply a single update, we want to find all overlapping spans and clear out just the intersections. If the update is adding a new span config, we'll also want to add the corresponding store entry after. We do this by deleting all overlapping spans in their entirety and re-adding the non-overlapping segments. Pseudo-code: for entry in store.overlapping(update.span): union, intersection = union(update.span, entry), intersection(update.span, entry) pre = span{union.start_key, intersection.start_key} post = span{intersection.end_key, union.end_key} delete {span=entry.span, conf=entry.conf} if entry.contains(update.span.start_key): # First entry overlapping with update. add {span=pre, conf=entry.conf} if non-empty if entry.contains(update.span.end_key): # Last entry overlapping with update. add {span=post, conf=entry.conf} if non-empty if adding: add {span=update.span, conf=update.conf} # add ourselves When extending to a set of updates, things are more involved. Let's assume that the updates are non-overlapping and sorted by start key. As before, we want to delete overlapping entries in their entirety and re-add the non-overlapping segments. With multiple updates, it's possible that a segment being re-added will overlap another update. If processing one update at a time in sorted order, we want to only re-add the gap between the consecutive updates. keyspace a b c d e f g h i j existing state [--------X--------) updates [--A--) [--B--) When processing [a,c):A, after deleting [b,h):X, it would be incorrect to re-add [c,h):X since we're also looking to apply [g,i):B. Instead of re-adding the trailing segment right away, we carry it forward and process it when iterating over the second, possibly overlapping update. In our example, when iterating over [g,i):B we can subtract the overlap from [c,h):X and only re-add [c,g):X. It's also possible for the segment to extend past the second update. In the example below, when processing [d,f):B and having [b,h):X carried over, we want to re-add [c,d):X and carry forward [f,h):X to the update after (i.e. [g,i):C)). keyspace a b c d e f g h i j existing state [--------X--------) updates [--A--) [--B--) [--C--) One final note: we're iterating through the updates without actually applying any mutations. Going back to our first example, when processing [g,i):B, retrieving the set of overlapping spans would (again) retrieve [b,h):X -- an entry we've already encountered when processing [a,c):A. Re-adding non-overlapping segments naively would re-add [b,g):X -- an entry that overlaps with our last update [a,c):A. When retrieving overlapping entries, we need to exclude any that overlap with the segment that was carried over. Pseudo-code: carry-over = <empty> for update in updates: carried-over, carry-over = carry-over, <empty> if update.overlap(carried-over): # Fill in the gap between consecutive updates. add {span=span{carried-over.start_key, update.start_key}, conf=carried-over.conf} # Consider the trailing span after update; carry it forward if non-empty. carry-over = {span=span{update.end_key, carried-over.end_key}, conf=carried-over.conf} else: add {span=carried-over.span, conf=carried-over.conf} if non-empty for entry in store.overlapping(update.span): if entry.overlap(processed): continue # already processed union, intersection = union(update.span, entry), intersection(update.span, entry) pre = span{union.start_key, intersection.start_key} post = span{intersection.end_key, union.end_key} delete {span=entry.span, conf=entry.conf} if entry.contains(update.span.start_key): # First entry overlapping with update. add {span=pre, conf=entry.conf} if non-empty if entry.contains(update.span.end_key): # Last entry overlapping with update. carry-over = {span=post, conf=entry.conf} if adding: add {span=update.span, conf=update.conf} # add ourselves add {span=carry-over.span, conf=carry-over.conf} if non-empty We've extended the randomized testing suite to generate batches of updates at a time. We've also added a few illustrated datadriven tests. Release note: None

The only mutable state across concurrent WatchForSQLUpdate calls was the internal buffer, which does not need to hang off the surrounding struct. This obviates the need for the factory pattern we were using -- callers can set up multiple SQLWatchers concurrently as is (see TestSQLWatcherMultiple). This commit simply hides the factory under the package boundary; in a future commit we'll shed it altogether. This has the benefit of reducing the number of symbols in pkg/spanconfig and making it symmetric with the other spanconfig dependencies typically found together (KVAccessor, SQLTranslator). It's also every-so-slightly less verbose to use in the upcoming spanconfig.Reconciler (cockroachdb#71994). Release note: None

irfansharif · 2021-12-06T22:42:19Z

~~Rebased on top of all the recent slew of PRs that have landed, pre-pending another that tripped up CI: #73531.~~

ajwerner

Here's an initial high-level pass. Cool testing framework.

ajwerner · 2021-12-08T15:11:18Z

pkg/spanconfig/spanconfig.go

+	GetSpanConfigEntriesForWithTxn(
+		ctx context.Context,
+		spans []roachpb.Span,
+		txn *kv.Txn,


Did you consider changing this around so the interface was the same but you curried the transaction in?

Something like:

type TxnKVAccessor interface{ WithTxn(*kv.Txn) KVAccessor }

ajwerner · 2021-12-08T15:16:34Z

pkg/spanconfig/spanconfigreconciler/reconciler.go

+// There's only single instance of the Reconciler running for a given tenant at
+// a given point it time (mutual exclusion/leasing is provided by the jobs
+// subsystem). We needn't worry about contending writers, or the KV state being
+// changed from underneath us. What we do have to worry about, however, is


This is something I've been worried about and lead to #73399

ajwerner · 2021-12-08T15:23:52Z

pkg/spanconfig/spanconfigreconciler/reconciler.go

+func (r *Reconciler) Reconcile(
+	ctx context.Context, checkpoint hlc.Timestamp, callback func(checkpoint hlc.Timestamp) error,
+) error {
+	// TODO(irfansharif): Check system.{zones,descriptors} for last GC timestamp


Why not just try to start the rangefeed and if you hit an error go and do the full reconciliation? In theory you need to deal with that case. Is there something that makes that harder now that #73086 has merged?

ajwerner · 2021-12-08T15:29:25Z

pkg/spanconfig/spanconfigreconciler/reconciler.go

+// checkpoint. For changes to, say, RANGE DEFAULT, the RPC request proto is
+// proportional to the number of schema objects.
+func (r *Reconciler) Reconcile(
+	ctx context.Context, checkpoint hlc.Timestamp, callback func(checkpoint hlc.Timestamp) error,


My sense is this code could be a bit more decomposed if you didn't maintain all of the reconciliation state in closed over variables. The Reconciler to me is more of a factory that holds on to dependencies. It is itself stateless. Consider pulling out a struct that can hold the state so that various callbacks and what not can be methods.

nvb

Reviewed 11 of 41 files at r14, 69 of 69 files at r16, 62 of 69 files at r17, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

pkg/server/server.go, line 627 at r16 (raw file):

	}

	spanConfig := struct {

nit var spanConfig struct { ... }

pkg/spanconfig/spanconfig.go, line 56 at r17 (raw file):

Previously, ajwerner wrote…

Did you consider changing this around so the interface was the same but you curried the transaction in?

Something like:
type TxnKVAccessor interface{
    WithTxn(*kv.Txn) KVAccessor
}

Once you do that, you may even be able to get rid of this separate interface. You could imagine:

type KVAccessor interface {
    ...
    WithTxn(*kv.Txn) KVAccessor
}

pkg/spanconfig/spanconfig.go, line 185 at r17 (raw file):

	Reconcile(
		ctx context.Context,
		checkpoint hlc.Timestamp,

Is "checkpoint" the right name for this parameter? It doesn't really tell me much about what the checkpoint represents and which values it can hold for the use of the Reconciler to be correct. Would "lastCheckpoint" be a better name? Or "prevCheckpoint"? Or just "startFrom"?

pkg/spanconfig/spanconfigkvaccessor/kvaccessor.go, line 84 at r16 (raw file):

// GetSpanConfigEntriesFor.
func (k *KVAccessor) GetSpanConfigEntriesForWithTxn(
	ctx context.Context, spans []roachpb.Span, txn *kv.Txn,

txn should be named something that indicates that it may be nil and that it can't be used multiple times with any expectation of transactionality.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 84 at r17 (raw file):

// up-to-date. We continue watching for future updates and repeat as necessary.
//
// There's only single instance of the Reconciler running for a given tenant at

"only a single instance"

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

//
// There's only single instance of the Reconciler running for a given tenant at
// a given point it time (mutual exclusion/leasing is provided by the jobs

Leasing provides best-effort mutual exclusion, not real mutual exclusion. I think it would be worthwhile to have a discussion about what can and can't go wrong if you do have two of these running at the same time. Nothing good has to happen, but we'll want to convince ourselves that nothing bad can happen either.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 95 at r17 (raw file):

// leaving orphaned span configs lying around (taking up storage space and
// creating pointless empty ranges). A "full reconciliation pass" is our attempt
// to find all these extraneous entries in KV and to delete them.

Did you have a chance to follow up on our discussion about a potential cooperative handshake before suspending a tenant pod?

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 117 at r17 (raw file):

Previously, ajwerner wrote…

Why not just try to start the rangefeed and if you hit an error go and do the full reconciliation? In theory you need to deal with that case. Is there something that makes that harder now that #73086 has merged?

My reading of this is that we aren't currently tracking the last checkpoint (i.e. checkpoint is always passed in as empty), so we know the rangefeed will initially fail. But I agree with what you said and we might as well structure this the way we want it to look to begin with.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 187 at r17 (raw file):

		// Make a copy of the current view of the world (i.e. KVAccessor
		// contents) for incremental maintenance below.

It wasn't immediately clear to me why we needed to hold on to both stores, especially because we were mutating initialStore. We should add to this comment and explain the algorithm we're using here. Or add it above somewhere. It's non-trivial and would benefit from being distilled into formal notation.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 194 at r17 (raw file):

		for _, update := range updates {
			initialStore.Apply(ctx, false /* dryrun */, spanconfig.Update{
				Span: update.Span,

nit: consider a comment like Config: roachpb.SpanConfig{}, // delete or some other way to re-iterate that these applications are both deleting overlapping entries.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 223 at r17 (raw file):

	}

	// TODO(irfansharif): Hook into #73086, which can help bubble up retryable

Related to Andrew's comment above, this seems like an important TODO, because falling back to a full reconciliation and then restarting the rangfeed implies a restructuring of the logic in this function (or the caller of this function?). It also may imply that the reconciler itself will need to remember checkpoints. So it feels like the kind of thing we should address now, even if it's not hooked up entirely yet.

pkg/spanconfig/spanconfigreconciler/testdata/basic, line 0 at r17 (raw file):
Did you mean to add this empty file?

pkg/spanconfig/spanconfigtestutils/recorder.go, line 76 at r17 (raw file):

	for _, u := range toUpsert {
		r.mu.mutations = append(r.mu.mutations, mutation{
			update:   spanconfig.Update{Span: u.Span, Config: u.Config},

The typing here is strange. I found it odd the first time I saw it and it still stands out to me. Why are spanconfig.Update and roachpb.SpanConfigEntry different types? They have exactly the same structure, don't they?

Beyond redundant code, I think there's a deeper point here, which is that there's an expectation that an "upsert" operation's input structure (in any context, not just here) is identical to the structure of the data that it's operating on. Upsert isn't an update, it's an overwrite. So the mismatch between the types keeps looking like a red flag.

EDIT: if you're doing this because you think it's clearer to decouple these concepts, consider a type alias instead of a completely different type.

pkg/spanconfig/spanconfigtestutils/recorder.go, line 91 at r17 (raw file):

	defer r.mu.Unlock()

	var output strings.Builder

nit: move this down above the loop that uses it.

pkg/sql/drop_table.go, line 134 at r17 (raw file):

			return err
		}

stray line?

pkg/sql/tenant.go, line 140 at r16 (raw file):

	// migrates over.
	//
	// [1]: It doesn't really matter what span is inserted here as long as it

Could you talk about why this just needs to land on the tenant's start key and doesn't need to extend all the way to its end key? That's because we don't need a split at the tenant's end key?

pkg/sql/tenant.go, line 145 at r16 (raw file):

	// reconciliation process.

	// TODO(irfansharif): What should this initial default be? Could be this

Were you planning on addressing this question in this commit?

pkg/sql/tenant.go, line 261 at r16 (raw file):

	}

	// Create initial splits for the new tenant. This is performed

How does the logic you're adding now interact with this logic? Are they redundant?

arulajmani

It's really exciting to see all this come together and compose so well! I want to spend a bit more time reading through all your tests here, but I'll publish my first pass comments in the meantime.

Reviewed 11 of 41 files at r14, 69 of 69 files at r16, 34 of 69 files at r17, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

-- commits, line 74 at r17:
nit: s/of the/of/g

-- commits, line 75 at r17:
Is this at least or always the case?

-- commits, line 83 at r17:
nit: s/only/only a/

pkg/spanconfig/spanconfig.go, line 56 at r17 (raw file):

Previously, ajwerner wrote…

Did you consider changing this around so the interface was the same but you curried the transaction in?

Something like:
type TxnKVAccessor interface{
    WithTxn(*kv.Txn) KVAccessor
}

Or alternatively, did you consider passing in a WithTxn option to the KVAccessor methods instead of defining a new interface?

pkg/spanconfig/spanconfig.go, line 167 at r17 (raw file):

	// Apply applies a batch of non-overlapping updates atomically[1] and
	// returns (i) the existing spans that were deleted, and (ii) the entries
	// that were newly added to make room for the batch. The deleted list can

Did you delete this because you didn't end up using this as described in the end?

pkg/spanconfig/spanconfig.go, line 236 at r17 (raw file):

	//  Store*   | [--- A ----)[--- D ----)[-- B --)         [-- C -)[--- E ---)
	//
	// [1]: Unless dryrun is true. We'll still generate the same {deleted,added}

Seems like we aren't making use of dryrun anywhere yet -- did we expect to use this and never ended up doing so? If memory serves me right I think we were hoping to make use of this in the Reconciler, but looks like we don't need it. Should we get rid of it?

pkg/spanconfig/spanconfigjob/job.go, line 36 at r17 (raw file):

	if err := rc.Reconcile(ctx, hlc.Timestamp{}, func(checkpoint hlc.Timestamp) error {
		// TODO(irfansharif): Stash this checkpoint somewhere and use it when

Is there an issue for checkpointing yet?

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 221 at r17 (raw file):

	if err := callback(fullTranslationTS); err != nil {
		return err
	}

Consider pulling out FullReconciliation into its own method?

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 245 at r17 (raw file):

						if update.DescriptorType == catalog.Table {
							desc, err := descsCol.GetImmutableDescriptorByID(ctx, txn, update.ID, tree.CommonLookupFlags{
								Required:       true, // we want to find out error out for missing descriptors

"find out error out"

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 287 at r17 (raw file):

				spanConfigUpdates = append(spanConfigUpdates, spanconfig.Update{
					Span: tableSpan,
				})

nit: add a comment indicating no config == deleted?

irfansharif

TFTRs! I decomposed the Reconciler code and added a bit more commentary. Do we think this PR is merge-able? (totally fine if it isn't -- just want to see if it could land before the weekend). For some reason I can't see @arulajmani's comments on reviewable, so I'm not completely sure what lines are being referenced for a few of them. I tried addressed everything that I could figure out though. Responding here:

Seems like we aren't making use of dryrun anywhere [...] Should we get rid of it?

Good idea, will follow up in another PR.

[re: comments on spanconfig.StoreWriter] Did you delete this because you didn't end up using this as described in the end?

No, the reconciler looked exactly as previously described in that comment block (it came in handy, I just blindly followed the pseudo code we spelled out there ha). That comment block was just repurposed elsewhere near the Reconciler itself.

Is there an issue for checkpointing yet?

It's added to #67679 for now, I'll break out that list into sub-issues to tackle over December/January as quick wins.

Consider pulling out FullReconciliation into its own method?

Done (though not exported).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

pkg/spanconfig/spanconfig.go, line 56 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Once you do that, you may even be able to get rid of this separate interface. You could imagine:
type KVAccessor interface {
    ...
    WithTxn(*kv.Txn) KVAccessor
}

Done, went with the unified interface. It does however add a wart -- this interface doesn't really apply to the Connector (just panics in the implementation). I'm happy to change it out to the usual factory pattern if you'd like.

pkg/spanconfig/spanconfig.go, line 185 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is "checkpoint" the right name for this parameter? It doesn't really tell me much about what the checkpoint represents and which values it can hold for the use of the Reconciler to be correct. Would "lastCheckpoint" be a better name? Or "prevCheckpoint"? Or just "startFrom"?

Done.

pkg/spanconfig/spanconfigkvaccessor/kvaccessor.go, line 84 at r16 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

txn should be named something that indicates that it may be nil and that it can't be used multiple times with any expectation of transactionality.

This method's gone, and added an assert + commentary. What were you thinking for the name? Something like {optional,Nullable}Txn?

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Leasing provides best-effort mutual exclusion, not real mutual exclusion. I think it would be worthwhile to have a discussion about what can and can't go wrong if you do have two of these running at the same time. Nothing good has to happen, but we'll want to convince ourselves that nothing bad can happen either.

Actually, the spanconfig.Manager here provides mutual exclusion. Before starting the reconciliation job, it waits for any existing ones to have died down. Updated the comment to say as much.

cockroach/pkg/spanconfig/spanconfigmanager/manager.go

Lines 175 to 178 in e4caa44

    
           // createAndStartJobIfNoneExists creates span config reconciliation job iff it 
        
           // hasn't been created already and notifies the jobs registry to adopt it. 
        
           // Returns a boolean indicating if the job was created. 
        
           func (m *Manager) createAndStartJobIfNoneExists(ctx context.Context) (bool, error) {

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 87 at r17 (raw file):

Previously, ajwerner wrote…

This is something I've been worried about and lead to #73399

Cool issue! I've added a footnote to this comment block, could you check if I've understood it correctly?

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 95 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you have a chance to follow up on our discussion about a potential cooperative handshake before suspending a tenant pod?

Not yet, I still have to think through it more. Perhaps we can continue discussing in our pod/1:1s. There's a footnote here referring to #73399, which I think will play some part in a cooperative pod suspension story, but I'm not sure what a more holistic approach would look like. Certainly portions of the Reconciler would benefit from stronger invariants -- just don't know what that looks like. Were you thinking we'd want to do something here short-term?

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 115 at r17 (raw file):

Previously, ajwerner wrote…

My sense is this code could be a bit more decomposed if you didn't maintain all of the reconciliation state in closed over variables. The Reconciler to me is more of a factory that holds on to dependencies. It is itself stateless. Consider pulling out a struct that can hold the state so that various callbacks and what not can be methods.

Done -- I think it should be much more understandable now, PTAL.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 117 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

My reading of this is that we aren't currently tracking the last checkpoint (i.e. checkpoint is always passed in as empty), so we know the rangefeed will initially fail. But I agree with what you said and we might as well structure this the way we want it to look to begin with.

#73086 was motivated by just this. If ok with you two, I want to add this as a follow-on along with actual persisting of the checkpoint timestamp in the reconciliation job. This code was typed out before #73086 had landed. After decomposing the Reconciler code like suggested above, it should be easy enough to do.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 187 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It wasn't immediately clear to me why we needed to hold on to both stores, especially because we were mutating initialStore. We should add to this comment and explain the algorithm we're using here. Or add it above somewhere. It's non-trivial and would benefit from being distilled into formal notation.

Done. A bunch of this was refactored as well, PTAL.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 194 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider a comment like Config: roachpb.SpanConfig{}, // delete or some other way to re-iterate that these applications are both deleting overlapping entries.

Done.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 223 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Related to Andrew's comment above, this seems like an important TODO, because falling back to a full reconciliation and then restarting the rangfeed implies a restructuring of the logic in this function (or the caller of this function?). It also may imply that the reconciler itself will need to remember checkpoints. So it feels like the kind of thing we should address now, even if it's not hooked up entirely yet.

See thread above -- right now these errors would bubble up to the reconciliation job, which would fail. The manager would detect the failure soon-ish and start another job -- so we're in the clear, but taking a long time to retry when it's safe to do so. Seeing as how we'll want to add code + testing for correct handling at the job level, and I'm planning to come back and add actual job checkpointing soon, I'd like to defer this work until later. I've moved this TODO to the job itself to make it clearer.

pkg/spanconfig/spanconfigtestutils/recorder.go, line 76 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The typing here is strange. I found it odd the first time I saw it and it still stands out to me. Why are spanconfig.Update and roachpb.SpanConfigEntry different types? They have exactly the same structure, don't they?

Beyond redundant code, I think there's a deeper point here, which is that there's an expectation that an "upsert" operation's input structure (in any context, not just here) is identical to the structure of the data that it's operating on. Upsert isn't an update, it's an overwrite. So the mismatch between the types keeps looking like a red flag.

EDIT: if you're doing this because you think it's clearer to decouple these concepts, consider a type alias instead of a completely different type.

Did you mean a type definition? i.e. type Update roachpb.SpanConfigEntry. A type alias would implicitly cast roachpb.SpanConfigEntry to Update and vice-versa, and we'd have to move Update into pkg/roachpb to define additional methods on the type (.Addition(), .Deletion()).

At some point I called this spanconfig.Update was spanconfig.Operation -- perhaps that's what we should go back to? I was thinking of "update" less as "updating/upserting a value" and more "applying an update/operation" (this type shows up only on StoreWriter.Apply).

I'm not fully following the suggestion but I'm happy to make the change you want -- I agree there's a lot of redundancy and I'm sure I'm holding the types wrong.

pkg/sql/tenant.go, line 140 at r16 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you talk about why this just needs to land on the tenant's start key and doesn't need to extend all the way to its end key? That's because we don't need a split at the tenant's end key?

It's because splits are only a function of the start keys, but yes, worth spelling out -- done.

pkg/sql/tenant.go, line 145 at r16 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Were you planning on addressing this question in this commit?

No, not really -- I think this is fine seeing as how these configs will just disappear as soon as the pod starts up for the first time. You could imagine plumbing in whatever the value for RANGE DEFAULT is at the time, but I don't think it will really give us much. Also, it wouldn't track the host's RANGE DEFAULT if it changes (doing that feels even less justified).

pkg/sql/tenant.go, line 261 at r16 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How does the logic you're adding now interact with this logic? Are they redundant?

Yup, they're redundant. This can be removed once we get rid of the system config span -- added a TODO.

pkg/spanconfig/spanconfigreconciler/testdata/basic, line at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you mean to add this empty file?

Woops, removed.

irfansharif · 2021-12-10T03:23:42Z

I folded in the migration for seed tenant records into this PR instead (originally #73623).

nvb

Reviewed 19 of 58 files at r18, 29 of 29 files at r20, 27 of 27 files at r21, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

pkg/migration/migrations/seed_tenant_span_configs.go, line 59 at r20 (raw file):

			// keyspace; elsewhere this ensures that we split on the tenant
			// boundary. Look towards CreateTenantRecord for more details.
			tenantSpanConfig := d.SpanConfig.Default

Do you think it would be worth avoiding the duplicate logic between this SpanConfigEntry and the one constructed in pkg/sql/tenant.go?

pkg/spanconfig/spanconfig.go, line 48 at r20 (raw file):

	// WithTxn returns a KVAccessor that's scoped to the given transaction
	// (allowed to be nil).

Leave a note about what happens if someone tries to chains calls to this like kvaccessor.WithTxn(txn1).WithTxn(txn2).

pkg/spanconfig/spanconfig.go, line 48 at r20 (raw file):

	// WithTxn returns a KVAccessor that's scoped to the given transaction
	// (allowed to be nil).

Can we expand on what it means for a KVAccessor to be "scoped to a given transaction"?

pkg/spanconfig/spanconfigkvaccessor/kvaccessor.go, line 84 at r16 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This method's gone, and added an assert + commentary. What were you thinking for the name? Something like {optional,Nullable}Txn?

Yes, optionalTxn or something like that would be an improvement. Something that makes it clear from the InternalExecutor callsites that repeat calls might not be transactional.

pkg/spanconfig/spanconfigkvaccessor/kvaccessor.go, line 132 at r20 (raw file):

// WithTxn is part of the KVAccessor interface.
func (k *KVAccessor) WithTxn(txn *kv.Txn) spanconfig.KVAccessor {
	return &KVAccessor{

Consider extracting out some shared, private constructor to avoid duplication with New. This seems like a recipe for forgetting a field in one of these two places.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Before starting the reconciliation job, it waits for any existing ones to have died down

How exactly does it do this? It's waiting for a lease to expire somewhere, right? If so, then it's not true mutual exclusion (e.g. https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html). So we need to ensure that we're guaranteeing safety properties in the case where we have two of these jobs running, even if we don't ensure liveness.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 117 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

#73086 was motivated by just this. If ok with you two, I want to add this as a follow-on along with actual persisting of the checkpoint timestamp in the reconciliation job. This code was typed out before #73086 had landed. After decomposing the Reconciler code like suggested above, it should be easy enough to do.

Thanks. I think this is factored well enough now that we can do that change in a separate commit. Mind opening a short issue to track that work though?

pkg/spanconfig/spanconfigtestutils/recorder.go, line 76 at r17 (raw file):

Did you mean a type definition?

Yes, good catch. type Update roachpb.SpanConfigEntry is what I was thinking. That allows you to define methods on a new type, without implicit casting, but also ensures that the two types have the same structure.

It also allows you to perform casts like up := spanconfig.Update(u) instead of the more error-prone and verbose field assignment. You can imagine this being combined with a func Delete(s roachpb.Span) Update { ... } function in the spanconfig package to make this code much clearer.

pkg/sql/tenant.go, line 261 at r16 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Yup, they're redundant. This can be removed once we get rid of the system config span -- added a TODO.

I haven't thought through this carefully, but there may be something to synchronously performing the split, at least at the tenant's start key, to guarantee that the tenant's range is already split off from its neighbor by the time that it's allowed to start operating.

pkg/sql/tenant.go, line 127 at r20 (raw file):

	}

	if !execCfg.Settings.Version.IsActive(ctx, clusterversion.SeedTenantSpanConfigs) {

Is this check racy with the migration? What happens if the migration runs on the first node to hear about the new cluster version, then a new tenant is created on the last node to hear about the new cluster version. Will the tenant be left without a spanconfig?

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)

pkg/migration/migrations/seed_tenant_span_configs.go, line 59 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think it would be worth avoiding the duplicate logic between this SpanConfigEntry and the one constructed in pkg/sql/tenant.go?

I think a bit of copying is fine, it's only a few lines and this migration code is ephemeral.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

It's waiting for a lease to expire somewhere, right?

No, it's transactionally checking against the job table to make sure that no job exists, and then (and only then) transactionally creating the new job. I'm reading the code snippet linked above and the library method linked below, though I may be misunderstanding something. Does our use of txns against the jobs table not give us the true mutual exclusion we're wanting?

cockroach/pkg/jobs/utils.go

Line 28 in e4caa44

func RunningJobExists(

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 117 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Thanks. I think this is factored well enough now that we can do that change in a separate commit. Mind opening a short issue to track that work though?

Filed #73694.

pkg/spanconfig/spanconfigtestutils/recorder.go, line 76 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you mean a type definition?

Yes, good catch. type Update roachpb.SpanConfigEntry is what I was thinking. That allows you to define methods on a new type, without implicit casting, but also ensures that the two types have the same structure.

It also allows you to perform casts like up := spanconfig.Update(u) instead of the more error-prone and verbose field assignment. You can imagine this being combined with a func Delete(s roachpb.Span) Update { ... } function in the spanconfig package to make this code much clearer.

Done (in a stand-alone commit). Called this constructor "Deletion" instead, so it reads store.Apply(..., spanconfig.Deletion(sp)).

pkg/sql/tenant.go, line 261 at r16 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I haven't thought through this carefully, but there may be something to synchronously performing the split, at least at the tenant's start key, to guarantee that the tenant's range is already split off from its neighbor by the time that it's allowed to start operating.

That's a good point, I guess the hour long split expiration should sufficiently cover all async-propagation delays of this split, which is what we were doing with the gossip delays as well. Removed the TODO and updated the comment to also refer to the new subsystem.

pkg/sql/tenant.go, line 127 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this check racy with the migration? What happens if the migration runs on the first node to hear about the new cluster version, then a new tenant is created on the last node to hear about the new cluster version. Will the tenant be left without a spanconfig?

Good catch, split the migration across two versions. The first will start allowing newly created tenants to install fresh span configs, the second will install configs for tenants that don't have any. Added appropriate tests, etc.

nvb

Reviewed 9 of 35 files at r22, 27 of 27 files at r23, 8 of 8 files at r24, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

pkg/migration/migrations/seed_tenant_span_configs.go, line 82 at r23 (raw file):

				// This tenant already has span config entries. It was either
				// already migrated (migrations need to be idempotent) or it was
				// created after PreSeedTenantSpanConfigs was activated. THere's

s/THere's/There's/

pkg/spanconfig/spanconfig.go, line 48 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Leave a note about what happens if someone tries to chains calls to this like kvaccessor.WithTxn(txn1).WithTxn(txn2).

Is this worth supporting? Could doing so mask very subtle bugs? Consider a log.Fatal in this case instead.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

It's waiting for a lease to expire somewhere, right?

No, it's transactionally checking against the job table to make sure that no job exists, and then (and only then) transactionally creating the new job. I'm reading the code snippet linked above and the library method linked below, though I may be misunderstanding something. Does our use of txns against the jobs table not give us the true mutual exclusion we're wanting?

cockroach/pkg/jobs/utils.go

Line 28 in e4caa44

func RunningJobExists(

That depends on what removes jobs from the table. If a node begins running a job and then crashes or otherwise becomes unresponsive, how do we recover and ensure that someone else can adopt the job? I assume there's a time-based lease at some level here or none of this would be fault-tolerant.

Maybe that's what the sqlliveness table stores. I see reference to a crdb_internal.sql_liveness_is_alive builtin function, which is checking a lease expiration. server.sqlliveness.ttl and server.sqlliveness.heartbeat imply a 40 second lease that's extended every 5 seconds.

We don't need to bottom out on this in this PR, but every new job that assumes mutual exclusion should be having these discussions at some point.

pkg/spanconfig/spanconfigtestutils/recorder.go, line 76 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done (in a stand-alone commit). Called this constructor "Deletion" instead, so it reads store.Apply(..., spanconfig.Deletion(sp)).

Thanks, I think this turned out well. It's much easier to read now.

pkg/sql/tenant.go, line 127 at r20 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Good catch, split the migration across two versions. The first will start allowing newly created tenants to install fresh span configs, the second will install configs for tenants that don't have any. Added appropriate tests, etc.

Is that sufficient to avoid the race? If these are minor versions, do we have a guarantee that the fast node that runs both migrations isn't two minor versions ahead of the old node?

What's the hazard with removing this version check entirely? Is it that the spanconfig table might not exist? Is so, maybe we should be using transactions to avoid the race. Could we transactionally check whether the table exists yet, and perform the insert if it does? And if it doesn't, we know that the migration to seed the table hasn't run yet, so we rely on that.

pkg/sql/tenant.go, line 128 at r23 (raw file):

	if !execCfg.Settings.Version.IsActive(ctx, clusterversion.PreSeedTenantSpanConfigs) {
		return err

Should this be return nil?

pkg/sql/tenant.go, line 440 at r23 (raw file):

		scKVAccessor := execCfg.SpanConfigKVAccessor.WithTxn(txn)
		entries, err := scKVAccessor.GetSpanConfigEntriesFor(ctx, []roachpb.Span{tenantSpan})

What will happen here if the spanconfigs table hasn't been created yet?

irfansharif

PTAL -- should've addressed everything.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)

pkg/spanconfig/spanconfig.go, line 48 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this worth supporting? Could doing so mask very subtle bugs? Consider a log.Fatal in this case instead.

Done.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

I assume there's a time-based lease at some level here or none of this would be fault-tolerant.

cockroach/pkg/spanconfig/spanconfigmanager/manager.go

Lines 147 to 154 in efed816

    
           // Periodically check if the span config reconciliation job exists and start 
        
           // it if it doesn't. 
        
           timer := timeutil.NewTimer() 
        
           defer timer.Stop() 
        
           triggerJobCheck() 
        
           for { 
        
           	timer.Reset(checkReconciliationJobInterval.Get(&m.settings.SV))

It's simpler, the span config manager (runs for every sql pod) simply polls the jobs table periodically to make sure that there's at least one job that's running, and if not, creating it. Perhaps @arulajmani or @ajwerner can correct me if I'm wrong, but if we're holding pkg/jobs right in the spanconfig manager, and creating jobs in the way described above, I believe we already have guaranteed mutual exclusion. If not, I'd love to learn how -- it's a guarantee we're relying on in the Reconciler.

pkg/spanconfig/spanconfigtestutils/recorder.go, line 76 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Thanks, I think this turned out well. It's much easier to read now.

Thanks for the suggestion! I agree.

pkg/sql/tenant.go, line 127 at r20 (raw file):

Is that sufficient to avoid the race? If these are minor versions, do we have a guarantee that the fast node that runs both migrations isn't two minor versions ahead of the old node?

I believe it's sufficient. The migrations infrastructure provides two guarantees:

All cluster versions, and their associated migrations, are stepped through in order
Before bumping the cluster version X, all nodes will have already observed cluster version bump X-1. This includes newly added nodes, etc. See here.

These guarantees let you string together arbitrarily complex migrations.

Is it that the spanconfig table might not exist?

Given the SpanConfigurationsTablemigration precedes both these migrations, the table is guaranteed to exist.

What's the hazard with removing this version check entirely?

I don't think there's much of a hazard -- the system table already exists (was baked into 21.2). That said, I think it's also fine to gate new functionality (writing of these records) with an explicit cluster version instead of allowing it to happen unconditionally if create_tenant is executed against a 22.1 node in a mixed-version cluster. It makes it slightly easier to reason about.

pkg/sql/tenant.go, line 128 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this be return nil?

Done.

pkg/sql/tenant.go, line 440 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What will happen here if the spanconfigs table hasn't been created yet?

That's not possible, see comment above about the guarantees provided by the migrations infrastructure.

arulajmani

Reviewed 11 of 41 files at r14, 69 of 69 files at r16, 34 of 69 files at r17, 13 of 58 files at r18, 35 of 35 files at r22, 1 of 8 files at r24.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

pkg/spanconfig/spanconfig.go, line 56 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done, went with the unified interface. It does however add a wart -- this interface doesn't really apply to the Connector (just panics in the implementation). I'm happy to change it out to the usual factory pattern if you'd like.

Mild preference to switching this to using the usual factory pattern. Like you mentioned, there's the wart with the Connector and it'll also sidestep Nathan's concern elsewhere about chaining WithTxns.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 121 at r24 (raw file):

// checkpoint. For changes to, say, RANGE DEFAULT, the RPC request proto is
// proportional to the number of schema objects.
func (r *Reconciler) Reconcile(

Did you consider having fullReconciler and incrementalReconciler implement the Reconciler interface and have the contents of the job do what this function does instead?

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 360 at r24 (raw file):

			}

			toDelete, toUpsert := r.storeWithKVContents.Apply(ctx, false /* dryrun */, updates...)

This means that the incrementalReconciler is single use, right? For example, you could've applied updates to the store but the RPC below could fail. I don't think this is a problem with the way things are structured currently, but maybe consider making this more explicit by either passing in the store to this function or making a factory for this struct.

arulajmani · 2021-12-13T21:04:39Z

@irfansharif I'm not sure why, but reviewable isn't being great to me on this PR. Even for this second batch of comments looks like the line numbers are getting lost in the UI but you can see them in the comment. I don't have that many, but sorry for the trouble here.

irfansharif

TFTR!

reviewable isn't being great to me on this PR [..] sorry for the trouble here

Nothing to be sorry about! Thanks for the review. Perhaps the native github reviews work better?

Mild preference to switching this to using the usual factory pattern.

I also like the factory pattern better, is it alright to do it in a follow on? I even had it in an earlier revision, it's a very small change. I keep catching painful (only because of slow UI build times, so not actually that painful) skews because of the new cluster versions and I'd like to start enabling this stuff on master before everyone's gone for the holidays.

Did you consider having fullReconciler and incrementalReconciler implement the Reconciler interface and have the contents of the job do what this function does instead?

Not really, I think I prefer with the current breakdown, especially with it being as decomposed as it is into the various stages. Part of the reason is that we're using the full reconciliation process to also cheaply construct a spanconfig.Store with kvaccessor contents. I also like keeping the job as dumb as it is, focused only on retries, and keep more business logic here where it's easier to test/inject knobs/etc. See below.

This means that the incrementalReconciler is single use, right? For example, you could've applied updates to the store but the RPC below could fail.

Even the fullReconciler is. This is what originally motivated the dryrun option, thinking I'd do diffs := store.apply(..., dryrun=true); issue-rpcs(diffs); store.apply(..., dryrun=false), but felt unnecessary at the end. In the happy case things will succeed. In the unhappy case, it's more than fine to retry at a higher level (span config job/manager), and preferable IMO because it keeps the retry/recovery logic out of the reconciliation code and at the level where it's going to exist in some form anyway (if the timestamp's being watched are already gc-ed for e.g., the jobs need to to start reconciliation all over).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)

nvb

Reviewed 2 of 11 files at r25, 8 of 8 files at r26, 6 of 42 files at r27, 30 of 33 files at r28, 8 of 8 files at r29, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

pkg/spanconfig/spanconfigkvaccessor/kvaccessor.go, line 66 at r29 (raw file):

func (k *KVAccessor) WithTxn(ctx context.Context, txn *kv.Txn) spanconfig.KVAccessor {
	if k.optionalTxn != nil {
		log.Fatalf(ctx, "KVAccessor already scoped to txn (was .WithTxn(...) chained multiple times?")

nit: missing closing paren.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I assume there's a time-based lease at some level here or none of this would be fault-tolerant.

cockroach/pkg/spanconfig/spanconfigmanager/manager.go

Lines 147 to 154 in efed816

// Periodically check if the span config reconciliation job exists and start

// it if it doesn't.

timer := timeutil.NewTimer()

defer timer.Stop()

triggerJobCheck()

for {

timer.Reset(checkReconciliationJobInterval.Get(&m.settings.SV))

It's simpler, the span config manager (runs for every sql pod) simply polls the jobs table periodically to make sure that there's at least one job that's running, and if not, creating it. Perhaps @arulajmani or @ajwerner can correct me if I'm wrong, but if we're holding pkg/jobs right in the spanconfig manager, and creating jobs in the way described above, I believe we already have guaranteed mutual exclusion. If not, I'd love to learn how -- it's a guarantee we're relying on in the Reconciler.

I think it might be helpful to ask the inverse question. What happens if the node running the job fails? Someone else must take over, right? Similarly, what happens if the node running the job hits a minute-long GC pause?

pkg/sql/tenant.go, line 127 at r20 (raw file):

Before bumping the cluster version X, all nodes will have already observed cluster version bump X-1. This includes newly added nodes, etc. See here.

Got it, thanks for the link. I was under the impression that this was only true of major version numbers, but that must be an outdated understanding.

pkg/sql/tenant.go, line 440 at r23 (raw file):

Previously, irfansharif (irfan sharif) wrote…

That's not possible, see comment above about the guarantees provided by the migrations infrastructure.

But we don't have a version gate in this code path. I must be missing something.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it might be helpful to ask the inverse question. What happens if the node running the job fails? Someone else must take over, right? Similarly, what happens if the node running the job hits a minute-long GC pause?

The only true claim of mutual exclusion we have is when we update the job checkpoint, and even then, only if we properly use the job object passed into the resumer. It does seem possible that under the right sort of scenario, the current job may fail or the current pod may lose its lease and then a new pod may start executing. Currently none of the job leasing code considers synchrony assumptions.

Here are the safeguards:

When a job instance goes to update its status, it ensures that the current lease is the lease under which it was resumed.
When an underlying sqlliveness session is observed to be expired by another instance, that session's record will be deleted.
Job registry objects will periodically revoke leases from jobs which are leased by sessions which are known to not be live.

The above mechanisms provide mutual exclusion for updates to the records itself.

Now, there's also the fact that sql pods will kill themselves if they think that they are expired. Sadly, that doesn't actually provide us with any real protection. Firstly, the process may not detect it is expired until another node has already removed it; session expiration occurs in the MVCC time domain and not the clock time domain. Secondly, I don't think we're that principled or detecting the expiration promptly while making an outstanding request to extend the session. Lastly, the process of receiving from a timer and acting on it is not instantaneous.

So, if we want to make deep claims about mutual exclusion we'll need something deeper than what we've got.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 87 at r17 (raw file):

Quoted 4 lines of code…

// [1]: #73399 proposes a new KV request type that would let us more easily
//      ascertain when a tenant's sql statements have been reconciled in KV,
//      which we could then use to only suspend pods after all outstanding work
//      ("reconciliation") has been done.

I may be misreading your note. The intention of the issue was not really to more easily ascertain when a tenant's sql statements had been reconciled, but rather to more rapidly trigger reconciliation. The reconciliation state itself will be represented by a single row in KV, so the caller can easily either watch that row or poll it to wait for reconciliation. The more pressing desire is to make reconciliation happen promptly after a write which causes changes. If we could do that, then it'd be reasonable to wait for reconciliation before returning to the client.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, ajwerner wrote…

The only true claim of mutual exclusion we have is when we update the job checkpoint, and even then, only if we properly use the job object passed into the resumer. It does seem possible that under the right sort of scenario, the current job may fail or the current pod may lose its lease and then a new pod may start executing. Currently none of the job leasing code considers synchrony assumptions.

Here are the safeguards:

When a job instance goes to update its status, it ensures that the current lease is the lease under which it was resumed.

When an underlying sqlliveness session is observed to be expired by another instance, that session's record will be deleted.

Job registry objects will periodically revoke leases from jobs which are leased by sessions which are known to not be live.

The above mechanisms provide mutual exclusion for updates to the records itself.

Now, there's also the fact that sql pods will kill themselves if they think that they are expired. Sadly, that doesn't actually provide us with any real protection. Firstly, the process may not detect it is expired until another node has already removed it; session expiration occurs in the MVCC time domain and not the clock time domain. Secondly, I don't think we're that principled or detecting the expiration promptly while making an outstanding request to extend the session. Lastly, the process of receiving from a timer and acting on it is not instantaneous.

So, if we want to make deep claims about mutual exclusion we'll need something deeper than what we've got.

Oh! I was under the mistaken assumption that job resumption entailed touching the job record itself transactionally, and so did the lease expiration (don't ask me how, I see it doesn't make sense). Filed #73789 to make sure it's addressed pre-22.1 (+cc @arulajmani).

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 87 at r17 (raw file):

Previously, ajwerner wrote…

// [1]: #73399 proposes a new KV request type that would let us more easily
//      ascertain when a tenant's sql statements have been reconciled in KV,
//      which we could then use to only suspend pods after all outstanding work
//      ("reconciliation") has been done.
I may be misreading your note. The intention of the issue was not really to more easily ascertain when a tenant's sql statements had been reconciled, but rather to more rapidly trigger reconciliation. The reconciliation state itself will be represented by a single row in KV, so the caller can easily either watch that row or poll it to wait for reconciliation. The more pressing desire is to make reconciliation happen promptly after a write which causes changes. If we could do that, then it'd be reasonable to wait for reconciliation before returning to the client.

Got it, updated. Elsewhere @nvanbenschoten mentioned possibly wanting to make pod suspension a more explicit protocol in the database. A request like proposed in #73399 could help us there too: we could block all new sql txns temporarily right as we're about to suspend, ensure we've reconciled, thus suspending only pods with no outstanding work.

pkg/sql/tenant.go, line 127 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Before bumping the cluster version X, all nodes will have already observed cluster version bump X-1. This includes newly added nodes, etc. See here.

Got it, thanks for the link. I was under the impression that this was only true of major version numbers, but that must be an outdated understanding.

Yea, it's something we codified with the long-running migrations project.

pkg/sql/tenant.go, line 440 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

But we don't have a version gate in this code path. I must be missing something.

Doh, didn't notice that was for the GCTenantRecord code path -- added.

irfansharif · 2021-12-15T13:58:11Z

(Bump.)

nvb

Reviewed 30 of 33 files at r31, 8 of 8 files at r32, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Oh! I was under the mistaken assumption that job resumption entailed touching the job record itself transactionally, and so did the lease expiration (don't ask me how, I see it doesn't make sense). Filed #73789 to make sure it's addressed pre-22.1 (+cc @arulajmani).

Thanks for filing. Let's continue the discussion over in #73789 so it doesn't hold up this PR any longer.

arulajmani

I also like the factory pattern better, is it alright to do it in a follow on? I even had it in an earlier revision, it's a very small change. I keep catching painful (only because of slow UI build times, so not actually that painful) skews because of the new cluster versions and I'd like to start enabling this stuff on master before everyone's gone for the holidays.

Sounds good, fine by me!

Even the fullReconciler is. This is what originally motivated the dryrun option, thinking I'd do diffs := store.apply(..., dryrun=true); issue-rpcs(diffs); store.apply(..., dryrun=false), but felt unnecessary at the end. In the happy case things will succeed. In the unhappy case, it's more than fine to retry at a higher level (span config job/manager), and preferable IMO because it keeps the retry/recovery logic out of the reconciliation code and at the level where it's going to exist in some form anyway (if the timestamp's being watched are already gc-ed for e.g., the jobs need to to start reconciliation all over).

I'm fine with retrying at a higher level, but we should add a comment on the {incremental,full}Reconcilers that these things are meant to be single use. Better yet, maybe we can be defensive about this in code by either using a factory to construct these or checking on some state to ensure {incremental,full}Reconciler.reconcile is only ever called once.

For newly created tenants, we want to ensure hard splits on tenant boundaries. The source of truth for split points in the span configs subsystem is the contents of system.span_configurations. To ensure hard splits, we insert a single key record at the tenant prefix. In a future commit we'll introduce the spanconfig.Reconciler process, which runs per tenant and governs all config entries under each tenant's purview. This has implications for this initial record we're talking about (this record might get cleaned up for e.g.); we'll explore it in tests for the Reconciler itself. Creating a single key record is easy enough -- we could've written directly to system.span_configurations. When a tenant is GC-ed however, we need to clear out all tenant state transactionally. To that end we plumb in a txn-scoped KVAccessor into the planner where crdb_internal.destroy_tenant is executed. This lets us easily delete all abandoned tenant span config records. Note: We get rid of spanconfig.experimental_kvaccessor.enabled. Access to spanconfigs infrastructure is already sufficiently gated through the env var. Now that crdb_internal.create_tenant attempts to write through the KVAccessor, it's cumbersome to have to enable the setting manually in every multi-tenant test (increasingly the default) enabling some part of the span configs infrastructure. --- This commit also needs a migration -- for existing clusters with secondary tenants, when upgrading we need to install this initial record at the tenant prefix for all extant tenants (and make sure to continue doing so for subsequent newly created tenants). This is to preserve the hard-split-on-tenant-boundary invariant we wish to provide. It's possible for an upgraded multi-tenant cluster to have dormant sql pods that have never reconciled. If KV switches over to the span configs subsystem, splitting only on the contents of system.span_configurations, we'll fail to split on all tenant boundaries. To this end we introduce clusterversion.SeedTenantSpanConfigs, which allows us to seed span config data for secondary tenants. The associated migration seeds entries for existing tenants. Release note: None

Reconciler is responsible for reconciling a tenant's zone configs (SQL construct) with the cluster's span configs (KV construct). It's the central engine for the span configs infrastructure; a single Reconciler instance is active for every tenant in the system. type Reconciler interface { // Reconcile starts the incremental reconciliation process from // the given checkpoint. If it does not find MVCC history going // far back enough[1], it falls back to a scan of all // descriptors and zone configs before being able to do more // incremental work. The provided callback is invoked with // timestamps that can be safely checkpointed. A future // Reconciliation attempt can make use of this timestamp to // reduce the amount of necessary work (provided the MVCC // history is still available). // // [1]: It's possible for system.{zones,descriptor} to have been // GC-ed away; think suspended tenants. Reconcile( ctx context.Context, checkpoint hlc.Timestamp, callback func(checkpoint hlc.Timestamp) error, ) error } Let's walk through what it does. At a high-level, we maintain an in-memory data structure that's up-to-date with the contents of the KV (at least the subset of spans we have access to, i.e. the keyspace carved out for our tenant ID). We watch for changes to SQL state (descriptors, zone configs), translate the SQL updates to the flattened span+config form, "diff" the updates against our data structure to see if there are any changes we need to inform KV of. If so, we do, and ensure that our data structure is kept up-to-date. We continue watching for future updates and repeat as necessary. There's only single instance of the Reconciler running for a given tenant at a given point it time (mutual exclusion/leasing is provided by the jobs subsystem). We needn't worry about contending writers, or the KV state being changed from underneath us. What we do have to worry about, however, is suspended tenants' not being reconciling while suspended. It's possible for a suspended tenant's SQL state to be GC-ed away at older MVCC timestamps; when watching for changes, we could fail to observe tables/indexes/partitions getting deleted. Left as is, this would result in us never issuing a corresponding deletion requests for the dropped span configs -- we'd be leaving orphaned span configs lying around (taking up storage space and creating pointless empty ranges). A "full reconciliation pass" is our attempt to find all these extraneous entries in KV and to delete them. We can use our span config data structure here too, one that's pre-populated with the contents of KV. We translate the entire SQL state into constituent spans and configs, diff against our data structure to generate KV updates that we then apply. We follow this with clearing out all these spans in our data structure, leaving behind all extraneous entries to be found in KV -- entries we can then simply issue deletes for. Release note: None

By using a type definition instead of a stand-alone type, we can reduce the amount of code needed to convert from a SpanConfigEntry to an Update (something we frequently do). It also helps make it less error prone. While here, we introduce two helpful constructors for the two kinds of updates we're typically interested in -- additions and deletions. Release note: None

irfansharif · 2021-12-15T18:28:49Z

but we should add a comment on the {incremental,full}Reconcilers that these things are meant to be single use.

Done. Thanks for all the reviews, this code was fun/exciting to write and iterate on -- looking forward to it flailing! I'm going to close #67679 and break out the remaining small items into individual quick-win issues.

bors r+

craig · 2021-12-15T19:36:45Z

Build succeeded:

GitHub CI (Cockroach)

This was referenced Oct 26, 2021

*: rework zone configurations to support tenants #67679

Closed

spanconfig: introduce the spanconfig.SQLWatcher #71968

Merged

irfansharif force-pushed the 211026.sql-reconciler branch from 56f5bf4 to 0392096 Compare November 1, 2021 18:04

irfansharif mentioned this pull request Nov 13, 2021

spanconfig: introduce the spanconfig.SQLWatcher #69661

Closed

irfansharif mentioned this pull request Nov 25, 2021

spanconfig/store: support applying a batch of updates #73150

Merged

irfansharif mentioned this pull request Nov 25, 2021

spanconfig/sqlwatcher: hide factory pattern, speed up tests, surface no-op checkpoints periodically #73171

Merged

irfansharif force-pushed the 211026.sql-reconciler branch 16 times, most recently from 0eda1a2 to e5e0da5 Compare November 29, 2021 20:49

irfansharif mentioned this pull request Dec 6, 2021

tree,spanconfig: s/AvoidCached/AvoidLeased in lookup flags #73531

Merged

ajwerner reviewed Dec 8, 2021

View reviewed changes

irfansharif mentioned this pull request Dec 8, 2021

migrations: introduce clusterversion.SeedTenantSpanConfigs #73623

Closed

nvb reviewed Dec 9, 2021

View reviewed changes

arulajmani reviewed Dec 9, 2021

View reviewed changes

irfansharif commented Dec 10, 2021

View reviewed changes

nvb reviewed Dec 10, 2021

View reviewed changes

irfansharif mentioned this pull request Dec 10, 2021

spanconfig: checkpoint the reconciliation job and retry eagerly when possible #73694

Closed

3 tasks

irfansharif commented Dec 10, 2021

View reviewed changes

nvb reviewed Dec 11, 2021

View reviewed changes

This was referenced Dec 13, 2021

spanconfig: visualize differences with gossiped SystemConfigSpan #73746

Merged

spanconfig: ensure splits on tenant boundaries within KV #73749

Closed

irfansharif commented Dec 13, 2021

View reviewed changes

arulajmani reviewed Dec 13, 2021

View reviewed changes

irfansharif commented Dec 13, 2021

View reviewed changes

nvb reviewed Dec 14, 2021

View reviewed changes

ajwerner reviewed Dec 14, 2021

View reviewed changes

irfansharif mentioned this pull request Dec 14, 2021

spanconfig: ensure safety without mutual exclusion in the reconciliation job #73789

Closed

irfansharif commented Dec 14, 2021

View reviewed changes

nvb approved these changes Dec 15, 2021

View reviewed changes

arulajmani approved these changes Dec 15, 2021

View reviewed changes

irfansharif added 3 commits December 15, 2021 13:20

	// createAndStartJobIfNoneExists creates span config reconciliation job iff it
	// hasn't been created already and notifies the jobs registry to adopt it.
	// Returns a boolean indicating if the job was created.
	func (m *Manager) createAndStartJobIfNoneExists(ctx context.Context) (bool, error) {

	// Periodically check if the span config reconciliation job exists and start
	// it if it doesn't.
	timer := timeutil.NewTimer()
	defer timer.Stop()

	triggerJobCheck()
	for {
	timer.Reset(checkReconciliationJobInterval.Get(&m.settings.SV))

Conversation

irfansharif commented Oct 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

spanconfig: introduce spanconfig.Reconciler

server,kvaccessor: record span configs during tenant creation/gc

Uh oh!

cockroach-teamcity commented Oct 26, 2021

Uh oh!

irfansharif commented Dec 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner Dec 8, 2021

Choose a reason for hiding this comment

Uh oh!

ajwerner Dec 8, 2021

Choose a reason for hiding this comment

Uh oh!

ajwerner Dec 8, 2021

Choose a reason for hiding this comment

Uh oh!

ajwerner Dec 8, 2021

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

arulajmani left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

irfansharif commented Dec 10, 2021

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

arulajmani left a comment

Choose a reason for hiding this comment

Uh oh!

arulajmani commented Dec 13, 2021

Uh oh!

irfansharif left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif commented Dec 15, 2021

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

arulajmani left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif commented Dec 15, 2021

Uh oh!

craig bot commented Dec 15, 2021

Uh oh!

irfansharif commented Oct 26, 2021 •

edited

Loading

irfansharif commented Dec 6, 2021 •

edited

Loading

irfansharif left a comment •

edited

Loading

irfansharif left a comment •

edited

Loading