Skip to content

ptreconcile,server: rework ptreconciler for multi-tenant#75688

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
adityamaru:rework-reconciler
Feb 1, 2022
Merged

ptreconcile,server: rework ptreconciler for multi-tenant#75688
craig[bot] merged 1 commit intocockroachdb:masterfrom
adityamaru:rework-reconciler

Conversation

@adityamaru
Copy link
Copy Markdown
Contributor

The ptreconciler is responisble for periodically scanning the
system.pts_records table, and checking if there are any stale
records to be released, based on callbacks registerd on server
startup. Previously, the ptreconciler used the meta1leaseholder
to ensure that there was only one instance of it running in the
cluster. Additionally, it was reliant on the ptcache to iterate
over the records when checking whether they were stale.

In the multi-tenant version of the protected timestamp subsystem,
the SQL pod running the reconciler cannot use the meta1leaseholder
to determine whether or not it should run the reconciliation loop.
To get around this, we move the Start of the ptreconciler to the
Resume hook of the auto span config job. We are guaranteed via the
spanconfig manager, that there will always be atmost one instance of
this job in a cluster. Further, this is a forever running job, and so
we can tie the execution of the ptreconciler to the lifetime of the
spanconfig job resumer. Additionally, since we will be doing away
with the ptcache, we switchover to doing a full table scan every time
the reconciliation loop is run. While not ideal, this is not alarming
since we have a conservative limit on the total size of all records
that can be stored in the table, and reconciliation only runs once
every 5mins by default. Additionally, we do not expect many
concurrent BACKUP/CDC jobs to exist in the cluster at a given point in
time.

This change also refactors some of the server and tenant code to plumb
a ptreconciler to the ExecutorConfig, for use by the auto span config
job. We move the relevant job+schedule tests into a ccl pacakge to allow
testing from within a secondary tenant.

Informs: #73727

Release note: None

@adityamaru adityamaru requested review from a team as code owners January 29, 2022 03:01
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@shermanCRL
Copy link
Copy Markdown
Contributor

(Suggestion, can the explanation in this commit message be a comment closer to the code?)

Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)


pkg/kv/kvserver/protectedts/ptreconcile/reconciler.go, line 34 at r1 (raw file):

// When set to zero - disables the report generation.
var ReconcileInterval = settings.RegisterDurationSetting(
	settings.TenantWritable,

why would this be tenant writable? Are we okay w/ tenant disabling this? Or setting it to 1ns?

@adityamaru
Copy link
Copy Markdown
Contributor Author

why would this be tenant writable? Are we okay w/ tenant disabling this? Or setting it to 1ns?

Hmm, good point. Every reconciliation will query the jobs/schedules table given a particular record, so setting this to 1ns might not be ideal? OTOH the transaction is very short lived in that it's a point query, and then a release of the record if at all. My hunch is that we don't really see this being a problem. Jobs are, or at least we hope they are more robust than before, and they clean up properly after themselves.

I was actually ignorant of these classes until you brought it up, so going by:

//  - Control settings relevant to tenant-specific internal implementation
//    should be TenantReadOnly.
//
//  - When in doubt, the first choice to consider should be TenantReadOnly.

my vote is we flip it to TenantReadOnly.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 20 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @miretskiy)


pkg/kv/kvserver/protectedts/ptreconcile/reconciler.go, line 34 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

why would this be tenant writable? Are we okay w/ tenant disabling this? Or setting it to 1ns?

I agree with tenant read-only


pkg/sql/exec_util.go, line 1203 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

This is carry forward from PTS v1, so I'm unsure of what the reasoning was when that was written. @ajwerner might remember, but my guess would be some sort of dep cycle.

Just didn't feel moved to abstract it into an interface because it was just a thing getting kicked off by the server that had no methods anybody used. Now that there are methods. Honestly, what are your thoughts on constructing this in the job rather than plumbing it down here? I think that upholds the same principle. I feel like for the spanconfig stuff, the plumbing is only justified by the weight of the dependency injection it'd take to construct such a thing in the job.

@adityamaru
Copy link
Copy Markdown
Contributor Author

(Suggestion, can the explanation in this commit message be a comment closer to the code?)

Is there any part in particular you suggest? I thought the interesting part was the fact that we use the auto span config job as our singleton for which I already have a short blurb here.

https://github.com/cockroachdb/cockroach/pull/75688/files#diff-4d440c27aeac8809854348841a6c37672ff7beb80308498c4ef35b668022e8e7R49

@adityamaru
Copy link
Copy Markdown
Contributor Author

Honestly, what are your thoughts on constructing this in the job

Good point, I think this is doable since execCtx has everything we need. The only piece I'm unsure about is how to add the corresponding reconciler metric struct, if we start init'ing the reconciler in the span config job Resume hook. We could plumb the metrics registry down, but is that any better than plumbing the one-time init'ed reconciler?

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a sufficiently legit reason to keep constructing it in the server. Consider making an interface for the thing in protectedts?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @miretskiy)

Comment on lines +144 to +155
// Metrics encapsulates the metrics exported by the reconciler.
type Metrics struct {
ReconcilationRuns *metric.Counter
RecordsProcessed *metric.Counter
RecordsRemoved *metric.Counter
ReconciliationErrors *metric.Counter
}

var _ metric.Struct = (*Metrics)(nil)

// MetricStruct makes Metrics a metric.Struct.
func (m *Metrics) MetricStruct() {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part doesn't need to move, the server still holds on to the concrete struct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, moved it back and exported the constructor instead.

// configured StatusFunc.
StartReconciler(ctx context.Context, stopper *stop.Stopper) error
// Metrics returns the metrics exported by the reconciler.
Metrics() *Metrics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to export this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we want to subsume the Reconciler interface in the Provider interface, the initialization of the reconciler needs to happen https://github.com/cockroachdb/cockroach/pull/75688/files#diff-47ad9f8fa32a4e686099e2e89dad4beca96772073e042d746b771bc58ffb9f01R55. So we don't have a handle to the concrete type of the reconciler in server/tenant.go to grab its metrics struct. I removed this method and exported the constructor of the Metrics struct instead. This is then plumbed into ptprovider.New().

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more round and I'll be happy

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @arulajmani, and @miretskiy)


pkg/kv/kvserver/protectedts/protectedts.go, line 165 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Since we want to subsume the Reconciler interface in the Provider interface, the initialization of the reconciler needs to happen https://github.com/cockroachdb/cockroach/pull/75688/files#diff-47ad9f8fa32a4e686099e2e89dad4beca96772073e042d746b771bc58ffb9f01R55. So we don't have a handle to the concrete type of the reconciler in server/tenant.go to grab its metrics struct. I removed this method and exported the constructor of the Metrics struct instead. This is then plumbed into ptprovider.New().

This is getting nit-picky, but I'd rather you just gave the Provider a Metrics() method which returned metric.MetricStruct as opposed to plumbing the metrics into the provider. That gives you a seamless way to add more metrics later in the provider without touching server code by potentially nesting the reconciler metric struct in some other metric struct later.

@adityamaru
Copy link
Copy Markdown
Contributor Author

as opposed to plumbing the metrics into the provider

I like it, thanks for the suggestion, done.

@adityamaru adityamaru requested a review from ajwerner January 31, 2022 19:48
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @arulajmani, and @miretskiy)


pkg/kv/kvserver/protectedts/ptprovider/provider.go, line 47 at r4 (raw file):

	protectedts.Cache
	protectedts.Reconciler
	metric.Struct

Heh one more, instead of this, can we have a method Metrics() that returns the metric.Struct and delegates to ptreconcile.Metrics()


pkg/kv/kvserver/protectedts/ptreconcile/metrics.go, line 27 at r4 (raw file):

func makeMetrics() *Metrics {
	return &Metrics{

I think you can now revert this


pkg/kv/kvserver/protectedts/ptreconcile/reconciler.go, line 86 at r4 (raw file):

// Metrics returns the reconciler's metrics.
func (r *Reconciler) Metrics() *Metrics {
	return r.metrics

just take the address like it had before

@adityamaru
Copy link
Copy Markdown
Contributor Author

adityamaru commented Jan 31, 2022

can we have a method Metrics() that returns the metric.Struct and delegates to ptreconcile.Metrics()

Hmm do you mean something like:

func (p *provider) Metrics() metric.Struct {
	return p.Reconciler.Metrics()
}

That would need us to expose Metrics in the Reconciler interface right?

@ajwerner
Copy link
Copy Markdown
Contributor

ajwerner commented Jan 31, 2022

That would need us to expose Metrics in the Reconciler interface right?

In my head the provider was embedding the implementations as opposed to the interfaces. I'm indifferent when it comes to copying the metric struct out or not. Throwing Metrics on the Reconciler interface is fine. Storing it, like you have, is also fine.

:lgtm:

The ptreconciler is responisble for periodically scanning the
`system.pts_records` table, and checking if there are any stale
records to be released, based on callbacks registerd on server
startup. Previously, the ptreconciler used the meta1leaseholder
to ensure that there was only one instance of it running in the
cluster. Additionally, it was reliant on the ptcache to iterate
over the records when checking whether they were stale.

In the multi-tenant version of the protected timestamp subsystem,
the SQL pod running the reconciler cannot use the meta1leaseholder
to determine whether or not it should run the reconciliation loop.
To get around this, we move the `Start` of the ptreconciler to the
Resume hook of the auto span config job. We are guaranteed via the
spanconfig manager, that there will always be atmost one instance of
this job in a cluster. Further, this is a forever running job, and so
we can tie the execution of the ptreconciler to the lifetime of the
spanconfig job resumer. Additionally, since we will be doing away
with the ptcache, we switchover to doing a full table scan every time
the reconciliation loop is run. While not ideal, this is not alarming
since we have a conservative limit on the total size of all records
that can be stored in the table, and reconciliation only runs once
every 5mins by default. Additionally, we do not expect many
concurrent BACKUP/CDC jobs to exist in the cluster at a given point in
time.

This change also refactors some of the server and tenant code to plumb
a ptreconciler to the ExecutorConfig, for use by the auto span config
job. We move the relevant job+schedule tests into a ccl pacakge to allow
testing from within a secondary tenant.

Informs: cockroachdb#73727

Release note: None
@adityamaru
Copy link
Copy Markdown
Contributor Author

I've seen this bazel timeout on other PRs, pretty sure it is unrelated. Started an internal thread to discuss, but going ahead with a merge here.

TFTR!
bors r=ajwerner,miretskiy

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 1, 2022

Build failed (retrying...):

@adityamaru
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 1, 2022

Already running a review

@adityamaru
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 1, 2022

Already running a review

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 1, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 1, 2022

Build succeeded:

@craig craig bot merged commit 65db9cf into cockroachdb:master Feb 1, 2022
@adityamaru adityamaru deleted the rework-reconciler branch February 1, 2022 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants