Skip to content

gcjob: teach GC job to respect protected timestamps for tables/indexes#78475

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
arulajmani:protectedts-gcjob-rpcs
Mar 29, 2022
Merged

gcjob: teach GC job to respect protected timestamps for tables/indexes#78475
craig[bot] merged 2 commits intocockroachdb:masterfrom
arulajmani:protectedts-gcjob-rpcs

Conversation

@arulajmani
Copy link
Copy Markdown
Collaborator

The GC job is responsible for issuing clear ranges for dropped tables
and indexes. Previously, we weren't consulting the new PTS subsystem
when deciding if a table or an index's range could be cleared. This
patch fixes that.

We check if a table/index span is protected after the GC TTL expires.
We do so by issuing an RPC to KV to fetch all span configurations and
system span configurations that apply over the spans. For secondary
tenants, this also includes any system span configurations that the
host tenant may have set over its ranges.

Lastly, it's worth noting that we make no attempt to coordinate with
concurrent protected timestamp writers. This is okay, given that
protected timestamps make no claims about expired data.

Closes #77156

Release note: None

@arulajmani arulajmani requested review from a team as code owners March 25, 2022 04:47
@arulajmani arulajmani requested a review from a team March 25, 2022 04:47
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just some minor comments, will stamp after another round.

// GetAllSystemSpanConfigsThatApply implements the spanconfig.KVAccessor
// interface.
func (c *Connector) GetAllSystemSpanConfigsThatApply(
ctx context.Context, ID roachpb.TenantID,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/ID/id/g?

// context of secondary tenants as it allows them a view into system span config
// state that applies over their ranges that has been set by the system tenant.
//
// Tenant are only allowed to request system span configs that apply over their
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Tenants

// own ranges. System span configs set by the requesting tenant itself are also
// returned.
message GetAllSystemSpanConfigsThatApplyRequest {
// TenantID identifies over which all system span configurations that apply
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Are we missing a word here? TenantID identifies the tenant...

return nil
}

// authGetAllSystemSPanConfigsThatApply authorizes the provided tenant to invoke
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/authGetAllSystemSPanConfigsThatApply/authGetAllSystemSpanConfigsThatApply/g

User() security.SQLUsername
MigrationJobDeps() migration.JobDeps
SpanConfigReconciler() spanconfig.Reconciler
//SpanConfigKVAccessor() spanconfig.KVAccessor
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: detritus


testutils.RunTrueAndFalse(t, "secondary-tenant", func(t *testing.T, forSecondaryTenant bool) {
resetTestingKnob()
if !forSecondaryTenant {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover from some debugging?


// Next, we'll remove the host set system span config on our secondary
// tenant.
r1, err = spanconfig.MakeRecord(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be r2 with 28, 35. Though I see we're deleting the whole target below so maybe this is just not needed?


// Update PTS state that applies to the entire keyspace -- we only
// remove the timestamp before the drop time. We should expect GC to
// succeed on if we're not testing for the system tenant at this point.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little at odds with the outline at the top of the test. Isn't the expectation that removing the entire keyspace protection policy below the drop time, affects both the system and secondary tenants? i.e. allows GC on both? I think thats what we are checking below bu the last line of this comment is a little misleading.

@adityamaru adityamaru self-requested a review March 25, 2022 12:42
@arulajmani arulajmani force-pushed the protectedts-gcjob-rpcs branch from 16057be to aa6cf47 Compare March 25, 2022 15:12
Copy link
Copy Markdown
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all your comments, thanks for the quick review!

Looks like we're failing TestFullClusterBackup in CI. This seems to be because we've disabled the span config infra for that test (see: #75060), which means the KVAccessor we use to determine if tables can be GC-ed perennially errors out. As a result, we never GC, and fail the expectation that there are no running GC jobs in the test. I can shove in a testing knob to skip checking the new subsystem in the GC job. Are you on board with that?

Separately, @adityamaru, is this test something we want to de-flake and run with the new subsystem before the release goes out? Is this something Bulk is tracking?

Dismissed @adityamaru from 3 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @nvanbenschoten)


pkg/roachpb/span_config.proto, line 342 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

nit: Are we missing a word here? TenantID identifies the tenant...

Few words actually; fixed.


pkg/sql/gcjob/refresh_statuses.go, line 315 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

thank you 😋

Haha


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 271 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

leftover from some debugging?

Indeed!


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 303 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

missing require.NoError()?

Done.


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 341 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

This seems a little at odds with the outline at the top of the test. Isn't the expectation that removing the entire keyspace protection policy below the drop time, affects both the system and secondary tenants? i.e. allows GC on both? I think thats what we are checking below bu the last line of this comment is a little misleading.

Oops, wrongly worded comment. It should've said "we should expect GC to succeed if we're testing for the system tenant at this point", not the other way around. Fixed.


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 356 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

can we check that GC has still not succeeded if we are on a secondary tenant?

Done.


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 359 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

should this be r2 with 28, 35. Though I see we're deleting the whole target below so maybe this is just not needed?

I think I was just meant to delete this. I initially had grand plans for this integration test, but as you can see, it's super heavy weight.

@arulajmani arulajmani force-pushed the protectedts-gcjob-rpcs branch from aa6cf47 to 3870184 Compare March 25, 2022 16:18
@arulajmani arulajmani requested a review from a team as a code owner March 25, 2022 16:18
@adityamaru
Copy link
Copy Markdown
Contributor

adityamaru commented Mar 25, 2022

Is this something Bulk is tracking?

My impression was that this would be picked up when addressing the fallout from enabling the spanconfig infra, so afaik Bulk didn't really think we owned this. I had filed #70173 a while ago and I believe it still needs to be addressed. The two may be related.

Are you on board with that?

Onboard with it to unblock this yep, maybe a TODO saying that we should get rid of the knob at some point.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 15 files at r1, 18 of 18 files at r3, 10 of 10 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)


pkg/roachpb/span_config.proto, line 336 at r3 (raw file):

// configs that apply to a given tenant. This is only ever meaningful in the
// context of secondary tenants as it allows them a view into system span config
// state that applies over their ranges that has been set by the system tenant.

This might be a good place to remind readers what "apply" means in this context.


pkg/sql/gcjob/refresh_statuses.go, line 177 at r4 (raw file):

		if err != nil {
			log.Infof(ctx, "error checking protection status %v", err)
			// Retry again later?

Was this a comment that you intended to address in this PR? Also, should ew use a more severe log level, like Warning or Error?


pkg/sql/gcjob/gc_protected_timestamp_test.go, line 72 at r4 (raw file):

		setup         func(t *testing.T, kvAccessor *manualKVAccessor, cache *manualCache)
		isProtected   bool
		droppedAtTime int64

nit: the ordering and naming of fields in this struct doesn't make it very clear what's the input and what's the expected output. Consider switching around droppedAtTime and isProtected and renaming the latter to expectIsProtected.

@arulajmani arulajmani force-pushed the protectedts-gcjob-rpcs branch from 3870184 to 9266774 Compare March 28, 2022 03:58
@arulajmani arulajmani requested a review from a team March 28, 2022 03:58
Copy link
Copy Markdown
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I hadn't switched out the deprecatedIsProtected codepath for indexes -- I fixed that and added an index variant to the integration test I had for tables. PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @nvanbenschoten)


pkg/roachpb/span_config.proto, line 336 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This might be a good place to remind readers what "apply" means in this context.

Added.


pkg/sql/gcjob/refresh_statuses.go, line 177 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Was this a comment that you intended to address in this PR? Also, should ew use a more severe log level, like Warning or Error?

Nice catch! There's not much we need to do here, other than return something sane, as the GC job will automatically be retried above in the call stack. I added a comment explaining what's happening and changed the log level to be Error.


pkg/sql/gcjob/gc_protected_timestamp_test.go, line 72 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: the ordering and naming of fields in this struct doesn't make it very clear what's the input and what's the expected output. Consider switching around droppedAtTime and isProtected and renaming the latter to expectIsProtected.

Done.

@arulajmani arulajmani requested a review from nvb March 28, 2022 04:00
@arulajmani arulajmani force-pushed the protectedts-gcjob-rpcs branch 2 times, most recently from 7b86d96 to 36752d1 Compare March 28, 2022 04:37
This patch adds the `GetAllSystemSpanConfigsThatApply` RPC to the
`Connector` and has the `spanconfig.KVAccessor` implement it. This RPC
returns all system span configs that apply to a given tenant. This
is only ever meaninful in the case of secondary tenants. It serves as a
means for secondary tenants to get a view into the system span config
state that applies over their ranges set by the system tenant.

We'll make use of this RPC in an upcoming patch. We'll use it to make
GC decisions when GC-ing dropped tables/indexes in the GC job.

Release note: None
The GC job is responsible for issuing clear ranges for dropped tables
and indexes. Previously, we weren't consulting the new PTS subsystem
when deciding if a table or an index's range could be cleared. This
patch fixes that.

We check if a table/index span is protected after the GC TTL expires.
We do so by issuing an RPC to KV to fetch all span configurations and
system span configurations that apply over the spans. For secondary
tenants, this also includes any system span configurations that the
host tenant may have set over its ranges.

Lastly, it's worth noting that we make no attempt to coordinate with
concurrent protected timestamp writers. This is okay, given that
protected timestamps make no claims about expired data.

Closes cockroachdb#77156

Release note: None
@arulajmani arulajmani force-pushed the protectedts-gcjob-rpcs branch from 36752d1 to 61380d3 Compare March 28, 2022 15:56
@arulajmani
Copy link
Copy Markdown
Collaborator Author

TFTRs!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 29, 2022

Build failed:

@arulajmani
Copy link
Copy Markdown
Collaborator Author

Flaked, we go again.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 29, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gcjob: teach the GC job to respect protected timestamps on tables/indexes

4 participants