gcjob: teach GC job to respect protected timestamps for tables/indexes#78475
gcjob: teach GC job to respect protected timestamps for tables/indexes#78475craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
494bf21 to
af0529d
Compare
af0529d to
16057be
Compare
adityamaru
left a comment
There was a problem hiding this comment.
Nice! Just some minor comments, will stamp after another round.
| // GetAllSystemSpanConfigsThatApply implements the spanconfig.KVAccessor | ||
| // interface. | ||
| func (c *Connector) GetAllSystemSpanConfigsThatApply( | ||
| ctx context.Context, ID roachpb.TenantID, |
pkg/roachpb/span_config.proto
Outdated
| // context of secondary tenants as it allows them a view into system span config | ||
| // state that applies over their ranges that has been set by the system tenant. | ||
| // | ||
| // Tenant are only allowed to request system span configs that apply over their |
pkg/roachpb/span_config.proto
Outdated
| // own ranges. System span configs set by the requesting tenant itself are also | ||
| // returned. | ||
| message GetAllSystemSpanConfigsThatApplyRequest { | ||
| // TenantID identifies over which all system span configurations that apply |
There was a problem hiding this comment.
nit: Are we missing a word here? TenantID identifies the tenant...
pkg/rpc/auth_tenant.go
Outdated
| return nil | ||
| } | ||
|
|
||
| // authGetAllSystemSPanConfigsThatApply authorizes the provided tenant to invoke |
There was a problem hiding this comment.
nit: s/authGetAllSystemSPanConfigsThatApply/authGetAllSystemSpanConfigsThatApply/g
pkg/sql/job_exec_context.go
Outdated
| User() security.SQLUsername | ||
| MigrationJobDeps() migration.JobDeps | ||
| SpanConfigReconciler() spanconfig.Reconciler | ||
| //SpanConfigKVAccessor() spanconfig.KVAccessor |
|
|
||
| testutils.RunTrueAndFalse(t, "secondary-tenant", func(t *testing.T, forSecondaryTenant bool) { | ||
| resetTestingKnob() | ||
| if !forSecondaryTenant { |
There was a problem hiding this comment.
leftover from some debugging?
|
|
||
| // Next, we'll remove the host set system span config on our secondary | ||
| // tenant. | ||
| r1, err = spanconfig.MakeRecord( |
There was a problem hiding this comment.
should this be r2 with 28, 35. Though I see we're deleting the whole target below so maybe this is just not needed?
|
|
||
| // Update PTS state that applies to the entire keyspace -- we only | ||
| // remove the timestamp before the drop time. We should expect GC to | ||
| // succeed on if we're not testing for the system tenant at this point. |
There was a problem hiding this comment.
This seems a little at odds with the outline at the top of the test. Isn't the expectation that removing the entire keyspace protection policy below the drop time, affects both the system and secondary tenants? i.e. allows GC on both? I think thats what we are checking below bu the last line of this comment is a little misleading.
16057be to
aa6cf47
Compare
arulajmani
left a comment
There was a problem hiding this comment.
Addressed all your comments, thanks for the quick review!
Looks like we're failing TestFullClusterBackup in CI. This seems to be because we've disabled the span config infra for that test (see: #75060), which means the KVAccessor we use to determine if tables can be GC-ed perennially errors out. As a result, we never GC, and fail the expectation that there are no running GC jobs in the test. I can shove in a testing knob to skip checking the new subsystem in the GC job. Are you on board with that?
Separately, @adityamaru, is this test something we want to de-flake and run with the new subsystem before the release goes out? Is this something Bulk is tracking?
Dismissed @adityamaru from 3 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @nvanbenschoten)
pkg/roachpb/span_config.proto, line 342 at r1 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
nit: Are we missing a word here? TenantID identifies the tenant...
Few words actually; fixed.
pkg/sql/gcjob/refresh_statuses.go, line 315 at r2 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
thank you 😋
Haha
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 271 at r2 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
leftover from some debugging?
Indeed!
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 303 at r2 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
missing require.NoError()?
Done.
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 341 at r2 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
This seems a little at odds with the outline at the top of the test. Isn't the expectation that removing the entire keyspace protection policy below the drop time, affects both the system and secondary tenants? i.e. allows GC on both? I think thats what we are checking below bu the last line of this comment is a little misleading.
Oops, wrongly worded comment. It should've said "we should expect GC to succeed if we're testing for the system tenant at this point", not the other way around. Fixed.
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 356 at r2 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
can we check that GC has still not succeeded if we are on a secondary tenant?
Done.
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 359 at r2 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
should this be r2 with 28, 35. Though I see we're deleting the whole target below so maybe this is just not needed?
I think I was just meant to delete this. I initially had grand plans for this integration test, but as you can see, it's super heavy weight.
aa6cf47 to
3870184
Compare
My impression was that this would be picked up when addressing the fallout from enabling the spanconfig infra, so afaik Bulk didn't really think we owned this. I had filed #70173 a while ago and I believe it still needs to be addressed. The two may be related.
Onboard with it to unblock this yep, maybe a TODO saying that we should get rid of the knob at some point. |
nvb
left a comment
There was a problem hiding this comment.
Reviewed 8 of 15 files at r1, 18 of 18 files at r3, 10 of 10 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)
pkg/roachpb/span_config.proto, line 336 at r3 (raw file):
// configs that apply to a given tenant. This is only ever meaningful in the // context of secondary tenants as it allows them a view into system span config // state that applies over their ranges that has been set by the system tenant.
This might be a good place to remind readers what "apply" means in this context.
pkg/sql/gcjob/refresh_statuses.go, line 177 at r4 (raw file):
if err != nil { log.Infof(ctx, "error checking protection status %v", err) // Retry again later?
Was this a comment that you intended to address in this PR? Also, should ew use a more severe log level, like Warning or Error?
pkg/sql/gcjob/gc_protected_timestamp_test.go, line 72 at r4 (raw file):
setup func(t *testing.T, kvAccessor *manualKVAccessor, cache *manualCache) isProtected bool droppedAtTime int64
nit: the ordering and naming of fields in this struct doesn't make it very clear what's the input and what's the expected output. Consider switching around droppedAtTime and isProtected and renaming the latter to expectIsProtected.
3870184 to
9266774
Compare
arulajmani
left a comment
There was a problem hiding this comment.
Turns out I hadn't switched out the deprecatedIsProtected codepath for indexes -- I fixed that and added an index variant to the integration test I had for tables. PTAL!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @nvanbenschoten)
pkg/roachpb/span_config.proto, line 336 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This might be a good place to remind readers what "apply" means in this context.
Added.
pkg/sql/gcjob/refresh_statuses.go, line 177 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Was this a comment that you intended to address in this PR? Also, should ew use a more severe
loglevel, like Warning or Error?
Nice catch! There's not much we need to do here, other than return something sane, as the GC job will automatically be retried above in the call stack. I added a comment explaining what's happening and changed the log level to be Error.
pkg/sql/gcjob/gc_protected_timestamp_test.go, line 72 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: the ordering and naming of fields in this struct doesn't make it very clear what's the input and what's the expected output. Consider switching around
droppedAtTimeandisProtectedand renaming the latter toexpectIsProtected.
Done.
7b86d96 to
36752d1
Compare
This patch adds the `GetAllSystemSpanConfigsThatApply` RPC to the `Connector` and has the `spanconfig.KVAccessor` implement it. This RPC returns all system span configs that apply to a given tenant. This is only ever meaninful in the case of secondary tenants. It serves as a means for secondary tenants to get a view into the system span config state that applies over their ranges set by the system tenant. We'll make use of this RPC in an upcoming patch. We'll use it to make GC decisions when GC-ing dropped tables/indexes in the GC job. Release note: None
The GC job is responsible for issuing clear ranges for dropped tables and indexes. Previously, we weren't consulting the new PTS subsystem when deciding if a table or an index's range could be cleared. This patch fixes that. We check if a table/index span is protected after the GC TTL expires. We do so by issuing an RPC to KV to fetch all span configurations and system span configurations that apply over the spans. For secondary tenants, this also includes any system span configurations that the host tenant may have set over its ranges. Lastly, it's worth noting that we make no attempt to coordinate with concurrent protected timestamp writers. This is okay, given that protected timestamps make no claims about expired data. Closes cockroachdb#77156 Release note: None
36752d1 to
61380d3
Compare
|
TFTRs! bors r+ |
|
Build failed: |
|
Flaked, we go again. bors r+ |
|
Build succeeded: |
The GC job is responsible for issuing clear ranges for dropped tables
and indexes. Previously, we weren't consulting the new PTS subsystem
when deciding if a table or an index's range could be cleared. This
patch fixes that.
We check if a table/index span is protected after the GC TTL expires.
We do so by issuing an RPC to KV to fetch all span configurations and
system span configurations that apply over the spans. For secondary
tenants, this also includes any system span configurations that the
host tenant may have set over its ranges.
Lastly, it's worth noting that we make no attempt to coordinate with
concurrent protected timestamp writers. This is okay, given that
protected timestamps make no claims about expired data.
Closes #77156
Release note: None