spanconfig,kvserver: teach the GC job to respect tenant PTS records#77726
spanconfig,kvserver: teach the GC job to respect tenant PTS records#77726craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
|
First commit is from #77478 which im going to try to merge ASAP. |
695606d to
fb8c12e
Compare
396343d to
91563a5
Compare
|
friendly ping @arulajmani @ajwerner |
91563a5 to
a6d123a
Compare
arulajmani
left a comment
There was a problem hiding this comment.
Sorry for taking so long to review this. Breather week and what not. Mostly looks good!
Reviewed 1 of 1 files at r1, 5 of 16 files at r2, 16 of 16 files at r3, 15 of 16 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)
-- commits, line 12 at r4:
nit: to respect protected timestamp records set by the system tenant that are protecting...
Also, could you add some words around why this is sane?
pkg/kv/kvserver/protectedts/ptutil/testutils.go, line 11 at r4 (raw file):
// licenses/APL.txt. package ptutil
Can we put this file in package protectedts?
pkg/kv/kvserver/protectedts/ptutil/testutils.go, line 29 at r4 (raw file):
// ensures a protection at the given protectionTimestamp exists for all the // supplied spans. func VerifyProtectionTimestampExistsOnSpans(
s/VerifyProtecteionTimestampExistsOnSpans/TestingVerifyProtectionTimestampExistsOnSpans/g
pkg/sql/gcjob/refresh_statuses.go, line 320 at r4 (raw file):
return nil }); err != nil { return isProtected, err
nit: return false, nil?
pkg/sql/gcjob/refresh_statuses.go, line 342 at r4 (raw file):
cfg := execCfg.SystemConfig.GetSystemConfig() zoneCfg, err := cfg.GetZoneConfigForObject(keys.MakeSQLCodec(roachpb.MakeTenantID(tenID)), 0) if err != nil {
Do we need to make this change that removes the fallback logic in this patch?
pkg/sql/gcjob/refresh_statuses.go, line 359 at r4 (raw file):
if isProtected { log.Infof(ctx, "GC TTL for dropped tenant %d has expired, but protected timestamp"+
nit: missing space?
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 132 at r4 (raw file):
} func TestGCTenantJobWaitsForProtectedTimestamps(t *testing.T) {
nit: could you add a comment?
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 135 at r4 (raw file):
defer leaktest.AfterTest(t)() defer log.Scope(t).Close(t) gcjob.SetSmallMaxGCIntervalForTest()
nit: defer gcjob.SetSmall...
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 146 at r4 (raw file):
sqlDB := sqlutils.MakeSQLRunner(sqlDBRaw) sqlDB.Exec(t, "SET CLUSTER SETTING kv.closed_timestamp.target_duration = '100ms'")
Why do we need this for this test?
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 163 at r4 (raw file):
} checkGCBlockedByPTS := func(sj *jobs.StartableJob, tenID uint64) {
Let's pass in a testing.T in here considering this is called from more than 1 subtests?
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 170 at r4 (raw file):
log.WithFlattenedSensitiveData) if err != nil { t.Fatal(err)
nit: can we just require.NoError?
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 172 at r4 (raw file):
t.Fatal(err) } if len(entries) > 0 {
I may be missing something, but shouldn't this be len(entries) == 0? Did you instead mean to wrap this reading from the log file and checking for pattern inside of a SucceedsSoon with a len(entries)==0 check?
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 263 at r4 (raw file):
sqlDB.Exec(t, "ALTER RANGE tenants CONFIGURE ZONE USING gc.ttlseconds = 1;") ten, conn11 := serverutils.StartTenant(t, srv,
nit: should we call this con10 (referencing the tenant ID) instead?
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 323 at r4 (raw file):
// TestGCTenant is lightweight test that tests the branching logic in Resume // depending on if the job is GC for tenant or tables/indexes. func TestGCResumer(t *testing.T) {
Did this test need to move to this file? IIUC, it doesn't need to be in the CCL package, does it? If so, can we please move it back.
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 412 at r4 (raw file):
// TestGCTenant tests the `gcTenant` method responsible for clearing the // tenant's data. func TestGCTenant(t *testing.T) {
Should we move this test to the non-CCL package file? Separately, is this test buying us much?
adityamaru
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)
pkg/kv/kvserver/protectedts/ptutil/testutils.go, line 11 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Can we put this file in package
protectedts?
There was a dependency cycle that motivated me to put this in its own package.
pkg/kv/kvserver/protectedts/ptutil/testutils.go, line 29 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
s/VerifyProtecteionTimestampExistsOnSpans/TestingVerifyProtectionTimestampExistsOnSpans/g
Done.
pkg/sql/gcjob/refresh_statuses.go, line 342 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Do we need to make this change that removes the fallback logic in this patch?
good catch, this wasn't intended. Reverted the change.
Code quote:
log.Errorf(ctx, "zone config for tenants range: err = %+v", err)pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 150 at r4 (raw file):
execCfg := srv.ExecutorConfig().(sql.ExecutorConfig) jobRegistry := execCfg.JobRegistry defer srv.Stopper().Stop(ctx)
doesn't this speed up the test by speeding up reconciliation?
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 163 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Let's pass in a
testing.Tin here considering this is called from more than 1 subtests?
Done.
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 170 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: can we just require.NoError?
Changed to a succeeds soon so returning an error instead
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 172 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I may be missing something, but shouldn't this be
len(entries) == 0? Did you instead mean to wrap this reading from the log file and checking for pattern inside of aSucceedsSoonwith alen(entries)==0check?
Hmmm how did this happen, that's exactly what I thought I was doing lol. Fixed!
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 263 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: should we call this con10 (referencing the tenant ID) instead?
Yup, got lost in the shuffle of tenant IDs.
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 323 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Did this test need to move to this file? IIUC, it doesn't need to be in the CCL package, does it? If so, can we please move it back.
I found it slightly annoying that tests for GC tenant and job were all over the place, but yeah I agree this doesn't mean that it should lie in CCL. I've moved them to the gc_job_test.go file.
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 412 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Should we move this test to the non-CCL package file? Separately, is this test buying us much?
Moved to gc_job_test.go, I think it's a nice unit test for gcTenant. The other tests seem to be more to do with the GC job evaluating a tenant GC-ability.
a6d123a to
abb2c7a
Compare
arulajmani
left a comment
There was a problem hiding this comment.
Reviewed 16 of 16 files at r5, 16 of 16 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)
pkg/kv/kvserver/protectedts/ptutil/testutils.go, line 11 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
There was a dependency cycle that motivated me to put this in its own package.
Ack.
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 150 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
doesn't this speed up the test by speeding up reconciliation?
I don't think we're relying on reconciliation in this test, are we?
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 412 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
Moved to gc_job_test.go, I think it's a nice unit test for
gcTenant. The other tests seem to be more to do with the GC job evaluating a tenant GC-ability.
We can keep it if you like it!
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 169 at r6 (raw file):
checkGCBlockedByPTS := func(t *testing.T, sj *jobs.StartableJob, tenID uint64) { testutils.SucceedsSoon(t, func() error { log.Flush()
Should this log.Flush be outside the succeeds soon block?
adityamaru
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 150 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I don't think we're relying on reconciliation in this test, are we?
The writing of pts records and the changes to range default GC TTL would be through reconciliation no?
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 169 at r6 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Should this
log.Flushbe outside the succeeds soon block?
hmm the reason I had this inside the closure was that if we haven't seen a log entry in this iteration of the succeeds soon, then we want to flush all the logs to file and check again in the next iteration. TestRotateCerts also seems to do this but I'm not sure if this is wrong.
|
TFTR! If you have any follow-up comments I'll pick them up as a fast follow. bors r=arulajmani |
|
bors r=arulajmani |
|
Already running a review |
Release note: None Release justification: fixes for high-priority bugs in existing functionality
This change teaches the GC job responsible for ClearRange'ing a tenants data to respect protected timestamp records written by the system tenant covering the secondary tenant's keyspace. Specifically, once the tenants row data has expired, the GC job now checks for protected timestamp records that have been written by the system tenant targetting that particular tenant, or targetting its entire keyspace i.e. all secondary tenants. If the record is protecting a time less than the drop time of the tenant then the GC will be delayed. This is important because a system tenant may be backing up the secondary tenant or all secondary tenants while the secondary tenant has been dropped. We don't want the GC job to ClearRange the tenant's data until the backup that wrote the PTS record on the tenant's keyspace has completed (backed up the tenant data) and released the record. This change also consolidates all tenant GC related tests into one file. The only new test is `TestGCTenantJobWaitsForProtectedTimestamps`. Release note (enterprise change): Tenant GC job will now wait for protected timestamp records targetting the tenant, and with a protect time less than the tenant's drop time. Release justification: high-priority changes in existing functionality
abb2c7a to
e4f862c
Compare
|
Canceled. |
arulajmani
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 150 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
The writing of pts records and the changes to range default GC TTL would be through reconciliation no?
Yup, but the test doesn't read them from KV.
pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 169 at r6 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
hmm the reason I had this inside the closure was that if we haven't seen a log entry in this iteration of the succeeds soon, then we want to flush all the logs to file and check again in the next iteration.
TestRotateCertsalso seems to do this but I'm not sure if this is wrong.
Never mind me, I was reading it as a "truncate" and not a flush. I think what you have is correct.
|
blathers backport 22.1 |
|
bors r=arulajmani |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating backport branch refs/heads/blathers/backport-release-22.1-77726: POST https://api.github.com/repos/cockroachdb/cockroach/git/refs: 422 Reference already exists [] Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
This change teaches the GC job responsible for ClearRange'ing a tenants
data to respect protected timestamp records that are protecting a tenants
data. Specifically, once the tenants row data has expired, the GC job now
checks for protected timestamp records that have been written by the system
tenant targetting that particular tenant, or targetting its entire keyspace
i.e. all secondary tenants. If the record is protecting a time less than the
drop time of the tenant then the GC will be delayed.
This change also consolidates all tenant GC related tests into one file. The only
new test is
TestGCTenantJobWaitsForProtectedTimestamps.Release note (enterprise change): Tenant GC job will now wait for protected
timestamp records targetting the tenant, and with a protect time less than
the tenant's drop time.
Release justification: high-priority changes in existing functionality
Fixes: #77239