Skip to content

spanconfig,kvserver: teach the GC job to respect tenant PTS records#77726

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
adityamaru:gc-tenant-pts
Mar 24, 2022
Merged

spanconfig,kvserver: teach the GC job to respect tenant PTS records#77726
craig[bot] merged 2 commits intocockroachdb:masterfrom
adityamaru:gc-tenant-pts

Conversation

@adityamaru
Copy link
Copy Markdown
Contributor

@adityamaru adityamaru commented Mar 13, 2022

This change teaches the GC job responsible for ClearRange'ing a tenants
data to respect protected timestamp records that are protecting a tenants
data. Specifically, once the tenants row data has expired, the GC job now
checks for protected timestamp records that have been written by the system
tenant targetting that particular tenant, or targetting its entire keyspace
i.e. all secondary tenants. If the record is protecting a time less than the
drop time of the tenant then the GC will be delayed.

This change also consolidates all tenant GC related tests into one file. The only
new test is TestGCTenantJobWaitsForProtectedTimestamps.

Release note (enterprise change): Tenant GC job will now wait for protected
timestamp records targetting the tenant, and with a protect time less than
the tenant's drop time.

Release justification: high-priority changes in existing functionality

Fixes: #77239

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@adityamaru
Copy link
Copy Markdown
Contributor Author

First commit is from #77478 which im going to try to merge ASAP.

@adityamaru adityamaru marked this pull request as ready for review March 14, 2022 17:00
@adityamaru adityamaru requested review from a team as code owners March 14, 2022 17:00
@adityamaru adityamaru requested review from a team and removed request for a team March 14, 2022 17:00
@adityamaru adityamaru force-pushed the gc-tenant-pts branch 2 times, most recently from 396343d to 91563a5 Compare March 15, 2022 10:31
@adityamaru
Copy link
Copy Markdown
Contributor Author

friendly ping @arulajmani @ajwerner

Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to review this. Breather week and what not. Mostly looks good!

Reviewed 1 of 1 files at r1, 5 of 16 files at r2, 16 of 16 files at r3, 15 of 16 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)


-- commits, line 12 at r4:
nit: to respect protected timestamp records set by the system tenant that are protecting...

Also, could you add some words around why this is sane?


pkg/kv/kvserver/protectedts/ptutil/testutils.go, line 11 at r4 (raw file):

// licenses/APL.txt.

package ptutil

Can we put this file in package protectedts?


pkg/kv/kvserver/protectedts/ptutil/testutils.go, line 29 at r4 (raw file):

// ensures a protection at the given protectionTimestamp exists for all the
// supplied spans.
func VerifyProtectionTimestampExistsOnSpans(

s/VerifyProtecteionTimestampExistsOnSpans/TestingVerifyProtectionTimestampExistsOnSpans/g


pkg/sql/gcjob/refresh_statuses.go, line 320 at r4 (raw file):

		return nil
	}); err != nil {
		return isProtected, err

nit: return false, nil?


pkg/sql/gcjob/refresh_statuses.go, line 342 at r4 (raw file):

	cfg := execCfg.SystemConfig.GetSystemConfig()
	zoneCfg, err := cfg.GetZoneConfigForObject(keys.MakeSQLCodec(roachpb.MakeTenantID(tenID)), 0)
	if err != nil {

Do we need to make this change that removes the fallback logic in this patch?


pkg/sql/gcjob/refresh_statuses.go, line 359 at r4 (raw file):

		if isProtected {
			log.Infof(ctx, "GC TTL for dropped tenant %d has expired, but protected timestamp"+

nit: missing space?


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 132 at r4 (raw file):

}

func TestGCTenantJobWaitsForProtectedTimestamps(t *testing.T) {

nit: could you add a comment?


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 135 at r4 (raw file):

	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)
	gcjob.SetSmallMaxGCIntervalForTest()

nit: defer gcjob.SetSmall...


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 146 at r4 (raw file):

	sqlDB := sqlutils.MakeSQLRunner(sqlDBRaw)

	sqlDB.Exec(t, "SET CLUSTER SETTING kv.closed_timestamp.target_duration = '100ms'")

Why do we need this for this test?


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 163 at r4 (raw file):

	}

	checkGCBlockedByPTS := func(sj *jobs.StartableJob, tenID uint64) {

Let's pass in a testing.T in here considering this is called from more than 1 subtests?


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 170 at r4 (raw file):

			log.WithFlattenedSensitiveData)
		if err != nil {
			t.Fatal(err)

nit: can we just require.NoError?


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 172 at r4 (raw file):

			t.Fatal(err)
		}
		if len(entries) > 0 {

I may be missing something, but shouldn't this be len(entries) == 0? Did you instead mean to wrap this reading from the log file and checking for pattern inside of a SucceedsSoon with a len(entries)==0 check?


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 263 at r4 (raw file):

		sqlDB.Exec(t, "ALTER RANGE tenants CONFIGURE ZONE USING gc.ttlseconds = 1;")

		ten, conn11 := serverutils.StartTenant(t, srv,

nit: should we call this con10 (referencing the tenant ID) instead?


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 323 at r4 (raw file):

// TestGCTenant is lightweight test that tests the branching logic in Resume
// depending on if the job is GC for tenant or tables/indexes.
func TestGCResumer(t *testing.T) {

Did this test need to move to this file? IIUC, it doesn't need to be in the CCL package, does it? If so, can we please move it back.


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 412 at r4 (raw file):

// TestGCTenant tests the `gcTenant` method responsible for clearing the
// tenant's data.
func TestGCTenant(t *testing.T) {

Should we move this test to the non-CCL package file? Separately, is this test buying us much?

Copy link
Copy Markdown
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)


pkg/kv/kvserver/protectedts/ptutil/testutils.go, line 11 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Can we put this file in package protectedts?

There was a dependency cycle that motivated me to put this in its own package.


pkg/kv/kvserver/protectedts/ptutil/testutils.go, line 29 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

s/VerifyProtecteionTimestampExistsOnSpans/TestingVerifyProtectionTimestampExistsOnSpans/g

Done.


pkg/sql/gcjob/refresh_statuses.go, line 342 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Do we need to make this change that removes the fallback logic in this patch?

good catch, this wasn't intended. Reverted the change.

Code quote:

log.Errorf(ctx, "zone config for tenants range: err = %+v", err)

pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 150 at r4 (raw file):

	execCfg := srv.ExecutorConfig().(sql.ExecutorConfig)
	jobRegistry := execCfg.JobRegistry
	defer srv.Stopper().Stop(ctx)

doesn't this speed up the test by speeding up reconciliation?


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 163 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Let's pass in a testing.T in here considering this is called from more than 1 subtests?

Done.


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 170 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: can we just require.NoError?

Changed to a succeeds soon so returning an error instead


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 172 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I may be missing something, but shouldn't this be len(entries) == 0? Did you instead mean to wrap this reading from the log file and checking for pattern inside of a SucceedsSoon with a len(entries)==0 check?

Hmmm how did this happen, that's exactly what I thought I was doing lol. Fixed!


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 263 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: should we call this con10 (referencing the tenant ID) instead?

Yup, got lost in the shuffle of tenant IDs.


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 323 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Did this test need to move to this file? IIUC, it doesn't need to be in the CCL package, does it? If so, can we please move it back.

I found it slightly annoying that tests for GC tenant and job were all over the place, but yeah I agree this doesn't mean that it should lie in CCL. I've moved them to the gc_job_test.go file.


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 412 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we move this test to the non-CCL package file? Separately, is this test buying us much?

Moved to gc_job_test.go, I think it's a nice unit test for gcTenant. The other tests seem to be more to do with the GC job evaluating a tenant GC-ability.

Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 16 of 16 files at r5, 16 of 16 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)


pkg/kv/kvserver/protectedts/ptutil/testutils.go, line 11 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

There was a dependency cycle that motivated me to put this in its own package.

Ack.


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 150 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

doesn't this speed up the test by speeding up reconciliation?

I don't think we're relying on reconciliation in this test, are we?


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 412 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Moved to gc_job_test.go, I think it's a nice unit test for gcTenant. The other tests seem to be more to do with the GC job evaluating a tenant GC-ability.

We can keep it if you like it!


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 169 at r6 (raw file):

	checkGCBlockedByPTS := func(t *testing.T, sj *jobs.StartableJob, tenID uint64) {
		testutils.SucceedsSoon(t, func() error {
			log.Flush()

Should this log.Flush be outside the succeeds soon block?

Copy link
Copy Markdown
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 150 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I don't think we're relying on reconciliation in this test, are we?

The writing of pts records and the changes to range default GC TTL would be through reconciliation no?


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 169 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should this log.Flush be outside the succeeds soon block?

hmm the reason I had this inside the closure was that if we haven't seen a log entry in this iteration of the succeeds soon, then we want to flush all the logs to file and check again in the next iteration. TestRotateCerts also seems to do this but I'm not sure if this is wrong.

@adityamaru
Copy link
Copy Markdown
Contributor Author

TFTR! If you have any follow-up comments I'll pick them up as a fast follow.

bors r=arulajmani

@adityamaru
Copy link
Copy Markdown
Contributor Author

bors r=arulajmani

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 22, 2022

Already running a review

Release note: None

Release justification: fixes for high-priority bugs in existing functionality
This change teaches the GC job responsible for ClearRange'ing a tenants
data to respect protected timestamp records written by the system tenant
covering the secondary tenant's keyspace. Specifically, once the tenants
row data has expired, the GC job now checks for protected timestamp records
that have been written by the system tenant targetting that particular
tenant, or targetting its entire keyspace i.e. all secondary tenants.
If the record is protecting a time less than the drop time of the tenant
then the GC will be delayed.

This is important because a system tenant may be backing up the secondary tenant
or all secondary tenants while the secondary tenant has been dropped. We don't want
the GC job to ClearRange the tenant's data until the backup that wrote the PTS record
on the tenant's keyspace has completed (backed up the tenant data) and released the record.

This change also consolidates all tenant GC related tests into one file. The only
new test is `TestGCTenantJobWaitsForProtectedTimestamps`.

Release note (enterprise change): Tenant GC job will now wait for protected
timestamp records targetting the tenant, and with a protect time less than
the tenant's drop time.

Release justification: high-priority changes in existing functionality
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 22, 2022

Canceled.

Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @arulajmani)


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 150 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

The writing of pts records and the changes to range default GC TTL would be through reconciliation no?

Yup, but the test doesn't read them from KV.


pkg/ccl/testccl/sqlccl/tenant_gc_test.go, line 169 at r6 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

hmm the reason I had this inside the closure was that if we haven't seen a log entry in this iteration of the succeeds soon, then we want to flush all the logs to file and check again in the next iteration. TestRotateCerts also seems to do this but I'm not sure if this is wrong.

Never mind me, I was reading it as a "truncate" and not a flush. I think what you have is correct.

@adityamaru
Copy link
Copy Markdown
Contributor Author

blathers backport 22.1

@adityamaru
Copy link
Copy Markdown
Contributor Author

bors r=arulajmani

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 24, 2022

Build succeeded:

@craig craig bot merged commit 86af7ad into cockroachdb:master Mar 24, 2022
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 24, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating backport branch refs/heads/blathers/backport-release-22.1-77726: POST https://api.github.com/repos/cockroachdb/cockroach/git/refs: 422 Reference already exists []

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gcjob: consult protected timestamps when deciding to GC tenants

3 participants