Skip to content

sql: clean up protected timestamps automatically on errors#91135

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
fqazi:protectedTSFixes
Nov 11, 2022
Merged

sql: clean up protected timestamps automatically on errors#91135
craig[bot] merged 1 commit intocockroachdb:masterfrom
fqazi:protectedTSFixes

Conversation

@fqazi
Copy link
Copy Markdown
Collaborator

@fqazi fqazi commented Nov 2, 2022

Fixes: #90774

Previously, if a cockroach node was to crash, we would rely on the deleted protected timestamp clean up logic to remove ones installed for validation. This change allows, the clean up to happen during job fail / cancel, so that no delays occurs dealing with these protected timestamps.

Additionally, the entire logic to support adding these protected timestamps is now done in a dependency injection friendly manner,

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@fqazi fqazi force-pushed the protectedTSFixes branch 5 times, most recently from 9fd131e to c1eb34d Compare November 3, 2022 03:59
@fqazi fqazi marked this pull request as ready for review November 3, 2022 17:04
@fqazi fqazi requested review from a team as code owners November 3, 2022 17:04
@fqazi fqazi requested review from a team and benbardin and removed request for a team November 3, 2022 17:04
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close. My feedback is cosmetic.

Reviewed 2 of 15 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @benbardin and @fqazi)


pkg/jobs/jobspb/jobs.proto line 566 at r1 (raw file):

  // corresponding to this job.
  bytes protected_timestamp_record = 7 [
    (gogoproto.customname) = "ProtectedTimestampRecord",

nit: why the custom name?


pkg/jobs/jobspb/jobs.proto line 773 at r1 (raw file):

  // corresponding to this job.
  bytes protected_timestamp_record = 11 [
    (gogoproto.customname) = "ProtectedTimestampRecord",

nit: why the custom name?


pkg/jobs/jobsprotectedts/jobs_protected_ts_provider.go line 34 at r1 (raw file):

// timedProtectTimeStampGCPct wait a percentage of the GC time before
// creating a protected timestamp record.
const timedProtectTimeStampGCPct = 80

nit: make this constant .8 as opposed to 80 to avoid any chance of integer division bugs in the type system.


pkg/jobs/jobsprotectedts/jobs_protected_ts_provider.go line 94 at r1 (raw file):

// Protect implements scexec.ProtectedTimestampForJobProvider.
func (p *protectedTimestampForJobProvider) Protect(

is there a way we can make this take the *jobs.Job object itself? That provides additional safety.


pkg/jobs/jobsprotectedts/jobs_protected_ts_provider.go line 99 at r1 (raw file):

	tableDesc catalog.TableDescriptor,
	readAsOf hlc.Timestamp,
) scexec.ProtectedTimestampForJobCleaner {

nit: I think it's more elegant to just rename protectedTimestampForJobRecord to Manager (which will read jobprotectedts.Manager) and then have that thing implicitly implement the scexec interfaces rather than tightly coupling this package to scexec.


pkg/jobs/jobsprotectedts/jobs_protected_ts_provider.go line 99 at r1 (raw file):

	tableDesc catalog.TableDescriptor,
	readAsOf hlc.Timestamp,
) scexec.ProtectedTimestampForJobCleaner {

I think it's be totally fine to just have this thing return a function as opposed to an interface.


pkg/jobs/jobsprotectedts/jobs_protected_ts_provider.go line 105 at r1 (raw file):

		jobID:                    int64(jobID),
	}
	jr.waitGrp.GoCtx(func(ctx context.Context) error {

How do you feel about refactoring this a bit into methods for waiting for the time to expire and then a method to do the protection? The logic is a bit too composed right now for my taste.

Copy link
Copy Markdown
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @benbardin)


pkg/jobs/jobspb/jobs.proto line 773 at r1 (raw file):

Previously, ajwerner wrote…

nit: why the custom name?

Done.

The other ones in the jobs.proto package did the same, but its not needed.


pkg/jobs/jobsprotectedts/jobs_protected_ts_provider.go line 94 at r1 (raw file):

Previously, ajwerner wrote…

is there a way we can make this take the *jobs.Job object itself? That provides additional safety.

Done.

Involved pushing the current job in the declarative world, which I think is fine.


pkg/jobs/jobsprotectedts/jobs_protected_ts_provider.go line 105 at r1 (raw file):

Previously, ajwerner wrote…

How do you feel about refactoring this a bit into methods for waiting for the time to expire and then a method to do the protection? The logic is a bit too composed right now for my taste.

Done.

I modified it into two parts, a method called ProtectBeforeGC and Protect. The former will install it before a GC occurs on a timer, and the latter instantly installs it.

@fqazi fqazi force-pushed the protectedTSFixes branch 4 times, most recently from eb6e51d to 3166b44 Compare November 8, 2022 23:23
@fqazi fqazi requested a review from ajwerner November 9, 2022 00:08
Copy link
Copy Markdown
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few questions, mostly for my own edification. I may use this infra in future work.

// transaction for a specific table, once a certain percentage of the GC time has
// elapsed. Returns a Cleaner function to cancel installation or remove the
// protected timestamp.
func (p *Manager) ProtectBeforeGC(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function supposed to guarantee that it will issue a PTS before the next GC job on the table? Either in this function or in the commit message, could you explain the purpose of this function? I.e. why would the caller use this as oppose to the Protect function?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified this with comments on ProtectBeforeGC:
//This method can be preferred in scenarios where the cost of installing
// a protected timestamp is more expensive relative to the typical length of an
// operation (for example multi-region)

and in the commit message:

A new jobsprotectedts.Manager object is introduced, which
can install timestamps synchronous or before the GC TTL
occurs. The asynchronous version ProtectedBeforeGC is
suited for validation where the cost of the protected
timestamp validation could be costly relative to validating
indexes. Whereas, the synchronous version is better suited
for cases where we know its an expensive operation and
a protected timestamp will always be needed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes thing clearer! One last q, that I think could be clarified.

This function does not guarantee that a PTS will be written before the next gc run, right? It only guarantees that a PT will be issued in time.Duration(float64(time.Duration(zoneCfg.GC.TTLSeconds)*time.Second) * timedProtectTimeStampGCPct) seconds. I.e. if a GC is about to run, the PTS will get issued after this next GC job. IIUC, then perhaps this function should be renamed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point and both of them have the property to update existing records inside them which needs to be documented too. I'm tempted to call it TryToProtectBeforeGC, and document that it's best effort bases. For validation, this is done early so it should be pretty effective unless the GC TTL is crazy small.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for the TryToProtectBeforeGC:
// The approach here is taken is heuristic
// and can be considered a best-effort basis since the GC TTL could change or
// the caller may not invoke this early enough in the transaction.

Added for Protect:
// If an existing record is found, it will be updated with
// a new timestamp

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet! this looks good to me, though I'll wait for andrew to give this a stamp. The DR refactors look good too.

func (p *Manager) Protect(
ctx context.Context,
jobID jobspb.JobID,
tableDesc catalog.TableDescriptor,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: i might refactor this function to pass a pts Target instead of a tableDesc, so DR could use this infra in Restore and Import.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, should try and get this adopted in other areas to streamline things, and the descriptor in this case was because it was natural on the schema change side.

@fqazi fqazi requested a review from msbutler November 9, 2022 20:00
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: this is going to be handy. Let's let this bake on master for a bit before backporting it to 22.2

Reviewed 4 of 15 files at r1, 13 of 16 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @benbardin, @fqazi, and @msbutler)


pkg/jobs/jobsprotectedts/jobs_protected_ts_manager.go line 70 at r3 (raw file):

		return v.ProtectedTimestampRecord
	default:
		panic("not supported")

nit: panic an assertion failed error so it doesn't get redacted?


pkg/jobs/jobsprotectedts/jobs_protected_ts_manager.go line 74 at r3 (raw file):

}

// NewManager creates a new protected timestamp provider

nit: Manager?


pkg/jobs/jobsprotectedts/jobs_protected_ts_manager.go line 124 at r3 (raw file):

			return err
		}
		waitBeforeProtectedTS := time.Duration(float64(time.Duration(zoneCfg.GC.TTLSeconds)*time.Second) *

nit: time.Duration has a Seconds() method which returns a float64


pkg/jobs/jobsprotectedts/jobs_protected_ts_manager.go line 157 at r3 (raw file):

// transaction for a specific table immediately in a synchronous
// manner. If an existing record is found, it will be updated with
// a new timestamp. Returns a Cleaner  function to remove the protected timestamp,

uber-nit: extra space after Cleaner

Copy link
Copy Markdown
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @benbardin, and @msbutler)


pkg/jobs/jobsprotectedts/jobs_protected_ts_manager.go line 96 at r2 (raw file):

Previously, msbutler (Michael Butler) wrote…

sweet! this looks good to me, though I'll wait for andrew to give this a stamp. The DR refactors look good too.

Done.


pkg/jobs/jobsprotectedts/jobs_protected_ts_manager.go line 157 at r3 (raw file):

Previously, ajwerner wrote…

uber-nit: extra space after Cleaner

Done.

@fqazi
Copy link
Copy Markdown
Collaborator Author

fqazi commented Nov 11, 2022

@ajwerner @msbutler TFTR!

@fqazi
Copy link
Copy Markdown
Collaborator Author

fqazi commented Nov 11, 2022

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 11, 2022

Build failed:

@fqazi
Copy link
Copy Markdown
Collaborator Author

fqazi commented Nov 11, 2022

bors r-

@fqazi
Copy link
Copy Markdown
Collaborator Author

fqazi commented Nov 11, 2022

bors r+

Fixes: cockroachdb#90774

Previously, if a cockroach node was to crash, we would
rely on the deleted protected timestamp clean up logic
to remove ones installed for validation. This change allows,
the clean up to happen during job fail / cancel, so that
no delays occurs dealing with these protected timestamps.

A new jobsprotectedts.Manager object is introduced, which
can install timestamps synchronous or before the GC TTL
occurs. The asynchronous version ProtectedBeforeGC is
suited for validation where the cost of the protected
timestamp validation could be costly relative to validating
indexes. Whereas, the synchronous version is better suited
for cases where we know its an expensive operation and
a protected timestamp will always be needed.

Additionally, the entire logic to support adding these
protected timestamps is now done in a dependency injection
friendly manner,

Release note: None
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 11, 2022

Canceled.

@fqazi
Copy link
Copy Markdown
Collaborator Author

fqazi commented Nov 11, 2022

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 11, 2022

Build succeeded:

@craig craig bot merged commit 334774b into cockroachdb:master Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: clean up protected timestamp installation logic for backfills

4 participants