jobs: replace old node liveness job leases by spaskob · Pull Request #51087 · cockroachdb/cockroach

spaskob · 2020-07-07T20:01:12Z

Touches #47892.

For full details see the sqlliveness RFC.

Add claim_session_id and claim_instance_id columns in a new column family.
These two values specify the current claim by a job registry on this job row.
The current live claims will live in a new table system.sqlliveness along with
an expiration deadline which will need to be extended to keep the claims alive.

Add sql migration code to migrate old definitions to the new one and for the new
system table.

The new job adoption logic is implemented in claimAndProcessJobs in
registry.go. It gets a live epoch, potentially steals jobs from other
registries and lays claim on them and finally it processes all claimed
jobs.

The main steps of claimAndProcessJobs are implemented in adopt.go
and they on purpose circumvent the flawed update.go jobs API opting
for direct interaction with the jobs table. I think their implementation
illustrates the simplicity of this approach.

The currently claimed and resumed on the node jobs are kept in a new map
adoptedJobs that keeps track of their epoch at the time of the claim
so that they can be easily canceled if the epoch is bumped.

Migration to 20.2 probably needs some more work but is also sketched here:

In mixed-version cluster we use the legacy adoption mechanism as
before.
Once the cluster is detected to be finalized, the old adoption does
not get triggered any more.
The new adoption kicks in, kills all legacy jobs and starts adopting
using the new epoch based mechanism.

Release note: none.

cockroach-teamcity · 2020-07-07T20:01:20Z

This change is

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/sql/sqlliveness/slinstance/slinstance.go, line 50 at r8 (raw file):

		case <-l.stopper.ShouldStop():
			return
		case <-time.After(l.hb):

use timeutil.NewTimer

pkg/sql/sqlliveness/slinstance/slinstance.go, line 57 at r8 (raw file):

			}
			// Extend the current live session.
			if err := l.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {

Ideally you wouldn't be holding the mutex during the entire transaction.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 65 at r8 (raw file):

					sqlbase.InternalExecutorSessionDataOverride{User: security.RootUser}, `
UPDATE system.sqlliveness SET expiration = $1
WHERE expiration > $2 AND session_id = $3 RETURNING session_id`,

I'm not sure that this expiration > $2 condition is necessary. That matter when removing expired sessions but I'm less certain it matters when heartbeating. If a session expires in the woods and nobody is around, did it expire? :p Furthermore, if you avoid using that predicate you never need to delete the session. My feeling is that making the deletion of the session the property for its failure is a nice simplification.

But in all seriousness, commentary on the properties of the sessions somewhere seems good.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 84 at r8 (raw file):

					}
				} else {
					l.mu.s.Exp = exp

I don't think you want to assign here because the function may get retried. I get that you're holding the mutex the whole time but I also am not super pleased about that.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 88 at r8 (raw file):

				return nil
			}); err != nil {
				log.Error(ctx, err.Error())

Do you need to clear the session if you fail?

pkg/sql/sqlliveness/slinstance/slinstance.go, line 117 at r8 (raw file):

	defer l.mu.Unlock()
	now := l.db.Clock().Now()
	if l.mu.s != nil && l.mu.s.Expiration().WallTime > now.WallTime {

I'm skeptical of this condition.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 123 at r8 (raw file):

	id := uuid.MakeV4().GetBytes()
	exp := l.db.Clock().Now().Add(l.d.Nanoseconds(), 1)
	if err := l.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {

my sense is that this lazy session creation makes the locking harder to follow. I can't think of why we'd have an Instance and not want it to have a session. I think you could simplify all of this by just having the loop create or update the session.

pkg/sql/sqlliveness/slsession/slsession.go, line 11 at r8 (raw file):

// licenses/APL.txt.

package slsession

This doesn't need to be in a subpackage, it should be an unexported struct in the slinstance package.

pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

		// If the expiration is corrupted it's better to treat session as expired.
	} else {
		if s.db.Clock().Now().WallTime < exp.WallTime {

I again think that just using the expiration is tenuous because then we have to worry about clock synchronization. My feeling is that you should only return false for a session if there is no row. That, of course, leaves open the question of who deletes expired sessions (the below TODO).

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/sql/sqlliveness/slinstance/slinstance.go, line 117 at r8 (raw file):

Previously, ajwerner wrote…

I'm skeptical of this condition.

So you prefer Session to a simple getter for the session and the heart beat loop do all the work?

pkg/sql/sqlliveness/slinstance/slinstance.go, line 123 at r8 (raw file):

Previously, ajwerner wrote…

my sense is that this lazy session creation makes the locking harder to follow. I can't think of why we'd have an Instance and not want it to have a session. I think you could simplify all of this by just having the loop create or update the session.

That's sounds good. How would you deal with cases where you create an instance, it starts the heart beat loop but inserting the new session takes some time. Meanwhile you try to get a session with a call to Session, does the caller have to retry?

pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, ajwerner wrote…

I again think that just using the expiration is tenuous because then we have to worry about clock synchronization. My feeling is that you should only return false for a session if there is no row. That, of course, leaves open the question of who deletes expired sessions (the below TODO).

That is tricky though, isn't it? If a node goes offline, another node has to declare the session expired, otherwise until the off-line node comes back the claimed resources will not be released. Should we use some time offset constant that is guaranteed to be >= the max time skew?

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @spaskob)

pkg/sql/sqlliveness/slinstance/slinstance.go, line 117 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

So you prefer Session to a simple getter for the session and the heart beat loop do all the work?

Yes though the comment was more regarding the in-memory check of the expiration

pkg/sql/sqlliveness/slinstance/slinstance.go, line 123 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

That's sounds good. How would you deal with cases where you create an instance, it starts the heart beat loop but inserting the new session takes some time. Meanwhile you try to get a session with a call to Session, does the caller have to retry?

The semantics of whether this function fails or blocks should be clear.

I might make both variants. To implementing blocking I'd use a channel which gets first initialized at construction time and then atomically swapped out when the session fails.

pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

That is tricky though, isn't it? If a node goes offline, another node has to declare the session expired, otherwise until the off-line node comes back the claimed resources will not be released. Should we use some time offset constant that is guaranteed to be >= the max time skew?

My hope is that if other nodes are alive they will be deleting sessions. I think we need something to periodically delete expired sessions.

I really quite strongly do not want to get into the business of thinking about clock skew and we can avoid it with the deletion techniques.

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, ajwerner wrote…

My hope is that if other nodes are alive they will be deleting sessions. I think we need something to periodically delete expired sessions.

I really quite strongly do not want to get into the business of thinking about clock skew and we can avoid it with the deletion techniques.

I see your idea - Storage object will upon creation start a daemon that deletes other nodes expired sessions without worrying about clock skew. The point is that the session will be deleted and the SQL instance that owns the session when serves calls to Session will have to lookup if the session id is still in system.sqlliveness or it will have to rely on the heartbeating loop to detect that the session was deleted. In the second case, there will be an interval (during the pause of the heart beat loop) in which the session will be deleted but still not considered expired by the owning sql instance. This does not sounds good, so we may have to query sqlliveness.system on each call to Session and use the blocking idea using channels that you described in your comment above. WDYT?

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @spaskob)

pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

or it will have to rely on the heartbeating loop to detect that the session was deleted.

This is the whole discussion at the end of the RFC. If a user of a session wants to gaurantee that the session still is live there are a variety of options.

The view of the session will carry the latest expiration which can be used to bound a transaction commit timestamp
The protocol can be that you have mutual exclusion so long as your claim exists and you transactionally verify the claim exists (and only remove claims of sessions which you know to no longer exist
You can observe your session transactionally.

This all ends up not being a problem because transactions operate in a shared time domain.

This does not sounds good, so we may have to query sqlliveness.system on each call to Session and use the blocking idea using channels that you described in your comment above. WDYT?

I do not think we need to do query Sessions except when heartbeating.

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/sql/sqlliveness/slinstance/slinstance.go, line 50 at r8 (raw file):

Previously, ajwerner wrote…

use timeutil.NewTimer

Done.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 57 at r8 (raw file):

Previously, ajwerner wrote…

Ideally you wouldn't be holding the mutex during the entire transaction.

Done.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 65 at r8 (raw file):

Previously, ajwerner wrote…

I'm not sure that this expiration > $2 condition is necessary. That matter when removing expired sessions but I'm less certain it matters when heartbeating. If a session expires in the woods and nobody is around, did it expire? :p Furthermore, if you avoid using that predicate you never need to delete the session. My feeling is that making the deletion of the session the property for its failure is a nice simplification.

But in all seriousness, commentary on the properties of the sessions somewhere seems good.

Done.
Good idea!

pkg/sql/sqlliveness/slinstance/slinstance.go, line 84 at r8 (raw file):

Previously, ajwerner wrote…

I don't think you want to assign here because the function may get retried. I get that you're holding the mutex the whole time but I also am not super pleased about that.

Done.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 88 at r8 (raw file):

Previously, ajwerner wrote…

Do you need to clear the session if you fail?

That's a good question and in general how to treat failures in the heart beat loop? We should probably raise a poison flag and kill the sql instance o/w we may get blocked in a call of Session waiting on the heart beat to extend or create a new session.

pkg/sql/sqlliveness/slsession/slsession.go, line 11 at r8 (raw file):

Previously, ajwerner wrote…

This doesn't need to be in a subpackage, it should be an unexported struct in the slinstance package.

Done.

pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, ajwerner wrote…

or it will have to rely on the heartbeating loop to detect that the session was deleted.

This is the whole discussion at the end of the RFC. If a user of a session wants to gaurantee that the session still is live there are a variety of options.

The view of the session will carry the latest expiration which can be used to bound a transaction commit timestamp

The protocol can be that you have mutual exclusion so long as your claim exists and you transactionally verify the claim exists (and only remove claims of sessions which you know to no longer exist

You can observe your session transactionally.

This all ends up not being a problem because transactions operate in a shared time domain.

This does not sounds good, so we may have to query sqlliveness.system on each call to Session and use the blocking idea using channels that you described in your comment above. WDYT?

I do not think we need to do query Sessions except when heartbeating.

Done.

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Done.

Should we have a separate daemon that deletes expired sessions?

ajwerner

Nice, this is looking really good! This still needs the daemon to actually delete the expired sessions, right? Some unit testing around the sqlliveness subpackages would be good. As a minor comment, it might be nice to rename the deprecated methods and fields in the jobs registry as I found it sort of unclear which methods had to do with the old system and which had to do with the new system

Reviewed 7 of 14 files at r1, 6 of 13 files at r4, 1 of 2 files at r5, 2 of 6 files at r6, 2 of 6 files at r8, 10 of 11 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/jobs/adopt.go, line 53 at r9 (raw file):

	rows, err := r.ex.QueryEx(
		ctx, "select-running/reverting-jobs", nil,
		sqlbase.InternalExecutorSessionDataOverride{User: security.RootUser}, `

do we want this to be the NodeUser instead? I'm not super pleased that the RootUser has write access to the jobs table. Seems like a recipe for disaster.

pkg/jobs/adopt.go, line 100 at r9 (raw file):

	row, err := r.ex.QueryRowEx(
		ctx, "get-job-row", nil,
		sqlbase.InternalExecutorSessionDataOverride{User: security.RootUser}, `

same NodeUser comment.

pkg/jobs/adopt.go, line 109 at r9 (raw file):

	}
	if row == nil {
		return errors.Errorf("job %d: epoch has changed, no longer eligible for resuming", jobID)

nit: consider returning the value of crdb_internal.sql_liveness_is_alive(claim_session_id) rather than filtering on it to disambiguate the row actually getting deleted from underneath you from the case where the session is not live.

pkg/jobs/registry.go, line 215 at r9 (raw file):

		adoptionCh:          make(chan struct{}),
		sqlInstance: slinstance.NewSqlInstance(
			stopper, clock, db, ex, DefaultAdoptInterval+10*time.Duration(time.Second), time.Second,

nit: I'd take this as a dependency and then construct it in the server.

pkg/jobs/registry.go, line 499 at r9 (raw file):

			// TODO(spaskob): think about this invariant, can this ever happen?
			// Probably not, so we should fail hard in this case.
			panic("old job!")

consider making this a log.Fatal and some more surrounding information about the job type.

pkg/jobs/registry.go, line 622 at r9 (raw file):

			return
		}
		log.Infof(ctx, "Registry live claim (instance_id: %s, sid: %s).\n",

nit: lose the .\n and consider moving this to VEventf(ctx, 1

pkg/sql/sqlbase/system.go, line 398 at r9 (raw file):

	keys.StatementDiagnosticsTableID:          privilege.ReadWriteData,
	keys.ScheduledJobsTableID:                 privilege.ReadWriteData,
	keys.SqllivenessID:                        privilege.ReadWriteData,

I think we want this to be privilege.ReadData. I don't really want root mucking around in here.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 11 at r9 (raw file):

// licenses/APL.txt.

package slinstance

nit: package level comment

pkg/sql/sqlliveness/slinstance/slinstance.go, line 34 at r9 (raw file):

	Sid sqlliveness.SessionID
	Exp hlc.Timestamp

nit: no reason to export these

pkg/sql/sqlliveness/slinstance/slinstance.go, line 43 at r9 (raw file):

func (s *session) Expiration() hlc.Timestamp { return s.Exp }

type SQLInstance struct {

nit: comment

pkg/sql/sqlliveness/slinstance/slinstance.go, line 83 at r9 (raw file):

func (l *SQLInstance) createSession(ctx context.Context) (*session, error) {
	id := uuid.MakeV4().GetBytes()
	exp := l.clock.Now().Add(l.ttl.Nanoseconds(), 1)

I don't think you need to add 1 to the logical field. Having a non-zero logical timestamp is pretty rare.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 150 at r9 (raw file):

				if err := l.extendSession(ctx); err != nil {
					l.clearSession()
					t.Reset(0)

My sense is that we want some sort of exponential backoff here or perhaps in extendSession

pkg/sql/sqlliveness/slinstance/slinstance.go, line 187 at r9 (raw file):

		}

		select {

This should also listen on ctx.Done()

pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Should we have a separate daemon that deletes expired sessions?

Yes, I think we should. I can live in a separate package potentially.

spaskob

Responded to feedback - will add tests as an additional change after we finalize the main discussion.
There are 2 open questions left:

Where to store the settings for heartbeat interval and expiration ttl?
How to get a stopper for the slstorage object that is needed for the is_alive builtin?

Reviewed 1 of 12 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/jobs/adopt.go, line 53 at r9 (raw file):

Previously, ajwerner wrote…

do we want this to be the NodeUser instead? I'm not super pleased that the RootUser has write access to the jobs table. Seems like a recipe for disaster.

done

pkg/jobs/registry.go, line 499 at r9 (raw file):

Previously, ajwerner wrote…

consider making this a log.Fatal and some more surrounding information about the job type.

Done.

pkg/sql/sqlbase/system.go, line 398 at r9 (raw file):

Previously, ajwerner wrote…

I think we want this to be privilege.ReadData. I don't really want root mucking around in here.

The tests start timing out when I do that, before I dig far - what is the rationale - we are supposed to write and delete from this table correct?

pkg/sql/sqlliveness/slinstance/slinstance.go, line 83 at r9 (raw file):

Previously, ajwerner wrote…

I don't think you need to add 1 to the logical field. Having a non-zero logical timestamp is pretty rare.

Done.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 150 at r9 (raw file):

Previously, ajwerner wrote…

My sense is that we want some sort of exponential backoff here or perhaps in extendSession

Done.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 187 at r9 (raw file):

Previously, ajwerner wrote…

This should also listen on ctx.Done()

Done.

pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, ajwerner wrote…

Yes, I think we should. I can live in a separate package potentially.

Code added - what's the reasoning behind it being in a separate package?
Also I will need some help - I am not sure how to get a stopper in https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/conn_executor.go#L2010 which I will need to pass to the slstorage object.

ajwerner

How about creating some cluster settings?
we've been discussing offline.

Another thought occurs to me, one thing that's been an observability problem with jobs is understanding how many times that a job has been adopted. How do you feel about adding an integer column next to the claim in the jobs table which we increment each time we write a new claim into the row?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

spaskob · 2020-07-29T16:59:55Z

Added cluster settings for slstorage and threaded through the slstorage in EvalCtx. If it looks good will do the same for slinstance. LMK.

spaskob

Replaced all constants with cluster settings. Ready for review.
As for the adoption counter idea - if we decide we can do it in a follow up PR. Currently we can deduce this counter from the logs and we also store all resume errors in the job payload, and # resume errors should be a good approximation.

Reviewed 3 of 12 files at r10, 1 of 6 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

ajwerner

Continuing to get better. Still looking for some unit testing on the implementations of the various sqlliveness interfaces. One area I'd like to focus on a bit before this merges that you could ease some concerns I have is around the upgrade path and making sure we handle it gracefully with regards to jobs that are currently being run by the registry when the version upgrade flips.

Reviewed 1 of 11 files at r9, 2 of 13 files at r14, 3 of 6 files at r15, 1 of 2 files at r16, 4 of 6 files at r17, 5 of 5 files at r18.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/jobs/registry.go, line 476 at r18 (raw file):

		defer r.mu.Unlock()
		if _, alreadyRegistered := r.mu.adoptedJobs[*j.ID()]; alreadyRegistered {
			log.Fatalf(ctx, "job %d: was just created but foun din registered adopted jobs", *j.ID())

nit foun din

pkg/jobs/registry_external_test.go, line 87 at r18 (raw file):

	jobs.LeniencySetting.Override(&s.ClusterSettings().SV, 0)
	const cancelInterval = time.Duration(math.MaxInt64)
	const adoptInterval = time.Nanosecond

under stressrace this can make things spin very fact. Consider time.Millisecond.

pkg/sql/sqlbase/system.go, line 398 at r9 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

The tests start timing out when I do that, before I dig far - what is the rationale - we are supposed to write and delete from this table correct?

Well, node should have read-write access withReadData, just root shouldn't. My guess is somewhere is writing to this table using the root user instead of the node user if it's timing out.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 170 at r18 (raw file):

			everySecond := log.Every(time.Second)
			if s, _ := l.getSessionOrBlockCh(); s == nil {
				for r := retry.StartWithCtx(ctx, opts); r.Next(); {

retry has a feature where you can pass it a stopper that should make this cleaner. See the Closer field https://godoc.org/github.com/cockroachdb/cockroach/pkg/util/retry#Options

pkg/sql/sqlliveness/slinstance/slinstance.go, line 191 at r18 (raw file):

				for r := retry.StartWithCtx(ctx, opts); r.Next(); {
					select {
					// Finish immediately in tests instead of waiting for the retry.;w

nit: ;w

pkg/sql/sqlliveness/slstorage/slstorage.go, line 88 at r18 (raw file):

	for {
		t := timeutil.NewTimer()
		t.Reset(0)

I think you want this above the loop

pkg/sql/sqlliveness/slstorage/slstorage.go, line 96 at r18 (raw file):

		case <-t.C:
			t.Read = true
			t.Reset(s.gcInterval())

generally I prefer resetting it after the operation completes. If the delete-sessions query get slow this might spin continuously, which I don't think we want.

pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Code added - what's the reasoning behind it being in a separate package?
Also I will need some help - I am not sure how to get a stopper in https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/conn_executor.go#L2010 which I will need to pass to the slstorage object.

It just feels orthogonal in functionality to the storage. I'm fine either way.

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/jobs/registry_external_test.go, line 87 at r18 (raw file):

Previously, ajwerner wrote…

under stressrace this can make things spin very fact. Consider time.Millisecond.

Done.

pkg/sql/sqlbase/system.go, line 398 at r9 (raw file):

Previously, ajwerner wrote…

Well, node should have read-write access withReadData, just root shouldn't. My guess is somewhere is writing to this table using the root user instead of the node user if it's timing out.

most probably, root user us used everywhere in jobs package. I propose to create an issue to fix this and do it in a follow up PR for the whole package.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 170 at r18 (raw file):

Previously, ajwerner wrote…

retry has a feature where you can pass it a stopper that should make this cleaner. See the Closer field https://godoc.org/github.com/cockroachdb/cockroach/pkg/util/retry#Options

Done.

pkg/sql/sqlliveness/slstorage/slstorage.go, line 88 at r18 (raw file):

Previously, ajwerner wrote…

I think you want this above the loop

Done.

pkg/sql/sqlliveness/slstorage/slstorage.go, line 96 at r18 (raw file):

Previously, ajwerner wrote…

generally I prefer resetting it after the operation completes. If the delete-sessions query get slow this might spin continuously, which I don't think we want.

Done.

ajwerner

I think you need to rebase/regenerate the cluster versions stuff.

Reviewed 5 of 12 files at r19.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

out, line 1 at r19 (raw file):

Running make with -j8

junk

pkg/jobs/deprecated.go, line 57 at r19 (raw file):

}

func (r *Registry) maybeAdoptJob(

nit: prefix with deprecated

pkg/jobs/deprecated.go, line 290 at r19 (raw file):

}

func (r *Registry) newLease() *jobspb.Lease {

nit: prefix with deprecated

pkg/jobs/deprecated.go, line 300 at r19 (raw file):

}

func (r *Registry) cancelAllLocked(ctx context.Context) {

nit: prefix with deprecated

pkg/jobs/deprecated.go, line 312 at r19 (raw file):

// killed and that no one else tries to resume it. This essentially works as a
// barrier that only one function can cross and try to resume the job.
func (r *Registry) register(jobID int64, cancel func()) error {

nit: prefix with deprecated

pkg/sql/sqlliveness/sqlliveness.go, line 63 at r19 (raw file):

	Insert(context.Context, Session) error
	Update(context.Context, Session) error

Commentary on the semantics. In particular, I suspect you're going to want a concrete error or boolean value from Update to indicate that the session has been removed so that you can stop trying to update it.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 141 at r19 (raw file):

			}
			everySecond := log.Every(time.Second)
			if s, _ := l.getSessionOrBlockCh(); s == nil {

Why not put the retry loops into createSession and extendSession? feels more fitting there.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 144 at r19 (raw file):

				for r := retry.StartWithCtx(ctx, opts); r.Next(); {
					newSession, err := l.createSession(ctx)
					if err != nil {

if ctx.Err() != nil don't log and probably just return.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 151 at r19 (raw file):

					}
					if log.V(2) {
						log.Infof(ctx, "Created new SQL liveness session %s", newSession.ID())

I suspect this should not have a V gate.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 160 at r19 (raw file):

				var err error
				for r := retry.StartWithCtx(ctx, opts); r.Next(); {
					if err = l.extendSession(ctx); err != nil {

This seems too simplistic. In the case that somebody has removed your claim, you'll never succeed and you should definitely go back around to create a new session. In the case that you don't know then you want to retry. In the case your context has been canceled, then you want to return, which I guess is already the case though rather implicitly.

I'm not sure I understand the reason to have a max number of retries to extend. My sense is you should try to extend until you know that it won't ever succeed.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 162 at r19 (raw file):

					if err = l.extendSession(ctx); err != nil {
						if log.V(2) {
							log.Infof(ctx, "Could not extend SQL liveness session %s", s.ID())

I don't think this one needs a vmodule gate, it's a pretty unexpected thing. Maybe indicate which attempt number we're on.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 168 at r19 (raw file):

					break
				}
				if err != nil {

If ctx.Err() != nil don't log and probably just return.

pkg/sql/sqlliveness/slinstance/slinstance.go, line 179 at r19 (raw file):

				}
			}
			t.Reset(l.hb())

This floating reset is hard to reason about give all the code above it. If more of the code were moved into helpers it might be clearer that this is the happy path.

pkg/sql/sqlliveness/slstorage/slstorage.go, line 107 at r19 (raw file):

				return err
			})
			t.Reset(s.gcInterval())

sometimes to make it more explicit I lift this to just above the select and do:

if t.Read {
     t.Reset(...)
}

Touches cockroachdb#47892. Add `claim_session_id` and `claim_instance_id` columns in a new column family. These two values specify the current claim by a job registry on this job row. The current live claims will live in a new table `system.sqlliveness` along with an expiration deadline which will need to be extended to keep the claims alive. Add sql migration code to migrate old definitions to the new one and for the new system table. Release note: none.

Release note: none.

See https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200615_sql_liveness.md for background on sql liveness implementation. Regarding using sqlliveness for implementing a new job leasing mechanism. The biggest difficulty was to ensure the correct migration of existing clusters with existing running or paused jobs. These need to be handled by the old job adoption and leasing implementation and newly created need to be handled by the new job leasing and adoption. There is a new job adoption function `claimAndProcessJobs` implemented using the new adoption API in `adopt.go`. The old adoption code is preserved and it will be triggered if and only if the cluster has not been finalized. This code will be deleted in release `20.2`. All jobs inherited from the previous version or created before upgrade is finalized will be handled by the old code. When the cluster is finalized the new adoption code will kick in for newly created jobs. Old jobs that have not finished yet will be detected as having expired claims (column `claim_session_id` will have value NULL), will be canceled and eventually resumed by the new adoption code. Release note: none.

We introduce a new SQL builtin function `crdb_internal.sql_liveness_is_alive` that returns true if the provided sqlliveness session id has not expired. Release note: none.

Release note: none.

This test checks if jobs before backup are preserved by restore but we are introducing a new column `claim_session_id` which is a `uuid` and needs to be exclueded from the check that job rows are identical for the same job id. Also refactored the test to produce a more helpful error message via `require.Equal`. Release note: none.

The SQL instance is normally started at startup but if old cluster is upgraded it won't start and when cluster is finalized the new adoption logic will kick in without a running SQL indstance which will cause adoption to stall. Release note: none.

This was supposed to be deprecated in 20.2 but because there is some chance of 19.2 jobs not finishing in 20.1 clusters we will depracte in 21.0. Release note: none.

This caching, as discussed in the RFC proved to be critical when there are a large number of jobs to avoid a quadratic number of lookups. Having a large number of jobs can occur relatively commonly. Release note: None

Also augmented the test a bit. Release note: None

Release note: None

This commit does a few things: 1) Drops the `SQL` from `SQLInstance` as it merely introduced stutter in the name. 2) Made the implementation of SessionID into a string as these things are immutable and we would like to use them to key maps. 3) Split up the storage interface as only the reading portion was used externally. The use of the `Session` interface in the writing portion ended up just being cruft. 4) Add a `Provider` interface which represents the subsystem as a whole exported outwards and an implementation in `slprovider`. 5) Adds sanity to the starting of the sqlliveness subsystem. Before this commit, the various components were started multiple times in different places. This commit moves the starting to the sqlserver Start method. Release note: None

Release note: None

spaskob · 2020-08-12T10:53:24Z

bors r+

craig · 2020-08-12T11:17:50Z

Build succeeded:

Compile Build (Cockroach)

spaskob requested a review from ajwerner July 7, 2020 20:01

spaskob force-pushed the sql-liveness branch 3 times, most recently from cd56a3b to a1deb9e Compare July 9, 2020 13:23

spaskob changed the title ~~sqlbase: new columns in system.jobs and sqlliveness table~~ jobs: replace old node liveness job leases Jul 9, 2020

spaskob force-pushed the sql-liveness branch from a1deb9e to d5ddaa2 Compare July 9, 2020 13:45

spaskob force-pushed the sql-liveness branch from d5ddaa2 to 60f1cd7 Compare July 20, 2020 15:29

ajwerner reviewed Jul 21, 2020

View reviewed changes

spaskob commented Jul 21, 2020

View reviewed changes

ajwerner reviewed Jul 21, 2020

View reviewed changes

spaskob commented Jul 21, 2020

View reviewed changes

ajwerner reviewed Jul 21, 2020

View reviewed changes

spaskob force-pushed the sql-liveness branch from 60f1cd7 to 40d9ea3 Compare July 22, 2020 20:34

spaskob commented Jul 22, 2020

View reviewed changes

spaskob force-pushed the sql-liveness branch from 0238870 to dea7d5f Compare July 23, 2020 13:20

ajwerner suggested changes Jul 28, 2020

View reviewed changes

spaskob force-pushed the sql-liveness branch 3 times, most recently from 3aa1572 to 24a17f6 Compare July 29, 2020 12:44

spaskob commented Jul 29, 2020

View reviewed changes

ajwerner reviewed Jul 29, 2020

View reviewed changes

spaskob force-pushed the sql-liveness branch from 24a17f6 to 8da47c8 Compare July 29, 2020 16:25

spaskob force-pushed the sql-liveness branch from e308a77 to 2942f86 Compare July 30, 2020 11:29

spaskob commented Jul 30, 2020

View reviewed changes

ajwerner reviewed Aug 3, 2020

View reviewed changes

spaskob commented Aug 4, 2020

View reviewed changes

ajwerner suggested changes Aug 5, 2020

View reviewed changes

ajwerner force-pushed the sql-liveness branch 5 times, most recently from d4d81a3 to e5111a3 Compare August 11, 2020 17:11

Spas Bojanov and others added 17 commits August 11, 2020 17:12

jobs: a minor test fix

d3b5e58

Release note: none.

sqlliveness: use builtin for live sessions

2463e05

We introduce a new SQL builtin function `crdb_internal.sql_liveness_is_alive` that returns true if the provided sqlliveness session id has not expired. Release note: none.

jobs: deprecate node liveness functions and data

0b629f3

Release note: none.

sqlliveness: delete expired sessions

1e674b7

Release note: none.

sqlliveness: add cluster settings

a6f513f

Release note: none.

sqlliveness: refactor code for testing and add tests

9fd9408

Release note: none.

jobs,sqlliveness: fix lint

beb01a0

jobs: support for 19.2 schema jobs

b4af778

This was supposed to be deprecated in 20.2 but because there is some chance of 19.2 jobs not finishing in 20.1 clusters we will depracte in 21.0. Release note: none.

sqlliveness/slstorage: implement caching of session status

83ac9b6

This caching, as discussed in the RFC proved to be critical when there are a large number of jobs to avoid a quadratic number of lookups. Having a large number of jobs can occur relatively commonly. Release note: None

sqlliveness/slstorage: add metrics

257f14a

Also augmented the test a bit. Release note: None

server: only construct slstorage.Storage once

b24ba5d

Release note: None

sqlliveness,job: move some startup and stop code around

b8417cc

Release note: None

ajwerner force-pushed the sql-liveness branch from e5111a3 to b8417cc Compare August 11, 2020 21:12

craig bot merged commit 44c3626 into cockroachdb:master Aug 12, 2020

spaskob deleted the sql-liveness branch August 12, 2020 11:39

spaskob mentioned this pull request Aug 13, 2020

jobs: needs to work without liveness and nodeID #47892

Closed

aayushshah15 mentioned this pull request Aug 25, 2020

kvserver: improve TestClosedTimestampInactiveAfterSubsumption #52949

Merged

ghost mentioned this pull request Jul 20, 2022

Remove multi-tenancy function from generated function doc #84756

Closed

Conversation

spaskob commented Jul 7, 2020

Uh oh!

cockroach-teamcity commented Jul 7, 2020

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob commented Jul 29, 2020

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob commented Aug 12, 2020

Uh oh!

craig bot commented Aug 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants