Skip to content

jobs: replace old node liveness job leases#51087

Merged
craig[bot] merged 17 commits intocockroachdb:masterfrom
spaskob:sql-liveness
Aug 12, 2020
Merged

jobs: replace old node liveness job leases#51087
craig[bot] merged 17 commits intocockroachdb:masterfrom
spaskob:sql-liveness

Conversation

@spaskob
Copy link
Copy Markdown
Contributor

@spaskob spaskob commented Jul 7, 2020

Touches #47892.

For full details see the sqlliveness RFC.

Add claim_session_id and claim_instance_id columns in a new column family.
These two values specify the current claim by a job registry on this job row.
The current live claims will live in a new table system.sqlliveness along with
an expiration deadline which will need to be extended to keep the claims alive.

Add sql migration code to migrate old definitions to the new one and for the new
system table.

The new job adoption logic is implemented in claimAndProcessJobs in
registry.go. It gets a live epoch, potentially steals jobs from other
registries and lays claim on them and finally it processes all claimed
jobs.

The main steps of claimAndProcessJobs are implemented in adopt.go
and they on purpose circumvent the flawed update.go jobs API opting
for direct interaction with the jobs table. I think their implementation
illustrates the simplicity of this approach.

The currently claimed and resumed on the node jobs are kept in a new map
adoptedJobs that keeps track of their epoch at the time of the claim
so that they can be easily canceled if the epoch is bumped.

Migration to 20.2 probably needs some more work but is also sketched here:

In mixed-version cluster we use the legacy adoption mechanism as
before.
Once the cluster is detected to be finalized, the old adoption does
not get triggered any more.
The new adoption kicks in, kills all legacy jobs and starts adopting
using the new epoch based mechanism.

Release note: none.

@spaskob spaskob requested a review from ajwerner July 7, 2020 20:01
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@spaskob spaskob force-pushed the sql-liveness branch 3 times, most recently from cd56a3b to a1deb9e Compare July 9, 2020 13:23
@spaskob spaskob changed the title sqlbase: new columns in system.jobs and sqlliveness table jobs: replace old node liveness job leases Jul 9, 2020
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/sql/sqlliveness/slinstance/slinstance.go, line 50 at r8 (raw file):

		case <-l.stopper.ShouldStop():
			return
		case <-time.After(l.hb):

use timeutil.NewTimer


pkg/sql/sqlliveness/slinstance/slinstance.go, line 57 at r8 (raw file):

			}
			// Extend the current live session.
			if err := l.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {

Ideally you wouldn't be holding the mutex during the entire transaction.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 65 at r8 (raw file):

					sqlbase.InternalExecutorSessionDataOverride{User: security.RootUser}, `
UPDATE system.sqlliveness SET expiration = $1
WHERE expiration > $2 AND session_id = $3 RETURNING session_id`,

I'm not sure that this expiration > $2 condition is necessary. That matter when removing expired sessions but I'm less certain it matters when heartbeating. If a session expires in the woods and nobody is around, did it expire? :p Furthermore, if you avoid using that predicate you never need to delete the session. My feeling is that making the deletion of the session the property for its failure is a nice simplification.

But in all seriousness, commentary on the properties of the sessions somewhere seems good.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 84 at r8 (raw file):

					}
				} else {
					l.mu.s.Exp = exp

I don't think you want to assign here because the function may get retried. I get that you're holding the mutex the whole time but I also am not super pleased about that.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 88 at r8 (raw file):

				return nil
			}); err != nil {
				log.Error(ctx, err.Error())

Do you need to clear the session if you fail?


pkg/sql/sqlliveness/slinstance/slinstance.go, line 117 at r8 (raw file):

	defer l.mu.Unlock()
	now := l.db.Clock().Now()
	if l.mu.s != nil && l.mu.s.Expiration().WallTime > now.WallTime {

I'm skeptical of this condition.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 123 at r8 (raw file):

	id := uuid.MakeV4().GetBytes()
	exp := l.db.Clock().Now().Add(l.d.Nanoseconds(), 1)
	if err := l.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {

my sense is that this lazy session creation makes the locking harder to follow. I can't think of why we'd have an Instance and not want it to have a session. I think you could simplify all of this by just having the loop create or update the session.


pkg/sql/sqlliveness/slsession/slsession.go, line 11 at r8 (raw file):

// licenses/APL.txt.

package slsession

This doesn't need to be in a subpackage, it should be an unexported struct in the slinstance package.


pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

		// If the expiration is corrupted it's better to treat session as expired.
	} else {
		if s.db.Clock().Now().WallTime < exp.WallTime {

I again think that just using the expiration is tenuous because then we have to worry about clock synchronization. My feeling is that you should only return false for a session if there is no row. That, of course, leaves open the question of who deletes expired sessions (the below TODO).

Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/sql/sqlliveness/slinstance/slinstance.go, line 117 at r8 (raw file):

Previously, ajwerner wrote…

I'm skeptical of this condition.

So you prefer Session to a simple getter for the session and the heart beat loop do all the work?


pkg/sql/sqlliveness/slinstance/slinstance.go, line 123 at r8 (raw file):

Previously, ajwerner wrote…

my sense is that this lazy session creation makes the locking harder to follow. I can't think of why we'd have an Instance and not want it to have a session. I think you could simplify all of this by just having the loop create or update the session.

That's sounds good. How would you deal with cases where you create an instance, it starts the heart beat loop but inserting the new session takes some time. Meanwhile you try to get a session with a call to Session, does the caller have to retry?


pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, ajwerner wrote…

I again think that just using the expiration is tenuous because then we have to worry about clock synchronization. My feeling is that you should only return false for a session if there is no row. That, of course, leaves open the question of who deletes expired sessions (the below TODO).

That is tricky though, isn't it? If a node goes offline, another node has to declare the session expired, otherwise until the off-line node comes back the claimed resources will not be released. Should we use some time offset constant that is guaranteed to be >= the max time skew?

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @spaskob)


pkg/sql/sqlliveness/slinstance/slinstance.go, line 117 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

So you prefer Session to a simple getter for the session and the heart beat loop do all the work?

Yes though the comment was more regarding the in-memory check of the expiration


pkg/sql/sqlliveness/slinstance/slinstance.go, line 123 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

That's sounds good. How would you deal with cases where you create an instance, it starts the heart beat loop but inserting the new session takes some time. Meanwhile you try to get a session with a call to Session, does the caller have to retry?

The semantics of whether this function fails or blocks should be clear.

I might make both variants. To implementing blocking I'd use a channel which gets first initialized at construction time and then atomically swapped out when the session fails.


pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

That is tricky though, isn't it? If a node goes offline, another node has to declare the session expired, otherwise until the off-line node comes back the claimed resources will not be released. Should we use some time offset constant that is guaranteed to be >= the max time skew?

My hope is that if other nodes are alive they will be deleting sessions. I think we need something to periodically delete expired sessions.

I really quite strongly do not want to get into the business of thinking about clock skew and we can avoid it with the deletion techniques.

Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, ajwerner wrote…

My hope is that if other nodes are alive they will be deleting sessions. I think we need something to periodically delete expired sessions.

I really quite strongly do not want to get into the business of thinking about clock skew and we can avoid it with the deletion techniques.

I see your idea - Storage object will upon creation start a daemon that deletes other nodes expired sessions without worrying about clock skew. The point is that the session will be deleted and the SQL instance that owns the session when serves calls to Session will have to lookup if the session id is still in system.sqlliveness or it will have to rely on the heartbeating loop to detect that the session was deleted. In the second case, there will be an interval (during the pause of the heart beat loop) in which the session will be deleted but still not considered expired by the owning sql instance. This does not sounds good, so we may have to query sqlliveness.system on each call to Session and use the blocking idea using channels that you described in your comment above. WDYT?

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @spaskob)


pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

or it will have to rely on the heartbeating loop to detect that the session was deleted.

This is the whole discussion at the end of the RFC. If a user of a session wants to gaurantee that the session still is live there are a variety of options.

  • The view of the session will carry the latest expiration which can be used to bound a transaction commit timestamp
  • The protocol can be that you have mutual exclusion so long as your claim exists and you transactionally verify the claim exists (and only remove claims of sessions which you know to no longer exist
  • You can observe your session transactionally.

This all ends up not being a problem because transactions operate in a shared time domain.

This does not sounds good, so we may have to query sqlliveness.system on each call to Session and use the blocking idea using channels that you described in your comment above. WDYT?

I do not think we need to do query Sessions except when heartbeating.

Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/sql/sqlliveness/slinstance/slinstance.go, line 50 at r8 (raw file):

Previously, ajwerner wrote…

use timeutil.NewTimer

Done.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 57 at r8 (raw file):

Previously, ajwerner wrote…

Ideally you wouldn't be holding the mutex during the entire transaction.

Done.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 65 at r8 (raw file):

Previously, ajwerner wrote…

I'm not sure that this expiration > $2 condition is necessary. That matter when removing expired sessions but I'm less certain it matters when heartbeating. If a session expires in the woods and nobody is around, did it expire? :p Furthermore, if you avoid using that predicate you never need to delete the session. My feeling is that making the deletion of the session the property for its failure is a nice simplification.

But in all seriousness, commentary on the properties of the sessions somewhere seems good.

Done.
Good idea!


pkg/sql/sqlliveness/slinstance/slinstance.go, line 84 at r8 (raw file):

Previously, ajwerner wrote…

I don't think you want to assign here because the function may get retried. I get that you're holding the mutex the whole time but I also am not super pleased about that.

Done.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 88 at r8 (raw file):

Previously, ajwerner wrote…

Do you need to clear the session if you fail?

That's a good question and in general how to treat failures in the heart beat loop? We should probably raise a poison flag and kill the sql instance o/w we may get blocked in a call of Session waiting on the heart beat to extend or create a new session.


pkg/sql/sqlliveness/slsession/slsession.go, line 11 at r8 (raw file):

Previously, ajwerner wrote…

This doesn't need to be in a subpackage, it should be an unexported struct in the slinstance package.

Done.


pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, ajwerner wrote…

or it will have to rely on the heartbeating loop to detect that the session was deleted.

This is the whole discussion at the end of the RFC. If a user of a session wants to gaurantee that the session still is live there are a variety of options.

  • The view of the session will carry the latest expiration which can be used to bound a transaction commit timestamp
  • The protocol can be that you have mutual exclusion so long as your claim exists and you transactionally verify the claim exists (and only remove claims of sessions which you know to no longer exist
  • You can observe your session transactionally.

This all ends up not being a problem because transactions operate in a shared time domain.

This does not sounds good, so we may have to query sqlliveness.system on each call to Session and use the blocking idea using channels that you described in your comment above. WDYT?

I do not think we need to do query Sessions except when heartbeating.

Done.

Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Done.

Should we have a separate daemon that deletes expired sessions?

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is looking really good! This still needs the daemon to actually delete the expired sessions, right? Some unit testing around the sqlliveness subpackages would be good. As a minor comment, it might be nice to rename the deprecated methods and fields in the jobs registry as I found it sort of unclear which methods had to do with the old system and which had to do with the new system

Reviewed 7 of 14 files at r1, 6 of 13 files at r4, 1 of 2 files at r5, 2 of 6 files at r6, 2 of 6 files at r8, 10 of 11 files at r9.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/jobs/adopt.go, line 53 at r9 (raw file):

	rows, err := r.ex.QueryEx(
		ctx, "select-running/reverting-jobs", nil,
		sqlbase.InternalExecutorSessionDataOverride{User: security.RootUser}, `

do we want this to be the NodeUser instead? I'm not super pleased that the RootUser has write access to the jobs table. Seems like a recipe for disaster.


pkg/jobs/adopt.go, line 100 at r9 (raw file):

	row, err := r.ex.QueryRowEx(
		ctx, "get-job-row", nil,
		sqlbase.InternalExecutorSessionDataOverride{User: security.RootUser}, `

same NodeUser comment.


pkg/jobs/adopt.go, line 109 at r9 (raw file):

	}
	if row == nil {
		return errors.Errorf("job %d: epoch has changed, no longer eligible for resuming", jobID)

nit: consider returning the value of crdb_internal.sql_liveness_is_alive(claim_session_id) rather than filtering on it to disambiguate the row actually getting deleted from underneath you from the case where the session is not live.


pkg/jobs/registry.go, line 215 at r9 (raw file):

		adoptionCh:          make(chan struct{}),
		sqlInstance: slinstance.NewSqlInstance(
			stopper, clock, db, ex, DefaultAdoptInterval+10*time.Duration(time.Second), time.Second,

nit: I'd take this as a dependency and then construct it in the server.


pkg/jobs/registry.go, line 499 at r9 (raw file):

			// TODO(spaskob): think about this invariant, can this ever happen?
			// Probably not, so we should fail hard in this case.
			panic("old job!")

consider making this a log.Fatal and some more surrounding information about the job type.


pkg/jobs/registry.go, line 622 at r9 (raw file):

			return
		}
		log.Infof(ctx, "Registry live claim (instance_id: %s, sid: %s).\n",

nit: lose the .\n and consider moving this to VEventf(ctx, 1


pkg/sql/sqlbase/system.go, line 398 at r9 (raw file):

	keys.StatementDiagnosticsTableID:          privilege.ReadWriteData,
	keys.ScheduledJobsTableID:                 privilege.ReadWriteData,
	keys.SqllivenessID:                        privilege.ReadWriteData,

I think we want this to be privilege.ReadData. I don't really want root mucking around in here.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 11 at r9 (raw file):

// licenses/APL.txt.

package slinstance

nit: package level comment


pkg/sql/sqlliveness/slinstance/slinstance.go, line 34 at r9 (raw file):

	Sid sqlliveness.SessionID
	Exp hlc.Timestamp

nit: no reason to export these


pkg/sql/sqlliveness/slinstance/slinstance.go, line 43 at r9 (raw file):

func (s *session) Expiration() hlc.Timestamp { return s.Exp }

type SQLInstance struct {

nit: comment


pkg/sql/sqlliveness/slinstance/slinstance.go, line 83 at r9 (raw file):

func (l *SQLInstance) createSession(ctx context.Context) (*session, error) {
	id := uuid.MakeV4().GetBytes()
	exp := l.clock.Now().Add(l.ttl.Nanoseconds(), 1)

I don't think you need to add 1 to the logical field. Having a non-zero logical timestamp is pretty rare.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 150 at r9 (raw file):

				if err := l.extendSession(ctx); err != nil {
					l.clearSession()
					t.Reset(0)

My sense is that we want some sort of exponential backoff here or perhaps in extendSession


pkg/sql/sqlliveness/slinstance/slinstance.go, line 187 at r9 (raw file):

		}

		select {

This should also listen on ctx.Done()


pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Should we have a separate daemon that deletes expired sessions?

Yes, I think we should. I can live in a separate package potentially.

@spaskob spaskob force-pushed the sql-liveness branch 3 times, most recently from 3aa1572 to 24a17f6 Compare July 29, 2020 12:44
Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to feedback - will add tests as an additional change after we finalize the main discussion.
There are 2 open questions left:

  1. Where to store the settings for heartbeat interval and expiration ttl?
  2. How to get a stopper for the slstorage object that is needed for the is_alive builtin?

Reviewed 1 of 12 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/jobs/adopt.go, line 53 at r9 (raw file):

Previously, ajwerner wrote…

do we want this to be the NodeUser instead? I'm not super pleased that the RootUser has write access to the jobs table. Seems like a recipe for disaster.

done


pkg/jobs/registry.go, line 499 at r9 (raw file):

Previously, ajwerner wrote…

consider making this a log.Fatal and some more surrounding information about the job type.

Done.


pkg/sql/sqlbase/system.go, line 398 at r9 (raw file):

Previously, ajwerner wrote…

I think we want this to be privilege.ReadData. I don't really want root mucking around in here.

The tests start timing out when I do that, before I dig far - what is the rationale - we are supposed to write and delete from this table correct?


pkg/sql/sqlliveness/slinstance/slinstance.go, line 83 at r9 (raw file):

Previously, ajwerner wrote…

I don't think you need to add 1 to the logical field. Having a non-zero logical timestamp is pretty rare.

Done.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 150 at r9 (raw file):

Previously, ajwerner wrote…

My sense is that we want some sort of exponential backoff here or perhaps in extendSession

Done.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 187 at r9 (raw file):

Previously, ajwerner wrote…

This should also listen on ctx.Done()

Done.


pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, ajwerner wrote…

Yes, I think we should. I can live in a separate package potentially.

Code added - what's the reasoning behind it being in a separate package?
Also I will need some help - I am not sure how to get a stopper in https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/conn_executor.go#L2010 which I will need to pass to the slstorage object.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How about creating some cluster settings?
  2. we've been discussing offline.

Another thought occurs to me, one thing that's been an observability problem with jobs is understanding how many times that a job has been adopted. How do you feel about adding an integer column next to the claim in the jobs table which we increment each time we write a new claim into the row?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

@spaskob
Copy link
Copy Markdown
Contributor Author

spaskob commented Jul 29, 2020

Added cluster settings for slstorage and threaded through the slstorage in EvalCtx. If it looks good will do the same for slinstance. LMK.

Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all constants with cluster settings. Ready for review.
As for the adoption counter idea - if we decide we can do it in a follow up PR. Currently we can deduce this counter from the logs and we also store all resume errors in the job payload, and # resume errors should be a good approximation.

Reviewed 3 of 12 files at r10, 1 of 6 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing to get better. Still looking for some unit testing on the implementations of the various sqlliveness interfaces. One area I'd like to focus on a bit before this merges that you could ease some concerns I have is around the upgrade path and making sure we handle it gracefully with regards to jobs that are currently being run by the registry when the version upgrade flips.

Reviewed 1 of 11 files at r9, 2 of 13 files at r14, 3 of 6 files at r15, 1 of 2 files at r16, 4 of 6 files at r17, 5 of 5 files at r18.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/jobs/registry.go, line 476 at r18 (raw file):

		defer r.mu.Unlock()
		if _, alreadyRegistered := r.mu.adoptedJobs[*j.ID()]; alreadyRegistered {
			log.Fatalf(ctx, "job %d: was just created but foun din registered adopted jobs", *j.ID())

nit foun din


pkg/jobs/registry_external_test.go, line 87 at r18 (raw file):

	jobs.LeniencySetting.Override(&s.ClusterSettings().SV, 0)
	const cancelInterval = time.Duration(math.MaxInt64)
	const adoptInterval = time.Nanosecond

under stressrace this can make things spin very fact. Consider time.Millisecond.


pkg/sql/sqlbase/system.go, line 398 at r9 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

The tests start timing out when I do that, before I dig far - what is the rationale - we are supposed to write and delete from this table correct?

Well, node should have read-write access withReadData, just root shouldn't. My guess is somewhere is writing to this table using the root user instead of the node user if it's timing out.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 170 at r18 (raw file):

			everySecond := log.Every(time.Second)
			if s, _ := l.getSessionOrBlockCh(); s == nil {
				for r := retry.StartWithCtx(ctx, opts); r.Next(); {

retry has a feature where you can pass it a stopper that should make this cleaner. See the Closer field https://godoc.org/github.com/cockroachdb/cockroach/pkg/util/retry#Options


pkg/sql/sqlliveness/slinstance/slinstance.go, line 191 at r18 (raw file):

				for r := retry.StartWithCtx(ctx, opts); r.Next(); {
					select {
					// Finish immediately in tests instead of waiting for the retry.;w

nit: ;w


pkg/sql/sqlliveness/slstorage/slstorage.go, line 88 at r18 (raw file):

	for {
		t := timeutil.NewTimer()
		t.Reset(0)

I think you want this above the loop


pkg/sql/sqlliveness/slstorage/slstorage.go, line 96 at r18 (raw file):

		case <-t.C:
			t.Read = true
			t.Reset(s.gcInterval())

generally I prefer resetting it after the operation completes. If the delete-sessions query get slow this might spin continuously, which I don't think we want.


pkg/sql/sqlliveness/sltorage/slstorage.go, line 55 at r8 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Code added - what's the reasoning behind it being in a separate package?
Also I will need some help - I am not sure how to get a stopper in https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/conn_executor.go#L2010 which I will need to pass to the slstorage object.

It just feels orthogonal in functionality to the storage. I'm fine either way.

Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/jobs/registry_external_test.go, line 87 at r18 (raw file):

Previously, ajwerner wrote…

under stressrace this can make things spin very fact. Consider time.Millisecond.

Done.


pkg/sql/sqlbase/system.go, line 398 at r9 (raw file):

Previously, ajwerner wrote…

Well, node should have read-write access withReadData, just root shouldn't. My guess is somewhere is writing to this table using the root user instead of the node user if it's timing out.

most probably, root user us used everywhere in jobs package. I propose to create an issue to fix this and do it in a follow up PR for the whole package.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 170 at r18 (raw file):

Previously, ajwerner wrote…

retry has a feature where you can pass it a stopper that should make this cleaner. See the Closer field https://godoc.org/github.com/cockroachdb/cockroach/pkg/util/retry#Options

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go, line 88 at r18 (raw file):

Previously, ajwerner wrote…

I think you want this above the loop

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go, line 96 at r18 (raw file):

Previously, ajwerner wrote…

generally I prefer resetting it after the operation completes. If the delete-sessions query get slow this might spin continuously, which I don't think we want.

Done.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to rebase/regenerate the cluster versions stuff.

Reviewed 5 of 12 files at r19.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


out, line 1 at r19 (raw file):

Running make with -j8

junk


pkg/jobs/deprecated.go, line 57 at r19 (raw file):

}

func (r *Registry) maybeAdoptJob(

nit: prefix with deprecated


pkg/jobs/deprecated.go, line 290 at r19 (raw file):

}

func (r *Registry) newLease() *jobspb.Lease {

nit: prefix with deprecated


pkg/jobs/deprecated.go, line 300 at r19 (raw file):

}

func (r *Registry) cancelAllLocked(ctx context.Context) {

nit: prefix with deprecated


pkg/jobs/deprecated.go, line 312 at r19 (raw file):

// killed and that no one else tries to resume it. This essentially works as a
// barrier that only one function can cross and try to resume the job.
func (r *Registry) register(jobID int64, cancel func()) error {

nit: prefix with deprecated


pkg/sql/sqlliveness/sqlliveness.go, line 63 at r19 (raw file):

	Insert(context.Context, Session) error
	Update(context.Context, Session) error

Commentary on the semantics. In particular, I suspect you're going to want a concrete error or boolean value from Update to indicate that the session has been removed so that you can stop trying to update it.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 141 at r19 (raw file):

			}
			everySecond := log.Every(time.Second)
			if s, _ := l.getSessionOrBlockCh(); s == nil {

Why not put the retry loops into createSession and extendSession? feels more fitting there.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 144 at r19 (raw file):

				for r := retry.StartWithCtx(ctx, opts); r.Next(); {
					newSession, err := l.createSession(ctx)
					if err != nil {

if ctx.Err() != nil don't log and probably just return.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 151 at r19 (raw file):

					}
					if log.V(2) {
						log.Infof(ctx, "Created new SQL liveness session %s", newSession.ID())

I suspect this should not have a V gate.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 160 at r19 (raw file):

				var err error
				for r := retry.StartWithCtx(ctx, opts); r.Next(); {
					if err = l.extendSession(ctx); err != nil {

This seems too simplistic. In the case that somebody has removed your claim, you'll never succeed and you should definitely go back around to create a new session. In the case that you don't know then you want to retry. In the case your context has been canceled, then you want to return, which I guess is already the case though rather implicitly.

I'm not sure I understand the reason to have a max number of retries to extend. My sense is you should try to extend until you know that it won't ever succeed.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 162 at r19 (raw file):

					if err = l.extendSession(ctx); err != nil {
						if log.V(2) {
							log.Infof(ctx, "Could not extend SQL liveness session %s", s.ID())

I don't think this one needs a vmodule gate, it's a pretty unexpected thing. Maybe indicate which attempt number we're on.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 168 at r19 (raw file):

					break
				}
				if err != nil {

If ctx.Err() != nil don't log and probably just return.


pkg/sql/sqlliveness/slinstance/slinstance.go, line 179 at r19 (raw file):

				}
			}
			t.Reset(l.hb())

This floating reset is hard to reason about give all the code above it. If more of the code were moved into helpers it might be clearer that this is the happy path.


pkg/sql/sqlliveness/slstorage/slstorage.go, line 107 at r19 (raw file):

				return err
			})
			t.Reset(s.gcInterval())

sometimes to make it more explicit I lift this to just above the select and do:

if t.Read {
     t.Reset(...)
}

@ajwerner ajwerner force-pushed the sql-liveness branch 5 times, most recently from d4d81a3 to e5111a3 Compare August 11, 2020 17:11
Spas Bojanov and others added 17 commits August 11, 2020 17:12
Touches cockroachdb#47892.

Add `claim_session_id` and `claim_instance_id` columns in a new column
family.  These two values specify the current claim by a job registry on
this job row.  The current live claims will live in a new table
`system.sqlliveness` along with an expiration deadline which will need
to be extended to keep the claims alive.

Add sql migration code to migrate old definitions to the new one and for
the new system table.

Release note: none.
Release note: none.
See https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200615_sql_liveness.md
for background on sql liveness implementation.

Regarding using sqlliveness for implementing a new job leasing
mechanism. The biggest difficulty was to ensure the correct migration of existing
clusters with existing running or paused jobs. These need to be handled
by the old job adoption and leasing implementation and newly created
need to be handled by the new job leasing and adoption.

There is a new job adoption function `claimAndProcessJobs` implemented
using the new adoption API in `adopt.go`.  The old adoption code is
preserved and it will be triggered if and only if the cluster has not
been finalized. This code will be deleted in release `20.2`. All jobs
inherited from the previous version or created before upgrade is
finalized will be handled by the old code.

When the cluster is finalized the new adoption code will kick in for
newly created jobs. Old jobs that have not finished yet will be detected
as having expired claims (column `claim_session_id` will have value
NULL), will be canceled and eventually resumed by the new adoption code.

Release note: none.
We introduce a new SQL builtin function
`crdb_internal.sql_liveness_is_alive` that returns
true if the provided sqlliveness session id has not
expired.

Release note: none.
Release note: none.
This test checks if jobs before backup are
preserved by restore but we are introducing a
new column `claim_session_id` which is a `uuid`
and needs to be exclueded from the check that
job rows are identical for the same job id.

Also refactored the test to produce a more
helpful error message via `require.Equal`.

Release note: none.
The SQL instance is normally started at startup but if old
cluster is upgraded it won't start and when cluster is
finalized the new adoption logic will kick in without
a running SQL indstance which will cause adoption to stall.

Release note: none.
This was supposed to be deprecated in 20.2
 but because there is some chance of 19.2
jobs not finishing in 20.1 clusters we will
depracte in 21.0.

Release note: none.
This caching, as discussed in the RFC proved to be critical when there
are a large number of jobs to avoid a quadratic number of lookups. Having
a large number of jobs can occur relatively commonly.

Release note: None
Also augmented the test a bit.

Release note: None
This commit does a few things:

1) Drops the `SQL` from `SQLInstance` as it merely introduced stutter in the
   name.

2) Made the implementation of SessionID into a string as these things are
   immutable and we would like to use them to key maps.

3) Split up the storage interface as only the reading portion was used
   externally. The use of the `Session` interface in the writing portion
   ended up just being cruft.

4) Add a `Provider` interface which represents the subsystem as a whole
   exported outwards and an implementation in `slprovider`.

5) Adds sanity to the starting of the sqlliveness subsystem. Before this
   commit, the various components were started multiple times in different
   places. This commit moves the starting to the sqlserver Start method.

Release note: None
@spaskob
Copy link
Copy Markdown
Contributor Author

spaskob commented Aug 12, 2020

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 12, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants