Skip to content

sql: schema change job resumer#44391

Merged
thoszhang merged 3 commits intocockroachdb:masterfrom
thoszhang:schema-change-job-resumer-rebased
Mar 6, 2020
Merged

sql: schema change job resumer#44391
thoszhang merged 3 commits intocockroachdb:masterfrom
thoszhang:schema-change-job-resumer-rebased

Conversation

@thoszhang
Copy link
Copy Markdown

@thoszhang thoszhang commented Jan 27, 2020

jobs: allow jobs to create internal executors

This PR adds a SessionBoundInternalExecutorFactory to the job
registry, and exposes the ability to create session bound internal
executors in jobs via a method MakeSessionBoundInternalExecutor() on
Job.

This is intended for use by the schema change job resumer, which stores
state on the internal executors it creates so that it can override the
tables in the TableCollection for each internal validation query.

Release note: None


jobs: randomize job order in adoption loop

Previously, the job adoption loop would pick a job to adopt by
attempting to adopt jobs in decreasing order by created timestamp. On a
single-node cluster, this could cause a deadlock where a schema change
job is never adopted because it's stuck behind a newer schema change job
with a higher mutation ID, which itself is stuck in a retry loop due to
not being first in line in the mutations list.

This PR randomizes the order in which jobs are up for attempted
adoption when triggered by the timer. In the case where we attempt to
adopt a job upon being notified of a queued job, we continue to try to
adopt the newest job.

Release note (general change): Non-running jobs are now considered for
adoption in randomized order instead of in a determistic order by
creation time, to avoid potential deadlocks when schema change jobs need
to execute in a specific order. This is a preemptive change, not a bug
fix, but it affects all jobs.


sql, jobs: add schema change job resumer

[Commit message is WIP; what follows for now is a general description for reviewers.]

This commit adds a SchemaChangeResumer for the schema change job, which creates and runs a SchemaChanger to process outstanding schema changes for a given table ID and mutation ID (with the exception of dropping a database). A schema change job is now queued to run after a transaction that initiates a schema change before results are returned to the client. Schema changes are now managed like other jobs and can be paused, cancelled, etc.

Specifically:

  • SchemaChangers are now created solely within jobs. SchemaChanger.exec() has been split into a main schema change phase (for Resume) and a rollback phase (for OnFailOrCancel), and most of the code that updates job state (including making a new rollback job) has been removed from the schema changer.
  • The SchemaChangeManager, which used to handle async schema changes, and the SchemaChangers queued in extraTxnState are all removed. TableDescriptor.Lease is no longer used.
  • We now queue a job for every schema change, even ones with no mutation, for every table affected (including, e.g., referenced tables for foreign keys), so we can wait for old leases to expire. The API for writing an updated table descriptor (used in, e.g., the execution of ALTER TABLE) has been refactored to accommodate this. Creating jobs for dropping tables (including views, etc.) has been unified with creating jobs for other schema changes.
  • The schema change tests have been refactored and (partially or entirely) skipped or removed in a lot of cases. Testing knobs only relevant to the old way of running schema changes have been removed. Tests are skipped for now either because this PR doesn't include any way to GC tables/indexes (see below) or because they were too complex and tightly coupled to the previous implementation and there seemed to be diminishing returns in getting them to work right now.

Known issues (which I think we can defer to follow-up PRs):

  • GC for tables and indexes after gc.ttlseconds have elapsed will be implemented in a separate PR. At that point, some of the skipped tests can be revisited.
  • We now start or update schema change jobs for every schema change and for every affected every table, and some of these jobs are only spawned due to CASCADE, or because a referenced table is updated, etc. I put in some basic placeholder names for now since I'm not sure what the best approach is from a UI perspective.
  • When running a test with an artificially low cluster version, one of the cluster migrations gets stuck because it involves a schema change, but it can't run because the job registry isn't initialized at that point. I don't think this is a problem that can actually happen, and the point of the tests is solely to test cluster version-related behavior, but I have skipped the tests and need to un-skip them.
  • I lowered the retry interval to 0.01 for SHOW JOBS WHEN COMPLETE to reduce the delay for short schema changes, to get the tests to run as quickly as possible on CI, but this is almost certainly too low. The tests time out when the interval is 0.25s (at some point they didn't, but I think the random job ordering, recently introduced, is also slowing things down). We need a way to retry more frequently in the beginning and increasing the backoff over time. Update 3/4/20: This is addressed by jobs: configure more frequent retries for polling in Registry.Run()  #45677.
  • A few of the more complex schema change tests still need to be ported over to the new way of running schema changes. I think the most undertested area is concurrent schema changes not in the same transaction.
  • The more I think about it, the more I think that schema change jobs without mutations (i.e., everything that's not long-running) should not be cancellable.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch from 18dca1d to 6a02833 Compare January 27, 2020 04:05
@thoszhang thoszhang marked this pull request as ready for review January 27, 2020 04:05
@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch 5 times, most recently from 3a9d1d8 to 59a342b Compare January 27, 2020 21:24
@thoszhang thoszhang added the do-not-merge bors won't merge a PR with this label. label Jan 27, 2020
@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch from 59a342b to 8e77ef1 Compare January 27, 2020 23:19
@thoszhang thoszhang requested review from dt and spaskob January 27, 2020 23:20
Copy link
Copy Markdown
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @lucy-zhang, and @spaskob)


pkg/jobs/registry.go, line 265 at r1 (raw file):

StatusFailed

or StatusCanceled perhaps?


pkg/jobs/registry.go, line 338 at r1 (raw file):

// Registry.Start has been called will not have any effect.
// TODO (lucy): change this back
var DefaultAdoptInterval = 100 * time.Millisecond

you will change back before merging?
This may be too fast as the code polls the job table.

@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch 2 times, most recently from a29620f to 8a48e81 Compare February 3, 2020 19:52
@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch 3 times, most recently from dddd41b to 4a78615 Compare February 18, 2020 19:18
@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch 12 times, most recently from 588204c to b35604e Compare February 25, 2020 00:26
@thoszhang thoszhang requested a review from ajwerner March 3, 2020 20:05
@thoszhang
Copy link
Copy Markdown
Author

RFAL

@thoszhang
Copy link
Copy Markdown
Author

FYI, I became dissatisfied with my attempt to improve the internal executor API (some progress is in #45617), so I decided to put that aside and stick with the approach of giving the job registry an internal executor factory for now. One advantage of this is that it requires fewer changes to the schema change internals. I think we can live with it for 20.1.

@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch from 1362420 to d064916 Compare March 3, 2020 21:04
Copy link
Copy Markdown
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed description of the commits!
It's a big PR so I did a first pass and added some preliminary comments. I plan to do more after seeing comments from the others. I don't have knowledge of most of the components apart from jobs so I wonder what is an effective way to get this PR through why still making sure it is well reviewed.

Reviewed 56 of 59 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @lucy-zhang, and @pbardea)


pkg/jobs/registry.go, line 890 at r2 (raw file):

			if err != nil {
				// TODO (lucy): This needs to distinguish between assertion errors in
				// the job registry and assertion errors elsewhere, and only fail on the

What do you mean by elsewhere? The Resume function?


pkg/jobs/registry_test.go, line 188 at r2 (raw file):

	t.Skip("")
	// TODO (lucy): This test probably shouldn't continue to exist in its current
	// form if GCMutations will cease to be used. Refactor or get rid of it.

Is it possible to fix the test in this PR? The GTC code has been a source of production issues.


pkg/sql/schema_changer.go, line 678 at r2 (raw file):

			}

			// if dropJobID != 0 {

Is this a leftover from testing or needs a TODO to document?


pkg/sql/schema_changer.go, line 769 at r2 (raw file):

	if err := sc.initJobRunningStatus(ctx); err != nil {
		if log.V(2) {
			log.Infof(ctx, "Failed to update job %d running status: %+v", *sc.job.ID(), err)

Please use format:
"job %d: ..."
We should put some job specific logging utilities in place at some point...


pkg/sql/schema_changer.go, line 1792 at r2 (raw file):

				log.Infof(
					ctx,
					"descriptor %d not found for schema change job %d processing mutation %d;"+

job %d: ....
This the format I have consistently used for jobs.


pkg/sql/schema_changer.go, line 1804 at r2 (raw file):

				return err
			} else {
				return jobs.NewRetryJobError(err.Error())

Is it possible for a job to be stuck in always returning retriable errors?


pkg/sql/schema_changer.go, line 1848 at r2 (raw file):

	}
	if details.TableID == sqlbase.InvalidID {
		return errors.AssertionFailedf("job %d has no database ID or table ID", r.job.ID())

job %d: ...


pkg/sql/schema_changer.go, line 1879 at r2 (raw file):

			log.Infof(
				ctx,
				"descriptor %d not found for rollback of schema change job %d processing mutation %d;"+

ditto


pkg/sql/schema_changer.go, line 1891 at r2 (raw file):

			// Always retry when we get an error. Otherwise we risk leaving the in-
			// progress schema change state on the table descriptor forever.
			return jobs.NewRetryJobError(rollbackErr.Error())

similarly here can we get stuck in infinitely retrying a job in this case?


pkg/sql/table.go, line 857 at r2 (raw file):

	var spanList []jobspb.ResumeSpanList
	if job != nil {

nit: since this predicate or it's negation are used later in the code, consider a bool
jobExists := job != nil


pkg/sql/table.go, line 934 at r2 (raw file):

			}
		}
		log.Infof(ctx, "updated existing schema change job for table %d, mutation %d",

add the job id as well, I am trying to have a consistent format of job related messages:
"job %d: info about this job"


pkg/sql/delegate/show_jobs.go, line 51 at r2 (raw file):

	// TODO (lucy): Ideally we should retry a few times with a small backoff to
	// quickly pick up the schema change jobs that run quickly, before increasing

The jobs infrastructure is not a good fit for sub-minute executions. For future releases we should re-think what part of the job infrastructure can be abstracted away to support small fast operations that can be queued but don't need support for resuming (ie they can be retired if failed).

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very preliminary first pass, will need more, just wanted to publish the comments I had.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @lucy-zhang, and @pbardea)


pkg/jobs/registry.go, line 199 at r2 (raw file):

}

// SetSessionBoundInternalExecutorFactory sets the

nit: can you note that this this isn't on the constructor to bypass a circular dependency?


pkg/server/server.go, line 836 at r2 (raw file):

	// SessionBoundInternalExecutorFactory.
	s.distSQLServer.ServerConfig.SessionBoundInternalExecutorFactory =
		func(

these factory functions are the same, can you just make it a variable?


pkg/sql/conn_executor.go, line 2131 at r2 (raw file):

				// error is the schema change error.
				// TODO (lucy): I'm not sure the above is true. What about DROP TABLE
				// with multiple tables?

Or DROP DATABASE


pkg/sql/schema_changer.go, line 1492 at r2 (raw file):

		}
	}
	if err := sc.job.WithTxn(txn).SetDetails(

:( every time I see this API used I get so sad. It needs to go with serious urgency.


pkg/sql/schema_changer.go, line 1788 at r2 (raw file):

			},
		}
		if err := sc.exec(ctx); err != nil {

nit: I think this block deserves an NB explaining that gives some discussions of the types of errors and their handling. I also might do this as a switch, I don't think the linter likes returns in if { ... } else if { ... } chains.


pkg/sql/schema_changer.go, line 1833 at r2 (raw file):

// OnTerminal is part of the jobs.Resumer interface.
func (r schemaChangeResumer) OnTerminal(

Given the whole thing with queued jobs behind this one needing to wait for an adoption and the latency involved there, maybe it'd be a good idea to poke the adoption loop here if there is another job. Of course the complexity there is knowing that there is another job queued up behind this job. There's some opportunities to do that but it's a little awkward.

I hate the arbitrary interface to these callbacks.

@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch 2 times, most recently from 528ae03 to 95caf3e Compare March 4, 2020 18:17
optional int64 expiration_time = 2 [(gogoproto.nullable) = false];
}
optional SchemaChangeLease lease = 15;
optional SchemaChangeLease lease = 15 [deprecated = true];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be a followup, but I think we usually also rename the field to DeprecatedFoo since the annotation is only shown in some editors.

// Find the job.
var jobID int64
if len(tableDesc.MutationJobs) > 0 {
// TODO (lucy): We need to get rid of MutationJobs. This is the only place
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, yeah, I'd forgotten this bit of code, and how unfortunate its use of jobs was. I've been wanting to rework this and how we checkpoint/load spans constantly in the backfiller and this is yet another bit of motivation.

return err
}
}
if err := sc.ExtendLease(ctx, lease); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2¢: I think it might be good to include checks to ctx.Err() where we used to check for lease loss, since job-lease-loss would be communicated that way. We'll probably find it anyway when we go to do an RPC but it'd be nice to have some explicit checks at least as a often as we used to check the lease i think?

return err
}

if err := sc.ExtendLease(ctx, lease); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch from 95caf3e to 082b9c2 Compare March 5, 2020 06:11
This PR adds a `SessionBoundInternalExecutorFactory` to the job
registry, and exposes the ability to create session bound internal
executors in jobs via a method `MakeSessionBoundInternalExecutor()` on
`Job`.

This is intended for use by the schema change job resumer, which stores
state on the internal executors it creates so that it can override the
tables in the TableCollection for each internal validation query.

Release note: None
Previously, the job adoption loop would pick a job to adopt by
attempting to adopt jobs in decreasing order by created timestamp. On a
single-node cluster, this could cause a deadlock where a schema change
job is never adopted because it's stuck behind a newer schema change job
with a higher mutation ID, which itself is stuck in a retry loop due to
not being first in line in the mutations list.

This PR randomizes the order in which jobs are up for attempted
adoption when triggered by the timer. In the case where we attempt to
adopt a job upon being notified of a queued job, we continue to try to
adopt the newest job.

Release note (general change): Non-running jobs are now considered for
adoption in randomized order instead of in a determistic order by
creation time, to avoid potential deadlocks when schema change jobs need
to execute in a specific order. This is a preemptive change, not a bug
fix, but it affects all jobs.
@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch from 082b9c2 to 71af92a Compare March 5, 2020 21:26
Copy link
Copy Markdown
Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now temporarily rebased on top of #45765.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @pbardea, and @spaskob)


pkg/jobs/registry.go, line 199 at r2 (raw file):

Previously, ajwerner wrote…

nit: can you note that this this isn't on the constructor to bypass a circular dependency?

Done.


pkg/jobs/registry.go, line 890 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

What do you mean by elsewhere? The Resume function?

Yeah, that's what I meant. I've updated the comment.


pkg/jobs/registry_test.go, line 188 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Is it possible to fix the test in this PR? The GTC code has been a source of production issues.

Probably, I'll work on it.


pkg/server/server.go, line 836 at r2 (raw file):

Previously, ajwerner wrote…

these factory functions are the same, can you just make it a variable?

Done.


pkg/sql/conn_executor.go, line 2131 at r2 (raw file):

Previously, ajwerner wrote…

Or DROP DATABASE

DROP DATABASE is a special case since we only queue a single job, so whatever single error we get from that is what we'll return to the client. DROP TABLE can drop multiple tables each in their own jobs, though, which is what I was concerned about. In Registry.Run() we're returning the first error we get from any queued job, so you'd only be able to see other errors by looking them up in the jobs table.

But, actually, there's more to it. I'd been assuming that a single schema change statement can only produce a single job, with the sole exception of DROP TABLE. But this isn't true, because in order to reproduce the existing SchemaChanger behavior, we're now queuing a job for every updated table, which includes, e.g., the referenced table for a new foreign key, just so that job can wait for old table leases to expire on that table.

The more I think about it, the more I think we should just be waiting for old table leases to expire somewhere else. There's no good reason why it has to happen in the schema changer itself. We could just iterate through all modified tables in the TableCollection in the connExecutor after we finish waiting for all queued jobs and wait until all their old leases expire. Spas points out that these very small jobs that don't need cancel/resume/etc. shouldn't really be jobs in the normal sense at all, which I think is true.


pkg/sql/schema_changer.go, line 678 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Is this a leftover from testing or needs a TODO to document?

I meant to get rid of the lines commented out. But we actually don't need a job update here at all, since the GC job is what should be setting the status to Status_WAIT_FOR_GC_INTERVAL; after draining names for a dropped table, the job itself is done. I've removed the job update entirely.


pkg/sql/schema_changer.go, line 769 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Please use format:
"job %d: ..."
We should put some job specific logging utilities in place at some point...

Addressed in #45728. I took out the job %d here.


pkg/sql/schema_changer.go, line 1492 at r2 (raw file):

Previously, ajwerner wrote…

:( every time I see this API used I get so sad. It needs to go with serious urgency.

I agree


pkg/sql/schema_changer.go, line 1788 at r2 (raw file):

Previously, ajwerner wrote…

nit: I think this block deserves an NB explaining that gives some discussions of the types of errors and their handling. I also might do this as a switch, I don't think the linter likes returns in if { ... } else if { ... } chains.

Done.


pkg/sql/schema_changer.go, line 1792 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

job %d: ....
This the format I have consistently used for jobs.

Same as above


pkg/sql/schema_changer.go, line 1804 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Is it possible for a job to be stuck in always returning retriable errors?

Not normally, but a bug somewhere that causes endless retriable errors could cause this (which was also true before). Overall, I'm trying to make the error handling logic match the old behavior, where any outstanding mutations would cause a schema changer to attempt to run no matter what. The goal is for table descriptors to never get out of sync with jobs and to avoid orphaned mutations, by ensuring that (for schema changes with mutations) there's always a job as long as the mutation is still there. I think this is still an improvement over the previous implementation, since the difference now is that jobs can at least be paused or cancelled if we're looping Resume().

Always retrying in OnFailOrCancel() is the more risky decision, since we can't cancel the job in that state, and we'd be retrying even on permanent errors (which is also the pre-existing behavior). But even in that case we'd be able to at least pause the job.

I think this change constitutes a strict improvement over the previous behavior, in that we can at least always pause the job if it's looping. Arguably, we should give up on retrying after some number of attempts and just leave the bad state to be cleaned up manually, but I'd rather leave that for a later PR (and we may want to make that change in the job registry for all jobs, not in this job resumer).


pkg/sql/schema_changer.go, line 1833 at r2 (raw file):

Previously, ajwerner wrote…

Given the whole thing with queued jobs behind this one needing to wait for an adoption and the latency involved there, maybe it'd be a good idea to poke the adoption loop here if there is another job. Of course the complexity there is knowing that there is another job queued up behind this job. There's some opportunities to do that but it's a little awkward.

I hate the arbitrary interface to these callbacks.

I think that's a good idea, especially for primary key changes where we queue up another job with a mutation after the first one finishes. I'm going to file this away as a potential improvement to follow (and as part of thinking more about how to improve the situation with #45150). (FWIW, OnTerminal is sort of deprecated in favor of just using both Resume and OnFailOrCancel, and I think it's slated to go away after 20.1.)


pkg/sql/schema_changer.go, line 1848 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

job %d: ...

Same as above


pkg/sql/schema_changer.go, line 1879 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

ditto

Same as above


pkg/sql/schema_changer.go, line 1891 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

similarly here can we get stuck in infinitely retrying a job in this case?

See above comment for Resume().


pkg/sql/table.go, line 857 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

nit: since this predicate or it's negation are used later in the code, consider a bool
jobExists := job != nil

Done.


pkg/sql/table.go, line 934 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

add the job id as well, I am trying to have a consistent format of job related messages:
"job %d: info about this job"

Done.


pkg/sql/delegate/show_jobs.go, line 51 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

The jobs infrastructure is not a good fit for sub-minute executions. For future releases we should re-think what part of the job infrastructure can be abstracted away to support small fast operations that can be queued but don't need support for resuming (ie they can be retired if failed).

Yeah, I agree with this.

One schema change-specific issue is that many (though not all) of the short jobs that we're queuing are doing nothing except waiting for old table leases to expire, which we used to queue a schema changer for, but probably didn't need to be in the schema changer at all. We should consider waiting for old table leases to expire as a separate step in the connExecutor after all the queued jobs run, which would allow us to stop queuing jobs for schema changes that require nothing beyond a write to the table descriptor.


pkg/sql/rowexec/backfiller.go, line 271 at r5 (raw file):

Previously, dt (David Taylor) wrote…

Ugh, yeah, I'd forgotten this bit of code, and how unfortunate its use of jobs was. I've been wanting to rework this and how we checkpoint/load spans constantly in the backfiller and this is yet another bit of motivation.

I'm hoping that even if we don't change the way this code works, we can at least give the backfiller a job ID or something.


pkg/sql/sqlbase/structured.proto, line 640 at r5 (raw file):

Previously, dt (David Taylor) wrote…

can be a followup, but I think we usually also rename the field to DeprecatedFoo since the annotation is only shown in some editors.

Hmm, I saw something about how the JSON protobuf encoding uses field names in the admin UI or something and didn't want to risk changing the name. Maybe that doesn't apply since we're not using the field anymore.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @pbardea, and @spaskob)


pkg/sql/conn_executor.go, line 2131 at r2 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

DROP DATABASE is a special case since we only queue a single job, so whatever single error we get from that is what we'll return to the client. DROP TABLE can drop multiple tables each in their own jobs, though, which is what I was concerned about. In Registry.Run() we're returning the first error we get from any queued job, so you'd only be able to see other errors by looking them up in the jobs table.

But, actually, there's more to it. I'd been assuming that a single schema change statement can only produce a single job, with the sole exception of DROP TABLE. But this isn't true, because in order to reproduce the existing SchemaChanger behavior, we're now queuing a job for every updated table, which includes, e.g., the referenced table for a new foreign key, just so that job can wait for old table leases to expire on that table.

The more I think about it, the more I think we should just be waiting for old table leases to expire somewhere else. There's no good reason why it has to happen in the schema changer itself. We could just iterate through all modified tables in the TableCollection in the connExecutor after we finish waiting for all queued jobs and wait until all their old leases expire. Spas points out that these very small jobs that don't need cancel/resume/etc. shouldn't really be jobs in the normal sense at all, which I think is true.

I was just having this very conversation with @pbardea. The interactions between waiting for leases to drain, waiting for leases to drain on versions of tables which don't even exist yet, and waiting for jobs is very hard for me to reason about. This waiting is a key component of the "correctness" of our DDLs.

For some jobs it's also critical to wait for multiple steps through multiple versions. For other changes we don't want to do much waiting at all.

For example, when dropping a database, we want to wait for there to be no leases on any of its tables. When dropping an index, we just need to wait for there to be no leases on that version that had that index as public. When we do the drop as a "backfill" we definitely don't need to wait for that backfill to finish (though if that's the behavior, I'm not so upset about it). I see there's a boolean threaded through decisions to control the queueing of jobs which I surmise is related to this waiting.

The fact that this waiting relates to correctness has me worried about how we reason about it. Is there a way we can logically centralize the waiting. Or at least clarify what changes lead to what waiting?


pkg/sql/schema_changer.go, line 1804 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Is it possible for a job to be stuck in always returning retriable errors?

I'm actually okay with that so long as the errors really are retriable. Much better than having to think about the case where we fail out due to retriable errors.

All that being said, should there be some mechanism to do some backoff if we're spinning?

Please add some metrics or at least a TODO about metrics. I could see the metrics existing in the jobs layer.


pkg/sql/schema_changer.go, line 1891 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

similarly here can we get stuck in infinitely retrying a job in this case?

I'd much rather retry forever than do something incorrect. This should potentially be louder.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of people have now had some eyes on this. The commentary has been a mix of theoretical and nit-picky without a lot of highly actionable concrete feedback. The thing with this PR is it's scary for what it changes and deletes much more than for what it adds. We're learning a lot less leaving this in review than having it merged.

The question in my mind is, how much pain is there going to be post merge? Are there known issues with testing w.r.t. either failure or timings? If the answer is no then I'm emphatically in favor of merging this thing first and then spending time understanding the implications once it's merged. Maybe Sunday is better than today.

If there are items you'd like to do before pressing the button, it might be helpful to reviewers if you could list them out. If there's anything I or anybody can do to help, let us know.

I'd rather have this in and building confidence as the rest of the company shifts to testing and staring at the database a bunch than to have it sit in several more rounds of review. I feel that way even with a substantial list of follow up and testing to be added later so long as the merge doesn't disrupt everybody.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @pbardea, and @spaskob)

@dt
Copy link
Copy Markdown
Contributor

dt commented Mar 6, 2020

to echo what @ajwerner said, I think another point in favor of merging sooner is we (reviewers) could help take some of the workload in some cases by just doing the suggested nits/cleanups/etc as followups

@spaskob
Copy link
Copy Markdown
Contributor

spaskob commented Mar 6, 2020

Sounds good to me, I will do one look now and will approve unless I find something concrete.

Copy link
Copy Markdown
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 55 files at r5, 2 of 15 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @pbardea, and @spaskob)

@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch 2 times, most recently from adaa258 to f9ac5a2 Compare March 6, 2020 20:48
Release note (general change): TODO
@thoszhang thoszhang force-pushed the schema-change-job-resumer-rebased branch from f9ac5a2 to 497f15e Compare March 6, 2020 22:07
Copy link
Copy Markdown
Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm concerned about now is that the logic tests appear to be flaky. TC reports that TestParallel/subquery_retry_multinode has timed out twice in the last few iterations of this PR (and a few times before that, though the code has diverged). I don't get any failures when stressing this test on particular on my laptop. I'm not sure if this is a case of all the logic tests just taking longer, or if there's a new deadlock introduced somewhere that happens nondeterministically, or something else, but I'd rather not merge in this state. I'm still looking into it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @pbardea, and @spaskob)


pkg/sql/conn_executor.go, line 2131 at r2 (raw file):

I see there's a boolean threaded through decisions to control the queueing of jobs which I surmise is related to this waiting.

queueJob is actually controlling whether we're queuing individual jobs to drop tables (which is to say, drain names for them; GC happens separately), or whether we don't queue individual jobs because the tables are being dropped as part of CREATE DATABASE. I added a comment to clarify in drop_database.go.

As for the issue being discussed, the places where we currently wait for leases to drain (i.e., call WaitForOneVersion are

  1. PublishMultiple/Publish (so, internal waiting while advancing the schema change state machine)
  2. in waitToUpdateLeases in a defer at the end of executing every schema change, to ensure that all old leases have expired for the table being operated on before we return results to the client
  3. a bunch of somewhat ad-hoc calls to waitToUpdateLeases throughout the schema changer, usually with the same goal as in (2) but for other tables (e.g., a referenced table in a FK constraint)

(The job registry only waits for the job to finish, so anything related to waiting for leases to drain currently has to go in the job.)

My original thought was to have the schema change job be responsible for all "internal" waiting within the job and for something else to be responsible for waiting for leases to drain on all updated table descriptors after a job. But most of the complications seem to be with schema changes that update multiple table descriptors (FKs, interleaved indexes, truncate, etc.). I don't really have a good proposal right now.


pkg/sql/schema_changer.go, line 1804 at r2 (raw file):

All that being said, should there be some mechanism to do some backoff if we're spinning?

Probably. I added some TODOs in the job registry.


pkg/sql/schema_changer.go, line 1891 at r2 (raw file):

Previously, ajwerner wrote…

I'd much rather retry forever than do something incorrect. This should potentially be louder.

I elaborated slightly in the comment.

@ajwerner
Copy link
Copy Markdown
Contributor

ajwerner commented Mar 6, 2020

TestParallel/subquery_retry_multinode is #38257 (comment), it's not this so you don't need to dig deeper.

@thoszhang
Copy link
Copy Markdown
Author

Oh, thanks, I overlooked the newest activity on that issue. I'm going to merge this now, TFTRs

@thoszhang thoszhang merged commit 4364d1a into cockroachdb:master Mar 6, 2020
@thoszhang thoszhang deleted the schema-change-job-resumer-rebased branch March 6, 2020 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants