backupccl: add missing context cancel checks to restore by adityamaru · Pull Request #96302 · cockroachdb/cockroach

adityamaru · 2023-01-31T20:34:51Z

In #95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running generateAndSendImportSpans. This resulted in the generative_split_and_scatter_processor getting stuck writing to a channel nobody was reading from
(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines
in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to input.Next() as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a
graceful teardown of the flow.

Informs: #95257

Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.

blathers-crl · 2023-01-31T20:34:55Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-01-31T20:35:35Z

This change is

adityamaru · 2023-01-31T20:52:48Z

The failure in CI is the data race being fixed in #95808.

stevendanna · 2023-02-01T08:37:27Z

pkg/ccl/backupccl/restore_job.go

-		for progress := range progCh {
-			mu.Lock()
-			var progDetails backuppb.RestoreProgress
-			if err := pbtypes.UnmarshalAny(&progress.ProgressDetails, &progDetails); err != nil {
-				log.Errorf(ctx, "unable to unmarshal restore progress details: %+v", err)
-			}
+		for {
+			select {
+			case <-ctx.Done():
+				return ctx.Err()
+			case progress, ok := <-progCh:
+				if !ok {
+					return nil
+				}


Was this update needed for the bug fix as well?

My only concern here is that previously this goroutine exited by virtue of the writer calling close when it was done. This has a couple of nice properties (1) the writer knows that the reader will always be there and (2) we have a single exit condition in this goroutine that only depends on the writer correctly exiting.

Now, if the context gets cancelled nothing is reading from progCh. We do a blocking write to progCh in the metadata callback writer in the distsql receiver goroutine. I have no proof that this is a problem, it just smells like a potential area for a problem so might be good to double check.

Personally, when it is possible to arrange for a single exit condition, I think we should lean in that direction.

This change wasn't required for the fix, I think I incorrectly convinced myself that this was another place we had overlooked a ctx cancellation check. I'm convinced by your comment so I'll revert this diff, thanks!

pkg/ccl/backupccl/restore_processor_planning.go

stevendanna · 2023-02-01T18:32:46Z

pkg/ccl/backupccl/restore_job.go

 			if idx >= mu.ceiling {
 				for i := mu.ceiling; i <= idx; i++ {
-					importSpan := <-importSpanCh
+					importSpan, ok := <-importSpanCh


Note to others who were momentarily confused by importSpanCh being read both here and in spanCountTask: this variable is defined, captured by the func, and then redefined, so it is actually two different channels.

In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.

adityamaru · 2023-02-02T04:15:58Z

bors r=stevendanna

craig · 2023-02-02T05:22:41Z

Build succeeded:

Bazel Essential CI (Cockroach)

adityamaru requested review from dt and rhu713 January 31, 2023 20:34

adityamaru requested review from a team as code owners January 31, 2023 20:34

stevendanna reviewed Feb 1, 2023

View reviewed changes

adityamaru force-pushed the restore-ctx-bug branch 2 times, most recently from 733b00f to 1d882cd Compare February 1, 2023 14:48

adityamaru requested a review from stevendanna February 1, 2023 14:49

stevendanna approved these changes Feb 1, 2023

View reviewed changes

stevendanna reviewed Feb 1, 2023

View reviewed changes

adityamaru force-pushed the restore-ctx-bug branch from 1d882cd to d7a6974 Compare February 2, 2023 00:31

craig bot merged commit 22244a7 into cockroachdb:master Feb 2, 2023

adityamaru deleted the restore-ctx-bug branch February 2, 2023 14:32

rhu713 mentioned this pull request Feb 15, 2023

release-22.2: move alloc heavy Files field from manifest to SST, use slim manifest in backup restore #97210

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: add missing context cancel checks to restore#96302

backupccl: add missing context cancel checks to restore#96302
craig[bot] merged 1 commit intocockroachdb:masterfrom
adityamaru:restore-ctx-bug

adityamaru commented Jan 31, 2023

Uh oh!

blathers-crl bot commented Jan 31, 2023

Uh oh!

cockroach-teamcity commented Jan 31, 2023

Uh oh!

adityamaru commented Jan 31, 2023

Uh oh!

stevendanna Feb 1, 2023 •

edited

Loading

Uh oh!

adityamaru Feb 1, 2023

Uh oh!

Uh oh!

stevendanna Feb 1, 2023

Uh oh!

adityamaru commented Feb 2, 2023

Uh oh!

craig bot commented Feb 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adityamaru commented Jan 31, 2023

Uh oh!

blathers-crl bot commented Jan 31, 2023

Uh oh!

cockroach-teamcity commented Jan 31, 2023

Uh oh!

adityamaru commented Jan 31, 2023

Uh oh!

stevendanna Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adityamaru Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stevendanna Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

adityamaru commented Feb 2, 2023

Uh oh!

craig bot commented Feb 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stevendanna Feb 1, 2023 •

edited

Loading