Skip to content

backupccl: add missing context cancel checks to restore#96302

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
adityamaru:restore-ctx-bug
Feb 2, 2023
Merged

backupccl: add missing context cancel checks to restore#96302
craig[bot] merged 1 commit intocockroachdb:masterfrom
adityamaru:restore-ctx-bug

Conversation

@adityamaru
Copy link
Copy Markdown
Contributor

In #95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running generateAndSendImportSpans. This resulted in the generative_split_and_scatter_processor getting stuck writing to a channel nobody was reading from
(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines
in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to input.Next() as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a
graceful teardown of the flow.

Informs: #95257

Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.

@adityamaru adityamaru requested review from dt and rhu713 January 31, 2023 20:34
@adityamaru adityamaru requested review from a team as code owners January 31, 2023 20:34
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 31, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@adityamaru
Copy link
Copy Markdown
Contributor Author

The failure in CI is the data race being fixed in #95808.

Comment on lines +385 to +406
for progress := range progCh {
mu.Lock()
var progDetails backuppb.RestoreProgress
if err := pbtypes.UnmarshalAny(&progress.ProgressDetails, &progDetails); err != nil {
log.Errorf(ctx, "unable to unmarshal restore progress details: %+v", err)
}
for {
select {
case <-ctx.Done():
return ctx.Err()
case progress, ok := <-progCh:
if !ok {
return nil
}
Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this update needed for the bug fix as well?

My only concern here is that previously this goroutine exited by virtue of the writer calling close when it was done. This has a couple of nice properties (1) the writer knows that the reader will always be there and (2) we have a single exit condition in this goroutine that only depends on the writer correctly exiting.

Now, if the context gets cancelled nothing is reading from progCh. We do a blocking write to progCh in the metadata callback writer in the distsql receiver goroutine. I have no proof that this is a problem, it just smells like a potential area for a problem so might be good to double check.

Personally, when it is possible to arrange for a single exit condition, I think we should lean in that direction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change wasn't required for the fix, I think I incorrectly convinced myself that this was another place we had overlooked a ctx cancellation check. I'm convinced by your comment so I'll revert this diff, thanks!

@adityamaru adityamaru force-pushed the restore-ctx-bug branch 2 times, most recently from 733b00f to 1d882cd Compare February 1, 2023 14:48
if idx >= mu.ceiling {
for i := mu.ceiling; i <= idx; i++ {
importSpan := <-importSpanCh
importSpan, ok := <-importSpanCh
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to others who were momentarily confused by importSpanCh being read both here and in spanCountTask: this variable is defined, captured by the func, and then redefined, so it is actually two different channels.

In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a
5 hour roachtest. The stacks indicated that we may have seen
a context cancellation that was not being respected by the
goroutine running `generateAndSendImportSpans`. This resulted
in the `generative_split_and_scatter_processor` getting stuck writing
to a channel nobody was reading from
(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516)
since the other goroutines
in the processor had seen the ctx cancellation and exited.
A side effect of the generative processor not shutting down
was that the downstream restore data processors would also
hang on their call to `input.Next()` as they would not receive
a row or a meta from the generative processor signalling them
to shutdown. This fix adds a ctx cancellation check to the
goroutine described above, thereby allowing a
graceful teardown of the flow.

This fix also adds the JobID to the generative processor spec
so that logs on remote nodes are correctly tagged with the JobID
making for easier debugging.

Informs: cockroachdb#95257

Release note (bug fix): fixes a bug where a restore flow could
hang indefinitely in the face of a context cancellation, manifesting
as a stuck restore job.
@adityamaru
Copy link
Copy Markdown
Contributor Author

bors r=stevendanna

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 2, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants