backupccl: add missing context cancel checks to restore#96302
backupccl: add missing context cancel checks to restore#96302craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
The failure in CI is the data race being fixed in #95808. |
pkg/ccl/backupccl/restore_job.go
Outdated
| for progress := range progCh { | ||
| mu.Lock() | ||
| var progDetails backuppb.RestoreProgress | ||
| if err := pbtypes.UnmarshalAny(&progress.ProgressDetails, &progDetails); err != nil { | ||
| log.Errorf(ctx, "unable to unmarshal restore progress details: %+v", err) | ||
| } | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| case progress, ok := <-progCh: | ||
| if !ok { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Was this update needed for the bug fix as well?
My only concern here is that previously this goroutine exited by virtue of the writer calling close when it was done. This has a couple of nice properties (1) the writer knows that the reader will always be there and (2) we have a single exit condition in this goroutine that only depends on the writer correctly exiting.
Now, if the context gets cancelled nothing is reading from progCh. We do a blocking write to progCh in the metadata callback writer in the distsql receiver goroutine. I have no proof that this is a problem, it just smells like a potential area for a problem so might be good to double check.
Personally, when it is possible to arrange for a single exit condition, I think we should lean in that direction.
There was a problem hiding this comment.
This change wasn't required for the fix, I think I incorrectly convinced myself that this was another place we had overlooked a ctx cancellation check. I'm convinced by your comment so I'll revert this diff, thanks!
733b00f to
1d882cd
Compare
| if idx >= mu.ceiling { | ||
| for i := mu.ceiling; i <= idx; i++ { | ||
| importSpan := <-importSpanCh | ||
| importSpan, ok := <-importSpanCh |
There was a problem hiding this comment.
Note to others who were momentarily confused by importSpanCh being read both here and in spanCountTask: this variable is defined, captured by the func, and then redefined, so it is actually two different channels.
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.
1d882cd to
d7a6974
Compare
|
bors r=stevendanna |
|
Build succeeded: |
In #95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running
generateAndSendImportSpans. This resulted in thegenerative_split_and_scatter_processorgetting stuck writing to a channel nobody was reading from(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines
in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to
input.Next()as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing agraceful teardown of the flow.
Informs: #95257
Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.