streamingccl: add retry into stream ingestion job by gh-casper · Pull Request #85432 · cockroachdb/cockroach

gh-casper · 2022-08-01T21:59:54Z

This PR support ingestion job to have its own job
retry mechanism. All errors are retryable by default
unless marked as permanent job error.

Also add running status into job progress when
the job reaches various stages.

Release note: None
Release justification: Cat 4

Closes: #83450
Closes: #82509

cockroach-teamcity · 2022-08-01T22:00:02Z

This change is

stevendanna · 2022-08-10T16:10:00Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go

 	}
 	ingestWithClient := func() error {
+		streamID := streaming.StreamID(details.StreamID)
+		updateRunningStatus(ctx, ingestionJob, fmt.Sprintf("connecting to the producer job %d", streamID))


We might want to have just one running status that covers both the connecting and planning. Updating the job status isn't free, so let's keep it to the highest value ones.

stevendanna · 2022-08-10T16:10:33Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go

 		log.Infof(ctx,
 			"starting to revert to the specified cutover timestamp for stream ingestion job %d",
 			ingestionJob.ID())
+		updateRunningStatus(ctx, ingestionJob, "starting to cut over to the given timestamp")


should we include the timesstamp in this message?

stevendanna · 2022-08-10T16:10:57Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go

 		}

 		log.Infof(ctx, "activating destination tenant %d", details.NewTenantID)
+		updateRunningStatus(ctx, ingestionJob, "activating destination tenant")


I think we can elide this one. I think it is reasonable that "cutover" includes the tenant activation.

stevendanna · 2022-08-10T16:11:06Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go

+		updateRunningStatus(ctx, ingestionJob, "running the SQL flow for the stream ingestion job")
 		if err = distStreamIngest(ctx, execCtx, sqlInstanceIDs, ingestionJob.ID(), planCtx, dsp,
 			streamIngestionSpecs, streamIngestionFrontierSpec); err != nil {
+			fmt.Println("ctx after running ingestion flow: ", ctx)


stray println

stevendanna · 2022-08-10T16:18:49Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go

+	ro := retry.Options{
+		InitialBackoff: 3 * time.Second,
+		Multiplier:     2,
+		MaxBackoff:     10 * time.Second,
+		MaxRetries:     5,
+	}


One issue here is that a failure that happened 2 days ago feels like it should be irrelevant to a failure that happens today, but because nothing resets our retry counter, that failure matters forever in terms of both the backoff and the max count.

I wonder, is there a reason to ever stop retrying? What's the advantage of going into a paused state vs retrying? I suppose it means we stop forcing the source cluster to do work.

stevendanna · 2022-08-10T16:20:43Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go

+		if retryCount != 0 {
+			status = fmt.Sprintf("retrying stream ingestion in the %d round with previous error: %s", retryCount, err)
+		}
+		updateRunningStatus(ctx, ingestionJob, status)
+		err = ingest(ctx, execCtx, ingestionJob)


These statuses are going to be pretty rapidly overwritten by the status' we set in ingest(). Perhaps we should just set the status after we get the err so that the user knows that we've gotten an error and are waiting to retry.

gh-casper

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @samiskin, and @stevendanna)

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go line 207 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

We might want to have just one running status that covers both the connecting and planning. Updating the job status isn't free, so let's keep it to the highest value ones.

Done.

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go line 260 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

stray println

Done.

Code quote:

fmt.Println

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go line 271 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

should we include the timesstamp in this message?

Done. Added in maybeRevertToCutoverTimestamp.

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go line 277 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

I think we can elide this one. I think it is reasonable that "cutover" includes the tenant activation.

Done.

Code quote:

log.Infof(ctx, "activating destination tenant %d", details.NewTenantID)

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go line 302 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

One issue here is that a failure that happened 2 days ago feels like it should be irrelevant to a failure that happens today, but because nothing resets our retry counter, that failure matters forever in terms of both the backoff and the max count.

I wonder, is there a reason to ever stop retrying? What's the advantage of going into a paused state vs retrying? I suppose it means we stop forcing the source cluster to do work.

If the destination keeps failing, I think it's reasonable to just let it pause and it makes it easier for operator to fix something in a paused state . I can make the retry to last longer.

I think later it also makes sense to let the producer side tells the ingestion side when the producer job will expire and update it in running status or after pausing.

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go line 312 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

These statuses are going to be pretty rapidly overwritten by the status' we set in ingest(). Perhaps we should just set the status after we get the err so that the user knows that we've gotten an error and are waiting to retry.

Done.

samiskin · 2022-08-24T13:49:52Z

pkg/ccl/streamingccl/streamingest/stream_replication_e2e_test.go


 	ctx := context.Background()
 	ingestErrCh := make(chan error, 1)
+	ingestionTimes := 0


ingestionStarts could be more clear here

samiskin · 2022-08-24T13:52:58Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go

+		updateRunningStatus(ctx, ingestionJob,
+			fmt.Sprintf("stream ingestion waits for retrying after error %s", err))


Assuming updateRunningStatus doesn't log anything, it'd be nice to log every time we're about to retry.

samiskin · 2022-08-24T13:55:23Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go

+	ro := retry.Options{
+		InitialBackoff: 3 * time.Second,
+		Multiplier:     2,
+		MaxBackoff:     1 * time.Minute,
+		MaxRetries:     60,
+	}


Steven's earlier comment seems like it'd still apply here, where this is expected to be an incredibly long running job and persisting consequences of previous eventually-successful retries is probably not great.

If you'd rather address that in a different issue/pr I'd at least put a comment here highlighting the concern.

samiskin · 2022-08-24T13:59:23Z

pkg/ccl/streamingccl/streamingest/stream_replication_e2e_test.go

-	_, alternateSrcTenantConn := serverutils.StartTenant(t, c.srcCluster.Server(1), base.TestTenantArgs{TenantID: c.args.srcTenantID, DisableCreateTenant: true, SkipTenantCheck: true})
+	_, alternateSrcTenantConn := serverutils.StartTenant(t, c.srcCluster.Server(1),
+		base.TestTenantArgs{TenantID: c.args.srcTenantID, DisableCreateTenant: true, SkipTenantCheck: true})


This part we can just leave SkipTenantCheck being unset, since the source tenant should exist and should be active and we don't care why this fails if that isn't the case. The purpose of the other place with SkipTenantCheck was to try to connect to an inactive destination tenant and verify that even without a testserver's TenantCheck it still fails.

This PR support ingestion job to have its own job retry mechanism. All errors are retryable by default unless marked as permanent job error. Also add running status into job progress when the job reaches various stages. Release note: None Release justificationi: low risk, high benefit changes to existing functionality

gh-casper

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @samiskin, and @stevendanna)

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go line 311 at r2 (raw file):

Previously, samiskin (Shiranka Miskin) wrote…

Steven's earlier comment seems like it'd still apply here, where this is expected to be an incredibly long running job and persisting consequences of previous eventually-successful retries is probably not great.

If you'd rather address that in a different issue/pr I'd at least put a comment here highlighting the concern.

I think what he meant in previous comment is that if we update status at the beginning of the retry, it will gets overridden very quick by the following update. But if we update the status after the a retry fails, the status can be shown for quite a while before the next retry (max backoff is 1 minute).

pkg/ccl/streamingccl/streamingest/stream_ingestion_job.go line 324 at r2 (raw file):

Previously, samiskin (Shiranka Miskin) wrote…

Assuming updateRunningStatus doesn't log anything, it'd be nice to log every time we're about to retry.

Done.

pkg/ccl/streamingccl/streamingest/stream_replication_e2e_test.go line 531 at r2 (raw file):

Previously, samiskin (Shiranka Miskin) wrote…

ingestionStarts could be more clear here

Done.

pkg/ccl/streamingccl/streamingest/stream_replication_e2e_test.go line 813 at r2 (raw file):

Previously, samiskin (Shiranka Miskin) wrote…

This part we can just leave SkipTenantCheck being unset, since the source tenant should exist and should be active and we don't care why this fails if that isn't the case. The purpose of the other place with SkipTenantCheck was to try to connect to an inactive destination tenant and verify that even without a testserver's TenantCheck it still fails.

Done.

Code quote:

alternateSrcSysSQL

gh-casper · 2022-09-01T04:12:16Z

bors r+

craig · 2022-09-01T04:56:10Z

Build succeeded:

Bazel Essential CI (Cockroach)

gh-casper requested review from miretskiy, samiskin and stevendanna August 1, 2022 21:59

gh-casper requested a review from a team as a code owner August 1, 2022 21:59

stevendanna reviewed Aug 10, 2022

View reviewed changes

gh-casper force-pushed the running-status branch from e9918cf to 97dd180 Compare August 23, 2022 04:22

gh-casper commented Aug 23, 2022

View reviewed changes

samiskin approved these changes Aug 24, 2022

View reviewed changes

gh-casper force-pushed the running-status branch from 97dd180 to 556c370 Compare August 25, 2022 01:56

gh-casper force-pushed the running-status branch from 556c370 to 745e6be Compare September 1, 2022 02:52

gh-casper commented Sep 1, 2022

View reviewed changes

craig bot merged commit b316a5e into cockroachdb:master Sep 1, 2022

		updateRunningStatus(ctx, ingestionJob,
		fmt.Sprintf("stream ingestion waits for retrying after error %s", err))

Conversation

gh-casper commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Aug 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gh-casper left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samiskin Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gh-casper left a comment

Choose a reason for hiding this comment

Uh oh!

gh-casper commented Sep 1, 2022

Uh oh!

craig bot commented Sep 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gh-casper commented Aug 1, 2022 •

edited

Loading

samiskin Aug 24, 2022 •

edited

Loading