Skip to content

sql/backfill: Implement retry mechanism during index backfill#135563

Merged
craig[bot] merged 1 commit into
cockroachdb:masterfrom
spilchen:gh-130939/241118/0820/final
Nov 20, 2024
Merged

sql/backfill: Implement retry mechanism during index backfill#135563
craig[bot] merged 1 commit into
cockroachdb:masterfrom
spilchen:gh-130939/241118/0820/final

Conversation

@spilchen

@spilchen spilchen commented Nov 18, 2024

Copy link
Copy Markdown
Contributor

During an index backfill, if the bulko.index_backfill.batch_size setting is not appropriately configured for a table's size or definition, it may consume all available memory before writing the new index entries. This change introduces a retry mechanism to handle out-of-memory scenarios.

Upon encountering memory issues, the batch size is halved on each retry, coupled with an exponential backoff. This backoff period allows the consumer of the index entries to free up memory. The retry mechanism reuses the same bound memory account, making it critical to accurately track memory usage, even during failed attempts. This ensures proper accounting and frees memory consumed during the failed operation.

Epic: CRDB-37796
Closes #130939
Closes #132048
Release note (bug fix): The schema changer's backfill process now includes a retry mechanism that reduces the batch size when memory issues occur. This improves the likelihood of operation success without requiring manual adjustment of the bulko.index_backfill.batch_size parameter.

@spilchen spilchen requested a review from a team November 18, 2024 12:56
@spilchen spilchen self-assigned this Nov 18, 2024
@spilchen spilchen requested review from a team as code owners November 18, 2024 12:56
@spilchen spilchen requested review from DrewKimball, nameisbhaskar and srosenberg and removed request for a team November 18, 2024 12:56
@cockroach-teamcity

Copy link
Copy Markdown
Member

This change is Reviewable

@fqazi fqazi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @nameisbhaskar, @spilchen, and @srosenberg)


pkg/sql/rowexec/indexbackfiller.go line 442 at r1 (raw file):

			// attempt. We will allocate new memory on the next iteration.
			if memUsedBuildingBatch > 0 {
				entries = nil

Should we assert entries are nil?


pkg/sql/rowexec/indexbackfiller.go line 472 at r1 (raw file):

}

type batchRetry struct {

Nit: Would it be better to call his indexBatchRetry since its in the rowexec namespace.

@spilchen spilchen force-pushed the gh-130939/241118/0820/final branch from c8e5197 to 67aff66 Compare November 18, 2024 16:43

@spilchen spilchen left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @fqazi, @nameisbhaskar, and @srosenberg)


pkg/sql/rowexec/indexbackfiller.go line 442 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Should we assert entries are nil?

Technically, in case of errors, the entries slice might still be non-nil. Since this handles the cleanup of any allocated memory in the bound account, I thought it made sense to clear out the entries slice here as well. If you'd prefer, I can update BuildIndexEntriesChunk to always return a nil entries slice when an error occurs.


pkg/sql/rowexec/indexbackfiller.go line 472 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Nit: Would it be better to call his indexBatchRetry since its in the rowexec namespace.

Done

@rafiss rafiss left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @fqazi, @nameisbhaskar, @spilchen, and @srosenberg)


pkg/sql/backfill/backfill.go line 808 at r2 (raw file):

	chunkSize int64,
	traceKV bool,
) (entries []rowenc.IndexEntry, resumeKey roachpb.Key, memUsedPerChunk int64, err error) {

nit: for error return parameters, I find it more clear to name the parameter retErr, so it's distinguished from any of the error variables that are declared in the function.


pkg/sql/backfill/backfill.go line 819 at r2 (raw file):

		err = errors.Wrap(err,
			"failed to initialize empty buffer to store the index entries of all rows in the chunk")
		return

nit: i think we should avoid naked returns, especially for larger functions like this. it makes it easy for us to start returning the wrong thing if certain lines or blocks of code are moved.

i was surprised our own coding guidelines don't have this, but it is in the Tour of Go, which we recommend: https://go.dev/tour/basics/7

@fqazi fqazi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @nameisbhaskar, @spilchen, and @srosenberg)


pkg/sql/rowexec/indexbackfiller.go line 442 at r1 (raw file):

Previously, spilchen wrote…

Technically, in case of errors, the entries slice might still be non-nil. Since this handles the cleanup of any allocated memory in the bound account, I thought it made sense to clear out the entries slice here as well. If you'd prefer, I can update BuildIndexEntriesChunk to always return a nil entries slice when an error occurs.

That makes sense, I'm guessing the bound account is incremented last

@spilchen spilchen force-pushed the gh-130939/241118/0820/final branch from 67aff66 to e59827d Compare November 18, 2024 21:25

@spilchen spilchen left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @fqazi, @nameisbhaskar, @rafiss, and @srosenberg)


pkg/sql/backfill/backfill.go line 808 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: for error return parameters, I find it more clear to name the parameter retErr, so it's distinguished from any of the error variables that are declared in the function.

I backed this part out when I removed the naked returns. So, it's back to what it was before.


pkg/sql/backfill/backfill.go line 819 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: i think we should avoid naked returns, especially for larger functions like this. it makes it easy for us to start returning the wrong thing if certain lines or blocks of code are moved.

i was surprised our own coding guidelines don't have this, but it is in the Tour of Go, which we recommend: https://go.dev/tour/basics/7

Great advice! I removed the naked returns and reinstated much of the existing error handling. For every instance where BuildIndexEntriesChunk returns an error, I ensured that the memory usage in the bound account is also returned. This allows us to properly track and free the allocated memory during retries.


pkg/sql/rowexec/indexbackfiller.go line 442 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

That makes sense, I'm guessing the bound account is incremented last

The counter is incremented as index entries are being built. However, if we encounter an out-of-memory error while building some of the entries, we need to ensure the bound account is freed for the memory allocated during the failed attempt.

I ended up reworking the code, and it made sense to include the assert as part of the changes. This should address your original concern.

@fqazi fqazi requested a review from rafiss November 18, 2024 22:23

@fqazi fqazi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @nameisbhaskar, @rafiss, and @srosenberg)

@rafiss rafiss left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the retry logic here is very cleanly implemented; nice job! my comments are mostly for minor adjustments.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @fqazi, @nameisbhaskar, @spilchen, and @srosenberg)


pkg/sql/rowexec/indexbackfiller.go line 476 at r3 (raw file):

type indexBatchRetry struct {
	nextChunkSize       *int64

nit: why did this need to be a pointer? it seemed to me that we could keep it a value


pkg/sql/rowexec/indexbackfiller.go line 477 at r3 (raw file):

type indexBatchRetry struct {
	nextChunkSize       *int64
	builder             func(ctx context.Context, txn isql.Txn) error

super nit: a name like buildIndexChunk might make the purpose of this function more clear.


pkg/sql/rowexec/indexbackfiller.go line 493 at r3 (raw file):

				// Callback to clear out any state acquired in the last attempt
				if resetErr := b.resetForNextAttempt(ctx); resetErr != nil {
					return errors.CombineErrors(resetErr, err)

nit: maybe this should be CombineErrors(err, resetErr), since the reset error is secondary to that other error.


pkg/sql/rowexec/indexbackfiller.go line 498 at r3 (raw file):

				if !r.Next() {
					// If we have exhausted all retries, fail with the out of memory error.
					return err

nit: should we return a wrapped error, like perhaps:

errors.Wrapf(err, "failed after %d retries", ...)

pkg/sql/rowexec/indexbackfiller.go line 503 at r3 (raw file):

				log.Infof(ctx,
					"out of memory while building index entries; retrying with batch size %d. Silencing error: %s",
					redact.SafeInt(*b.nextChunkSize), redact.SafeString(err.Error()))

The nextChunkSize is an int, which is automatically considered safe for non-redaction, so SafeInt is not needed.

For the error, we can't assume that err.Error() is non-sensitive. it's possible that it is a wrapped error, whose error text could contain application data. For that reason, each error object is responsible for implementing errors.SafeFormatter in order to teach the formatter how to redact and unredact as it should. To format it here, we'd use log.Infof("...%v", ..., err).

The page here has a overview of these concepts in a lot greater detail, and is a good reference. Happy to discuss this more synchronously if you think it would help; or perhaps we could review this in a team tech discussion? https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/1824817806/Log+and+error+redactability

During an index backfill, if the bulko.index_backfill.batch_size setting
is not appropriately configured for a table's size or definition, it may
consume all available memory before writing the new index entries. This
change introduces a retry mechanism to handle out-of-memory scenarios.

Upon encountering memory issues, the batch size is halved on each retry,
coupled with an exponential backoff. This backoff period allows the
consumer of the index entries to free up memory. The retry mechanism
reuses the same bound memory account, making it critical to accurately
track memory usage, even during failed attempts. This ensures proper
accounting and frees memory consumed during the failed operation.

Epic: CRDB-37796
Closes cockroachdb#130939, cockroachdb#132048
Release note (bug fix): The schema changer's backfill process now includes
a retry mechanism that reduces the batch size when memory issues occur.
This improves the likelihood of operation success without requiring
manual adjustment of the bulko.index_backfill.batch_size parameter.
@spilchen spilchen force-pushed the gh-130939/241118/0820/final branch from e59827d to 2bf463b Compare November 19, 2024 12:54
@rafiss rafiss requested a review from fqazi November 19, 2024 20:44

@rafiss rafiss left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! nice improvement for sure

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @fqazi, @nameisbhaskar, @spilchen, and @srosenberg)

@spilchen

Copy link
Copy Markdown
Contributor Author

TFTR!

bors r+

@craig

craig Bot commented Nov 20, 2024

Copy link
Copy Markdown
Contributor

Build failed:

@spilchen

Copy link
Copy Markdown
Contributor Author

bors retry

@craig

craig Bot commented Nov 20, 2024

Copy link
Copy Markdown
Contributor

@craig craig Bot merged commit 354444d into cockroachdb:master Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: perturbation/full/backfill failed sql: index backfill fails to complete with default settings

4 participants