sql/backfill: Implement retry mechanism during index backfill#135563
Conversation
fqazi
left a comment
There was a problem hiding this comment.
Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @nameisbhaskar, @spilchen, and @srosenberg)
pkg/sql/rowexec/indexbackfiller.go line 442 at r1 (raw file):
// attempt. We will allocate new memory on the next iteration. if memUsedBuildingBatch > 0 { entries = nil
Should we assert entries are nil?
pkg/sql/rowexec/indexbackfiller.go line 472 at r1 (raw file):
} type batchRetry struct {
Nit: Would it be better to call his indexBatchRetry since its in the rowexec namespace.
c8e5197 to
67aff66
Compare
spilchen
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @fqazi, @nameisbhaskar, and @srosenberg)
pkg/sql/rowexec/indexbackfiller.go line 442 at r1 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Should we assert entries are nil?
Technically, in case of errors, the entries slice might still be non-nil. Since this handles the cleanup of any allocated memory in the bound account, I thought it made sense to clear out the entries slice here as well. If you'd prefer, I can update BuildIndexEntriesChunk to always return a nil entries slice when an error occurs.
pkg/sql/rowexec/indexbackfiller.go line 472 at r1 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Nit: Would it be better to call his indexBatchRetry since its in the rowexec namespace.
Done
rafiss
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @fqazi, @nameisbhaskar, @spilchen, and @srosenberg)
pkg/sql/backfill/backfill.go line 808 at r2 (raw file):
chunkSize int64, traceKV bool, ) (entries []rowenc.IndexEntry, resumeKey roachpb.Key, memUsedPerChunk int64, err error) {
nit: for error return parameters, I find it more clear to name the parameter retErr, so it's distinguished from any of the error variables that are declared in the function.
pkg/sql/backfill/backfill.go line 819 at r2 (raw file):
err = errors.Wrap(err, "failed to initialize empty buffer to store the index entries of all rows in the chunk") return
nit: i think we should avoid naked returns, especially for larger functions like this. it makes it easy for us to start returning the wrong thing if certain lines or blocks of code are moved.
i was surprised our own coding guidelines don't have this, but it is in the Tour of Go, which we recommend: https://go.dev/tour/basics/7
fqazi
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @nameisbhaskar, @spilchen, and @srosenberg)
pkg/sql/rowexec/indexbackfiller.go line 442 at r1 (raw file):
Previously, spilchen wrote…
Technically, in case of errors, the
entriesslice might still be non-nil. Since this handles the cleanup of any allocated memory in the bound account, I thought it made sense to clear out theentriesslice here as well. If you'd prefer, I can updateBuildIndexEntriesChunkto always return a nilentriesslice when an error occurs.
That makes sense, I'm guessing the bound account is incremented last
67aff66 to
e59827d
Compare
spilchen
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @fqazi, @nameisbhaskar, @rafiss, and @srosenberg)
pkg/sql/backfill/backfill.go line 808 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
nit: for error return parameters, I find it more clear to name the parameter
retErr, so it's distinguished from any of the error variables that are declared in the function.
I backed this part out when I removed the naked returns. So, it's back to what it was before.
pkg/sql/backfill/backfill.go line 819 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
nit: i think we should avoid naked returns, especially for larger functions like this. it makes it easy for us to start returning the wrong thing if certain lines or blocks of code are moved.
i was surprised our own coding guidelines don't have this, but it is in the Tour of Go, which we recommend: https://go.dev/tour/basics/7
Great advice! I removed the naked returns and reinstated much of the existing error handling. For every instance where BuildIndexEntriesChunk returns an error, I ensured that the memory usage in the bound account is also returned. This allows us to properly track and free the allocated memory during retries.
pkg/sql/rowexec/indexbackfiller.go line 442 at r1 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
That makes sense, I'm guessing the bound account is incremented last
The counter is incremented as index entries are being built. However, if we encounter an out-of-memory error while building some of the entries, we need to ensure the bound account is freed for the memory allocated during the failed attempt.
I ended up reworking the code, and it made sense to include the assert as part of the changes. This should address your original concern.
fqazi
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @nameisbhaskar, @rafiss, and @srosenberg)
rafiss
left a comment
There was a problem hiding this comment.
the retry logic here is very cleanly implemented; nice job! my comments are mostly for minor adjustments.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @fqazi, @nameisbhaskar, @spilchen, and @srosenberg)
pkg/sql/rowexec/indexbackfiller.go line 476 at r3 (raw file):
type indexBatchRetry struct { nextChunkSize *int64
nit: why did this need to be a pointer? it seemed to me that we could keep it a value
pkg/sql/rowexec/indexbackfiller.go line 477 at r3 (raw file):
type indexBatchRetry struct { nextChunkSize *int64 builder func(ctx context.Context, txn isql.Txn) error
super nit: a name like buildIndexChunk might make the purpose of this function more clear.
pkg/sql/rowexec/indexbackfiller.go line 493 at r3 (raw file):
// Callback to clear out any state acquired in the last attempt if resetErr := b.resetForNextAttempt(ctx); resetErr != nil { return errors.CombineErrors(resetErr, err)
nit: maybe this should be CombineErrors(err, resetErr), since the reset error is secondary to that other error.
pkg/sql/rowexec/indexbackfiller.go line 498 at r3 (raw file):
if !r.Next() { // If we have exhausted all retries, fail with the out of memory error. return err
nit: should we return a wrapped error, like perhaps:
errors.Wrapf(err, "failed after %d retries", ...)
pkg/sql/rowexec/indexbackfiller.go line 503 at r3 (raw file):
log.Infof(ctx, "out of memory while building index entries; retrying with batch size %d. Silencing error: %s", redact.SafeInt(*b.nextChunkSize), redact.SafeString(err.Error()))
The nextChunkSize is an int, which is automatically considered safe for non-redaction, so SafeInt is not needed.
For the error, we can't assume that err.Error() is non-sensitive. it's possible that it is a wrapped error, whose error text could contain application data. For that reason, each error object is responsible for implementing errors.SafeFormatter in order to teach the formatter how to redact and unredact as it should. To format it here, we'd use log.Infof("...%v", ..., err).
The page here has a overview of these concepts in a lot greater detail, and is a good reference. Happy to discuss this more synchronously if you think it would help; or perhaps we could review this in a team tech discussion? https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/1824817806/Log+and+error+redactability
During an index backfill, if the bulko.index_backfill.batch_size setting is not appropriately configured for a table's size or definition, it may consume all available memory before writing the new index entries. This change introduces a retry mechanism to handle out-of-memory scenarios. Upon encountering memory issues, the batch size is halved on each retry, coupled with an exponential backoff. This backoff period allows the consumer of the index entries to free up memory. The retry mechanism reuses the same bound memory account, making it critical to accurately track memory usage, even during failed attempts. This ensures proper accounting and frees memory consumed during the failed operation. Epic: CRDB-37796 Closes cockroachdb#130939, cockroachdb#132048 Release note (bug fix): The schema changer's backfill process now includes a retry mechanism that reduces the batch size when memory issues occur. This improves the likelihood of operation success without requiring manual adjustment of the bulko.index_backfill.batch_size parameter.
e59827d to
2bf463b
Compare
rafiss
left a comment
There was a problem hiding this comment.
lgtm! nice improvement for sure
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @fqazi, @nameisbhaskar, @spilchen, and @srosenberg)
|
TFTR! bors r+ |
|
Build failed: |
|
bors retry |
During an index backfill, if the bulko.index_backfill.batch_size setting is not appropriately configured for a table's size or definition, it may consume all available memory before writing the new index entries. This change introduces a retry mechanism to handle out-of-memory scenarios.
Upon encountering memory issues, the batch size is halved on each retry, coupled with an exponential backoff. This backoff period allows the consumer of the index entries to free up memory. The retry mechanism reuses the same bound memory account, making it critical to accurately track memory usage, even during failed attempts. This ensures proper accounting and frees memory consumed during the failed operation.
Epic: CRDB-37796
Closes #130939
Closes #132048
Release note (bug fix): The schema changer's backfill process now includes a retry mechanism that reduces the batch size when memory issues occur. This improves the likelihood of operation success without requiring manual adjustment of the bulko.index_backfill.batch_size parameter.