Skip to content

kv: don't commit-wait on each columnBackfiller chunk#63412

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/backfillerCommitWait
Apr 15, 2021
Merged

kv: don't commit-wait on each columnBackfiller chunk#63412
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/backfillerCommitWait

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Apr 9, 2021

Fixes #61444.

Previously, each transaction issued by columnBackfiller.runChunk would perform a
commit-wait of about 800ms when writing to a GLOBAL table. This adds up,
as the ColumnBackfiller can issue many small transactions. The commit-wait step
is meant to ensure linearizability with future reads of the values written by
the column backfiller. However, there's no need to pay this cost on each chunk -
we only need reads after the backfiller is completely finished to be guaranteed
to observe its writes. So we only need to pay the commit-wait cost once.

This commit accomplishes this by exposing a new (*kv.Txn).DeferCommitWait API,
which allows well-informed callers to take responsibility for performing the
commit-wait for a committed transaction.

The commit then begins using this DeferCommitWait API in columnBackfiller.
The columnBackfiller defers the commit-wait stage for each chunk that it runs
and accumulates a collection of commit-wait handles to run before flushing. In
doing so, it can coalesce the wait across all chunks, which dramatically reduces
the total time we spend waiting for consistency when backfilling a column on
GLOBAL tables.

This speedup is reflected in TestAlterTableLocalityRegionalByRowError, which
drops from a runtime of 419s to 48s.

I'm mildly concerned about a potential memory blowup of holding references to
previously committed TxnCoordSender objects through these commit-wait handle
functions, so I've conservatively capped the number of functions that we'll
retain at any given time to 128. Is there a roachtest I can run to validate
that this isn't a problem?

@nvb nvb requested review from ajwerner, andreimatei and dt April 9, 2021 21:24
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

// Each function retains a reference to its corresponding TxnCoordSender, so
// we need to be careful not to accumulate an unbounded number of these
// functions.
maxCommitWaitFns = 128
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not-even-nit: I have a knee-jerk reaction to consts vs cluster settings just for how many times I've had something OOMing or whatever and wished we had a setting to get us out of a bind, but this seems fine unless you think we might want to tune it up/down.

// commit-wait function to ensure that any dependent reads on the rows we just
// backfilled observe the new column.
func (cb *columnBackfiller) runCommitWait(ctx context.Context) {
for i, fn := range cb.commitWaitFns {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we veventf in here before the loop, if we anticipate we might hang in the loop for any length of time?

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 7 of 8 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @dt, and @nvanbenschoten)


pkg/sql/rowexec/columnbackfiller.go, line 31 at r1 (raw file):

Previously, dt (David Taylor) wrote…

not-even-nit: I have a knee-jerk reaction to consts vs cluster settings just for how many times I've had something OOMing or whatever and wished we had a setting to get us out of a bind, but this seems fine unless you think we might want to tune it up/down.

+1 always add a cluster setting. They cost so little.

Fixes cockroachdb#61444.

Previously, each transaction issued by `columnBackfiller.runChunk` would perform a
commit-wait of about 800ms when writing to a GLOBAL table. This adds up,
as the `ColumnBackfiller` can issue many small transactions. The commit-wait step
is meant to ensure linearizability with future reads of the values written by
the column backfiller. However, there's no need to pay this cost on each chunk -
we only need reads after the backfiller is completely finished to be guaranteed
to observe its writes. So we only need to pay the commit-wait cost once.

This commit accomplishes this by exposing a new `(*kv.Txn).DeferCommitWait` API,
which allows well-informed callers to take responsibility for performing the
commit-wait for a committed transaction.

The commit then begins using this `DeferCommitWait` API in `columnBackfiller`.
The `columnBackfiller` defers the commit-wait stage for each chunk that it runs
and accumulates a collection of commit-wait handles to run before flushing. In
doing so, it can coalesce the wait across all chunks, which dramatically reduces
the total time we spend waiting for consistency when backfilling a column on
GLOBAL tables.

This speedup is reflected in `TestAlterTableLocalityRegionalByRowError`, which
drops from a runtime of 419s to 48s.

I'm mildly concerned about a potential memory blowup of holding references to
previously committed TxnCoordSender objects through these commit-wait handle
functions, so I've conservatively capped the number of functions that we'll
retain at any given time to 128. Is there a roachtest I can run to validate
that this isn't a problem?
@nvb nvb force-pushed the nvanbenschoten/backfillerCommitWait branch from 7b80f3b to 6eac5d9 Compare April 15, 2021 18:39
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @andreimatei, and @dt)


pkg/sql/rowexec/columnbackfiller.go, line 31 at r1 (raw file):

Previously, ajwerner wrote…

+1 always add a cluster setting. They cost so little.

Good idea. Done.


pkg/sql/rowexec/columnbackfiller.go, line 146 at r1 (raw file):

Previously, dt (David Taylor) wrote…

should we veventf in here before the loop, if we anticipate we might hang in the loop for any length of time?

There are already a few veventf calls in these functions if they do actually need to wait, so I figured we'd be better off with just deferring to that logging.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Apr 15, 2021

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 15, 2021

Build succeeded:

@craig craig bot merged commit d4faf50 into cockroachdb:master Apr 15, 2021
@nvb nvb deleted the nvanbenschoten/backfillerCommitWait branch April 19, 2021 23:36
@mikeCRL
Copy link
Copy Markdown
Contributor

mikeCRL commented Apr 28, 2021

@nvanbenschoten Should this be included in the release notes? If so, can you please suggest a summary? Thanks!

@mikeCRL
Copy link
Copy Markdown
Contributor

mikeCRL commented Apr 28, 2021

@nvanbenschoten Chatted with @ajwerner. Given that this only enhances new-in-v21.1 functionality, he suggested it does not need an RN. I agree. Feel free to disagree if you see this in time, otherwise no worries and I'll leave this out for the beta 5 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv: don't commit-wait on each columnBackfiller chunk

5 participants