Skip to content

kv: don't hold latches while rate limiting ExportRequests#66338

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/exportRateLimit
Jun 11, 2021
Merged

kv: don't hold latches while rate limiting ExportRequests#66338
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/exportRateLimit

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Jun 10, 2021

This commit moves ExportRequest rate limiting up above evaluation and
outside of latching. This ensures that if an ExportRequest is
rate-limited, it does not inadvertently block others while waiting.

Interestingly, we were already careful about this for AddSSTableRequests
because it is easier for them to block others while holding latches. In
the case of read-only requests like ExportRequest, blocking others is
less common. However, it is still possible. Notably, MVCC write requests
with lower timestamps than the read will still block. Additionally,
non-MVCC requests like range splits will block. In one customer
investigation, we found that an export request was holding latches and
blocking a non-MVCC request (a range split) which was transitively
blocking all write traffic to the range and all read traffic to the RHS
of the range. We believe that the stall lasted for so long (seconds to
minutes) because the ExportRequest was throttled while holding its
latches.

I did notice that some of this code was just touched by #66092, so any
potential backport here may be a little involved.

Release note (bug fix): Backups no longer risk the possibility of
blocking conflicting writes while being rate limited by the
kv.bulk_io_write.concurrent_export_requests concurrency limit.

/cc. @cockroachdb/kv

@nvb nvb requested review from a team, aayushshah15, andreimatei and dt June 10, 2021 22:15
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/exportRateLimit branch from 2d32435 to f2b537b Compare June 10, 2021 22:50
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/util/limit/limiter.go, line 28 at r1 (raw file):

}

// Reservation is a allocation from a limiter which should be released once the

"an allocation"


pkg/util/limit/limiter.go, line 30 at r1 (raw file):

// Reservation is a allocation from a limiter which should be released once the
// limited task has been completed.
type Reservation interface {

Is this new interface necessary? Can't clients work on quota pool allocs?

Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/store_send.go, line 295 at r1 (raw file):

		// Limit the number of concurrent Export requests, as these often scan and
		// entire Range at a time and place significant read load on a Store.
		return s.limiters.ConcurrentExportRequests.Begin(ctx)

Do you think its worth emitting a trace event or a log message (like above) of some sort?

This commit moves ExportRequest rate limiting up above evaluation and
outside of latching. This ensures that if an ExportRequest is
rate-limited, it does not inadvertently block others while waiting.

Interestingly, we were already careful about this for `AddSSTableRequests`
because it is easier for them to block others while holding latches. In
the case of read-only requests like `ExportRequest`, blocking others is
less common. However, it is still possible. Notably, MVCC write requests
with lower timestamps than the read will still block. Additionally,
non-MVCC requests like range splits will block. In one customer
investigation, we found that an export request was holding latches and
blocking a non-MVCC request (a range split) which was transitively
blocking all write traffic to the range and all read traffic to the RHS
of the range. We believe that the stall lasted for so long (seconds to
minutes) because the ExportRequest was throttled while holding its
latches.

I did notice that some of this code was just touched by cockroachdb#66092, so any
potential backport here may be a little involved.

Release note (bug fix): Backups no longer risk the possibility of
blocking conflicting writes while being rate limited by the
kv.bulk_io_write.concurrent_export_requests concurrency limit.
@nvb nvb force-pushed the nvanbenschoten/exportRateLimit branch from f2b537b to cbc0810 Compare June 11, 2021 19:35
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/store_send.go, line 295 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Do you think its worth emitting a trace event or a log message (like above) of some sort?

Yes, done.


pkg/util/limit/limiter.go, line 28 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

"an allocation"

Done.


pkg/util/limit/limiter.go, line 30 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Is this new interface necessary? Can't clients work on quota pool allocs?

We don't need it, but I found it strange that the limit package would be returning a *quotapool.IntAlloc. Hiding this behind an interface is free and seemed like a good idea.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jun 11, 2021

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 11, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants