Skip to content

kvserver: slow_replication_threshold := 1m#76146

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:breakers-on
Mar 22, 2022
Merged

kvserver: slow_replication_threshold := 1m#76146
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:breakers-on

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Feb 7, 2022

Fixes #74705.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Feb 9, 2022

Running roachtest nightly off this branch to see if anything obvious transpires. https://teamcity.cockroachdb.com/viewQueued.html?itemId=4338356&tab=queuedBuildOverviewTab

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Feb 10, 2022

There's definitely some fallout. Even if I remove all of the (lost quorum: true) hits (where some nodes are non-live, so the breakers firing has to be fair game), I'm left with quite a bit of fallout across multiple tests:

https://gist.github.com/tbg/e122a96d29aa6cf91adaeadcbffebab7

It makes sense for there to be some fallout. We tend to quite overload clusters in some of the more heavyweight tests, and now any proposal that takes >15s to replicate can trigger the breaker. The clients at the other end of the error are not going to retry the breaker errors, so jobs and workloads fail that would previously just have waited for a long time.

Without having investigated some of these failures in more depth (which I'll do) I wonder if perhaps outright tripping the breaker on a slow request is to aggressive. Perhaps instead we should "tickle" the breaker, letting the probe run. And the probe then determines whether to trip the breaker. I don't expect that to solve the problems here (as replication serializes all commands, if a command is slow then a probe on top of it will also be slow) but it is perhaps the "right" way to set things up to give fewer false positives in general, especially when we're looking to add triggers as discussed in #74712.

In the interim I'll experiment with a 2 or even 5 minute breaker timeout. I'd rather get them turned on in some capacity than to get sucked into an endless string of investigations into why proposals are taking longer than X in situation Y. It looks like I was way too optimistic with the initial choice of 15s.


Here are the tests that saw at least one such error (that made it to artifacts, so presumably there were more of these errors in places where they may not have failed the test - for example tpccbench):

$ rg 'unable to serve request' | grep -Eo '.*run_1' | sort | uniq
import/tpcc/warehouses=4000/geo/run_1
network/authentication/nodes=4/run_1
pgjdbc/run_1
restore2TB/nodes=6/cpus=8/pd-volume=2500GB/run_1
schemachange/random-load/run_1
tlp/run_1
tpcc/multiregion/survive=region/chaos=true/run_1

Going to look into each of them.

🐌 import/tpcc/warehouses=4000/geo

The breaker tripped during the import, failing the job, so we're seeing an attempt at a revert, which also fails on the breaker:

| Error: importing fixture: importing table history: pq: reverting execution from '2022-02-09 20:30:59.19108' to '2022-02-09 20:31:00.409991' on 1 failed: clearing data for table 109: clear range /Table/109 - /Table/109/PrefixEnd: raft status: {"id":"3","term":6,"vote":"0","commit":19,"lead":"2","raftState":"StateFollower","applied":19,"progress":{},"leadtransferee":"0"}: replica (n8,s8):3 unable to serve request to r234:/Table/1{09/1/3521/"\xe1d\xee\a\xf7\x8cH\x00\x80\x00\x00\x00\x06L%6"-10} [(n5,s5):1, (n3,s3):2, (n8,s8):3, next=4, gen=25, sticky=1644442184.367330914,0]

Asked the CDC team (internally) whether there's anything with jobs that I have to be careful about (will critical steps bail if they see a breaker error or retry? They have to retry). I think all is well but want expert confirmation.

Anyway, digging a bit more into the logs, just confirming:

teamcity-4338356-1644424858-47-n8cpu16-geo-0008> W220209 20:45:29.630388 280 kv/kvserver/pkg/kv/kvserver/replica_raft.go:1197 ⋮ [n8,s8,r2334/1:‹/Table/114/1/{649/6/…-1000/1…}›,raft] 1768 have been waiting 16.80s for slow proposal AddSSTable [‹/Table/114/1/649/6/-1134/4/0›,‹/Table/114/1/649/10/-2/14/0/NULL›)
teamcity-4338356-1644424858-47-n8cpu16-geo-0008> I220209 20:45:29.631054 47836 kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:258 ⋮ [n8,s8,r2334/1:‹/Table/114/1/{649/6/…-1000/1…}›] 1769 breaker: tripped with error: raft status: {"id":"1","term":6,"vote":"0","commit":13,"lead":"2","raftState":"StateFollower","applied":13,"progress":{},"leadtransferee":"0"}: replica (n8,s8):1 unable to serve request to r2334:‹/Table/114/1/{649/6/-1134/4-1000/1/-3001/1}› [(n8,s8):1, (n3,s3):2, (n2,s2):3, next=4, gen=103, sticky=1644443071.776777050,0]

This slowness has nothing to do with the breakers, but now it trips them. It's just a run-of-the-mill import, so presumably it'll affect production customers as well. So definitely can't expect to get away with a 15s timeout. I wonder if we should consider special-casing bulk-related commands; I don't see how they would benefit from the breakers anyway. A job that hangs for a few minutes during an outage would be preferable to one that then does extra work trying to revert what it already did, getting stuck there, and ultimately starting from scratch again. I'm starting an (internal) thread with bulk-io to discuss.

tpcc/multiregion/survive=region/chaos=true

This one is interesting. The test fails:

02:53:13 test_impl.go:312: test failure: tpcc.go:267,tpcc.go:577,test_runner.go:779: error at from 2022-02-10T02:12:15Z, to 2022-02-10T02:
16:55Z on metric workload_tpcc_newOrder_error_total{instance="35.231.233.81:2121"}: expected <=200 errors, found from 840528.000000, to 846012.00

✅ network/authentication/nodes=4

This test intentionally partitions the network, and the test passed, but we see a breaker error during artifacts collection (while trying to run consistency checks). This means it did what it was supposed to do, turn a hanging request into a fail-fast. Hooray!

// runNetworkAuthentication creates a network black hole to the leaseholder
// of system.users, and then validates that the time required to create
// new connections to the cluster afterwards remains under a reasonable limit.

✅ pgjdbc, schemachange/random-load, tlp, restore2TB/nodes=6/cpus=8/pd-volume=2500GB

These tests matched only accidentally (there's a copy of the test runner log in that directory, which happens to contain this output produced by a different test).

@tbg tbg changed the title kvserver: slow_replication_threshold := 15s (base.SlowRequestThreshold) kvserver: slow_replication_threshold := TODO Feb 10, 2022
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Feb 10, 2022

5m timeout run just kicked off here: https://teamcity.cockroachdb.com/viewLog.html?buildId=4346831&

edit: no breakers tripped except in follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum where it is expected.

tbg added a commit to tbg/cockroach that referenced this pull request Feb 17, 2022
While running the nightly roachtest suite with circuit breakers enabled
(cockroachdb#76146) I observed undesirable interactions between circuit breakers
and bulk operations.

Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).

For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.

This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:

- Export
- AddSSTable
- RevertRange
- ClearRange
- GC
- Probe

Touches cockroachdb#33007.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Feb 17, 2022
`RangeFeed`s are long-running operations with a high fixed cost
(catch-up scan). They can fall behind for various reasons, replicas
becoming unavailable being one of them. However, when a replica's
circuit breaker trips, this should not abort any pending (or new, for
that matter) RangeFeeds.

Add a test that asserts that this is the case.

Touches cockroachdb#33007.
Touches cockroachdb#76146.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 17, 2022
76397: server: flush SQL stats during drain r=knz,Azhng a=cameronnunez

Fixes #72045.
Fixes #74413.

Previously, SQL stats would be lost when a node drains. Now a drain
triggers a flush of the SQL stats into the statement statistics
system table while the SQL layer is being drained.

Release note (cli change): a drain of node now ensures that
SQL statistics are not lost during the process; they are now
preserved in the statement statistics system table.

76697: opt: plan lookup anti-, semi-, and leftjoins on indexes with virtual columns r=mgartner a=mgartner

Fixes #75873

Release note: None

76732: kvserver: test that RangeFeed unaffected by Replica circuit breaker r=erikgrinaker a=tbg

`RangeFeed`s are long-running operations with a high fixed cost
(catch-up scan). They can fall behind for various reasons, replicas
becoming unavailable being one of them. However, when a replica's
circuit breaker trips, this should not abort any pending (or new, for
that matter) RangeFeeds.

Add a test that asserts that this is the case.

Touches #33007.
Touches #76146.

Release note: None


Co-authored-by: Cameron Nunez <cameron@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg added a commit to tbg/cockroach that referenced this pull request Feb 21, 2022
While running the nightly roachtest suite with circuit breakers enabled
(cockroachdb#76146) I observed undesirable interactions between circuit breakers
and bulk operations.

Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).

For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.

This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:

- Export
- AddSSTable
- RevertRange
- ClearRange
- GC
- Probe

Touches cockroachdb#33007.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Feb 21, 2022
While running the nightly roachtest suite with circuit breakers enabled
(cockroachdb#76146) I observed undesirable interactions between circuit breakers
and bulk operations.

Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).

For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.

This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:

- Export
- AddSSTable
- RevertRange
- ClearRange
- GC
- Probe

Touches cockroachdb#33007.

Release note: None
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Feb 21, 2022

Doing another run, this time again with the aggressive 15s timeout, and #76673 cherry-picked, just to see what shakes out.
I'm only running import/tpcc/warehouses=4000/geo and tpcc/multiregion/survive=region/chaos=true, each three times.

https://teamcity.cockroachdb.com/viewLog.html?buildId=4424614

craig bot pushed a commit that referenced this pull request Feb 21, 2022
76673: kvserver: exempt certain requests from circuit breakers r=erikgrinaker a=tbg

While running the nightly roachtest suite with circuit breakers enabled
(#76146) I observed undesirable interactions between circuit breakers
and bulk operations.

Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).

For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.

This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:

- Export
- AddSSTable
- RevertRange
- ClearRange
- GC
- Probe

Touches #33007.

Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Feb 22, 2022

Ok that didn't work, #76881. Need to try this again but for now rebasing and holding off.

maryliag pushed a commit to maryliag/cockroach that referenced this pull request Feb 28, 2022
While running the nightly roachtest suite with circuit breakers enabled
(cockroachdb#76146) I observed undesirable interactions between circuit breakers
and bulk operations.

Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).

For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.

This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:

- Export
- AddSSTable
- RevertRange
- ClearRange
- GC
- Probe

Touches cockroachdb#33007.

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this pull request Mar 6, 2022
`RangeFeed`s are long-running operations with a high fixed cost
(catch-up scan). They can fall behind for various reasons, replicas
becoming unavailable being one of them. However, when a replica's
circuit breaker trips, this should not abort any pending (or new, for
that matter) RangeFeeds.

Add a test that asserts that this is the case.

Touches cockroachdb#33007.
Touches cockroachdb#76146.

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this pull request Mar 6, 2022
While running the nightly roachtest suite with circuit breakers enabled
(cockroachdb#76146) I observed undesirable interactions between circuit breakers
and bulk operations.

Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).

For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.

This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:

- Export
- AddSSTable
- RevertRange
- ClearRange
- GC
- Probe

Touches cockroachdb#33007.

Release note: None
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Mar 17, 2022

Now that circuit breakers have been rewritten to affect only the replication layer (i.e. don't prevent reads that could succeed), and are bypassing the breakers for bulk request, and we've branched off release-21.2, this should be good to merge.

I conservatively bumped the circuit breaker threshold to 1m, but if nothing shakes out on master we might want to lower it a bit for the 22.2 release.

@tbg tbg marked this pull request as ready for review March 17, 2022 15:21
@cockroach-teamcity
Copy link
Copy Markdown
Member

cockroach-teamcity commented Mar 17, 2022

CLA assistant check
All committers have signed the CLA.

@blathers-crl

This comment was marked as resolved.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Mar 17, 2022
@blathers-crl blathers-crl bot added the X-blathers-triaged blathers was able to find an owner label Mar 17, 2022
@tbg tbg requested a review from a team as a code owner March 17, 2022 19:10
@tbg tbg requested a review from erikgrinaker March 17, 2022 19:11
@erikgrinaker
Copy link
Copy Markdown
Contributor

What's with the CI failure, looks like the run timed out after 90 minutes? Should probably look into that.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Mar 21, 2022

Good question. From the artifacts I can see that the last output was this:

? github.com/cockroachdb/cockroach/pkg/sql/lexbase/allkeywords [no test files]

at 19:56:19, then nothing until the test gets killed 40 minutes later. But we can see that what's running is TestLogic:

image

(see the logTestLogic directory) and its last timestamps match up with when the test was killed:

image

Unfortunately there's nothing else in these logs. It's just lots of under-replicated ranges. No slow requests (and thus also circuit breakers shouldn't be involved at all). The log file starts at 19:48:07.

So it looks like TestLogic timed out, albeit without leaving any trace as to what it was even doing at the time. 😢
Maybe something for @cockroachdb/sql-queries to look into improving.

I'm going to restart CI and see what I get.

@tbg tbg changed the title kvserver: slow_replication_threshold := TODO kvserver: slow_replication_threshold := 1m Mar 21, 2022
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Mar 21, 2022

Passed now. @erikgrinaker going to wait for your sign-off.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Mar 21, 2022

bors r=erikgrinaker

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 21, 2022

Build failed:

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Mar 22, 2022

bors r=erikgrinaker

The LogicTest timeouts were discussed here and seem to be a general problem right now.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 22, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: default replica circuit breakers to on

3 participants