kvserver: slow_replication_threshold := 1m by tbg · Pull Request #76146 · cockroachdb/cockroach

tbg · 2022-02-07T08:59:48Z

Release note: None

cockroach-teamcity · 2022-02-07T08:59:55Z

This change is

tbg · 2022-02-09T16:20:57Z

Running roachtest nightly off this branch to see if anything obvious transpires. https://teamcity.cockroachdb.com/viewQueued.html?itemId=4338356&tab=queuedBuildOverviewTab

tbg · 2022-02-10T11:32:17Z

There's definitely some fallout. Even if I remove all of the (lost quorum: true) hits (where some nodes are non-live, so the breakers firing has to be fair game), I'm left with quite a bit of fallout across multiple tests:

https://gist.github.com/tbg/e122a96d29aa6cf91adaeadcbffebab7

It makes sense for there to be some fallout. We tend to quite overload clusters in some of the more heavyweight tests, and now any proposal that takes >15s to replicate can trigger the breaker. The clients at the other end of the error are not going to retry the breaker errors, so jobs and workloads fail that would previously just have waited for a long time.

Without having investigated some of these failures in more depth (which I'll do) I wonder if perhaps outright tripping the breaker on a slow request is to aggressive. Perhaps instead we should "tickle" the breaker, letting the probe run. And the probe then determines whether to trip the breaker. I don't expect that to solve the problems here (as replication serializes all commands, if a command is slow then a probe on top of it will also be slow) but it is perhaps the "right" way to set things up to give fewer false positives in general, especially when we're looking to add triggers as discussed in #74712.

In the interim I'll experiment with a 2 or even 5 minute breaker timeout. I'd rather get them turned on in some capacity than to get sucked into an endless string of investigations into why proposals are taking longer than X in situation Y. It looks like I was way too optimistic with the initial choice of 15s.

Here are the tests that saw at least one such error (that made it to artifacts, so presumably there were more of these errors in places where they may not have failed the test - for example tpccbench):

$ rg 'unable to serve request' | grep -Eo '.*run_1' | sort | uniq
import/tpcc/warehouses=4000/geo/run_1
network/authentication/nodes=4/run_1
pgjdbc/run_1
restore2TB/nodes=6/cpus=8/pd-volume=2500GB/run_1
schemachange/random-load/run_1
tlp/run_1
tpcc/multiregion/survive=region/chaos=true/run_1

Going to look into each of them.

🐌 import/tpcc/warehouses=4000/geo

The breaker tripped during the import, failing the job, so we're seeing an attempt at a revert, which also fails on the breaker:

| Error: importing fixture: importing table history: pq: reverting execution from '2022-02-09 20:30:59.19108' to '2022-02-09 20:31:00.409991' on 1 failed: clearing data for table 109: clear range /Table/109 - /Table/109/PrefixEnd: raft status: {"id":"3","term":6,"vote":"0","commit":19,"lead":"2","raftState":"StateFollower","applied":19,"progress":{},"leadtransferee":"0"}: replica (n8,s8):3 unable to serve request to r234:/Table/1{09/1/3521/"\xe1d\xee\a\xf7\x8cH\x00\x80\x00\x00\x00\x06L%6"-10} [(n5,s5):1, (n3,s3):2, (n8,s8):3, next=4, gen=25, sticky=1644442184.367330914,0]

Asked the CDC team (internally) whether there's anything with jobs that I have to be careful about (will critical steps bail if they see a breaker error or retry? They have to retry). I think all is well but want expert confirmation.

Anyway, digging a bit more into the logs, just confirming:

teamcity-4338356-1644424858-47-n8cpu16-geo-0008> W220209 20:45:29.630388 280 kv/kvserver/pkg/kv/kvserver/replica_raft.go:1197 ⋮ [n8,s8,r2334/1:‹/Table/114/1/{649/6/…-1000/1…}›,raft] 1768 have been waiting 16.80s for slow proposal AddSSTable [‹/Table/114/1/649/6/-1134/4/0›,‹/Table/114/1/649/10/-2/14/0/NULL›)
teamcity-4338356-1644424858-47-n8cpu16-geo-0008> I220209 20:45:29.631054 47836 kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:258 ⋮ [n8,s8,r2334/1:‹/Table/114/1/{649/6/…-1000/1…}›] 1769 breaker: tripped with error: raft status: {"id":"1","term":6,"vote":"0","commit":13,"lead":"2","raftState":"StateFollower","applied":13,"progress":{},"leadtransferee":"0"}: replica (n8,s8):1 unable to serve request to r2334:‹/Table/114/1/{649/6/-1134/4-1000/1/-3001/1}› [(n8,s8):1, (n3,s3):2, (n2,s2):3, next=4, gen=103, sticky=1644443071.776777050,0]

This slowness has nothing to do with the breakers, but now it trips them. It's just a run-of-the-mill import, so presumably it'll affect production customers as well. So definitely can't expect to get away with a 15s timeout. I wonder if we should consider special-casing bulk-related commands; I don't see how they would benefit from the breakers anyway. A job that hangs for a few minutes during an outage would be preferable to one that then does extra work trying to revert what it already did, getting stuck there, and ultimately starting from scratch again. I'm starting an (internal) thread with bulk-io to discuss.

tpcc/multiregion/survive=region/chaos=true

This one is interesting. The test fails:

02:53:13 test_impl.go:312: test failure: tpcc.go:267,tpcc.go:577,test_runner.go:779: error at from 2022-02-10T02:12:15Z, to 2022-02-10T02:
16:55Z on metric workload_tpcc_newOrder_error_total{instance="35.231.233.81:2121"}: expected <=200 errors, found from 840528.000000, to 846012.00

✅ network/authentication/nodes=4

This test intentionally partitions the network, and the test passed, but we see a breaker error during artifacts collection (while trying to run consistency checks). This means it did what it was supposed to do, turn a hanging request into a fail-fast. Hooray!

cockroach/pkg/cmd/roachtest/tests/network.go

Lines 110 to 112 in 21a286f

    
           // runNetworkAuthentication creates a network black hole to the leaseholder 
        
           // of system.users, and then validates that the time required to create 
        
           // new connections to the cluster afterwards remains under a reasonable limit.

✅ pgjdbc, schemachange/random-load, tlp, restore2TB/nodes=6/cpus=8/pd-volume=2500GB

These tests matched only accidentally (there's a copy of the test runner log in that directory, which happens to contain this output produced by a different test).

tbg · 2022-02-10T11:48:26Z

5m timeout run just kicked off here: https://teamcity.cockroachdb.com/viewLog.html?buildId=4346831&

edit: no breakers tripped except in follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum where it is expected.

While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None

`RangeFeed`s are long-running operations with a high fixed cost (catch-up scan). They can fall behind for various reasons, replicas becoming unavailable being one of them. However, when a replica's circuit breaker trips, this should not abort any pending (or new, for that matter) RangeFeeds. Add a test that asserts that this is the case. Touches cockroachdb#33007. Touches cockroachdb#76146. Release note: None

76397: server: flush SQL stats during drain r=knz,Azhng a=cameronnunez Fixes #72045. Fixes #74413. Previously, SQL stats would be lost when a node drains. Now a drain triggers a flush of the SQL stats into the statement statistics system table while the SQL layer is being drained. Release note (cli change): a drain of node now ensures that SQL statistics are not lost during the process; they are now preserved in the statement statistics system table. 76697: opt: plan lookup anti-, semi-, and leftjoins on indexes with virtual columns r=mgartner a=mgartner Fixes #75873 Release note: None 76732: kvserver: test that RangeFeed unaffected by Replica circuit breaker r=erikgrinaker a=tbg `RangeFeed`s are long-running operations with a high fixed cost (catch-up scan). They can fall behind for various reasons, replicas becoming unavailable being one of them. However, when a replica's circuit breaker trips, this should not abort any pending (or new, for that matter) RangeFeeds. Add a test that asserts that this is the case. Touches #33007. Touches #76146. Release note: None Co-authored-by: Cameron Nunez <cameron@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None

tbg · 2022-02-21T15:59:37Z

Doing another run, this time again with the aggressive 15s timeout, and #76673 cherry-picked, just to see what shakes out.
I'm only running import/tpcc/warehouses=4000/geo and tpcc/multiregion/survive=region/chaos=true, each three times.

https://teamcity.cockroachdb.com/viewLog.html?buildId=4424614

76673: kvserver: exempt certain requests from circuit breakers r=erikgrinaker a=tbg While running the nightly roachtest suite with circuit breakers enabled (#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches #33007. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

tbg · 2022-02-22T11:13:34Z

Ok that didn't work, #76881. Need to try this again but for now rebasing and holding off.

While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None

`RangeFeed`s are long-running operations with a high fixed cost (catch-up scan). They can fall behind for various reasons, replicas becoming unavailable being one of them. However, when a replica's circuit breaker trips, this should not abort any pending (or new, for that matter) RangeFeeds. Add a test that asserts that this is the case. Touches cockroachdb#33007. Touches cockroachdb#76146. Release note: None

While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None

tbg · 2022-03-17T14:41:20Z

Now that circuit breakers have been rewritten to affect only the replication layer (i.e. don't prevent reads that could succeed), and are bypassing the breakers for bulk request, and we've branched off release-21.2, this should be good to merge.

I conservatively bumped the circuit breaker threshold to 1m, but if nothing shakes out on master we might want to lower it a bit for the 22.2 release.

cockroach-teamcity · 2022-03-17T18:34:14Z

All committers have signed the CLA.

erikgrinaker · 2022-03-18T09:50:41Z

What's with the CI failure, looks like the run timed out after 90 minutes? Should probably look into that.

tbg · 2022-03-21T09:26:55Z

Good question. From the artifacts I can see that the last output was this:

? github.com/cockroachdb/cockroach/pkg/sql/lexbase/allkeywords [no test files]

at 19:56:19, then nothing until the test gets killed 40 minutes later. But we can see that what's running is TestLogic:

(see the logTestLogic directory) and its last timestamps match up with when the test was killed:

Unfortunately there's nothing else in these logs. It's just lots of under-replicated ranges. No slow requests (and thus also circuit breakers shouldn't be involved at all). The log file starts at 19:48:07.

So it looks like TestLogic timed out, albeit without leaving any trace as to what it was even doing at the time. 😢
Maybe something for @cockroachdb/sql-queries to look into improving.

I'm going to restart CI and see what I get.

Fixes cockroachdb#74705. Release note: None

tbg · 2022-03-21T11:24:46Z

Passed now. @erikgrinaker going to wait for your sign-off.

tbg · 2022-03-21T11:38:46Z

bors r=erikgrinaker

craig · 2022-03-21T13:14:25Z

Build failed:

GitHub CI (Cockroach)

tbg · 2022-03-22T08:04:01Z

bors r=erikgrinaker

The LogicTest timeouts were discussed here and seem to be a general problem right now.

craig · 2022-03-22T08:59:08Z

Build succeeded:

GitHub CI (Cockroach)

tbg force-pushed the breakers-on branch from 9cabb1a to f611337 Compare February 10, 2022 11:47

tbg changed the title ~~kvserver: slow_replication_threshold := 15s (base.SlowRequestThreshold)~~ kvserver: slow_replication_threshold := TODO Feb 10, 2022

tbg mentioned this pull request Feb 16, 2022

kvserver: exempt certain requests from circuit breakers #76673

Merged

tbg mentioned this pull request Feb 17, 2022

kvserver: test that RangeFeed unaffected by Replica circuit breaker #76732

Merged

tbg force-pushed the breakers-on branch from f611337 to e164bdb Compare February 21, 2022 15:57

tbg force-pushed the breakers-on branch from e164bdb to 38bbddc Compare February 22, 2022 11:13

tbg marked this pull request as ready for review March 17, 2022 15:21

This comment was marked as resolved.

Sign in to view

blathers-crl bot added the O-community Originated from the community label Mar 17, 2022

blathers-crl bot added the X-blathers-triaged blathers was able to find an owner label Mar 17, 2022

tbg force-pushed the breakers-on branch from 14bbe18 to dc6d548 Compare March 17, 2022 19:10

tbg requested a review from a team as a code owner March 17, 2022 19:10

tbg requested a review from erikgrinaker March 17, 2022 19:11

kvserver: slow_replication_threshold := 60s

2259c1a

Fixes cockroachdb#74705. Release note: None

tbg force-pushed the breakers-on branch from dc6d548 to 2259c1a Compare March 21, 2022 09:27

tbg changed the title ~~kvserver: slow_replication_threshold := TODO~~ kvserver: slow_replication_threshold := 1m Mar 21, 2022

tbg added the backport-22.1.x label Mar 21, 2022

erikgrinaker approved these changes Mar 21, 2022

View reviewed changes

craig bot merged commit 4e7ba29 into cockroachdb:master Mar 22, 2022

blathers-crl bot mentioned this pull request Mar 22, 2022

release-22.1: kvserver: slow_replication_threshold := 1m #78236

Merged

tbg mentioned this pull request Mar 23, 2022

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] #78284

Closed

tbg deleted the breakers-on branch March 23, 2022 14:56

Conversation

tbg commented Feb 7, 2022

Uh oh!

cockroach-teamcity commented Feb 7, 2022

Uh oh!

tbg commented Feb 9, 2022

Uh oh!

tbg commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐌 import/tpcc/warehouses=4000/geo

tpcc/multiregion/survive=region/chaos=true

✅ network/authentication/nodes=4

✅ pgjdbc, schemachange/random-load, tlp, restore2TB/nodes=6/cpus=8/pd-volume=2500GB

Uh oh!

tbg commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbg commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbg commented Feb 22, 2022

Uh oh!

tbg commented Mar 17, 2022

Uh oh!

cockroach-teamcity commented Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

erikgrinaker commented Mar 18, 2022

Uh oh!

tbg commented Mar 21, 2022

Uh oh!

tbg commented Mar 21, 2022

Uh oh!

tbg commented Mar 21, 2022

Uh oh!

craig bot commented Mar 21, 2022

Uh oh!

tbg commented Mar 22, 2022

Uh oh!

craig bot commented Mar 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tbg commented Feb 10, 2022 •

edited

Loading

tbg commented Feb 10, 2022 •

edited

Loading

tbg commented Feb 21, 2022 •

edited

Loading

cockroach-teamcity commented Mar 17, 2022 •

edited

Loading