kvserver: slow_replication_threshold := 1m#76146
kvserver: slow_replication_threshold := 1m#76146craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
Running roachtest nightly off this branch to see if anything obvious transpires. https://teamcity.cockroachdb.com/viewQueued.html?itemId=4338356&tab=queuedBuildOverviewTab |
|
There's definitely some fallout. Even if I remove all of the https://gist.github.com/tbg/e122a96d29aa6cf91adaeadcbffebab7 It makes sense for there to be some fallout. We tend to quite overload clusters in some of the more heavyweight tests, and now any proposal that takes >15s to replicate can trigger the breaker. The clients at the other end of the error are not going to retry the breaker errors, so jobs and workloads fail that would previously just have waited for a long time. Without having investigated some of these failures in more depth (which I'll do) I wonder if perhaps outright tripping the breaker on a slow request is to aggressive. Perhaps instead we should "tickle" the breaker, letting the probe run. And the probe then determines whether to trip the breaker. I don't expect that to solve the problems here (as replication serializes all commands, if a command is slow then a probe on top of it will also be slow) but it is perhaps the "right" way to set things up to give fewer false positives in general, especially when we're looking to add triggers as discussed in #74712. In the interim I'll experiment with a 2 or even 5 minute breaker timeout. I'd rather get them turned on in some capacity than to get sucked into an endless string of investigations into why proposals are taking longer than X in situation Y. It looks like I was way too optimistic with the initial choice of 15s. Here are the tests that saw at least one such error (that made it to artifacts, so presumably there were more of these errors in places where they may not have failed the test - for example Going to look into each of them. 🐌 import/tpcc/warehouses=4000/geoThe breaker tripped during the import, failing the job, so we're seeing an attempt at a revert, which also fails on the breaker:
Asked the CDC team (internally) whether there's anything with jobs that I have to be careful about (will critical steps bail if they see a breaker error or retry? They have to retry). I think all is well but want expert confirmation. Anyway, digging a bit more into the logs, just confirming:
This slowness has nothing to do with the breakers, but now it trips them. It's just a run-of-the-mill import, so presumably it'll affect production customers as well. So definitely can't expect to get away with a 15s timeout. I wonder if we should consider special-casing bulk-related commands; I don't see how they would benefit from the breakers anyway. A job that hangs for a few minutes during an outage would be preferable to one that then does extra work trying to revert what it already did, getting stuck there, and ultimately starting from scratch again. I'm starting an (internal) thread with bulk-io to discuss. tpcc/multiregion/survive=region/chaos=trueThis one is interesting. The test fails:
✅ network/authentication/nodes=4This test intentionally partitions the network, and the test passed, but we see a breaker error during artifacts collection (while trying to run consistency checks). This means it did what it was supposed to do, turn a hanging request into a fail-fast. Hooray! cockroach/pkg/cmd/roachtest/tests/network.go Lines 110 to 112 in 21a286f ✅ pgjdbc, schemachange/random-load, tlp, restore2TB/nodes=6/cpus=8/pd-volume=2500GBThese tests matched only accidentally (there's a copy of the test runner log in that directory, which happens to contain this output produced by a different test). |
|
5m timeout run just kicked off here: https://teamcity.cockroachdb.com/viewLog.html?buildId=4346831& edit: no breakers tripped except in |
While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None
`RangeFeed`s are long-running operations with a high fixed cost (catch-up scan). They can fall behind for various reasons, replicas becoming unavailable being one of them. However, when a replica's circuit breaker trips, this should not abort any pending (or new, for that matter) RangeFeeds. Add a test that asserts that this is the case. Touches cockroachdb#33007. Touches cockroachdb#76146. Release note: None
76397: server: flush SQL stats during drain r=knz,Azhng a=cameronnunez Fixes #72045. Fixes #74413. Previously, SQL stats would be lost when a node drains. Now a drain triggers a flush of the SQL stats into the statement statistics system table while the SQL layer is being drained. Release note (cli change): a drain of node now ensures that SQL statistics are not lost during the process; they are now preserved in the statement statistics system table. 76697: opt: plan lookup anti-, semi-, and leftjoins on indexes with virtual columns r=mgartner a=mgartner Fixes #75873 Release note: None 76732: kvserver: test that RangeFeed unaffected by Replica circuit breaker r=erikgrinaker a=tbg `RangeFeed`s are long-running operations with a high fixed cost (catch-up scan). They can fall behind for various reasons, replicas becoming unavailable being one of them. However, when a replica's circuit breaker trips, this should not abort any pending (or new, for that matter) RangeFeeds. Add a test that asserts that this is the case. Touches #33007. Touches #76146. Release note: None Co-authored-by: Cameron Nunez <cameron@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None
While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None
|
Doing another run, this time again with the aggressive 15s timeout, and #76673 cherry-picked, just to see what shakes out. https://teamcity.cockroachdb.com/viewLog.html?buildId=4424614 |
76673: kvserver: exempt certain requests from circuit breakers r=erikgrinaker a=tbg While running the nightly roachtest suite with circuit breakers enabled (#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches #33007. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
|
Ok that didn't work, #76881. Need to try this again but for now rebasing and holding off. |
While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None
`RangeFeed`s are long-running operations with a high fixed cost (catch-up scan). They can fall behind for various reasons, replicas becoming unavailable being one of them. However, when a replica's circuit breaker trips, this should not abort any pending (or new, for that matter) RangeFeeds. Add a test that asserts that this is the case. Touches cockroachdb#33007. Touches cockroachdb#76146. Release note: None
While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None
|
Now that circuit breakers have been rewritten to affect only the replication layer (i.e. don't prevent reads that could succeed), and are bypassing the breakers for bulk request, and we've branched off release-21.2, this should be good to merge. I conservatively bumped the circuit breaker threshold to 1m, but if nothing shakes out on master we might want to lower it a bit for the 22.2 release. |
This comment was marked as resolved.
This comment was marked as resolved.
|
What's with the CI failure, looks like the run timed out after 90 minutes? Should probably look into that. |
|
Good question. From the artifacts I can see that the last output was this:
at 19:56:19, then nothing until the test gets killed 40 minutes later. But we can see that what's running is TestLogic: (see the Unfortunately there's nothing else in these logs. It's just lots of under-replicated ranges. No slow requests (and thus also circuit breakers shouldn't be involved at all). The log file starts at 19:48:07. So it looks like TestLogic timed out, albeit without leaving any trace as to what it was even doing at the time. 😢 I'm going to restart CI and see what I get. |
Fixes cockroachdb#74705. Release note: None
|
Passed now. @erikgrinaker going to wait for your sign-off. |
|
bors r=erikgrinaker |
|
Build failed: |
|
bors r=erikgrinaker The LogicTest timeouts were discussed here and seem to be a general problem right now. |
|
Build succeeded: |


Fixes #74705.
Release note: None